Order-of-Magnitude Thinking: April 2015

Thursday, April 16, 2015

Refining Order-of-Magnitude Estimates with Monte Carlo Simulation

I recently showed how to use order-of-magnitude thinking and interval estimates to identify which of four potential threats to the continuity of a hypothetical business in the San Francisco area would actually concern a business continuity planner. They were Earthquake and Pandemic. This was the result of literally multiplying the worst-case values for frequency of occurrence and loss magnitude. (This is “honest math,” not multiplying red times green to get yellow, because we started with actual numbers.)

When we have an interval estimate, such as for the probable frequency of occurrence of earthquakes being between once every hundred years and once every ten years, that is between 0.01 and 0.1 times per year, it is another way of saying we are uncertain what the actual value would turn out to be if we had perfect information. There is some number that if we knew it would be between 0.01 and 0.1. We can model our belief about this number as a random variable with some probability distribution between those two limits.

But which of the infinite number of probability distributions should we use? Since I am completely unsure which number it would be, or even what it would be near, I’ll choose a uniform distribution, so that it is equally likely to be anywhere in the range. I did this for all four quantities – the loss event frequencies and the loss event magnitudes of Earthquake and Pandemic.

What I now do is randomly pick, according to each probability distribution, numbers for loss event frequency and loss magnitude, and multiply them together (“honest math”) to get the annualized loss expectancy (ALE) for that combination of frequency and magnitude. That gives me one data point for what the ALE could be. If I did that a jillion times, I’d get good coverage of the whole range of frequency and magnitude, and so get a whole population of ALEs that could occur consistent with my estimates. If we plotted the distribution of ALEs, we’d have a complete description of the risk of that BC threat. That is exactly what we mean by “risk.”

See all that stuff in the previous paragraph? That’s a Monte Carlo simulation. You know – Monte Carlo – that’s the place where they spin roulette wheels to generate random numbers. At least they are random in honest casinos.

I did that for Earthquake and Pandemic. Here is what I got for the simulations of ALE for each. Each chart summarizes the results of 1,000 simulations. The top charts are the frequency histograms; the bottom charts are the cumulative probabilities. If I really did a jillion, the lines would be nice and smooth.

Now here’s the point. We may say, using our management judgment, that the 95% point for loss expectancy (or some other point) is our benchmark for how we will assess risk. For Earthquake, the 95% point is about $285K of ALE, almost 30% less than the worst case of $400K. For Pandemic, the 95% point is $390K, or vs a max of $528K, or 27% lower than worst case. Of course the comparisons are even more dramatic for the 90% and 80% points.

The Upshot. The net of it all is that by using some pretty simple Monte Carlo simulations we can get a more realistic picture of our risk than the-worst-times-the-worst, but still as conservative as we like.

The Total Risk. In BCP parlance, the total risk assessment (TRA) is simply the list of the conceivable threats with their likelihoods, loss magnitudes, and some kind of judgment combining the two. It’s more like an inventory than a total. But we are more sophisticated than that. We know that risk is the probability distribution of annual loss expectancy, not some fake-math multiplication of red times green. With the probability distributions of ALE for Earthquake and Pandemic in hand, we simply use Monte Carlo to get the probability distribution of the sum, which is the total risk. I’ve done that for Earthquake and Pandemic, and also for the two threats that are not so interesting, Blizzard and Aviation Accident. Here is the cumulative probability of annual loss expectancy for all four threats:

Here we see that, supposing these four threats are the only ones we need be concerned with, and that they are independent of each other, there is a 95% chance that total ALE is $550K or less. This is less than the sum of the 95% points for the individual threats, and a whopping 40% less than the $938K total of the maxima because, again, if you are having a bad year on one threat you are unlikely to have a bad year on another.

Monte Carlo simulation allows us to easily get deeper and more-realistic analysis of multiple factors, and see them in context, than the traditional methods. And it’s not that hard.

Monday, April 13, 2015

Business Continuity Examples of Order-of-Magnitude Thinking

In a previous note I showed how to use order-of-magnitude thinking to quickly narrow down a highly uncertain number to a workable range. I used the rather artificial example of the number of pages in the Christian Bible (equally applicable to Gone with the Wind or Harry Potter Meets Dracula). Here I show a real-life example from business continuity planning.

The Challenge. How in the world can a conscientious business continuity analyst possibly come up with the dozens of estimates needed for a competent total risk assessment (TRA), which is just the first step in a business continuity plan? This note shows with concrete examples how order-of-magnitude thinking and interval estimates can make fast work of this task, and still get a result that is both sensible and defensible.

Taking inventory of the possible threats to business continuity is one of the first steps in making a business continuity plan (BCP). (I use the term “threat” to align with the FAIR taxonomy on risk, although “hazard” would suit too.) Often this starts with somebody’s long list of threats. These lists are commonly of the one-size-fits-all sort, without regard to any particular circumstances, and so comprise a vast variety of threats, many of which would not apply. The analyst is then charged to assess the probability, or probable frequency of occurrence, and the probable loss to the business if each threat were to materialize. She may be on the defensive to explain why the risk of a typhoon can be ignored. Finally the analyst is to somehow combine the probability with the magnitude of loss to come up with a loss expectancy estimate for each of these several dozen threats. And that’s just table stakes for a BCP.

I’ll demonstrate the method with four representative threats for a hypothetical software development business located in the San Francisco Bay Area:

- Blizzard
- Earthquake
- Aviation accident, and
- Pandemic.

They represent the general categories of meteorological, geological, technological, medical threats. I’ll give my personal (therefore subjective) estimates for probable frequency of occurrence and also subjective estimates for the dollar impact on this hypothetical business if each threat were to occur. In all cases I’ll give a rough range from low to high. Finally I’ll use half-orders of magnitude, that is, numbers like 1, 3, 10, 30, 100, etc., because I believe this is close enough for a first cut. The second cut comes, well, second.

Blizzard. Snow is very unlikely in the Bay Area except at the highest elevations, but I realize that a snow event big enough to impact the business could occur, so I’ll estimate the frequency to be between once in 30 years and once in 100 years. If such an event were to occur, I feel it is highly likely it would not last more than a day. Since this business is all knowledge work, the business impact would mostly be loss of people productivity. Suppose this business has 300 people and the average total compensation is $150K / year. I also assume that the value lost is reasonably approximated by the replacement cost of the work. One day of lost productivity out of 250 working days per year is roughly $200K ($150K x 300 / 250). (If your software engineers work 80 hours a week, scale accordingly. Your mileage may vary.) Even in this event probably most people would work at home, which they often do anyway, so the loss may be more like half a day, or $100K. With these numbers in mind I estimate the conditional impact between $30K and $300K. In fact, a short search of historical records shows that snow had accumulated on the streets of San Francisco in historical times, so a frequency of once a century is reasonable.)

Earthquake. This is earthquake country, no doubt about it. As a casual reader of the local papers I am aware of geologists’ estimates that the next Big One will likely occur within 30 years, so I’ll put the probable frequency in the range of 10 to 100 years. Notice that I am giving wide latitude – half an order of magnitude – to the consensus number, in recognition of the uncertainty. But if the Big One were to occur, the business would effectively be shut down for some time. The question is, how long? The Loma Prieta quake in 1989 took most people one to a few days to get back on their feet. That’s the low end. The high end may be 10 to 20 days, so again using half-order-of-magnitude thinking I’ll estimate an impact of 1 to 30 days, or $200K to $4M. This may seem like a uselessly wide range, but stay tuned.

(Notice that I am ignoring a lot of detail here at the high end. What about loss of revenue and penalties for missed delivery dates? What if the firm is driven into bankruptcy? We’ll get to that later.)

Aviation Accident. There are several airports in the area, both large commercial and small general aviation. An accident in the flight path could plausibly affect almost any building in the Bay Area. If this were to happen I judge the impact to be comparable to an earthquake – damage could range from minimal to catastrophic. However I can only think of a few cases in the United States in the past two or so decades of an aviation accident impacting people on the ground, aside from terrorism (which is a different threat). If there have been say 10 such cases in 10 years, spread over what must be more than a million buildings, the probable frequency is something like one in one million per year. I could easily be off by an order of magnitude either way, so I’ll put the frequency at 1 in 100,000 to 1 in 10 million.

Pandemic has attracted much attention from BC planners in the last few years so it is worth a look. Given the news coverage of Ebola, I am going to estimate the probable frequency between one in three years to one in 30 years. The impact on the business would again be loss of productivity. In the optimistic case only a few people, say 10, would be personally affected, assuming public health resources are effectively mobilized and people cooperate to prevent the spread. In the pessimistic case 30% of the staff may not be able to work for several weeks, say 30 days. I’ll assume unaffected people can work from home if necessary with no productivity impact. Multiplying it out I get an impact range of roughly $180K (10 people x $150K x 30/250) to $1.6M (30% x 300 *$150K x 30/250).

We’ve done all the spadework, so now we can put the results together.

To compute annual loss expectancy I’ve simplistically multiplied the lows by the lows and the highs by the highs. This could be overly pessimistic in the case of the highs because it assumes the highest frequency occurs together with the highest loss, which is probably not the case. In fact, more-frequent losses tend to be the lower-magnitude ones. We could improve on this with a Monte Carlo simulation but for a first cut the table is good enough.

Please note that the calculation of annual loss expectancy is an honest multiplication. The method avoids the fake math of “multiplying” a “low” frequency by a “high” impact to get a “medium” loss expectancy, and the like.

Notice also that the annual loss expectancies fall naturally into two categories, the ones that seem safe to ignore and the ones we need to pay attention to. Also the threats in the two categories do seem to accord with intuition.

Benefits. This analysis has done several things for us:

it focuses the BC planning where it really ought to go
it shows where we may need to take a second cut
it provides reasonable justification for what we decide to ignore
it refines our intuition (and can alert us to blind spots), and
it makes efficient use of our time.

Not a bad deal.

Wednesday, April 1, 2015

What a Risk Decision Actually Is

In this note I’ll dissect and expose exactly is meant by making a decision among risky alternatives, and what we should expect the management of an organization to be able to do in making these decisions.

In a previous note I proposed the following definition:

Risk Decision. A decision by the leadership of an organization to accept an option having a given risk function in preference to another, or in preference to taking no action. I assume that competent leadership of any organization worth its pay can make such a decision, at the appropriate level of seniority.

The term is shorthand for a decision between alternatives, at least one of which has a probability of loss. (Usually in cyber risk we are concerned with losses, but all the ideas extend naturally to upside or opportunity risk. Few people and fewer organizations take on risk without some expectation of advantage, if only cost avoidance.)

The definition depends on the idea of a risk function (AKA “the risk” of something) as:

The probability distribution of loss magnitudes for some stated period of time, such as one year. This is what I think most people really mean when they speak of the “risk” of something.

I like to think of the risk function in terms of its loss exceedance curve, the probability distribution that a particular loss magnitude will be exceeded, for the given time frame, as a function of the loss magnitude. The nearby graphic illustrates two possible loss exceedance curves for a “before” and “after” assessment of an investment which is supposed to reduce risk.

These curves are the final quantitative result of a risk analysis of a particular scenario. The decision problem is whether to invest in the control or not. (It may be a web application firewall, for instance.) The analysis says, for instance, that investing in the control will reduce the chance of annual loss greater than $40K from 95% to 20%. Sounds pretty good!

Of course there is more to it. Management needs to know how much the control will cost. Costing out a control, including recurring and non-recurring costs, cost of capital, staff support, all in, is a well-established discipline compared to risk analysis, so let’s assume it has been done. Suppose the price tag is $20K. Management has to decide if the reduction in risk is worth the cost.

There has been much agonizing in the literature about how a rational actor can consistently choose among risk functions. The most prominent approach is Von-Neumann-Morgenstern utility. Its main result is that, given any risk function, a rational actor can assign a number with his personal utility function such that more-preferred risk functions always have higher numbers than less-preferred ones. It’s a nifty but impractical result for several reasons. For one thing, it turns out to be hard to estimate a person’s utility function. And if it’s hard for the average person, you will not get many a CEO to sit still for the exercise. For another, risk decisions, especially big ones, are often made jointly by multiple stakeholders, like the CIO, CFO and CEO, for good reasons. Getting a utility function for a committee is even harder. Finally, senior managers have an understandable need to “do a gut check” and personally engage with big decisions. They are not going to delegate the decision to a formula, nor should they.

So I assume that, given two risk functions, leadership can and will know which they prefer. Making risk decisions is what they are paid to do. This is the reason for my definition of a “risk decision.”

The definition has some immediate implications. The first is that through a series of pair-wise comparison leadership can set any set of risk functions in order from most-preferred to least-preferred. On one end, the reaction is, “This is great! Where do I sign?” At the other it’s “Over my dead body.” In between there is a zone of indifference where management thinks “I don’t really care one way or the other.”

Next, having in principle ranked a bunch of risk functions, management will say that there are some I just would not choose if I had the option not to. So there is a notion of “this far and no further” in the pursuit of our goals. This is the basis of the definition of:

Risk Appetite. The worst (least-preferred) set of probability distributions of loss magnitudes that the management of an organization is willing to voluntarily accept in the pursuit of its objectives.

In other words, in our ranking scheme, these are the ones just a little better than unacceptable, if we have a choice.

But what management doesn’t have a choice? Threats can be discovered that we would not actively accept in the furtherance of our objectives. Some we can live with even if we prefer not to. The worst (least-preferred) risk functions that we are willing tolerate if imposed upon us leads to:

Risk Tolerance. The set of least-preferred probability distributions of loss magnitudes that the management of an organization is willing to accept when presented with them involuntarily.

Risk Tolerance is by definition greater than (includes more probability distributions of losses) than Risk Appetite. The key is involuntariness.

So we have three sets of risk functions: those we are willing to choose in pursuing our objectives, those we are willing to accept but not opt for, and those we cannot abide. And within those sets there may well be ones that we have about the same preferences for even if their risk functions differ.

What if a loss exposure (aka risk function for a scenario) is discovered that is worse than our risk tolerance? Well then it is by definition intolerable and we have to do something to mitigate or avoid it. A threat of this nature is almost by definition an existential threat to the organization – it threatens the ability of the organization to achieve its goals or perhaps even survive. But that’s another topic: business continuity planning.

A Plea for a New Taxonomy for Cyber Risk

Despite much ink spilled on the subject, the vocabulary of cyber risk continues to be muddled. Just consider that the word “risk” itself has multiple meanings, indiscriminately applied. An absurd construction as “that risk is a high risk” is perfectly possible in today’s vocabulary. This note is the first of a series of contributions to solve this problem.

The root of the problem is that people try too hard to reconcile specialist meanings with ordinary language. For literature and poetry, the multiple meanings of English words and the ambiguities of syntax are often a useful and sometimes a wonderful thing. But for practical affairs of science, engineering, business and law, it’s a breeding ground for problems.

Other professions have solved the problem in various ways. Mathematics precisely defines ordinary words like “group” and “function” to have special meanings. The relatively closed nature of the profession prevents misunderstandings among outsiders. This method would not work well in cyber risk, as the word “risk” itself shows, because the specialists have to communicate with non-specialists all the time. We can’t appropriate ordinary words to mean something special only when we talk amongst ourselves.

A variant of the ordinary-word method is to put common words to new meanings, such as “motor” or “inductor” were in the nineteenth century, and then rely on the obviously new context to prevent misinterpretation.

Another way out is to create new words that ordinary people won’t use. Biology and medicine are famous for this. If you mean a specific kind of mosquito or muscle, it’s anopheles or biceps brachii. When you want to make sure that outsiders are kept outside and suitably intimidated, a dead language is perfect! But that’s the trouble: arcane words are a barrier to communication, and that’s the last thing we need in cyber risk.

We can create new words out of whole cloth, instead of stealing from Aristotle and Virgil. “Cybernetics” and “cryogenics” are examples that do not prevent communication with lay persons. Technology is a rich source of neologisms, as witness “transistor,” “diode,” and “automobile.”

The last way out of the swamp of confusion, one that I find very attractive, is the noun phrase. Here you put together a few ordinary words in an improbable juxtaposition, such as “integrated circuit,” “tensile strength,” or “coefficient of thermal expansion.” This seems to be the best solution. The reader needn’t have studied Latin or Greek, she can easily see that something special is meant, and even the non-specialist can get a sense of what the special meaning is.

To get this movement kicked off for cyber risk, I’ll propose some of my own definitions. I build on the excellent foundation of the FAIR taxonomy (Factor Analysis for Information Risk), which you can find on The Open Group website.

First let’s agree to use “risk” by itself only as a lay term, and otherwise regard it as a four-letter word not to be used in polite conversation. And when we use it in a lay context, let “a risk” mean “a loss event scenario,” as advised by Freund & Jones (“Measuring and Managing Information Risk,” p. 353). Notice the “a”.

Here are a few related terms and my proposed definitions.

Risk Function. The probability distribution of loss magnitudes for some stated period of time, such as one year. This is what I think most people really mean when they speak of the “risk” of something.

Loss Exceedance. The probability distribution that a particular loss magnitude will be exceeded, for the given time frame, as a function of the loss magnitude. It is the “tail distribution” of the risk function. This is a standard term in the insurance industry (from which we can learn much). The loss exceedance function has some nice properties which give it intuitive appeal.

And finally, to settle the age-old dispute about the difference between risk appetite and risk tolerance:

Risk Tolerance. The set of least-preferred probability distributions of loss magnitudes that the management of an organization is willing to accept when presented with them involuntarily. Risk Tolerance is by definition greater than (includes more probability distributions of losses) than Risk Appetite. The key proviso here is involuntariness.

I’ve have more to offer later about notions like attack, attack surface, attack vector, exploit, flaw, and vulnerability.

Order of Magnitude Thinking

We often run into the problem of estimating a number about which we seemingly have no idea. For example, how many severe defects probably remain undiscovered in software that is now being submitted for deployment to production? The answers I have gotten to this question have been (a) “none, because QA would already have found them and we fixed them,” (b) “we cannot know until we deploy to production and wait 30 days for new bug reports,” and (c) “I have no earthly idea.”

Surely we can do better than this!

We can put some reasonable and often useful bounds on estimates of highly uncertain numbers using order-of-magnitude thinking. I have used this technique on my own estimates and in querying colleagues for many years and usually found the results to be illuminating and useful. You only need at least some familiarity with the subject. (There are problems, of course, about which we may have no earthly idea, such as the number of neutrinos passing through our body in one second. It’s huge!)

Let’s consider the number of pages in some random edition of the Christian Bible. Which edition hardly matters, as we shall see.

The first step is to take an impossibly low number and an impossibly high number to bracket the range. It’s easy if we limit ourselves to powers of 10. Each power of 10 is one order of magnitude.

Could the Bible be as short at one page? Certainly not. We know it is a rather hefty book. Ten pages? No. A hundred? Again, no. A thousand? Well, I would be unwilling to bet that it’s more than a thousand pages.

How about the other end? Could it be a million pages? No. A hundred thousand? No. Ten thousand? No – that would be at least 4 or 5 very big books, and we know it’s only one book. A thousand? Again I am unwilling to bet it’s less than a thousand.

So we already know that some edition of the Bible is almost certainly between 100 and 10,000 pages. We may now feel, based on this preliminary ranging and other experience, that the answer is in the neighborhood of 1000 pages. So can we narrow the range a bit more?

Instead of limiting ourselves to powers of 10, let’s tighten it up a bit to a half a power of 10, that is, the square root of 10. We’ll use 3 for convenience.

Now, do we think our Bible is more than 300 pages? Yes. We are pretty confident it is several hundred pages. Is it less than 3000? Again, yes. It’s long but not that long. So we have succeeded in tightening the range from two orders of magnitude, 100 to 10,000, to one order of magnitude. Progress!

We could stop here, feeling that we now have enough information for whatever the purpose is. (You need to know how much precision you need. This exercise can help you think about that.)

Or we could try to narrow it further. Moving from powers of 10 to powers of the square root of 10 (roughly 3), we could try powers of 2 – “binary orders of magnitude” – and so potentially narrow the range to 500 to 2000 pages. Of course you can go as far as you like with this procedure, until you are uncomfortable with narrowing the range any further.

This procedure is quick and often yields useful insights to probable magnitude, and to the extent of our uncertainty. It is surprising how often the result is “good enough.” And it may quickly guide us to which among several highly uncertain numbers it is worth the effort to research more carefully. As Doug Hubbard says, you know more than you think you do, what you do know is often good enough, and it is usually only one or two numbers among many that are worth buying more precision about.

Post script: This method is inspired by the scale knobs on many kinds of electronic test equipment, which often have to accommodate huge ranges. A voltmeter may need to measure from millivolts or microvolts to 1000 volts – 6 to 9 orders of magnitude. They have range settings using the 1-2-5-10 scheme, for 1, 2, 5 and 10 millivolts of sensitivity, and so on up the scale. A useful way of thinking!