Tuesday, May 12, 2015

Threat Capability and Resistance Strength: A Weight on a Rope

Threat Capability and Resistance Strength in the FAIR taxonomy are among the more abstract and difficult concepts to get a firm grasp on.  The standard seeks to fix ideas with the analogy of a weight on a rope.  This note models that analogy in detail and uses it to explore these concepts.

The FAIR taxonomy [1] uses the term “vulnerability” in a special way that differs significantly from how it is used by CERT and many network and software scanners.  “Vulnerability” in FAIR is “the probability that a threat event will become a loss event.” The usual meaning of “vulnerability” in information security is a flaw or suboptimal configuration in software or hardware. The taxonomy breaks Vulnerability into two component drivers, Threat Capability and Resistance Strength.  (I’ll use initial capitals to make it clear where FAIR-defined words are meant.  I’ll also use the standard abbreviations Vuln, TCap, and RS.)  Note that since Vulnerability is a probability, it is a number between 0 and 1, or 0% and 100%.

Threat Capability is defined as “the probable level of force that a threat agent is capable of applying against an asset,” leaving it to analyst to identify what kind of “force” is to be considered for the scenario at hand, and how to quantify it.  “Probable level” is a hint that TCap is a probability distribution, though it could be a single number in a simple case. Resistance Strength is defined as “the strength of a control as compared to a baseline unit of force.”  The accompanying discussion in the standard emphasizes that RS is to be measured on the same scale as TCap, which is helpful to the extent that one understands force for the TCap.  To help fix ideas for all three concepts, the standard offers the example of a weight (the Threat Agent) on a rope (which is a control that protects an asset – maybe your toes beneath the weight).  The force is gravity, the measure of force is pounds-force or Newtons, and the Resistance Strength is the tensile strength of the rope, and so it too is measured in pounds or Newtons.  The Vulnerability is then the probability that a specific weight, or population of possible weights, will exceed the tensile strength of the rope.

Let us model this scenario to see if it can help us understand these three ideas better.  First we define the scenario.

Scenario Description

Purpose:  To assess the risk posed by weights on a construction site being hoisted over a partially-completed building.

Assets:  A building under construction, materials and equipment on the site, life safety of the workers.

Threats:  Heavy construction materials, such as steel beams and loads of wet concrete to be hoisted.

Threat Event:  A load being hoisted over the building or the site.

Loss Types:  Structural integrity of the building, availability of the building on the site for further work, availability of the building for delivery to the owner on the contracted date (using the C-I-A loss categories).

Risk Scenario:  A construction load being hoisted into position breaks its rope (Threat Event) and crashes into the building or the site, damaging the building, materials, and equipment, and causing injury or loss of life (Loss Event).

Threat Community:  The set of loads planned to be hoisted, ranging from a very light load to 35 kiloNewtons (7875 pounds of force to us Yanks), with an uncertainty of +/- 5 kN (one standard deviation).

Threat Agent:  The specific member of the Threat Community we’ll start with is the maximum weight of 35 kN +/- 5 kN.

Control:  a steel rope with a specified tensile strength of 40 kN (9675 pounds), with an uncertainty of +/- 3 kN (one standard deviation).   We’ll assume the specification is one standard deviation lower than the mean breaking strength of 43 kN.

Analysis

The problem is to determine how likely it is that the load exceeds the strength of the rope, or in FAIR terms the probability that a Threat Event becomes a Loss Event.  That is precisely the FAIR Vulnerability.  In any given hoisting operation, we have a load of uncertain weight imposing a force on a rope of uncertain tensile strength.  If the load exceeds the rope strength, the rope breaks and we have a Loss Event.  We need to determine how likely it is (the probability) that the uncertain load will exceed the uncertain tensile strength.


Like a B-minus sociology student, we shall naively assume that all probability distributions are normal (Gaussian), and casually ignore the infinitesimal probabilities of negative weights and negative tensile strengths.  Given that, here is the probability distribution of the biggest planned load (Threat Agent).



The density function peaks at 35 kN, which is also the 50% point on the cumulative distribution, as it should.

The tensile strength has a similar probability distribution, but I find it more natural to think of it in terms of its cumulative distribution – that is, what is the probability of breaking at or below any given load – rather than its density function.  Here it is:



Notice that the cumulative curve is a similar shape to the one for the load but shifted a bit to the right (we should hope that the strength is at least a bit greater than the load). 

Here is what we do to figure the Vulnerability.  (Plus one point if you smell a Monte Carlo simulation coming.)

Procedure

1.        Generate a random variable according to the density function of the load (normal, mean 35 kN, standard deviation 5 kN).

2.       For each realization of the load random variable, look up the probability of the rope breaking, and record it.  For 40 kN, it is 0.16.

3.       Do this a bunch of times, say 1000.

4.       Average the thousand probabilities you got in step 2. 


The answer is a single number, the probability of the rope breaking, averaged over the probable load weights for the given load (Threat Agent) and rope strengths.  This is the Vulnerability, the probability for this load size (Threat Agent) that a Threat Event becomes a Loss Event.  The number I got was 0.079. (There will be some run-to-run variation in a MC simulation.)

(Another procedure is to generate two random variables, one for load and one for strength.  You record a 1 if load is greater than the strength and 0 otherwise.  The average of the 1’s and 0’s is the answer.  This is the method Jack Jones uses in his video on the CXOWARE web site.  It gives the same answer but I find the procedure above easier to understand. It can be shown that the two procedures are equivalent.)

Vulnerability for Various Threat Agents

We could repeat the analysis for a whole range of loads we see lying around in the construction site.  In FAIR words, there are other Threat Actors in the Threat Community, and they have different Threat Capabilities.  After putting away my steel-toed work boots, I did that.  Here’s what I got.  Each dot represents the probability of breaking for a load whose mean size is shown on the x axis.  The standard deviations of all the loads is 5.0 kN.



We see that the probability of failure (Threat Event becomes a Loss Event) increases with the load (that’s reassuring) and gets pretty high as we approach the specified tensile strength of the rope of 40 kN (that is too). 

This set of points looks an awful lot like the curve for the rope, but it’s not the same.  Here are both sets of data plotted on the same chart.



For small loads (TAs), the probability of the rope breaking is greater than the probability of the rope breaking for the average of the load.  Why?  Because a load of say 35 kN average has some probability of being more than 35 kN, which has a bigger break probability.  The opposite is true for large average loads, above about 45 kN.  The curve for the dots is flatter than the curve for the rope because it includes the uncertainty in Vuln of the sizes of the loads as well as the uncertainty in the rope strength.

Vulnerability for the Threat Community

Each dot in the previous chart represents a specific member of the Threat Community, a specific Threat Agent.  In our scenario, it is a load or group of loads with a certain average weight and a certain standard deviation.   The dot is the Vulnerability for that load size (TA).
 
Now suppose we want to generalize to the whole Threat Community.  After all, the job is to finish the building, not just to hoist one kind of load.  In surveying the job we might see that there or a 50 or 100 kinds of small loads, and only a few of the very largest loads.  In that case we would do this:

  1. Take a census of loads to be hoisted.  This is the Threat Community.
  2. Classify them into a reasonable number of relatively homogeneous subsets.  Each is a Threat Agent.  Estimate their means and standard deviations.  Count the number in each subset.
  3. For each TA, do the MC simulation like we did above for the 35 kN load, and so get the probability of failure (Vulnerability, conditional for that particular TA).
  4. Compute the weighted sum of these conditional Vulnerabilities.  The weights are relative frequencies of occurrence of the various TAs (subsets).  Each hoisting job counts as one.


The weighted sum is the Vulnerability for the entire Threat Community.  It is just a number, like 0.01 or 0.50 or 0.97.  Unlike Threat Event Frequency or Annual Loss Expectancy itself, it is not a distribution.

What do you expect to find?  You expect that the Vulnerability to the entire TC is less than the worst-case TA.  This may be confusing.  As risk managers, what should we plan for, the entire TC (which gives us a happier number) or the worst-case TA?  Well, that depends on your scenario.  Obviously if your scenario is a mix of TAs you expect to encounter, the Vuln is going to be lower than for the worst-case TA.  You think, in your risk-averse mind, “Gosh, I need to plan for the worst case.”  But now is the time to think carefully (well, again, not for the first time!).  This is the root of disagreements about whether risk should be assessed based on the worst case or the whole range of expected possibilities.  (Another problem with “worst case” is that it is usually ill-defined, if defined at all.  There is practically no limit how bad a worst case can be.  Leaving it to the analyst will lead to uncontrolled biases, inability to compare results, and lack of reproducibility.)

Yes, you need to be aware of, and understand the consequences of, the (plausible) worst case.  But that is not an accurate description of your expected overall experience.  Yes, the worst-case could happen, and sooner or later it will happen, and it needs to be accounted for in the analysis, but it is a mistake to over-weight it.

How do we properly weight the worst case with all of the non-worst cases?  The answer is with Threat Event Frequency.  If the scenario is the worst-case TA, then the TEF is presumably lower than if the scenario is for the entire Threat Community.  If the scenario is for the entire Threat Community, not just the worst-case TA, then the worst-case TA will be in there, with its appropriate weight, along with all the lesser TAs in the TC.  In the end, when you roll the results up to the Annual Loss Expectancy, the worst-case TA will be in there, appropriately weighted.  In other words, Yeah, it could happen, but not that often.

Which scenario to choose for analysis depends on what you need to know for making decisions.  In the case of our construction site, it may well be that the scenario that management needs to understand is the worst plausible TA (who cares about the lesser ones?).  In another situation, maybe it is a broader Threat Community.  What you get depends on what you want, all of which goes to show how critical it is to define the scenario carefully, and get agreement it is the right one.

Safety Factors

Nobody in his (or her) right mind would, I hope, even consider hosting a 40 kN load on a 43 kN rope, or even a much smaller load.  In fact I am sure there are workplace safety regulations about that. 

Now suppose you are a regulator whose job it is to place a limit on the permissible load for a certain-size rope.  Limits are commonly stated as the safety factor, the ratio of the rope strength (e.g.) to the permissible load (RS to TCap in FAIR terms).  How do you do that?  One way is to use the method described above as a first step to quantify the probability of failure.  You would need reams of data on material testing.  But it’s only an initial step because setting final rules will of course be as much a values-driven and political process as a technical one.  Nevertheless it is interesting to think how such things can be done, and what kind of logic underlies safety factors of 1.5, 2, or more. 

What would it mean to our industry if a safety factor (RS/TCap) of 1.5 or 2 were required by regulation?

Further Questions

If the analysis of the rope example aids your understanding of TCap and RS in cyber risk, it nevertheless raises some other questions.  How can we understand “force” in cyber risk?  What exactly are TCap and RS?  And what exactly is the Threat Community, on which the whole analysis hinges?  I’ll address some of these questions in future notes.

However, if nothing else is clear, I hope you believe now that FAIR is applicable much more broadly than only to information risk.  In fact it can be applied to any risk scenario whose losses can be quantified in a single number, commonly dollars.  Multi-dimensional risk is a whole different beast. 

References:

[1]          The Open Group, Risk Taxonomy (O-RT), version 2.0, document number C13K

Sunday, May 3, 2015

A FAIR Telescope for Cyber Risk



“Imagine what it must have been like to look through the first telescopes or the first microscopes, or to see the bottom of the sea as clearly as if the water were made of gin.”

So the estimable science writer Matt Ridley begins today’s column (Wall Street Journal, May 2, 2015, p. C1) on how DNA sequencing, now so cheap and fast, has begun to illuminate the early history of humankind, with its many migrations, near-extinctions, and assimilations.  
The history of science is in no small measure the result of the progress in the technologies of observation.  The virtuous cycle of improved engineering and fabrication to improved observation to scientific advance, and back to improved engineering and fabrication, has profoundly affected all three, as well as our civilization and well-being.

Or, to follow Ridley, imagine the reaction of Louis Pasteur on seeing germs through a microscope.  So too does Fagan-style inspection of software enable its users to see the many “germs” that are defects in code.  (I have used it on all manner of business documents.  The results are inevitably sobering.)

It is almost trite now to say “you cannot manage what you cannot measure.”  But equally you cannot measure what you cannot see.

Analysis and management of operational risk, in particular cyber risk, now has such a microscope, Factor Analysis of Information Risk, or FAIR.  Thanks to the FAIR taxonomy, we now have a vocabulary and a means of identifying and making useful distinctions among the main words we use to describe operational risk.  This allows us to make repeatable and useful measurements of risk and its components, such as threat event frequency and loss magnitudes.


Now that we have precisely defined what we are talking about, we can manage risk better than ever before.

Photo credit "ALMA and a Starry Night" by ESO/B. Tafreshi (twanight.org) - http://www.eso.org/public/images/potw1238a/. Licensed under CC BY 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:ALMA_and_a_Starry_Night.jpg#/media/File:ALMA_and_a_Starry_Night.jpg

Thursday, April 16, 2015

Refining Order-of-Magnitude Estimates with Monte Carlo Simulation

I recently showed how to use order-of-magnitude thinking and interval estimates to identify which of four potential threats to the continuity of a hypothetical business in the San Francisco area would actually concern a business continuity planner.  They were Earthquake and Pandemic.  This was the result of literally multiplying the worst-case values for frequency of occurrence and loss magnitude.  (This is “honest math,” not multiplying red times green to get yellow, because we started with actual numbers.)

When we have an interval estimate, such as for the probable frequency of occurrence of earthquakes being between once every hundred years and once every ten years, that is between 0.01 and 0.1 times per year, it is another way of saying we are uncertain what the actual value would turn out to be if we had perfect information.  There is some number that if we knew it would be between 0.01 and 0.1.  We can model our belief about this number as a random variable with some probability distribution between those two limits. 

But which of the infinite number of probability distributions should we use?  Since I am completely unsure which number it would be, or even what it would be near, I’ll choose a uniform distribution, so that it is equally likely to be anywhere in the range.  I did this for all four quantities – the loss event frequencies and the loss event magnitudes of Earthquake and Pandemic. 

What I now do is randomly pick, according to each probability distribution, numbers for loss event frequency and loss magnitude, and multiply them together (“honest math”) to get the annualized loss expectancy (ALE) for that combination of frequency and magnitude.  That gives me one data point for what the ALE could be.  If I did that a jillion times, I’d get good coverage of the whole range of frequency and magnitude, and so get a whole population of ALEs that could occur consistent with my estimates.  If we plotted the distribution of ALEs, we’d have a complete description of the risk of that BC threat.  That is exactly what we mean by “risk.”

See all that stuff in the previous paragraph?  That’s a Monte Carlo simulation.  You know – Monte Carlo – that’s the place where they spin roulette wheels to generate random numbers.  At least they are random in honest casinos. 

I did that for Earthquake and Pandemic.  Here is what I got for the simulations of ALE for each.  Each chart summarizes the results of 1,000 simulations.  The top charts are the frequency histograms; the bottom charts are the cumulative probabilities.  If I really did a jillion, the lines would be nice and smooth.




Now here’s the point.  We may say, using our management judgment, that the 95% point for loss expectancy (or some other point) is our benchmark for how we will assess risk.  For Earthquake, the 95% point is about $285K of ALE, almost 30% less than the worst case of $400K.  For Pandemic, the 95% point is $390K, or vs a max of $528K, or 27% lower than worst case.  Of course the comparisons are even more dramatic for the 90% and 80% points.

The Upshot.  The net of it all is that by using some pretty simple Monte Carlo simulations we can get a more realistic picture of our risk than the-worst-times-the-worst, but still as conservative as we like. 

The Total Risk.  In BCP parlance, the total risk assessment (TRA) is simply the list of the conceivable threats with their likelihoods, loss magnitudes, and some kind of judgment combining the two.  It’s more like an inventory than a total.  But we are more sophisticated than that.  We know that risk is the probability distribution of annual loss expectancy, not some fake-math multiplication of red times green.  With the probability distributions of ALE for Earthquake and Pandemic in hand, we simply use Monte Carlo to get the probability distribution of the sum, which is the total risk.  I’ve done that for Earthquake and Pandemic, and also for the two threats that are not so interesting, Blizzard and Aviation Accident.  Here is the cumulative probability of annual loss expectancy for all four threats:


Here we see that, supposing these four threats are the only ones we need be concerned with, and that they are independent of each other, there is a 95% chance that total ALE is $550K or less.  This is less than the sum of the 95% points for the individual threats, and a whopping 40% less than the $938K total of the maxima because, again, if you are having a bad year on one threat you are unlikely to have a bad year on another. 


Monte Carlo simulation allows us to easily get deeper and more-realistic analysis of multiple factors, and see them in context, than the traditional methods.  And it’s not that hard.

Monday, April 13, 2015

Business Continuity Examples of Order-of-Magnitude Thinking

In a previous note I showed how to use order-of-magnitude thinking to quickly narrow down a highly uncertain number to a workable range.  I used the rather artificial example of the number of pages in the Christian Bible (equally applicable to Gone with the Wind or Harry Potter Meets Dracula).  Here I show a real-life example from business continuity planning.

The Challenge. How in the world can a conscientious business continuity analyst possibly come up with the dozens of estimates needed for a competent total risk assessment (TRA), which is just the first step in a business continuity plan? This note shows with concrete examples how order-of-magnitude thinking and interval estimates can make fast work of this task, and still get a result that is both sensible and defensible.

Taking inventory of the possible threats to business continuity is one of the first steps in making a business continuity plan (BCP). (I use the term “threat” to align with the FAIR taxonomy on risk, although “hazard” would suit too.)  Often this starts with somebody’s long list of threats.  These lists are commonly of the one-size-fits-all sort, without regard to any particular circumstances, and so comprise a vast variety of threats, many of which would not apply.  The analyst is then charged to assess the probability, or probable frequency of occurrence, and the probable loss to the business if each threat were to materialize.  She may be on the defensive to explain why the risk of a typhoon can be ignored.  Finally the analyst is to somehow combine the probability with the magnitude of loss to come up with a loss expectancy estimate for each of these several dozen threats.  And that’s just table stakes for a BCP.

I’ll demonstrate the method with four representative threats for a hypothetical software development business located in the San Francisco Bay Area: 
  • -        Blizzard
  • -       Earthquake
  • -       Aviation accident, and
  • -       Pandemic.

They represent the general categories of meteorological, geological, technological, medical threats.  I’ll give my personal (therefore subjective) estimates for probable frequency of occurrence and also subjective estimates for the dollar impact on this hypothetical business if each threat were to occur.  In all cases I’ll give a rough range from low to high.  Finally I’ll use half-orders of magnitude, that is, numbers like 1, 3, 10, 30, 100, etc., because I believe this is close enough for a first cut.  The second cut comes, well, second.

Blizzard.  Snow is very unlikely in the Bay Area except at the highest elevations, but I realize that a snow event big enough to impact the business could occur, so I’ll estimate the frequency to be between once in 30 years and once in 100 years.  If such an event were to occur, I feel it is highly likely it would not last more than a day.  Since this business is all knowledge work, the business impact would mostly be loss of people productivity.  Suppose this business has 300 people and the average total compensation is $150K / year.  I also assume that the value lost is reasonably approximated by the replacement cost of the work.  One day of lost productivity out of 250 working days per year is roughly $200K ($150K x 300 / 250).  (If your software engineers work 80 hours a week, scale accordingly.  Your mileage may vary.) Even in this event probably most people would work at home, which they often do anyway, so the loss may be more like half a day, or $100K.  With these numbers in mind I estimate the conditional impact between $30K and $300K.  In fact, a short search of historical records shows that snow had accumulated on the streets of San Francisco in historical times, so a frequency of once a century is reasonable.)

Earthquake.  This is earthquake country, no doubt about it.  As a casual reader of the local papers I am aware of geologists’ estimates that the next Big One will likely occur within 30 years, so I’ll put the probable frequency in the range of 10 to 100 years.  Notice that I am giving wide latitude – half an order of magnitude – to the consensus number, in recognition of the uncertainty.  But if the Big One were to occur, the business would effectively be shut down for some time.  The question is, how long?  The Loma Prieta quake in 1989 took most people one to a few days to get back on their feet.  That’s the low end.  The high end may be 10 to 20 days, so again using half-order-of-magnitude thinking I’ll estimate an impact of 1 to 30 days, or $200K to $4M.  This may seem like a uselessly wide range, but stay tuned.

(Notice that I am ignoring a lot of detail here at the high end.  What about loss of revenue and penalties for missed delivery dates?  What if the firm is driven into bankruptcy?  We’ll get to that later.)

Aviation Accident.  There are several airports in the area, both large commercial and small general aviation.  An accident in the flight path could plausibly affect almost any building in the Bay Area.  If this were to happen I judge the impact to be comparable to an earthquake – damage could range from minimal to catastrophic.  However I can only think of a few cases in the United States in the past two or so decades of an aviation accident impacting people on the ground, aside from terrorism (which is a different threat).  If there have been say 10 such cases in 10 years, spread over what must be more than a million buildings, the probable frequency is something like one in one million per year.  I could easily be off by an order of magnitude either way, so I’ll put the frequency at 1 in 100,000 to 1 in 10 million.

Pandemic has attracted much attention from BC planners in the last few years so it is worth a look.  Given the news coverage of Ebola, I am going to estimate the probable frequency between one in three years to one in 30 years.  The impact on the business would again be loss of productivity.  In the optimistic case only a few people, say 10, would be personally affected, assuming public health resources are effectively mobilized and people cooperate to prevent the spread.  In the pessimistic case 30% of the staff may not be able to work for several weeks, say 30 days.  I’ll assume unaffected people can work from home if necessary with no productivity impact.  Multiplying it out I get an impact range of roughly $180K (10 people x $150K x 30/250) to $1.6M (30% x 300 *$150K x 30/250).

We’ve done all the spadework, so now we can put the results together.



To compute annual loss expectancy I’ve simplistically multiplied the lows by the lows and the highs by the highs.  This could be overly pessimistic in the case of the highs because it assumes the highest frequency occurs together with the highest loss, which is probably not the case.  In fact, more-frequent losses tend to be the lower-magnitude ones.  We could improve on this with a Monte Carlo simulation but for a first cut the table is good enough.

Please note that the calculation of annual loss expectancy is an honest multiplication.  The method avoids the fake math of “multiplying” a “low” frequency by a “high” impact to get a “medium” loss expectancy, and the like. 


Notice also that the annual loss expectancies fall naturally into two categories, the ones that seem safe to ignore and the ones we need to pay attention to. Also the threats in the two categories do seem to accord with intuition.  

Benefits.  This analysis has done several things for us:  
  1. it focuses the BC planning where it really ought to go
  2. it shows where we may need to take a second cut
  3. it provides reasonable justification for what we decide to ignore
  4. it refines our intuition (and can alert us to blind spots), and 
  5. it makes efficient use of our time.  

Not a bad deal.  

Wednesday, April 1, 2015

What a Risk Decision Actually Is

In this note I’ll dissect and expose exactly is meant by making a decision among risky alternatives, and what we should expect the management of an organization to be able to do in making these decisions.

In a previous note I proposed the following definition:
Risk Decision.  A decision by the leadership of an organization to accept an option having a given risk function in preference to another, or in preference to taking no action.  I assume that competent leadership of any organization worth its pay can make such a decision, at the appropriate level of seniority.
The term is shorthand for a decision between alternatives, at least one of which has a probability of loss. (Usually in cyber risk we are concerned with losses, but all the ideas extend naturally to upside or opportunity risk.  Few people and fewer organizations take on risk without some expectation of advantage, if only cost avoidance.)
The definition depends on the idea of a risk function (AKA “the risk” of something) as:
The probability distribution of loss magnitudes for some stated period of time, such as one year.  This is what I think most people really mean when they speak of the “risk” of something.
I like to think of the risk function in terms of its loss exceedance curve, the probability distribution that a particular loss magnitude will be exceeded, for the given time frame, as a function of the loss magnitude.  The nearby graphic illustrates two possible loss exceedance curves for a “before” and “after” assessment of an investment which is supposed to reduce risk.  


These curves are the final quantitative result of a risk analysis of a particular scenario.  The decision problem is whether to invest in the control or not.  (It may be a web application firewall, for instance.)  The analysis says, for instance, that investing in the control will reduce the chance of annual loss greater than $40K from 95% to 20%.  Sounds pretty good!
Of course there is more to it.  Management needs to know how much the control will cost.  Costing out a control, including recurring and non-recurring costs, cost of capital, staff support, all in, is a well-established discipline compared to risk analysis, so let’s assume it has been done.  Suppose the price tag is $20K.  Management has to decide if the reduction in risk is worth the cost.
There has been much agonizing in the literature about how a rational actor can consistently choose among risk functions.  The most prominent approach is Von-Neumann-Morgenstern utility.  Its main result is that, given any risk function, a rational actor can assign a number with his personal utility function such that more-preferred risk functions always have higher numbers than less-preferred ones.  It’s a nifty but impractical result for several reasons.  For one thing, it turns out to be hard to estimate a person’s utility function.  And if it’s hard for the average person, you will not get many a CEO to sit still for the exercise.  For another, risk decisions, especially big ones, are often made jointly by multiple stakeholders, like the CIO, CFO and CEO, for good reasons.  Getting a utility function for a committee is even harder.  Finally, senior managers have an understandable need to “do a gut check” and personally engage with big decisions.  They are not going to delegate the decision to a formula, nor should they. 
So I assume that, given two risk functions, leadership can and will know which they prefer.  Making risk decisions is what they are paid to do.  This is the reason for my definition of a “risk decision.”
The definition has some immediate implications.  The first is that through a series of pair-wise comparison leadership can set any set of risk functions in order from most-preferred to least-preferred.  On one end, the reaction is, “This is great!  Where do I sign?”  At the other it’s “Over my dead body.”  In between there is a zone of indifference where management thinks “I don’t really care one way or the other.” 
Next, having in principle ranked a bunch of risk functions, management will say that there are some I just would not choose if I had the option not to.  So there is a notion of “this far and no further” in the pursuit of our goals.  This is the basis of the definition of:
Risk Appetite.  The worst (least-preferred) set of probability distributions of loss magnitudes that the management of an organization is willing to voluntarily accept in the pursuit of its objectives.
In other words, in our ranking scheme, these are the ones just a little better than unacceptable, if we have a choice.
But what management doesn’t have a choice?  Threats can be discovered that we would not actively accept in the furtherance of our objectives.  Some we can live with even if we prefer not to.  The worst (least-preferred) risk functions that we are willing tolerate if imposed upon us leads to:
Risk Tolerance.  The set of least-preferred probability distributions of loss magnitudes that the management of an organization is willing to accept when presented with them involuntarily. 
Risk Tolerance is by definition greater than (includes more probability distributions of losses) than Risk Appetite.  The key is involuntariness.
So we have three sets of risk functions:  those we are willing to choose in pursuing our objectives, those we are willing to accept but not opt for, and those we cannot abide.  And within those sets there may well be ones that we have about the same preferences for even if their risk functions differ. 
What if a loss exposure (aka risk function for a scenario) is discovered that is worse than our risk tolerance?  Well then it is by definition intolerable and we have to do something to mitigate or avoid it. A threat of this nature is almost by definition an existential threat to the organization – it threatens the ability of the organization to achieve its goals or perhaps even survive.  But that’s another topic:  business continuity planning.

A Plea for a New Taxonomy for Cyber Risk

Despite much ink spilled on the subject, the vocabulary of cyber risk continues to be muddled.  Just consider that the word “risk” itself has multiple meanings, indiscriminately applied.  An absurd construction as “that risk is a high risk” is perfectly possible in today’s vocabulary.  This note is the first of a series of contributions to solve this problem. 

The root of the problem is that people try too hard to reconcile specialist meanings with ordinary language.  For literature and poetry, the multiple meanings of English words and the ambiguities of syntax are often a useful and sometimes a wonderful thing.  But for practical affairs of science, engineering, business and law, it’s a breeding ground for problems. 

Other professions have solved the problem in various ways.  Mathematics precisely defines ordinary words like “group” and “function” to have special meanings.  The relatively closed nature of the profession prevents misunderstandings among outsiders.  This method would not work well in cyber risk, as the word “risk” itself shows, because the specialists have to communicate with non-specialists all the time.  We can’t appropriate ordinary words to mean something special only when we talk amongst ourselves.

A variant of the ordinary-word method is to put common words to new meanings, such as “motor” or “inductor” were in the nineteenth century, and then rely on the obviously new context to prevent misinterpretation.

Another way out is to create new words that ordinary people won’t use.  Biology and medicine are famous for this.  If you mean a specific kind of mosquito or muscle, it’s anopheles or biceps brachii.  When you want to make sure that outsiders are kept outside and suitably intimidated, a dead language is perfect!  But that’s the trouble:  arcane words are a barrier to communication, and that’s the last thing we need in cyber risk.

We can create new words out of whole cloth, instead of stealing from Aristotle and Virgil.  “Cybernetics” and “cryogenics” are examples that do not prevent communication with lay persons.  Technology is a rich source of neologisms, as witness “transistor,” “diode,” and “automobile.”

The last way out of the swamp of confusion, one that I find very attractive, is the noun phrase.  Here you put together a few ordinary words in an improbable juxtaposition, such as “integrated circuit,” “tensile strength,” or “coefficient of thermal expansion.”  This seems to be the best solution.  The reader needn’t have studied Latin or Greek, she can easily see that something special is meant, and even the non-specialist can get a sense of what the special meaning is.

To get this movement kicked off for cyber risk, I’ll propose some of my own definitions.  I build on the excellent foundation of the FAIR taxonomy (Factor Analysis for Information Risk), which you can find on The Open Group website.

First let’s agree to use “risk” by itself only as a lay term, and otherwise regard it as a four-letter word not to be used in polite conversation.  And when we use it in a lay context, let “a risk” mean “a loss event scenario,” as advised by Freund & Jones (“Measuring and Managing Information Risk,” p. 353).  Notice the “a”.

Here are a few related terms and my proposed definitions.

Risk Function.  The probability distribution of loss magnitudes for some stated period of time, such as one year.  This is what I think most people really mean when they speak of the “risk” of something.

Loss Exceedance.  The probability distribution that a particular loss magnitude will be exceeded, for the given time frame, as a function of the loss magnitude.  It is the “tail distribution” of the risk function.  This is a standard term in the insurance industry (from which we can learn much).  The loss exceedance function has some nice properties which give it intuitive appeal.

Risk Decision.  A decision by the leadership of an organization to accept an option having a given risk function in preference to another, or in preference to taking no action.  I assume that competent leadership of any organization worth its pay can make such a decision, at the appropriate level of seniority.

Risk Appetite.  The worst (least-preferred) set of probability distributions of loss magnitudes that the management of an organization is willing to voluntarily accept in the pursuit of its objectives.  The key idea here is voluntariness.

And finally, to settle the age-old dispute about the difference between risk appetite and risk tolerance:

Risk Tolerance.  The set of least-preferred probability distributions of loss magnitudes that the management of an organization is willing to accept when presented with them involuntarily.  Risk Tolerance is by definition greater than (includes more probability distributions of losses) than Risk Appetite.  The key proviso here is involuntariness.

I’ve have more to offer later about notions like attack, attack surface, attack vector, exploit, flaw, and vulnerability.

Order of Magnitude Thinking


We often run into the problem of estimating a number about which we seemingly have no idea.  For example, how many severe defects probably remain undiscovered in software that is now being submitted for deployment to production?  The answers I have gotten to this question have been (a) “none, because QA would already have found them and we fixed them,” (b) “we cannot know until we deploy to production and wait 30 days for new bug reports,” and (c) “I have no earthly idea.”

Surely we can do better than this!

We can put some reasonable and often useful bounds on estimates of highly uncertain numbers using order-of-magnitude thinking.  I have used this technique on my own estimates and in querying colleagues for many years and usually found the results to be illuminating and useful.  You only need at least some familiarity with the subject.  (There are problems, of course, about which we may have no earthly idea, such as the number of neutrinos passing through our body in one second.  It’s huge!) 

Let’s consider the number of pages in some random edition of the Christian Bible.  Which edition hardly matters, as we shall see.

The first step is to take an impossibly low number and an impossibly high number to bracket the range.  It’s easy if we limit ourselves to powers of 10.  Each power of 10 is one order of magnitude.

Could the Bible be as short at one page?  Certainly not.  We know it is a rather hefty book.  Ten pages? No.  A hundred?  Again, no.  A thousand?  Well, I would be unwilling to bet that it’s more than a thousand pages.

How about the other end?  Could it be a million pages?  No.  A hundred thousand?  No.  Ten thousand?  No – that would be at least 4 or 5 very big books, and we know it’s only one book.  A thousand?  Again I am unwilling to bet it’s less than a thousand.

So we already know that some edition of the Bible is almost certainly between 100 and 10,000 pages.  We may now feel, based on this preliminary ranging and other experience, that the answer is in the neighborhood of 1000 pages.  So can we narrow the range a bit more?

Instead of limiting ourselves to powers of 10, let’s tighten it up a bit to a half a power of 10, that is, the square root of 10.  We’ll use 3 for convenience. 

Now, do we think our Bible is more than 300 pages?  Yes.  We are pretty confident it is several hundred pages.  Is it less than 3000?  Again, yes.  It’s long but not that long.  So we have succeeded in tightening the range from two orders of magnitude, 100 to 10,000, to one order of magnitude.  Progress!

We could stop here, feeling that we now have enough information for whatever the purpose is.  (You need to know how much precision you need.  This exercise can help you think about that.) 

Or we could try to narrow it further.  Moving from powers of 10 to powers of the square root of 10 (roughly 3), we could try powers of 2 – “binary orders of magnitude” – and so potentially narrow the range to 500 to 2000 pages.  Of course you can go as far as you like with this procedure, until you are uncomfortable with narrowing the range any further. 

This procedure is quick and often yields useful insights to probable magnitude, and to the extent of our uncertainty.  It is surprising how often the result is “good enough.”  And it may quickly guide us to which among several highly uncertain numbers it is worth the effort to research more carefully.  As Doug Hubbard says, you know more than you think you do, what you do know is often good enough, and it is usually only one or two numbers among many that are worth buying more precision about.
Post script: This method is inspired by the scale knobs on many kinds of electronic test equipment, which often have to accommodate huge ranges.  A voltmeter may need to measure from millivolts or microvolts to 1000 volts – 6 to 9 orders of magnitude.  They have range settings using the 1-2-5-10 scheme, for 1, 2, 5 and 10 millivolts of sensitivity, and so on up the scale.  A useful way of thinking!