Monday, April 13, 2015

Business Continuity Examples of Order-of-Magnitude Thinking

In a previous note I showed how to use order-of-magnitude thinking to quickly narrow down a highly uncertain number to a workable range.  I used the rather artificial example of the number of pages in the Christian Bible (equally applicable to Gone with the Wind or Harry Potter Meets Dracula).  Here I show a real-life example from business continuity planning.

The Challenge. How in the world can a conscientious business continuity analyst possibly come up with the dozens of estimates needed for a competent total risk assessment (TRA), which is just the first step in a business continuity plan? This note shows with concrete examples how order-of-magnitude thinking and interval estimates can make fast work of this task, and still get a result that is both sensible and defensible.

Taking inventory of the possible threats to business continuity is one of the first steps in making a business continuity plan (BCP). (I use the term “threat” to align with the FAIR taxonomy on risk, although “hazard” would suit too.)  Often this starts with somebody’s long list of threats.  These lists are commonly of the one-size-fits-all sort, without regard to any particular circumstances, and so comprise a vast variety of threats, many of which would not apply.  The analyst is then charged to assess the probability, or probable frequency of occurrence, and the probable loss to the business if each threat were to materialize.  She may be on the defensive to explain why the risk of a typhoon can be ignored.  Finally the analyst is to somehow combine the probability with the magnitude of loss to come up with a loss expectancy estimate for each of these several dozen threats.  And that’s just table stakes for a BCP.

I’ll demonstrate the method with four representative threats for a hypothetical software development business located in the San Francisco Bay Area: 
  • -        Blizzard
  • -       Earthquake
  • -       Aviation accident, and
  • -       Pandemic.

They represent the general categories of meteorological, geological, technological, medical threats.  I’ll give my personal (therefore subjective) estimates for probable frequency of occurrence and also subjective estimates for the dollar impact on this hypothetical business if each threat were to occur.  In all cases I’ll give a rough range from low to high.  Finally I’ll use half-orders of magnitude, that is, numbers like 1, 3, 10, 30, 100, etc., because I believe this is close enough for a first cut.  The second cut comes, well, second.

Blizzard.  Snow is very unlikely in the Bay Area except at the highest elevations, but I realize that a snow event big enough to impact the business could occur, so I’ll estimate the frequency to be between once in 30 years and once in 100 years.  If such an event were to occur, I feel it is highly likely it would not last more than a day.  Since this business is all knowledge work, the business impact would mostly be loss of people productivity.  Suppose this business has 300 people and the average total compensation is $150K / year.  I also assume that the value lost is reasonably approximated by the replacement cost of the work.  One day of lost productivity out of 250 working days per year is roughly $200K ($150K x 300 / 250).  (If your software engineers work 80 hours a week, scale accordingly.  Your mileage may vary.) Even in this event probably most people would work at home, which they often do anyway, so the loss may be more like half a day, or $100K.  With these numbers in mind I estimate the conditional impact between $30K and $300K.  In fact, a short search of historical records shows that snow had accumulated on the streets of San Francisco in historical times, so a frequency of once a century is reasonable.)

Earthquake.  This is earthquake country, no doubt about it.  As a casual reader of the local papers I am aware of geologists’ estimates that the next Big One will likely occur within 30 years, so I’ll put the probable frequency in the range of 10 to 100 years.  Notice that I am giving wide latitude – half an order of magnitude – to the consensus number, in recognition of the uncertainty.  But if the Big One were to occur, the business would effectively be shut down for some time.  The question is, how long?  The Loma Prieta quake in 1989 took most people one to a few days to get back on their feet.  That’s the low end.  The high end may be 10 to 20 days, so again using half-order-of-magnitude thinking I’ll estimate an impact of 1 to 30 days, or $200K to $4M.  This may seem like a uselessly wide range, but stay tuned.

(Notice that I am ignoring a lot of detail here at the high end.  What about loss of revenue and penalties for missed delivery dates?  What if the firm is driven into bankruptcy?  We’ll get to that later.)

Aviation Accident.  There are several airports in the area, both large commercial and small general aviation.  An accident in the flight path could plausibly affect almost any building in the Bay Area.  If this were to happen I judge the impact to be comparable to an earthquake – damage could range from minimal to catastrophic.  However I can only think of a few cases in the United States in the past two or so decades of an aviation accident impacting people on the ground, aside from terrorism (which is a different threat).  If there have been say 10 such cases in 10 years, spread over what must be more than a million buildings, the probable frequency is something like one in one million per year.  I could easily be off by an order of magnitude either way, so I’ll put the frequency at 1 in 100,000 to 1 in 10 million.

Pandemic has attracted much attention from BC planners in the last few years so it is worth a look.  Given the news coverage of Ebola, I am going to estimate the probable frequency between one in three years to one in 30 years.  The impact on the business would again be loss of productivity.  In the optimistic case only a few people, say 10, would be personally affected, assuming public health resources are effectively mobilized and people cooperate to prevent the spread.  In the pessimistic case 30% of the staff may not be able to work for several weeks, say 30 days.  I’ll assume unaffected people can work from home if necessary with no productivity impact.  Multiplying it out I get an impact range of roughly $180K (10 people x $150K x 30/250) to $1.6M (30% x 300 *$150K x 30/250).

We’ve done all the spadework, so now we can put the results together.

To compute annual loss expectancy I’ve simplistically multiplied the lows by the lows and the highs by the highs.  This could be overly pessimistic in the case of the highs because it assumes the highest frequency occurs together with the highest loss, which is probably not the case.  In fact, more-frequent losses tend to be the lower-magnitude ones.  We could improve on this with a Monte Carlo simulation but for a first cut the table is good enough.

Please note that the calculation of annual loss expectancy is an honest multiplication.  The method avoids the fake math of “multiplying” a “low” frequency by a “high” impact to get a “medium” loss expectancy, and the like. 

Notice also that the annual loss expectancies fall naturally into two categories, the ones that seem safe to ignore and the ones we need to pay attention to. Also the threats in the two categories do seem to accord with intuition.  

Benefits.  This analysis has done several things for us:  
  1. it focuses the BC planning where it really ought to go
  2. it shows where we may need to take a second cut
  3. it provides reasonable justification for what we decide to ignore
  4. it refines our intuition (and can alert us to blind spots), and 
  5. it makes efficient use of our time.  

Not a bad deal.  

No comments:

Post a Comment