Archives of Personal Papers ex libris Ludwig Benner, Jr.
   - - - - - -Last updated on Saturday, July 12, 2012
   [Investigation Catalyst Software ] [ Investigation Research Roundtable Site ]   
[ Contact "me" at ludwigbenner.org ]


This is a paper presented at a System Safety Society conference by Mr. Rimson.


Mishap Investigations: Tools for Evaluating the Quality of
System Safety Program Performance

Ira J. Rimson, P.E. (SM1768)
the validata corporation
8223 Mosquero Ave. N.E.
Albuquerque, NM 87111
Ludwig Benner, Jr., P.E. (FM0359)
Ludwig Benner & Associates
12101 Toreador Lane
Oakton, VA 22124

Abstract

System safety practitioners use predictive analyses to identify and rank risks posed by new technical initiatives. While systems are being developed, system safety techniques use known and assumed data to predict destabilizing scenarios, enabling managers to select actions which help optimize specified project goals.

After systems start operating, anomalies, deviations, disruptions, failures, accidents and catastrophes begin to occur. These real world mishaps offer opportunities for system safety practitioners to validate the quality of their predictive performance.

System safety practitioners have not taken advantage of these opportunities to evaluate the performance of their chosen methodologies. Lacking evaluation data, the profession has neglected its responsibility to identify and initiate improvements to methodologies and techniques. The resulting quality assurance void misleads development and implementation of optimum preventive actions by creating an illusion of understanding.

The authors assert a pressing need for introducing robust quality evaluations of the system safety risk prediction process as precursors to improving its scientific validity.

Characteristics of High Risk and Complex Systems

Most high-risk systems have some special characteristics, beyond their toxic or explosive or genetic dangers, that make accidents in them inevitable, even "normal". This has to do with the way failures can interact and the way the system is tied together. It is possible to analyze these special characteristics and in doing so gain a better position to argue that certain technologies should be abandoned, and others, which we cannot abandon because we have built much of our society around them, should be modified. Risk will never be eliminated from high-risk systems, and we will never eliminate more than a few systems at best. At the very least, however, we might stop blaming the wrong people and the wrong factors, and stop trying to fix the systems in ways that only make them riskier.

(Perrow, @4)


The Challenger] case extends Perrow's notion of system to include aspects of both environment and organization that affect the risk assessment process. . . [I]nterpretation of the signals is subject to errors shaped by a still wider system that includes history, competition, scarcity, bureaucratic procedures, power, rules and norms, hierarchy, culture, and patterns of information

(Vaughan, @415)

The System Safety Process has been described as encompassing the following steps applied over a system's life-cycle:
  1. Understanding the system
  2. Identifying the potential hazards associated with the system
  3. Developing the means for controlling identified hazards adequately
  4. Implementing the hazard controls
  5. Verifying the effectiveness of the implemented hazard controls
  6. Repeating the process at varying levels of detail.

(SSA Handbook @1-2)

Neither Perrow's "normal accidents" nor the environmental/organizational factors which Vaughan identifies in the Challenger accident process are identifiable by traditional predictive analyses. Neither can be perceived within current system safety paradigms before a mishap [1]. Once risk acceptance decisions are implemented and the system has been launched into the real world, the effectiveness of system safety's pro forma efforts are measured by the occurrence and severity of the mishaps which were not predicted [2].

To assess those measures accurately, we need to know what happened. If we want the ability to evaluate the effectiveness of system safety's prevention efforts, we must change from traditional methodologies to those which can identify both the precursors and the results of unanticipated "normal accidents," and provide replicable data on which to design reality-based responses to operational hazards.

Few system safety practitioners or their managers understand that Step 5 in the process cited above — verifying the effectiveness of implemented hazard controls — can also provide a functional audit of the quality of the preceding analytical process. [3] That is why robust investigation is essential to system safety program success.

In the rush to "do system safety", it is too often the case that inadequate regard is given to the important business of selecting, with rational care, the particular analytical method to be used — whether that method might be a type or a technique. Methods are selected on the basis of their current popularity, their fancied potency, or the affection developed for them by the individual practitioner, rather than on the basis of their worth at dealing meaningfully with the real technical issues at hand.

(P. L. Clemens, in System Safety Analysis Handbook @xiv)

Attempts to evaluate operational mishaps by predictive analyses [4] fail because do not provide the specific data needed to analyze the mishap process: Who did What, When, and Why. The explanatory potential of predictive analyses is limited by what is known; their results do not account for the kind of unpredictable interactions among elaborate structure and intricate environmental influences which characterize complex systems and modern technology. If we are to discover what happened , probabilistic cataloging of random interactions among events must yield to deterministic discrimination of specific happenings. [5]

Probabilism vs. Determinism

Why are traditional predictive analyses inappropriate for evaluating real-world events? Simply put, because predictive analyses are designed to estimate probabilities for as many alternate scenarios as imagination can generate. A Fault Tree should contain every possible failure to which its chosen "Top Event" is susceptible. [6] Some of those scenarios will happen, some can happen but probably won't, and some can't happen at all. Predictive analyses aren't fussy, because their purpose is to support management decisions for mitigating all risks to an acceptable level. [7]

Likewise, "Top Events" in Fault Trees aren't particular about what specific failure modes initiate their demise. Like animal-rights activists who argue that "a pig is a dog is a boy", in the initiation of a generator-induced electrical failure, a driveshaft is a seal is a bearing. Predictive analyses tell us what might happen; they don't explain what did happen. Once a mishap occurs, the probability of that mishap's process and outcome's occurring is 1.0; the probability of any other confluence of events producing that outcome is zero. [8] Statistically derived probabilities aggregated from prior data become meaningless once something happens. Furthermore, each false lead (P=0) of the probabilistic methodology must be explored to determine that it did not occur, leading to investigation by elimination, data restriction, and excessive expenditures of time and money. [9] (See Presidential Commission Report and Vaughan @53)


Proofs of the Pudding

From Rogers (1971) and Hammer (1972), through Johnson (1980), Roland & Moriarty (1983) to Stephenson (1991), system safety practitioners have been unwilling to acknowledge the existence of methodological failures.

And there have been failures aplenty, most significantly within professional domains which were supposedly the most accepting of system safety practices:


NASA:
Challenger; Hubble telescope
US #000080:
USS Iowa ; Aegis system; F-14 engines; F/A-18
Commercial Aviation:
Fly-by-Wire; Glass Cockpits; ATR-42/-72 icing
Railroads:
NJ & Maryland commuter rail; DC Metro
US Army:
Patriot missile system; UH-60 control system
USAF:
AWACS/UH-60 shootdown; C-5A pressure doors

Little, if any, substantive changes have been initiated in system safety technologies since their inception. [10] Each of the disasters cited, and many others, represent lost opportunities to discover flaws in the underlying system safety methodology, and apply the lessons learned to improving future identification of risks for elimination and control. In all these cases, failures have been viewed as component failures, not as failures of the methodology.

Has anyone ever asked "Why didn't our System Safety Program capture this potential failure?". If so, the response failed to inspire any changes to the "conventional wisdom".

Data Utilization

Potential pitfalls for practitioners applying predictive analyses lie in attempting to apply general data to specific system applications, especially if the demand exhausts the available data, and technical experts resort to intuition. [11] Input data accepted uncritically can result in an illusion of understanding. (Fischhoff & Merz @177) Uncritical acceptance of faulty data was instrumental in the fateful Challenger launch decision:

Even Safety, Reliability & Quality Assurance, the one regulatory unit with personnel who worked closely with the NASA engineers on a daily basis, did not challenge the work group's definition of the situation. Their job was to review, not to produce, data and conduct tests as an independent check. Dependent on the work group for information and its interpretation, they became enculturated. They reviewed the engineering analysis and agreed.
(Vaughan @392)

Rigorous investigation provides data from which to construct a model of the operating system mishap process, which may be compared with a model of the planned system operational process. [12] These comparisons depict specific points at which reality diverges from prediction. [13] Multi-layered overlays are effective tracking tools for evaluating system evolution; e.g., (1) original planned configuration, (2) "changed-to" configuration, and (3) actual operational configuration.

The Challenge

System safety's current "win-loss" ratio is uninspiring when measured by the outcomes of its efforts. Vaughan's deconstruction of the Challenger accident and its investigation reports demonstrates that traditional investigative techniques, even when blessed by Presidential and Congressional sponsors, are ineffective for identifying methodological inadequacies.

We propose the following fourfold challenge to the system safety profession, with the objective of improving the validity and utility of the benefits it offers to science, technology and society:

  1. Change system safety perspective from its current commodity orientation to a contextual orientation; i.e., broaden the focus on things to include the environmental, organization, social and societal circumstances within which they operate. Focus on the decision process rather than on the decision makers and their decisions. [14]
  2. Expand the system safety investigative methodology from mere prevention data generation; add goals for assessing methodology insufficiencies which have failed to identify and/or prevent the current mishap;
  3. Establish new objectives which incorporate system safety process assessments into Continuous Quality Improvement programs to enhance the validity and utility of system safety prevention initiatives; and

  4. Recognize the limitations of predictive analyses as paradigms for mishap investigation, recognizing that as investigative techniques they are:
    1. Inefficient- requiring conclusion by elimination;
    2. Ineffective - dealing only with known factors and lacking the capability to identify unknowns;
    3. Subjective - limited by investigators' judgment; and
    4. Misdirected - focused on individual failure points rather than on the context within which the object system operates, and how that context influences its behavior and operation.
The system safety profession has incomparable benefits to bring to its clients in particular, and society and civilization in general. It must first acknowledge its inadequacies, then take action to correct and improve on them. To paraphrase the words of Yogi Berra:

If we keep doing what we did, we're gonna keep getting what we got!




									
References
  1. Fischhoff, Baruch and J. F. Merz. "The Inconvenient Public: Behavioral Research Approaches to Reducing Product Liability Risks". Product Liability and Innovation . Washington: National Academy Press, 1994.
  2. _____Guidelines for Investigating Chemical Process Accidents. New York: Center for Chemical Process Safety of the American Institute of Chemical Engineers, 1993.
  3. Hammer, Willie. Handbook of System and Product Safety . Englewood Cliffs, NJ: Prentice-Hall, 1972.
  4. Johnson, William G. MORT Safety Assurance Systems . New York: Marcel Dekker, 1980.
  5. Kitfield, James. "Crisis of Conscience". Government Executive Magazine; October 1995, pp. 14-24.
  6. Perrow, Charles. Normal Accidents . New York: Basic Books, 1984.
  7. ___ Report to the President by the Presidential Commission on the Space Shuttle Challenger Accident , Volume 1 of 5. Washington: U.S. Government Printing Office, 1986.
  8. Reason, James. Human Error.Cambridge University Press, 1990.
  9. Rogers, William P. Introduction to System Safety Engineering New York: John Wiley, 1971.
  10. Roland, H. E. and B. Moriarty. System Safety Engineering and Management New York: John Wiley, 1983.
  11. Stephenson, Joe. System Safety 2000 . New York: Van Nostrand Reinhold, 1991.
  12. System Safety Analysis Handbook. System, Safety Society, 1993.
  13. Thompson, Mark. "Way, Way Off in the Wild Blue Yonder". Time Magazine, 145:22; May 29, 1995, pp. 32-33.
  14. Vaughan, Diane. The Challenger Launch Decision . University of Chicago Press, 1996.


[1] None of the contents of the current System Safety Analysis Handbook address these issues directly by name.
[2]
Interestingly measurement of the aaggregate “success” of these efforts is not addressed directly by the contents of the Handbook; the monitoring of the effectiveness of prevention actions is addressed on p 3-247. (Purpose.)

[3] The authors know of only one paper in the SSS literature dealing with this analysis issue (HP26:1 Safety Training’s Achilles’ Heel p 6) It would be helpful to members for someone with examples from SSS audits to share them.

[4] Primarily logic tree-based analysis methods which support PRAs.

[5] To describe and explain what happened, causal relationships among actions and influences must be established deterministically, rather than probabilistically.

[6] If you overlook any options, you upset the rest of the relative probabilities.

[7] Management acts on probabilistic ranking of estimated relative risks.

[8] The fact that an investigator does not understand what happened does not change the probabilities.

[9] Interestingly the SS Analysis Handbook is relatively silent on the issue of efficiency of any analysis or investigation methods.

[10] The dominance of logic tree-based approaches is apparent in most of the literature cited, and the references in the SS Analysis Handbook.

[11] SS practitioners take descriptions of systems they analyze in whatever form they might find available. Only one analysis method described in the SS Analysis Handbook holds itself out to define the system in a way that permits a systematic system safety analyses (@ p 3-247 Purpose [STEP].) Another @p 3-79 [ETBA]) calls for defining system operation in terms that permit the analyst to trace energy flows in the system. Others reference the system in terms of task analysis formats (@3-33 Method [CPQRA]) or “identify all major components, functions and processes (@p 3-111 Method [FMEA]) without specifying how these identified elements are to be presented so the analyst can work through the process interactions progressively and systematically to find potential adverse influences and anomalies.

[12] A model describing interactions in the form of who did what is applicable to both predictive and retrospective analyses.

[13] To be identified, specific points must be identified as specific actions by specific people or objects in both the predictive and retrospective analyses .

[14] The decision process involves interactions affected by environmental and organizational influences.