Comment PDF Operations & Maintenance

Integrated Risk-Management Matrices

By Nathanael Ince, Pinnacle Advanced Reliability Technologies |

An overview of the tools available to reliability professionals for making their organization the best-in-class

Since the 1960s, process facility operators have made concerted efforts to improve the overall reliability and availability of their plants. From reliability theory to practical advancements in non-destructive examination and condition-monitoring techniques, the industry has significantly evolved and left key operations personnel with more tools at their disposal than ever before. However, this deeper arsenal of tools, coupled with more stringent regulatory scrutiny and internal business pressure, introduces a heightened expectation of performance. Now, more than ever, companies recognize that best-in-class reliability programs not only save lives but increase the bottom line. These programs are also one of the foremost levers for C-level personnel to pull when trying to contend in a competitive environment.


With this in mind, a best-in-class reliability organization combines state-of-the-art theory, software and condition-monitoring techniques with a strong collaboration of departments and associated personnel. An independent risk-based inspection (RBI) program or reliability-centered maintenance (RCM) program no longer suffices as cutting-edge. Rather, the inspection department (power users of RBI) and maintenance department (power users of RCM) are integrating with process, operations, capital projects and other teams to form an overall reliability work process for the success of the plant.

To highlight reliability’s growing prominence within process facilities, this article addresses the following:

  • A brief history of reliability practices in the 20th and 21st centuries
  • Examples of current reliability program tools
  • A characterization of three different risk-mitigation applications that are currently applied in process facilities
  • The case for ensuring these risk mitigation frameworks are working together
  • The value of key performance indicators (KPIs) in providing transparency and accountability to the effectiveness of these risk mitigation frameworks


Reliability, historically

When one thinks about process reliability, a variety of definitions come to mind. However, it has come a long way since the early 20th century. From the 1920s to the 1950s, reliability went from being classified as “repeatability” (how many times could the same results repeat) to dependability (hours of flight time for an engine), to a specific, repeatable result expected for a duration of time.

Through the 1950’s age of industrialization, reliability’s evolving definition was still very much focused on design and not as much on operations or maintenance. Then in the 1960s, the airline industry introduced the concept of reliability centered maintenance (RCM), pushing the idea that the overall reliability of a system included not only the design, but also the operations and maintenance of that system. In other words, reliability engineering was now stretching into other departments, mandating that the overall risk of failure was tied to multiple aspects of the asset’s lifecycle. As a result, several different departments and individuals cooperated to ensure they attained reliability.

The concept of RCM pushed through some industries quicker than others. While it started with the airlines, it flowed quickly into power generation, petrochemical and petroleum-refining operations thereafter.

Fast-forward to 1992, and another facet, called process-safety management (PSM), was introduced into the reliability picture. In response to a growing perception of risk related to hazardous processes, the Occupational Safety and Health Administration (OSHA) issued the Process Safety Standard, OSHA 1910.119, which includes the following 14 required elements:

  • Process-safety information
  • Process hazard analysis
  • Operating procedures
  • Training
  • Contractors
  • Mechanical integrity
  • Hot work
  • Management of change
  • Incident investigation
  • Compliance audits
  • Trade secrets
  • Employee participation
  • Pre-startup safety review
  • Emergency planning & response

The intent of the regulation was to limit the overall risk related to dangerous processes, and raise the bar for compliance expectation for facilities with these “covered” processes. At that point, it became law to fulfill these 14 elements, and to ignore them, or to show negligence to these steps in the event of a release, implied the possibility of criminal activity. In other words, if those responsible in the event of a release were found to be negligent in these items, they could go to jail. The other business implication of this standard was that it meant that other individuals, and departments, now had a part to play in reliability and overall process safety.

While reliability was confined to designing equipment that could last a certain time and coupling it with a non-certified inspector to make general observations in the 1950s, by the mid-1990s, reliability had become a much more complex, integrated and accelerated science.


Reliability today

With the greater expectation on today’s programs, department managers (including reliability, mechanical-integrity or maintenance managers) face a powerful, but often intimidating array of tools available to them for improving their reliability programs. Examples are listed in Table 1.


While this only represents a subset of the options available to the manager, all of these activities aim at doing the following:

  1. Reducing the risk of unplanned downtime.
  2. Limiting safety and environmental risk.
  3. Ensuring compliance with regulatory standards.
  4. Doing steps one through three for the least cost possible.

To summarize, the goal of these managers is to put a plan in place and execute a plan that identifies and mitigates risks as efficiently as possible. To do that, one has to systematically identify those risks in addition to the level to which those risks must be mitigated. If this is done correctly, the design, inspections, preventative maintenance, operational strategies, and other program facets should all be aligned in attaining steps one through four.


Risk-mitigation approaches

Since the 1960s, there have been substantial efforts on figuring out how to best characterize both downtime and loss-of-containment risk in a facility so that appropriate and targeted mitigation actions can be taken at the right time. That being said, there are three common risk identification and mitigation frameworks that are currently being used in process facilities today. These include process hazard analysis (PHA), risk-based inspection (RBI), and reliability-centered maintenance (RCM). Let’s briefly characterize each.

PHA. The PHA came out of OSHA’s PSM standard and is one of the 14 elements listed above. Every five years, subject matter experts come together for a couple of weeks and identify the major events that could happen at different “nodes” in a unit. The general idea is to use guidewords to systematically focus the team on the identification of process deviations that can lead to undesirable consequences, the risk ranking of those deviations, and the assignment of actions to either lower the probability of those failures or the consequence if the failures do occur. While a PHA would not identify maintenance strategies or detailed corrosion mitigation or identification strategies, it focuses on safety and not unit reliability. In the end, the major deliverable is a set of actions that have to be closed out to ensure compliance with the PSM standard. Typically, this process is owned and facilitated by the PSM manager or department.

RBI. RBI arose from an industry study in the 1990s that produced API (American Petroleum Institute) 580 and 581, which describe a systematic risk identification and mitigation framework that focuses only on loss of containment. For this reason, when an equipment item or piping segment (typically called “piping circuit”) is evaluated, the only failure that is of concern to the facility is the breach of the pressure boundary.

As an example, the only failure mode evaluated on a pump would typically be a leak in the casing or the seal. The consequence of those losses can be business, safety or environmental, and while a variety of software packages and spreadsheets can be used to accomplish the exercise, the deliverable is an RBI plan targeting the mitigation of loss-of-containment events.

In addition, a best-in-class RBI program will not just be a systematic re-evaluation of that plan every five or ten years, but an ongoing management strategy that updates the framework whenever, the risk factors change. Therefore, if an equipment’s material of construction was changed, insulation was added to an asset, or a piece of equipment was moved to a different location, a re-evaluation of the asset loss-of-containment risk and an associated update of the RBI plan would be appropriate. Typically, this process is owned and facilitated by the inspection or mechanical integrity manager or department.

RCM. As mentioned earlier, RCM was spawned out of the aviation industry, but the focus was to identify a proactive maintenance strategy that would ensure reliability and that performance goals were met. While this has been loosely codified in SAE (Society of Automobile Engineers) JA1011, there are a variety of methods and approaches and therefore RCM isn’t as controlled as RBI.

However, much like RBI, the RCM study itself aims at identifying the different failure modes of an asset, the effects of those failure modes, and the probabilities of those failure modes occurring at any given time. Once the potential failure causes are identified, strategies are recommended that mitigate the failure mode to acceptable levels. Unlike RBI, RCM accounts for all failure modes relating to loss of function, including loss of containment (although it typically outsources this exercise to the RBI study), and the end deliverable is a set of predictive maintenance, preventative maintenance, and operator activities that lower loss-of-function risks to acceptable levels. Typically, this process is owned and facilitated by the maintenance or reliability manager or department.


How do we measure risk?

While it’s not uncommon for a single facility to run PHA, RBI and RCM at once, it begs the question, which one is right? To find the answer, let’s briefly discuss risk matrices. A risk matrix is a tool that allows one to associate individual assets, failure modes or situations with specific levels of risk. There is both a probability of an asset failing and a consequence of an asset failing, and each is represented by one axis on the matrix. The multiplication of both probability of failure and consequence of failure (represented by the actual location of the asset on the matrix) equals risk. What’s interesting is that many facilities that are utilizing multiple-risk frameworks in their facility are utilizing multiple-risk matrices. This again begs the question, which one is right?

Figure 1 is a risk matrix that is much larger than the typical 4 × 4 or 5 × 5 risk matrix, but it shows each of the previously discussed risk frameworks on one larger matrix. The probability of failure is on the horizontal axis, and the consequence of failure is shown on the vertical axis.

Figure 1.  This graphical “consequence-of-failure” risk matrix shows the areas covered by process hazard analysis (PHA), risk-based inspection (RBI) and reliability centered maintenance (RCM)

Figure 1. This graphical “consequence-of-failure” risk matrix shows the areas covered by process hazard analysis (PHA), risk-based inspection (RBI) and reliability centered maintenance (RCM)

As shown, the frameworks reveal the following characterization for each of the three covered risk mitigation frameworks:

PHA — High consequence of failure events but lower probability that they will happen (an example would be an overpressure on a column with insufficient relief-systems capacity)

RBI — Medium consequence of failure events (loss of containment) and a medium probability that they will happen (an example would be a two-inch diameter leak of a flammable fluid from a drum)

RCM — Low consequence of failure events (loss of function) but a higher probability that they will happen (an example would be a rotor failure on a pump)

While each of these frameworks generally operate in different areas on the matrix, they are still standardized to a consistent amount of risk. The need to include all three risk-management tools into one standard matrix is twofold:

  • Making sure the data, calculations and actions coming from one study are properly informing the other studies.
  • Insuring that the actions being produced by each framework are being prioritized appropriately, as determined by their risk.

Making sure each of the three frameworks is communicating with one another is a common omission in facilities and programs. Many times, facilities spend millions of dollars building out and managing these frameworks, but there is often overlap between them and data gathered for one framework could be utilized for another framework. As an example, an inspection department representative should be present to ensure the RBI study is aiding the PHA effort.

Engineer at electric power plant

In addition, prioritizing risk between each framework is another challenge. A plant manager is not wholly concerned about each individual risk framework but rather a prioritized list of actions with those action’s projected return-on-investment (whether it is reduction of risk, a reduction of cost, or a reduction of compliance fines). The objective of the integrated and organization-wide risk mitigation system should be that all possible failures must be identified, assessed, properly mitigated (whether through design, maintenance, inspection, or operations) and monitored in order of priority with an expected amount of return. If a consistent risk matrix is used effectively, this can inform single asset or system decisions and continue to ensure reliability value is being driven consistently across the facility.


KPIs and risk

A good set of key performance indicators (KPIs) is needed as well to help identify root causes and guide programmatic decisions. Once systematic risk management, production loss, and enterprise-resource-planning (ERP) systems are properly setup, roll-up KPIs can be reported regularly that reveal the overall trending of the reliability program and drive specific initiatives with targeted results (risk reduction, cost reduction or compliance satisfaction).

For example, at any point in time, the plant (or group of plants) could see the total risk of loss-of-containment or loss-of-function events across their units and assets, the total risk of loss of function events across its units and assets, the total planned and unplanned downtime across the plant with associated causes, and the total cost associated with running those programs broken out by activity, area and other helpful specifics. When one or many of those rollout KPIs reveal concerns, sub KPIs should be accessible to explore the root cause of those risks, downtime or costs. It’s from this KPI drill-down, empowered by synthesized risk frameworks, that targeted initiatives and actions can be driven.



Reliability programs have come a long way in 100 years, and reliability professionals have more tools than ever at their disposal to increase overall plant availability and process safety. To drive systematic improvements in plant reliability with all these different tools, it is essential for facilities to get the data-management strategy right, to synthesize one’s approach to measuring, reporting and mitigating risk, and to roll it up in a KPI framework that combines risk, cost and compliance reports.



InceNathanael Ince is client solutions director, supporting the Solutions Department of Pinnacle Advanced Reliability Technologies (One Pinnacle Way, Pasadena, TX 77504; Phone: +1-281-598-1330; Email: nathanael.ince@pinnacleart.com). In this capacity, he works closely with his team of solutions engineers to ensure the department is building and implementing the best asset integrity and reliability programs for PinnacleART’s clients. With more than eight years on the PinnacleART team, Ince is an expert source on mechanical integrity, including proper assessment and implementation of risk-based mechanical-integrity programs. Ince has a B.S.M.E. degree from Texas A&M University.

Related Content
The Changing Face of Maintenance
Connected enterprises and intelligent instrumentation promote predictive maintenance and improve reliability Process reliability is one of the most important factors…

Chemical Engineering publishes FREE eletters that bring our original content to our readers in an easily accessible email format about once a week.
Subscribe Now
Make your chemical centrifuge ready for the future
Securing the availability of chemical processes through a long-term partnership
Metering gas in biogas plants
The Big 6 level measurement technologies: Where to use them and why
Minimizing particle breakage and mother liquor residue in technical salts production

View More