30
7X24 MAGAZINE SPRING 2015
Risk, Reliability, and Availability in Critical Facilities
A critical facility is one where failures are to be avoided to
the maximum extent possible. Failure of “mission-critical”
facilities jeopardizes the operations and sometimes the
existence of the associated enterprise.
It is not possible or desirable to build a number of critical
facilities, observe their performance, and address any
shortcomings in later versions. Critical facilities such as
nuclear power plants must demonstrate that by design
they are extremely unlikely to fail in a manner that
jeopardizes human health and safety.
Data center failures are typically measured in terms of
financial losses. As our society and its infrastructure grow
ever more dependent on the Internet and data centers,
data center failures will inevitably produce human
casualties. Today a failure in a major hospital data center
could compromise patient care and obviate many claimed
benefits of centralized digital patient records.
Risk is the product of probability and consequences. If the
consequence of a data center outage is $1 million, the risk
to the associated enterprise is proportional to the
probability of an outage multiplied by $1 million. If the
data center anticipates one failure per 20 years, the
expected loss per year is approximately $50,000 (1/20 x
$1,000,000). If there is a failure every year, the expected
loss is $1 million per year.
Availability is the average time that a system or asset is
available for use, or is operating. Availability can be a
useful metric for planning revenues and fleet
management, but availability cannot be used to evaluate
risk.
This inherent limit of availability is demonstrated as
follows; in one year, critical facility 1 has a single outage
lasting 24 hours. In the same period critical facility 2
experiences ten outages of 2.4 hours each. Both facilities
have 24 hours during the year (8760 hours) when they are
not available. Both have the same availability: (8760-
24)/8760.
The two facilities have very different reliability. Critical
facility 2 is ten times more likely to fail than critical facility
1. Data center owners, who typically experience
substantial losses even after brief outages, tend to be
much more interested in how likely a facility is to fail than
in average uptime.
CLASS AS A MEASURE OF CRITICAL
FACILITY PERFORMANCE
Reliability is the probability that a system will operate for
the specified period of time (or number of trials), called
the mission. The highest possible reliability is 1; this is not
achievable by human-made systems.
Reliability can be expressed as 1 – (probability of failure).
A synonym for probability of failure is unreliability.
MTech has been calculating reliability, unreliability,
availability, and risk for 17 years. We have observed that
many of our clients find these terms novel and initially
confusing.
We propose to call the percentage Unreliability for a 1-
year mission the critical facility Class. A Class 10 facility
has 10% chance of failure per year. A Class 1 facility has
1% chance of failure per year. A Class 0.1 system has only
0.1% probability of failure per year, one chance in 1000.
The unattainable class is Class 0, with no chance of failure.
Gravity is a Class 0 system, but there are no Class 0
systems made by mankind. Like the temperature absolute
zero, Class 0 is a physical possibility but beyond the
abilities of present-day technology.
MTech uses fault tree analysis and associated tools to
develop mathematical models of critical facilities. We use
these models to calculate facility unreliability, sensitivity to
component performance, and risk. Our clients include
data centers, producers of equipment marketed to data
centers, oil and gas facilities, nuclear power plants, and
energy storage products.
Our experience over the past 17 years is that most data
centers could be Class 5 or perhaps Class 2 facilities. A
few are Class 1, and few if any would realistically meet a
Class 0.1 standard.
The fact that nearly all critical facilities are exposed to
external hazards beyond their control can be reflected in
the Class rating. Many disaster recovery plans use 100-
year events as the threshold for activation. It is deemed
too expensive to engineer effective facility protection for
the 100-year fire, flood, or earthquake. The disaster
recovery plan is executed should they occur.
This suggests diminishing returns for investing in facilities
with Class ratings much lower than 1. Even if 250-year
events are set as the threshold, three independent