Background Image
Previous Page  30 / 80 Next Page
Information
Show Menu
Previous Page 30 / 80 Next Page
Page Background

30

7X24 MAGAZINE SPRING 2015

Risk, Reliability, and Availability in Critical Facilities

A critical facility is one where failures are to be avoided to

the maximum extent possible. Failure of “mission-critical”

facilities jeopardizes the operations and sometimes the

existence of the associated enterprise.

It is not possible or desirable to build a number of critical

facilities, observe their performance, and address any

shortcomings in later versions. Critical facilities such as

nuclear power plants must demonstrate that by design

they are extremely unlikely to fail in a manner that

jeopardizes human health and safety.

Data center failures are typically measured in terms of

financial losses. As our society and its infrastructure grow

ever more dependent on the Internet and data centers,

data center failures will inevitably produce human

casualties. Today a failure in a major hospital data center

could compromise patient care and obviate many claimed

benefits of centralized digital patient records.

Risk is the product of probability and consequences. If the

consequence of a data center outage is $1 million, the risk

to the associated enterprise is proportional to the

probability of an outage multiplied by $1 million. If the

data center anticipates one failure per 20 years, the

expected loss per year is approximately $50,000 (1/20 x

$1,000,000). If there is a failure every year, the expected

loss is $1 million per year.

Availability is the average time that a system or asset is

available for use, or is operating. Availability can be a

useful metric for planning revenues and fleet

management, but availability cannot be used to evaluate

risk.

This inherent limit of availability is demonstrated as

follows; in one year, critical facility 1 has a single outage

lasting 24 hours. In the same period critical facility 2

experiences ten outages of 2.4 hours each. Both facilities

have 24 hours during the year (8760 hours) when they are

not available. Both have the same availability: (8760-

24)/8760.

The two facilities have very different reliability. Critical

facility 2 is ten times more likely to fail than critical facility

1. Data center owners, who typically experience

substantial losses even after brief outages, tend to be

much more interested in how likely a facility is to fail than

in average uptime.

CLASS AS A MEASURE OF CRITICAL

FACILITY PERFORMANCE

Reliability is the probability that a system will operate for

the specified period of time (or number of trials), called

the mission. The highest possible reliability is 1; this is not

achievable by human-made systems.

Reliability can be expressed as 1 – (probability of failure).

A synonym for probability of failure is unreliability.

MTech has been calculating reliability, unreliability,

availability, and risk for 17 years. We have observed that

many of our clients find these terms novel and initially

confusing.

We propose to call the percentage Unreliability for a 1-

year mission the critical facility Class. A Class 10 facility

has 10% chance of failure per year. A Class 1 facility has

1% chance of failure per year. A Class 0.1 system has only

0.1% probability of failure per year, one chance in 1000.

The unattainable class is Class 0, with no chance of failure.

Gravity is a Class 0 system, but there are no Class 0

systems made by mankind. Like the temperature absolute

zero, Class 0 is a physical possibility but beyond the

abilities of present-day technology.

MTech uses fault tree analysis and associated tools to

develop mathematical models of critical facilities. We use

these models to calculate facility unreliability, sensitivity to

component performance, and risk. Our clients include

data centers, producers of equipment marketed to data

centers, oil and gas facilities, nuclear power plants, and

energy storage products.

Our experience over the past 17 years is that most data

centers could be Class 5 or perhaps Class 2 facilities. A

few are Class 1, and few if any would realistically meet a

Class 0.1 standard.

The fact that nearly all critical facilities are exposed to

external hazards beyond their control can be reflected in

the Class rating. Many disaster recovery plans use 100-

year events as the threshold for activation. It is deemed

too expensive to engineer effective facility protection for

the 100-year fire, flood, or earthquake. The disaster

recovery plan is executed should they occur.

This suggests diminishing returns for investing in facilities

with Class ratings much lower than 1. Even if 250-year

events are set as the threshold, three independent