The Availability and Resilience Perspective

In the traditional data processing model of system availability, computers supported the mainstream business of the organization during the day (typically 9 A.M. to 5:30 P.M., Monday through Friday) by capturing orders, cash withdrawals, or other sorts of transactions. Then the computers reverted to batch mode during the night to perform tasks such as reconciliation, consolidation, and exchange of information with other systems.

Although we still see this model in some organizations, in recent years there has been a significant change in the way that most companies carry out their business, driven to a large extent by the Internet and the global operations of large organizations. The business day has in general become longer, often extending into the weekend (the traditional preserve of huge, long-running batch jobs), and near-continuous operation has become the norm in many places.

Today’s requirement for many systems, therefore, is to be available for much, if not all, of the twenty-four hour cycle. With the improved reliability of hardware and, to a lesser extent, software, many expect that failures will be few and far between and that, where these do occur, recovery will be prompt, effective, and largely automated. As the large number of Web-site failures in the early years of Internet e-commerce showed, any system exposed directly to your customers must be up and running-if it isn’t, your company’s reputation will suffer.

This business environment means that getting your availability characteristics wrong can be very expensive. However, increased online availability comes at a cost, whether in terms of more hardware, increased software sophistication, or redundancy in your telecommunications network.

Desired Quality The ability of the system to be fully or partly operational as and when required and to effectively handle failures that could affect system availability

Applicability Any system that has complex or extended availability requirements, complex recovery processes, or a high profile (e.g., is visible to the public)

Concerns

classes of service
planned downtime
unplanned downtime
time to repair
disaster recovery

Activities

capture the availability requirements
produce the availability schedule
estimate platform availability
estimate functional availability
assess against the requirements
rework the architecture

Tactics

select fault-tolerant hardware
use high-availability clustering and load balancing
log transactions
apply software availability solutions
select or create fault-tolerant software
design for failure
allow for component replication
relax transactional consistency
identify backup and disaster recovery solutions

Pitfalls

single point of failure
cascading failure
unavailability through overload
overambitious availability requirements
ineffective error detection
overestimation of component resilience
overlooked global availability requirements
incompatible technologies

← The Accessibility Perspective | Perspectives | The Development Resource Perspective →

Desired Quality	The ability of the system to be fully or partly operational as and when required and to effectively handle failures that could affect system availability
Applicability	Any system that has complex or extended availability requirements, complex recovery processes, or a high profile (e.g., is visible to the public)
Concerns	classes of service planned downtime unplanned downtime time to repair disaster recovery
Activities	capture the availability requirements produce the availability schedule estimate platform availability estimate functional availability assess against the requirements rework the architecture
Tactics	select fault-tolerant hardware use high-availability clustering and load balancing log transactions apply software availability solutions select or create fault-tolerant software design for failure allow for component replication relax transactional consistency identify backup and disaster recovery solutions
Pitfalls	single point of failure cascading failure unavailability through overload overambitious availability requirements ineffective error detection overestimation of component resilience overlooked global availability requirements incompatible technologies