Software Systems Architecture

The Availability and Resilience Perspective

In the traditional data processing model of system availability, computers supported the mainstream business of the organization during the day (typically 9 A.M. to 5:30 P.M., Monday through Friday) by capturing orders, cash withdrawals, or other sorts of transactions. Then the computers reverted to batch mode during the night to perform tasks such as reconciliation, consolidation, and exchange of information with other systems.

Although we still see this model in some organizations, in recent years there has been a significant change in the way that most companies carry out their business, driven to a large extent by the Internet and the global operations of large organizations. The business day has in general become longer, often extending into the weekend (the traditional preserve of huge, long-running batch jobs), and near-continuous operation has become the norm in many places.

Today’s requirement for many systems, therefore, is to be available for much, if not all, of the twenty-four hour cycle. With the improved reliability of hardware and, to a lesser extent, software, many expect that failures will be few and far between and that, where these do occur, recovery will be prompt, effective, and largely automated. As the large number of Web-site failures in the early years of Internet e-commerce showed, any system exposed directly to your customers must be up and running-if it isn’t, your company’s reputation will suffer.

This business environment means that getting your availability characteristics wrong can be very expensive. However, increased online availability comes at a cost, whether in terms of more hardware, increased software sophistication, or redundancy in your telecommunications network.

Desired Quality The ability of the system to be fully or partly operational as and when required and to effectively handle failures that could affect system availability
Applicability Any system that has complex or extended availability requirements, complex recovery processes, or a high profile (e.g., is visible to the public)
  • classes of service
  • planned downtime
  • unplanned downtime
  • time to repair
  • disaster recovery
  • capture the availability requirements
  • produce the availability schedule
  • estimate platform availability
  • estimate functional availability
  • assess against the requirements
  • rework the architecture
  • select fault-tolerant hardware
  • use high-availability clustering and load balancing
  • log transactions
  • apply software availability solutions
  • select or create fault-tolerant software
  • design for failure
  • allow for component replication
  • relax transactional consistency
  • identify backup and disaster recovery solutions
  • single point of failure
  • cascading failure
  • unavailability through overload
  • overambitious availability requirements
  • ineffective error detection
  • overestimation of component resilience
  • overlooked global availability requirements
  • incompatible technologies

← The Accessibility Perspective     |     Perspectives    |     The Development Resource Perspective →