Introduction
We are now at a point in time, where companies are as reliant on their computer systems and tech stacks, as they are on almost any other part of their business. We have come from a time where these systems were supplementary - where they simply added value, to a point where they are the foundation of almost everything we do. It is no surprise then that great care must be taken when building modern computer systems or software applications, to ensure that they will always be there for the people who need them. For our customers, our employees and ourselves: modern systems must be highly available.
What does this mean?
Availability is a term used to describe the ability of a system or service to be ready and able to serve its purpose on demand.
Exactly what this means, varies by the type and purpose of the system in question. For example, if we are speaking about a web application, we would likely expect it to be available for users to browse on their device of choice and carry out the tasks for which it was intended. For a flight web app, this may be to book a flight or carry out a search of available flights. For a banking app, we would expect the user to be able to access it and view their balance, transfer money or access statements.
The actions and features of the system or application are handled during its design. However, just because a system or service can do something, doesn’t mean that it always will. Our job as solutions architects and engineers, is to ensure that these systems can continue to provide all of their features and actions regardless of circumstances, adverse factors, or scale - and this is a challenge.
Availability is often measured as a percentage of how long the system has been completely functional over a period of time. These days it is unusual to see a target of less that 99% for important systems and you will often see ambitious targets of 99.999% (sometimes referred to as five 9s), which allows for only 5.26 minutes downtime per year!
These targets may seem very achievable for a small-scale system or application which serves only few users, but they can be very hard to realise with a large-scale distributed system serving 100 million users per month. The reason for this is primary the required complexity of the system. The more complex a system must become, the more points of failure it introduces. Furthermore, the increasing number of requests creates more opportunities for a problem to occur. This also forces the system to scale.
How can we achieve it?
To achieve high availability, we must consider it from the absolute beginning. Too often I see designs being built around features without consideration for how available they need to be (and all of the things that come with that). There should be a target and a clear path to reaching it. There should be resilience and auto healing built in and there should be data - because without data we don’t know anything.
Some considerations when designing and building systems:
Remove single points of failure (SPOF). The system is only as robust as its weakest point. Sometimes it can be very hard to engineer out single points of failure, but it is necessary to ensure the system is fit for purpose. It is usually easier to design a system without any SPOF, than to remove them from an existing system. Sometimes you must increase the project budget to accommodate this requirement, but by reducing outages in the future, this will usually pay for itself.
Build for failure. Never assume the system will never fail - it will. When designing a system, I start by thinking of all the ways it could fail, everything that could go wrong and then I try to eliminate these possibilities. By thinking of as many future issues as possible, it makes it much easier to anticipate problems in the design and it will ultimately lead to a stronger system.
Use redundancy when appropriate. Redundancy is when a system has a backup feature or service which can be utilised if the system is in an undesirable state. It can be implemented in many different ways depending on the type of system and can usually be active/active or active/standby in nature. Active systems utilise all resources to some extent during normal operation and usually load balance between themselves. Whereas active/passive system allow the active system to do the work and the passive to be in a standby mode, ready to take over in the event of a failure. Redundancy is important in almost all systems and can be used in many ways. Container-based systems can also utilise redundancy by load balancing across containers or pods. Capacity planning should be considered when planning for redundancy.
Plan capacity requirements. It is important when building a system to ensure you have enough computing capacity for the load you expect to receive. You should also have enough capacity to provide redundancy in case of failure. Planning for capacity can be difficult with new systems, because there isn’t usually any data available. Therefore, educated predictions must be made and the system must be built to scale rapidly, in case your prediction is off. Cloud based systems scale very well (they have elasticity built in) whereas on-premises systems can be more difficult.
Ensure scalability. Systems should ideally have elasticity as a primarily design property. That means they should be able to scale up and down rapidly as load changes, without any noticeable impact on the service. Scaling can be described as vertical or horizontal. Vertical scaling is usually less desirable and involves giving one system more resources (it can be useful for monolithic systems). Horizontal scaling is usually preferred and involved provisioning more instances of a system (this could be more EC2s, more containers etc).
Promote self-healing. Some systems can be built to support self-healing. This is common in container-based systems and allows containers to be monitored for a healthy state. If a container is found to be unhealthy, it can be terminated. At this point a scaling policy can be configured to enforce a minimum number of containers and once it realises that the state is now n -1, it will create a new instance of the terminated container. This allows the system to resolve issues itself and replace broken components.
Know what healthy means. Through the use of monitoring and data collection, know what the system should look like when it is in a healthy state. This may involve predictions at first with new systems, but data can then be gathered, and definitions can gradually evolve. Only by really understanding how the system should work, are we able to know when something is wrong.
Collect (the right) data. Many of the above points rely on some kind of data. Ensure that proper monitoring and log collection is established from the beginning and collect as much of the right data as possible. Whilst it’s better to have all of the data, rather than none of it, too much irrelevant data will make it hard to process and analyse. Decide which data will be valuable and collect it automatically. Routinely revisit the data and adjust the system based on what it is telling you.
Summary
Hopefully this post has given you some ideas for where to start when building highly available systems. This list of considerations is by no means complete and whilst many of the points in this article, should probably be considered for all systems. It is important to remember that there is rarely one correct solution in solutions architecture. Decide what is right for the system you are building and use data to drive decisions where possible. I frequently revisit designs and make changes based on new insight and I think that that is perfectly normal. Modern systems are never finished.