The headline grabbing IT failures such as that suffered recently by BA should make us pause and reflect on how it happened. IT is a critical part of most companies, but it often gets neglected or seen as a cost rather than a key system or an enabler. It’s easy to focus on the ‘sexy’ parts of IT that deliver new products or generate income and neglect the boring aspects.
Infrastructure, and especially DR (Disaster Recovery) certainly fall into the boring category for most companies, but it’s always surprising the scale of organisations that are in the same position.
DR, is the process of recovering IT systems following a disaster. Most people think of a disaster as flood, earthquake or fire but in truth it covers any scenario where a primary infrastructure is unavailable. A common scenario is power failure, (such as BA reported) and most data centres have backup generators and UPS (Uninterruptible Power Supplies) to ensure the power keeps going. Also common is network outage, either through device failure or WAN links being broken (underground fibres being dug-up for example). Something more mundane, but much more common, might be single point of failure with IT equipment. I’ve had to manage plenty of storage device recoveries from devices that were supposed to be fully resilient. DR is designed to protect against anything that will take out a significant element of IT.
A common misconception about DR results from how people tend to concentrate on the flood or power cut and assume they’ll never actually need to use DR. They miss the more likely causes, but as a result DR credibility as a requirement, and therefore focus within the IT organisation. I’ve worked in environments where people openly mocked the ability to use DR in spite of the million of pounds spent designing and installing it.
All significant operations, especially those on the scale of BA, should have a comprehensive risk based model covering all scenarios. This helps the business understand:
– what it is you’re protecting
– what the value of the asset is (losses/hour)
– how likely it is to fail
– what mitigation you should have in place (DR)
A key outcome is that you therefore understand all the reasons to have DR in place, not just the act-of-god style disasters.
In BA’s case we don’t know much more than the primary cause was power failure at a primary data centre. We can probably speculate that whilst the data centre probably has resilient power feeds, UPS and generators, the failure was inside the building, e.g. a power distribution unit or the UPS itself could cause a failure. I’ve seen exactly that on a planned maintenance window where the power switch itself failed taking out the grid feed, the UPS and the generator. There is an important difference between a Tier 2 and Tier 3 DC.
What we do know though is that the system took many hours to recover, so whatever DR systems were in place didn’t work. The chances are, and recent articles have speculated as such, that the process itself was poorly understood. As a cost focused organisation it seems credible that management were focused on reducing the cost of IT at the expense of the boring bits. Or perhaps delivery deadlines focused people on getting new projects delivered. It’s not uncommon unfortunately for DR to be deemed insufficiently important to block acceptance of a new system into service. Organisations ‘accept the risk’ because they’re under pressure from things that seem more important at the time.
If the DR process was poorly understood, and it didn’t work, it’s almost certain that the DR process wasn’t tested recently. It’s absolutely critical for DR (and backups for the same reason) that the process is tested regularly. This is because:
– people will understand how to do it
– you find out what aspects don’t work and fix them
Everything we’ve talked about has been DC and traditional infrastructure focused, so how does it relate to Public Cloud?
Firstly, you still have to plan DR. Some services are location specific, some include the DR option as a tick box, and some are seamless. Work out which services require little or no thought, and which require your design to include DR.
– An EC2 instance is located in a specific AWS Availability Zone (for AZ think physical DC and it’s a good approximation). If the AZ fails (and they have before) your EC2 instance will disappear.
– An RDS instance has the option to chose Multi-AZ. When you tick this option, and slightly dependant on the RDBMS you selected, the RDS service handles the DR between availability zones in the same region. It’s important to understand the recovery mechanism though. You’ll get an RPO of 5 minutes and RTO depending on the size and type.
– S3 is fully managed from a DR perspective. RPO is near zero (it depends on what you accept as the object being stored). RTO near zero.
So if your solution uses all 3, you have to plan for EC2 instances, think about RDS data loss and system recovery, and you can more-or-less forget about S3.
Here are 3 cloud enabled patterns for DR that bring immediate improvements over traditional systems:
Active/active across AZ’s
Big, traditional systems such as those we suspect are in place at BA run only on one DC, and require a lengthy process to restore service in another site. Why not design your solution to be active\active? This solves more than one problem:
– you’ve solved your HA (High Availability) requirement to guard against individual component failure
– you’ve solved your DR requirement to be resilient across locations (or availability zones)
– you’re permanently testing the DR process, so you have a much higher confidence that it works
DR by DevOps
One method of ensuring DR is always working in a traditional environment is to deploy new releases to the passive site and use the DR process as the cutover. During the release window you effectively don’t have DR, but the risk window is short, and the improvement in reliability makes it a good pattern in most cases. You could follow the same pattern in Public Cloud, but why not automate it through DevOps. The DR process then becomes a button press in your favourite automation tool (we like Jenkins for example).
Software Defined DR
Even better why not use a blue/green deployment method where instead of accepting the risk window whilst you deploy to the passive site, simply build an entirely new system using software defined infrastructure and cut across once it’s ready and tested. You can use this across regions even! Why not have primary in the UK (AWS EU-WEST-2) and DR in Dublin (AWS EU-WEST-1)?
For the fully paranoid (risk averse) at LayerV we have built solutions across different Public Cloud providers. This is simply to deal with the concentration risk of using a single supplier, both technically and contractually.
At LayerV we’re experienced at designing, building and operating solutions that accurately meet business risk, and take maximum advantage of Public Cloud as an enabler for smarter, cheaper and more effective DR solutions.