Today's EC2 / EBS Outage: Lessons learned

Today Britain woke to the news that Amazon Web Services had suffered a major outage in its US East facility. This affected Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. The cause of the outage appears to have been a case of so-called 'auto-immune disease'. Amazon's automated processes began remirroring a large number of EBS volumes, which had a knock on effect of significantly degrading EBS (and thus RDS) performance and availability across multiple availability zones. Naturally the nay-sayers were out in force, decrying cloud-based architectures as doomed to failure from the very start. As the dust starts to settle, we attempt to distill some lessons from the outage.

Expect downtime

The first and most obvious point to make is that downtime is inevitable. Clouds fail. Datacenters fail. Disasters happen. The people trying to make some causal relationship between deploying to the cloud and general failure are missing the point.

What matters is how you respond to downtime. At Atalanta Systems we challenge our clients to switch off machines at random. If their architecture isn't built to withstand failure, we've failed in helping them. Incidentally we've been doing this for years, long before anyone ever mentioned 'chaos monkeys'.

Especially in a cloudy world, expect failure - EC2 instances can and will randomly crash. Expect this, and you won't be disappointed. From day one, expect hardware problems, expect network problems, expect your availability zone to break.

Now, of course, there's a big difference between switching off a few machines or pulling a few cables and losing a whole datacenter. However, we have to expect downtime, and we have to be ready for it. Here are a few suggestions:

Use amazon's built-in availability mechanisms

Don't treat AWS like a traditional datacenter. Amazon provides up to four availability zones per region, and a range of free and paid-for tools for using them. Techniques for taking advantage of these features range from as simple as using elastic IP addresses and remapping manually to a different zone, to using multi-availability zone RDS instances to replicate database updates across zones.

Make use of autoscaling groups, and deploy in more than two availability zones. Latency between zones is minimal, and autoscaling groups can span availability zones, and can be configured to trigger based on utilisation. People maintaining that it costs twice as much to run a highly available infrastructure in AWS simply haven't read the documentation. Take care to avoid the classic fallacy of having three web servers at 60% utilisation, and one failing, resulting in two failing immediately afterwards.

Size your infrastructure to include headroom for load spikes, and to be able to sustain an complete AZ failure. For any business for whom downtime can be measured in tens of pounds per minute (which accounts for even small startups), it's cheaper to build in the availability than to suffer the outage.

The problem with today's outage is that it appears to have impacted multiple availability zones. The full explanation for this has not yet been forthcoming, but it does service to highlight that if availability really matters to you, you really need to consider using multiple regions. Amazon has points of presence on the East coast, the West coast, Western Europe, and two in South East Asia. Backing up to S3 from one region enables restore into another. Cloudwatch triggers can be used to launch new instances in a different region, or even a full stack via Cloud Formation. We have clients doing this on the East and West coast, without spending outrageous amounts of money.

The bottom line is that one of the key benefits of using AWS is the geographic spread it enables, together with its monitoring and scaling and balancing capabilities. Look into using these - if you're not at least exploring these areas, you're doing the equivalent of buying an iPhone and only ever using it for text messages.

Think about your use of EBS

It's not the first time there have been problems with EBS - only last month, Reddit was down for most of the day because of EBS-related issues. Here are a few things to consider when thinking about using EBS in your setup:

  • EBS is not a SAN

EBS is network accessible, block storage. It's more like a NetApp than a fibre-based storage array. Treat it as such. Don't expect to be able to use EBS effectively if your network is saturated. Also be aware that EBS (and the whole of AWS) is built on commodity hardware, and as such is not going to behave in the same way as a NetApp. You're going to struggle to get the kind of performance you'd get from a commercial SAN or NAS, with battery-backed cache, but EBS is considerably cheaper.

  • EBS is multi-tenant

Remember that you're sharing disk space and IO with other people. Design with this in mind. Deploy large volumes, even if you don't need the space, to minimise contention. Consider using lots of volumes and building up your own RAID 10 or RAID 6 from EBS volumes. Think of it as a way to get as many spindles as you can, spread across as many disk-providers as possible. Avoid wherever possible using a single EBS volume - as Reddit found to their cost last month, this is not the right way to use EBS.

  • Don't use EBS snapshots as a backup

EBS snapshots are a very handy feature, but they are not backups. Although they are available to different availabilty zones in a given region, you can't move them between regions. If you want backups of your EBS-backed volumes, by all means use a snapshot as part of your backup strategy, but then actually do a backup - either to S3 (we use duplicity) or to another machine in a different region (we back up to EBS-backed volumes in US-EAST). Don't be afraid of bandwidth charges - run the calculation on the AWS simple calculator - it's not as terrifying as you might have feared.

  • Consider not using EBS at all

In many cases, EBS volumes are not needed. Instance storage scales to 1.7TB, and although ephemeral, doesn't seem to have the kinds of problems many have been experiencing with EBS. If this fits your architecture, give it some thought.

Consider building towards a vendor-neutral architecture

We're big fans of AWS. But today raises questions about the wisdom of tying your infrastructure to one cloud provider. Heroku is an interesting example. Heroku's infrastructure piggy-backs on top of AWS, which meant that many applications were unavailable. Worse, access to the Heroku API was affected, and so users were stuck.

Architecting across multiple vendors is difficult, but not impossible. Cloud abstraction tools like Fog, and configuration management frameworks such as Chef make the task easier.

Patterns within the application architecture can also be used. If a decision has been made to make use of an AWS-specific tool or API, consider a writing lightweight wrapper around the AWS service, and try to build in and test an alternative provider's API, or your own implementation, or at least provide the capability of plugging one in. This prevents lock-in, and makes it much easier to deploy your systems to a different cloud should the requirement arise.

This said, I happen to hold to the view that for a smaller investment, if a client is already committed to using AWS, they can probably make use of Amazon's five regions, and design their systems around the ability to move between regions in the very rare case where multiple availability zones are impacted.

Have a DR plan, and practice it

Part of planning for failure is to know what to do when disaster strikes. When you've been paged at 3am and told that the whole site is down, and your hosting provider has no estimated time to recovery, the last thing you want to do is think. You should be on autopilot - everyone knows what to do, it's written down, it's been rehearsed, as much of it is automated as possible.

I encourage my engineers to write the plan down, somewhere accessible (and not only on the wiki that just went down). Have fire drills - pick a day, and run through the process of bringing up the DR systems, and recovering from backup. Follow the process - and improve it if you can.

Testing restores is the critical part of the process. Know how long it takes to restore your systems. If you have vast datasets that take hours to import, at least you know this in advance, and when and if you need to put the recovery plan into action, you can set expectations. Remember, though, your backups mean nothing if you haven't verified you can restore them. Make it a habit. When you need to do it for real, you'll be grateful you drilled yourself and your team.

Infrastructure as code is hugely relevant

One of the great enablers of the infrastructure as code paradigm is the ability to rebuild the business from nothing more than a source code repository, some new compute resource (virtual or physical) and an application data backup. In the case of multi-region failover, you might find that your strategy is to keep a database running, but deploy a stack, provisioned with your configuration management tool, on demand. We've tested this with cloud formation and chef and can bring up a simple site in five or ten minutes, and a multi-tier architecture with dozens of nodes within 30 minutes. The bottleneck is almost always the data restore - so work out ways to reduce the time taken to do this, and practice practice practice.

Many people reading this will be in a position where they already have an infrastructure in place that either isn't managed with a framework such as Chef, or is only partially built. If you take nothing else from today's issues, take an action to prioritise getting to the stage where you can rebuild your whole infrastructure from a git repo and a backup. The cloud is great for this - you can practice spinning your systems up in a different region, or a different zone, as many times as you like, until you're happy with it.

The cloud (and AWS) is still great

Sadly today has brought out the worst kinds of smugness and schadenfreude from people using other cloud providers, or traditional infrastructures. These people have very short memories. Joyent, Rackspace, Savvis, all these providers have had large and public outages. As we've already said, outages are part of life - get used to it.

Some commentators have suggested that AWS has inherent weaknesses by offering platform services beyond the basic resource provision that a simpler provider such as Linode offers. Linode is a great provider, and we've used them for year. However, I'm not sure it's as simple as that. If you've decided to deploy your application in the cloud, and you need flexible, scalable, persistent storage, or a highly available relational database, or an API-driven SMTP service, you have a choice. You can spend your time, and your developers' time, building your own, and making it enterprise ready, or you can trust some of the best architects in the world to build one for you. Sometimes making your own is a better choice, but you don't get it for free. You'll be paying more for the extra machines to support it, and the staff to administer it. Personally, I'm unconvinced that trying to build and manage these ancillary systems delivers value for the organisation.

Yes, today's outage is hugely visible. Yes it's had a massive impact on some businesses. That doesn't make the cloud bad, or dangerous. Quora, made a great point by serving a maintenance page with a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”

Using the cloud as part of your IT strategy is about much more than reliability. Not that EC2's reliability is bad - EC2 offers a 99.95% SLA. That's equivalent to the best managed hosting providers. The US East region that suffered so much today had a 100% record between 2009 and 2010. It should, of course, be noted that, strictly speaking, todays issues were with EBS, which doesn't attract an SLA. Be wary of SLAs and figures - they can be misleading.

Making use of the cloud is about flexibility and control and scalability. It's about a different way of thinking about provisioning infrastructure that encourages better business agility, and caters for unpredictable business growth. Yes you might get better availability from traditional hardware in a managed hosting facility, but even then outages happen, and more often than not these outages can take many hours to recover from.

The cloud is about being able to spin up complete systems in minutes. The cloud is about being able to triple the size of your infrastructure in days, when your product turns out to be much more popular than you imagined. Similarly, it's about being able to shrink to something tiny, and still survive, if you misjudge the market. The cloud is about the ability change how your infrastructure works, quickly, without worrying about sunk cost in switches or routers that you thought you might need. The cloud is about the ease with which we can provide a development environment that mirrors production, within 30 minutes, and then throw it away again. The cloud is about being able to add capacity for a big launch, and then take it away again with a mere API call. I could go on...

One, albeit major, outage in one region of one cloud vendor doesn't mean the cloud was a big con, a waste of time, a marketing person's wet dream. The emperor isn't naked, and the nay-sayers are simply enjoying their day of 'I told you so'. The cloud is here to stay, and brings with it huge benefits to the IT industry. However, it does require a different approach to building systems. The cloud is not dead - it's still great.

Summary

Today has been a tough day for business affected by the EC2 outage. We can take the following high level lessons away from today:

  • Expect, and design for downtime
  • Have a DR plan, and practice it until it's second nature
  • Make it your priority to build your infrastructure as code, and to be able to rebuild it from scratch, from nothing more than a source code repository and a backup
  • The cloud is still great

Published on in Devops