How Not to Fail at the Cloud

Lets reflect on recent outages at Amazon that crippled Pintrest, Heroku, and others:

One of the promises of PaaS is productivity. Services like Heroku increase productivity because they abstract from the user the underlying infrastructure.

1: Choose Wisely

The main lesson from this outage is that relying on the provider to carry all your operations isn’t always a safe bet. When we move to a PaaS provider we still need to understand how they run their disaster recovery, high-availability, and scaling procedures. Heroku-like PaaS also forces you to a lowest common denominator approach to deal with continuous availability. In reality, however, there are many trade-offs between scalability, performance, and high availability and deciding between those trade-offs tends to be application specific, so compromising on a lowest common denominator could be not at all what was intended at the end of the day.

PaaS is meant to provide a higher productivity for running our apps on the cloud by abstracting the details of how we run our application (the operation) from the application developer. The black-box approach of many of the public PaaS offerings, such as Heroku, is often an extreme.

There is often a close coupling between what the application does and the way we run it. A new class of platform like juju offer a different open source alternative that gives you more control of your underlying PaaS platform. Juju uses an open model for its Charms that can easily integrate with Puppet, Chef/Chef-Solo, bcfg2, or most any configuration management enabling you to easily customize and control your operations creating your own PaaS and without affecting developer productivity.

2: Database Availability

Most public PaaS offering don’t adequately address is database high-availability, which is obviously a tough area. Specifically, in the event of data center failure or availability zone failure, as in the present case. To deal with database availability, it is necessary to ensure real-time synchronization of the database across sites, or Hot-spare / “Continuous Backup”.

3: Coping with Failure, Avoiding a Single Point of Failure

The general lesson from this and previous failures is actually not new. To be fair, this lesson is not specific to AWS or to any cloud service. Failures are inevitable, and often happen when and where we least expect them to. Instead of trying to prevent failure from happening we should design our systems to cope with failure. The method of dealing with failures is also not that new – use redundancy, don’t rely on a single point failure (including a data center or even a data center provider). Automate the fail-over process, etc…

Haven’t Learned from Past Lessons?

The question that comes out of this is not necessarily how to deal with failures, but instead – why are we failing to implement the lessons?

Assuming that the people running these systems are among the best in the industry makes this question even more interesting … We’re giving up responsibility when we move to the cloud: When we move our operations to the cloud, we often assume that we’re out-sourcing our data center operation completely, including our disaster recovery procedures. The truth is that when we move to the cloud we’re only outsourcing the infrastructure, not our operations, and the responsibility of how to use this infrastructure remain ours.

Implementing Past Lessons in the Cloud

We need to assume full responsibility for our applications’ disaster recovery procedures, in the cloud world just as if we were running our own data center.

The hard part in the cloud is that we often have less visibility, control, and knowledge of the infrastructure, which affects our ability to protect our applications – and each sub-component of our application – from failure. On the other hand, the cloud enables us to spawn new instances easily on various data center locations, a.k.a Availability Zones.

And so, most failures can be addressed by moving from the failed system to a completely different system regardless of the root cause of the failure. Therefore, the first lesson is that in the cloud world it is easier to implement disaster recovery plans, by moving our application traffic to a completely different redundant system in a snap, rather than trying to protect every component of our application from a failure.

If we’re willing to tolerate a short window of downtime, we can even use an on-demand backup site rather than pay the consistent cost and overhead of maintaining a hot backup site.

What do we need to build such a solution?


Providing a consistent way to launch, configure, manage, and scale a primary as well as redundant environments that are ready to take over in case of failure. Specifically, the ability to deploy your application, then orchestrate services and configuration in a consistent way across sites, and on demand.

How to get started ?

In an upcoming post I’ll detail taking a real world webapp from a traditional 1 or 2 VPS “mom and pop” setup that will likely resemble many readers current setups at home or work, then show you how I transform it to apply the above information using juju in detail. Hopefully this will give you a better look into some of the things you may encounter doing the same thing versus setting a new project up from scratch.

Something to add? Comment Below

Loading Facebook Comments ...
Loading Disqus Comments ...
Loading Google+ Badge ...
Follow me on Twitter
Archived Posts