When the cloud goes down: Software design for failure


Cloud computing & HA

A recent outage of Amazon’s Virginia based cloud, resulted in several companies being completely offline for several hours, days or worse. The world was shocked at it had always assumed the cloud to just be there, always. One can only imagine the damage done to businesses in terms of missed revenue, but also because of bad publicity. This problem did not hit all companies that were also serving their customer from the same Virginia based cloud. How did they manage to handle these problems?

The great thing about cloud computing is that you don’t have to worry about the availability of the underlying platform that is hosting your applications. Or at least you think you don’t have to care about it. Most people assume availability is something we used to worry about in a traditional non-cloud based set-up, but we nowadays expect the cloud provider to take care of it. This is a fundamental mistake, as we have seen with the recent outage of the world’s largest cloud provider Amazon. Another example is the recent bankruptcy of the Dutch Cloud provider iTricity.  Their strategy was to deliver IT as Electricity, but we all know even electricity coming from a wall socket is never guaranteed. That’s why we have emergency power supplies in data centers and hospitals. It’s funny that we don’t always take the same kind of measures when it comes to applications we run in the cloud. We probably just assume the cloud already takes care of this. Although there are numerous of technical solutions to implement HA this post focuses primarily on the fundamental idea how to deal with the world of cloud computing which has shown to be unpredictable when it comes to HA.

Cloud Software Design Principles

The companies not having major problems during Amazon’s recent outages did something differently than others. It’s not rocket-science: They just assumed the cloud would most likely be unavailable at some point in time and already implemented a way of dealing with this. Some of them made sure all data was continuously replicated to a different (cloud based) location and were able to deploy resources fast to get their applications up and running. This is great but needs manual intervention and still means downtime because it just takes time to start everything in a different locations. Others took cloud HA to the next level and made sure the application was aware of different cloud locations. Perfect HA software never trusts the platform it’s running on and has mechanisms built-in to automatically move to a different location when needed. Without downtime.

The following are key design principles for software engineering in the cloud

  • don’t trust the underlying platform
  • assume things will go wrong
  • design for failure, to automatically handle platform failures
  • make the application aware for the underlying platform, so it can manage its own platform resources if things go wrong
  • make sure you physically know where your cloud runs, even if it is cloud based
  • implement auto provisioning to dynamically scale capacity. Automatically failing over services into another locations may require more resources that usual. You don’ want a lack of capacity to take down this environment too

The ideas described here assume you have at least 2 completely independent locations that are geographically distributed. This means that no problems in one location can in any way disturb the other location. There can be no dependencies between them. You also want to check any cloud services you are running, such as dns or email. Are they depending on the availability of one supplier? Are they at the same locations you are in? You need to make sure these services are also in redundant locations without any dependencies between them.

New trend in Cloud software design

We have traditionally been focusing on building high available infrastructure in data centers. With cloud computing, we do not have control any more on the infrastructure level, but we should never assume it’s going to be available always. There is a new trend coming: it’s not the focus on infrastructure being HA, it’ the software that’s architected and built to be high available. The software can now handle infrastructure failures and even expects them to happen. This fundamental shift in thinking will require a more tight cooperation between development and operations (DevOps) in software companies.

Other Posts:

http://blog.rightscale.com/2011/05/02/aws-outage-follow-up-if-you-wanted-details-you-got-details/

About these ads

About Sander Koers
As a IT professional I have architected and implemented several cloud infrastructures for large enterprise customers. My goal and passion is to achieve 100% customer satisfaction in mission critical IT solution. Translating (customer) business requirements into solutions both the customer and the team are proud of is always the ultimate goal.

One Response to When the cloud goes down: Software design for failure

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: