People Using Amazon Cloud: Get Some Cheap Insurance At Least

April 2011
M	T	W	T	F	S	S
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Posted by Bob Warfield on April 23, 2011

I’m reading through Twitter streams, Amazon Forums, and other news sources trying to get a sense of how users are responding and what their problems are. It’s pretty appalling out there. B2B companies admitting they have no recent backups and just have to wait for it to come back online. A company that claims patient’s lives are at stake as they do cardiac monitoring based in the Amazon Cloud and are desperately seeking assistance. The list goes on.

There’s some basic insurance any company using the Amazon Cloud needs to take out first chance they get. It’s not hard, it’s not expensive, it’s not push a button and get hot failover to multiple Clouds, and it won’t fix your problems if you’re caught in the current outage. But it will at least give you a little more maneuvering room. Many of the acounts I’m reading boil down to a lack of options other than waiting because they have no accessible backup data. In other words, they’d love to bring up their sites again on another Amazon Region, but they can’t because they’re missing access to a reasonably current data backup, or the Amazon Machine Instances are all in the affected region or issues along those lines.

Companies need the Cloud equivalent of offsite backup. At a minimum, you need to be sure you can get access to a backup of your infrastructure–all the AMI’s and Data needed to restart. Storage is cheap. Heck, if you’re totally paranoid, turn the tables and backup the Cloud to your own datacenter which consists of just the backup infrastructure. At least that way you’ve always got the data. Yes, there will be latency issues and that data will not be up to the minute. But look at all that’s happened. Suppose you could’ve spun up in another region having lost 2 hours of data. Not good, not good at all. But is it really worse than waiting over 24 hours or would you be feeling blessed about now if you could’ve done it 2 hours into the emergency? These are the kind of trade offs to be thinking about for disaster recovery. It’s chewing gum and bailing wire until you get an architecture that’s more resilient, but it sure beats not having any choices and waiting.

Another thing: make sure you test your backups. Do they restore? Can you go through the exercise of spinning up in another region to see that it works? Don’t just test once and forget about it. Pick an interval and retest. Make it routine so you know it works.

Staging all the data to other locations is not that expensive compared to continuously running dual failover infrastructure. That’s one of the beauties of elasticity.

There’s a lot of grumbling about how hard it is to failover to other regions and how expensive. Nothing is harder than explaining to your customers why your site is down. But at least get some cheap insurance in place so you have options the next time this happens. And there will be a next time, no matter whether it is Amazon, some other Cloud provider, or your own datacenter. There is always a next time.

While you’re at it, consider some other cheap insurance:

– Do you have a way to communicate with your customers when your site is down? An ops blog that you’re sure is hosted in a different cloud is cheap and cheerful.

– Can you at least get your web site home page showing? Think about how to get DNS access and a place to host that don’t rely 100% on one Cloud provider.

– Is there something about your app that would make partial access in an outage valuable? For example, on a customer service app, being able to log trouble tickets as email during an outage or scheduled downtime would be extremely helpful. Mail is cheap and easy to offer as alternate infrastructure, and it is also easy to imagine piping the email messages through a converter that would file them as tickets when the site came back up. It’s not hard to imagine being able to queue many kinds of transaction this way in an emergency. What are the key limited-functionality areas your users will want to have access to in an emergency?

– For some apps, it is easier to provide high availability for reading than for writing. Can you arrange that in an emergency, reading is still possible, just not writing or creating new objects? Customers are a lot more tractable if they know they still have access to their data, but just can’t create new data for a while. For example, a bookmarking site that lets me access my bookmarks but not create new ones during an outage is much less threatening than one that just brings up its Fail Whale equivalent on me.

Welcome to the world of Disaster Recovery. Disasters have a User Experience too. Have you planned your customer’s Disaster UX yet?

This entry was posted on April 23, 2011 at 5:55 pm and is filed under amazon, cloud. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

12 Responses to “People Using Amazon Cloud: Get Some Cheap Insurance At Least”

mihaahronovitz said

April 23, 2011 at 11:20 pm
Learning from AWS incident, we need to develop processes to easily have clouds on different providers and mitigate the risk that one of them fails. CloudSigma and Strategic Blue press release is significant announcement , particularly in the context of the AWS failure http://bit.ly/g09yRl%20

“Now customers of CloudSigma can choose to interface directly or through Strategic Blue for their cloud billing arrangements, depending on their preference. The partnership will complement the pre-pay billing model available directly from CloudSigma with invoice payment terms available through Strategic Blue. Strategic Blue “solution transforms multi-cloud infrastructure deployment from a potentially complicated management issue to a simple billing relationship with one party, Strategic Blue.”

Next trend to watch is how cloud providers will develop “easier ways” to migrate to other clouds. We need a magic list of cloud providers customer may choose for multiple hosting for defense against cloud failure. I am paraphrasing Gartner Magic Quadrant for cloud providers

http://bit.ly/dNayOS

You write: For some apps, it is easier to provide high availability for reading than for writing. There are 5 lessons to be learned:

Lesson 1: Both Cloud and Dedicated Computing Have Single Points of Failure
Lesson 2: Size is No Protection from Outages without Redundancy
Lesson 3: All Data Centers Are Not Equal
Lesson 4: The Price-Performance-Reliability Metric
Lesson 5: Achieving a highly robust set-up is cheaper and easier in the Cloud

See the blog post on this subject at
http://www.cloudsigma.com/en/blog/2011/04/23/21-cloud-outages-lessons-learned

Reply
- Bob Warfield said
  
  April 24, 2011 at 1:08 am
  miha, in this case, even just architecting for failure of a region would’ve made all the difference between businesses like Netflix and Twilio that stayed up, and those that didn’t. This particular post of mine is advocating a level of basic prudence to at least have control of all your data no matter what. Folks that haven’t taken that step are a long way from being ready to start worrying about multi-cloud architectures.
  
  Cheers,
  
  BW
  
  Reply
  - mihaahronovitz said
    
    April 24, 2011 at 4:00 am
    I wonder whether the credit – for not being affected by AWS failure – Netflix gets because they were smart (which they are), or because they were luck, because they used the Western California AWS locations. If you saw the movie Matchpoint, by Woody Allen, if one survives is not caught not caught – automatically becomes a genius
  - Bob Warfield said
    
    April 24, 2011 at 6:24 am
    No miha, they architected to run in 3 regions and stay up if they lost any 1.
  - mihaahronovitz said
    
    April 24, 2011 at 4:29 pm
    In that case – I know the people at Netflix, like Adrian Cockcroft – Netflix made an extraordinary point for all other companies porting to Amazon. But the quality of Netflix team, is hard to match by other mainstream companies porting to AWS. Yet people like zencoder and quora have brilliant engineering leadership, just like Netflix, and went down. Actually Zencoder blog is a must to read: http://bit.ly/fiehlU . How come they went down and Amazon didn’t? Perhaps because is very expensive to have three locations of the size of Netflix, instead of one .
    
    Perhaps you can have a look at affordable IaaS providers who are both easy-to-use and lower cost. See this video of cloudsigma.com http://bit.ly/g3UuTN . Cloudsigma will launch in US at four location in June 2011
schlafly said

April 24, 2011 at 12:10 am
You’re right, it is foolish to be so dependent on the Amazon Cloud. But I am guessing that to a lot of businesses, the main appeal of the Cloud is that they do not have to worry about backups and database outages and all those other problems. I think that this is a disaster for Amazon.

Reply
Bob Warfield said

April 24, 2011 at 1:05 am
Roger, any business that outsources and assumes they gave up responsibility along with the work is going to get what they get. Another Cloud provider, which will eventually have an outage too, and more disappointments. Businesses are responsible to their customers for delivery, regardless of who they sublet it to. The smart businesses understand that reality and will deal with it.

Let’s hope most of the businesses caught in this mishap just weren’t quite pessimistic enough and will take care of their architectural shortcomings.

Unfortunately, I think the Tech World has also gotten pretty complacent about doing things on the cheap Consumer Internet style. There are drawbacks to that approach as we’re seeing.

Cheers,

BW

Reply
Cloud Failure, FUD, and The Whole AWS Oatage… « Composite Code said

April 25, 2011 at 4:57 am
[…] Bob Warfield – People Using Amazon Cloud: Get Some Cheap Insurance At Least […]

Reply
Cloud Failure, FUD, and The Whole AWS Oatage… said

April 25, 2011 at 2:29 pm
[…] Bob Warfield – People Using Amazon Cloud: Get Some Cheap Insurance At Least […]

Reply
关于Amazon云宕机的网贴收集 | 酷壳 - CoolShell.cn said

April 27, 2011 at 2:50 pm
[…] People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield […]

Reply
CIO analysis: Examining Amazon's cloud failure | ZDNet said

May 2, 2011 at 3:29 pm
[…] don’t actually test their systems thoroughly before an outage occurs. Veteran technologist, Bob Warfield, offers suggestions for implementing disaster recovery plans in a cloud environment: Companies need […]

Reply
关于Amazon云宕机的网贴收集 | multiprocess said

November 10, 2013 at 8:44 am
[…] People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield […]

Reply

	Camels to Cars, Arti… on A Picture of the Multicore Cri…
	LinkedIn shuts down… on Get Ready to Give Up on Linked…
	LinkedIn shuts down… on Get Ready to Give Up on Linked…
	Start With an Audien… on The Very First Thing a Foundin…
	Breaking through the… on Reflections on Six Years of Co…

SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Blog Tools

Archives

Recent Comments

Pages

Top Posts

Recent Posts

Meta