SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

How Much Uptime Does Your Application Need?

Posted by Bob Warfield on May 25, 2010

I was having an interesting discussion with a friend recently about Amazon’s spot instance pricing for EC2 that prompted me to write this post.  Let me walk you through how it went, because I think it replays some of the thinking that all startups should go through about their online service.  BTW, hat tip to Geva Perry for putting me on to the spot instances, which I hadn’t noticed.

To begin with, what is Amazon’s spot instance pricing?  A spot instance is one where you bid, sort of auction style, what you’re willing to pay for an EC2 instance.  The spot price moves up or down based on availability.   For the most part, it is quite a bit less than the normal retail pricing for EC2, so one can get EC2 instances for considerably less.  I think this is a very cool model.  But, there is a catch.  Amazon publishes the spot price, which is based on the availability of unused capacity.  If the spot price goes above your bid, your EC2 instance will be immediately shut down.  Note that you don’t pay your bid price either.  Instead, you pay the spot price, which by definition has to be below your bid price right up until they shut you down for having bid too little.

Interesting model, no?  I was remarking to my friend about how excited I was about it.  You see, cost to deliver the service really matters for Cloud and SaaS businesses, and here was Amazon blazing an interesting new trail with the potential for interesting savings.

My friend’s reaction was pretty negative.  He couldn’t imagine dedicating a production application to an opportunistic model like this where his app could go down at any moment due not to some catastrophic failure, but to the vagaries of this odd marketplace.  It was a brief exchange, but I almost wondered if he thought it unsavory to do that to his users.

The thing is, most developers and businesses automatically assume that whatever software they are offering has to be up 24×7, five “9’s”, and with 100% efficacy.   But is that really true? 

No doubt it is for some apps, but probably not for a lot of apps.  How do you know what is needed for your application?  Your customers will tell you, but not because you asked.  If you do ask, of course they want 100% uptime.  They may even insist on it contractually.  However, your service has almost certainly gone down and been unavailable.  They all do, sooner or later.  It has probably experienced outages that are so short, nobody noticed, or perhaps very few did.  BTW, read the fine print carefully on those five “9’s” Cloud SLA’s.  They’re all riddled with exclusions.  For example, they may exclude outages caused by their Internet service provider.  They may exclude outages lasting less than a certain length of time, or outages that had a certain amount of lead time warning.  Those businesses have spent time understanding what their customers really demand and what they can really deliver. 

Getting back to the story at hand, you’ve probably had outages.  Let’s say you had a 10 minute outage.  How many of your customers noticed and called or wrote?  At Helpstream, when this happened, and it was seldom, a 10 minute outage would generally net contact by 1 to 3 customers.  This despite having nearly 2 million seats using the software.  The contact would boil down to, “What’s going on, we can’t access the service?”  Usually, the service was back up before we could respond with an answer, and the customer had already lost interest in the outage.

Does this mean you can bombard customers with 10 minute outages with impunity?  No, absolutely not.  But in this case, a rare 10 minute outage was not causing a lot of trouble.  Now look at spot pricing and Amazon.  It typically took us 3 or 4 minutes to spin up a new server on EC2.  So, if we had to convert all servers from spot price to normal instance pricing, that could be done in well under 10 minutes of outage.  Consider that the worst case and consider that it might happen once every six months or so.  What are some better cases?

There are many.  You could envision cases where only part of the service was down, or perhaps where the service simply got a little slower because some, not all instances were using spot pricing on a highly distributed “scale out” architecture.  While you may have an app that couldn’t tolerate a 10 minute service outage, how many couldn’t tolerate being slower for 10 minutes out of every 6 months?

Or, consider an opportunistic scenario for giving customers a better experience on a mission critical application.  Take my old alma mater, Callidus Software.   Callidus offers a transactional piece of enterprise software that computes sales compensation for some of the world’s largest companies.  Take my word for it, payroll is mission critical.  The transaction volumes on the system are huge, and a compensation payroll run can easily take hours.  We used a grid computing array of worker processors, up to 120 cpu’s, to compute a typical run.  These were restartable if one or more died during the run.  Can you see where this is going?  If the whole grid were done with spot priced instances, and the whole grid went down for 10 minutes, it would simply delay the overall run for 10 minutes or so.  Not great, but not the end of the world.  Now suppose Callidus offered 50% faster payroll processing because it could afford to throw more cpu’s at the problem.  Buried in the fine print is a disclaimer that sometimes, even as often as once a quarter, your payroll run could be delayed 10 minutes.  Yet overall, your payroll would be processed 50% faster.  Does anyone care about the rare 10 minute delay?  My sense is they would not. 

My hobby involves machine tools, and machinists are constantly upset with drawings that specify too much precision in the part to be machined.  It’s easy for a young engineer to call out every feature on the drawing with many digits of precision.  But, each digit costs a lot more money to guarantee because it means a lot more work for the machinist.  Up time is the same sort of thing.  Specifying or offering SLA’s that are excessively harsh is very common in the IT world, but there is a cost associated with it.  The moral of the story is, don’t just assume every service and every aspect of every service has to have five “9’s” of availability.  Consider what you can do by trading a little of that away for the customers or for your bottom line.  Chances are, you can offer something that customers will like at a lower delivery cost.  Assuming your competitor’s haven’t thought of it, that might just be the edge you’re looking for.

Leave a Reply

%d bloggers like this: