Amazon Ran Out of Capacity

February 2008
M	T	W	T	F	S	S
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29

Posted by Bob Warfield on February 18, 2008

As I suggested in my original post on the topic, Amazon’s recent S3 outage was due to running out of capacity. Specifically, they ran out of authentication capacity. In part, this problem was due to the fact that Amazon wasn’t monitoring exactly this part of their capacity envelope very well. High Contrast has the Amazon quote telling us that it was also due to just a few customers radically increasing their load on the system in an unpredictable way:

the surge was caused by at least one very large customer plus several other customers suddenly and unexpectedly increasing their usage.

So far, most of the pundits are in something of a denial mode. They argue that nothing really new and interesting is happening here. All services go down, including the electric company. Vinnie Merchandani says corporate data centers have been going down a lot more often than 99.999% uptime allows for since forever. Folks like Nick Carr seem to feel the biggest issue in this outage was that users didn’t have timely information and Amazon is fixing that.

This all misses a bigger point. What these writers are doing is attempting to apply the old standards and methods against the new world of Cloud Computing. The trouble is, there is something genuinely new at work here that goes beyond the inevitability of some outages and the need to be more transparent with customers about what is going on. The problem Amazon and other would-be cloud platform purveyors face is predictability. The world they deal in is radically less predictable than corporate data centers of old because the Internet today has much lower friction and higher connectivity between different web sites that make load spikes increasingly sudden and intense. There is a cascade of dominoes effect that is enabled by the low friction web that wasn’t nearly so twitchy in the past.

The premise of any large computing infrastructure is that by sharing the load across many customers (and in Amazon’s case, sharing excess capacity from their core retail business), we enable headroom for such load spikes. But how realistic is that concept?

Consider this Alexa plot of CNN and Flickr traffic over time:

Do these two curves look predictable to you? Take CNN, for example. To handle the big spikes requires 2-3x overload capacity. Flickr is a little less crazy except for one massive event that involved a doubling in a very short time. This latter even was permanent in its effect, so if you were counting on temporarily borrowing some headroom, you would have had to keep it in place indefinitely and grow from there. Ironically, that chart was brought to my attention at Amazon Startup Project where they used it to sell the idea of unlimited headroom a startup can’t afford to purchase by using Amazon Web Services.

These charts are displaying non-linear behaviour, the hardest of all phenomena to predict. This non-linearity is becoming more and more common because the Internet has become extremely viral. It is crosslinked, the very meaning of the word “web”, and messages travel along the links with almost no friction. Viral has become a virtue, and much of the current innovation is focused around how to make the viral spread of information more likely. Social Networks are all about such behaviour. Take a look again at those CNN spikes. Now let’s imagine your cloud computing infrastructure is hosting a bunch of different blogging, micro-blogging, video, photo sharing, and other social sites. The CNN spikes no doubt represent something newsworthy happening. The greatest likelihood is that each spike will be echoed at some level across all of these sites that are in the business of spreading information. Friction has been lowered to the point it is almost non-existent when it comes to the spread of memes on the Internet. We have major spikes from world events, such as the assassination of a world leader. In the Internet, we can have major spikes from such inane moments as Scoble shedding tears of delight over new Microsoft secret software. And the whole thing is wired together. That one tear on Scoble’s cheek breeds a thousand or more accounts ranging from poking fun to trying to guess what this secret software is. There is a ravenous beast poised over the keyboard waiting for something interesting to pass onto its network of other ravenous beasts.

This is decidedly non-linear behaviour and impossible to predict. The answer is major cloud computing infrastructure providers will need to have considerable excess capacity available on tap at all times to avoid outages. Take Amazon. Web bandwidth to their web services now exceeds to total traffic to all of their other properties. What might have once been a nice remaindering business allowing them to resell their excess capacity is now driving the need for more capacity. They have just a few choices. They can invest in a lot more hardware and lower the margins on their business, or they can implement some strategies to limit the availability of the service to some customers. It strains credulity to think they’ll limit capacity to their retail business. How will they decide? Tiered pricing of some kind?

Think in terms of other unexpected networked events. I’m reminded of financial markets and the law of unintended consequences. Look at today’s housing market. Remember Long Term Capital, a hedge fund with Nobel Laureates who had mathematical proofs they would continue making money. Right up until they unpredictably went bankrupt. BTW, this sort of thing used to happen with the electrical grid too. In both cases, the financial markets and the electrical grid, elaborate means were put into place to artificially inject friction to damp the machine’s oscillations before it could destroy itself. There are elaborate rules in the stock exchanges about shorting stocks that are falling. They inject a form of friction back into those markets to prevent total free fall.

Perhaps this points the way to new technology for Cloud Computing infrastructure. A gentle injection of the right kind of friction at the right point for a limited time might prevent suddenly massive spikes and outages. It’s an area ripe for innovation. Meanwhile, Amazon could sorely use some competition. If a customer could contract for emergency capacity from elsewhere, or even better, if the Cloud Computing Providers could share slack capacity as the electrical companies do, it would be tremendously helpful when the inevitable load spikes arrive.

This entry was posted on February 18, 2008 at 6:25 pm and is filed under amazon, data center, platforms, saas, Web 2.0. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “Amazon Ran Out of Capacity”

stuartcharlton said

February 19, 2008 at 1:29 am
It seems cloud computing is just one part of a broader global trend that your examples highlight. We’re on a path to more Black Swans…

Curious considering how many in sales or investment demand predictable results, leading to all sorts of delusional behaviour to hide the underlying reality.

Reply
charlie262 said

February 21, 2008 at 12:57 am
Bob–

Very interesting article.

But how can Amazon, or any single corporate or political entity, ever have the capacity to provide web and database services to a substantial portion (say, > 10%) of the world’s IP and database infrastructure? The Internet is fundamentally decentralized. That was its original design center, and the value of the design proves itself every day as routers fail and undersea cables are cut–and the Internet by and large keeps working. I don’t see how these Amazon offerings can ever provide more than a fraction of the database/webservices needed by today’s SAAS users and web surfers. There are too many of them–one company can never have enough servers or bandwidth.

Somewhat off topic, I once read a science fiction book about a world in which teleportation was cheap and effective–and people would show up by the millions wherever on Earth something moderately interesting was going on. The trick was getting there before the place got too crowded.

Reply
smoothspan said

February 21, 2008 at 2:16 am
Not off topic at all, Charlie. Your teleportation example is the removal of another form of friction and the result is the same. It’s actually very analagous to my writings about how the blogs that show up in places like Techmeme are “already too crowded.” The interesting stuff is just off the edge of the radar.

As to whether any single corporate or political entity can have the capacity, that I don’t doubt will happen. For example, wouldn’t you expect that Intel manufactures 10% of the MIPs in use already? Microsoft is way over 10% of the OS cycles for PC’s I would bet, and that’s worldwide. I wonder if any single telco handles 10% of the world’s calls? Probably not. Does any oil company refine 10% of the world’s oil?

The larger question is whether it’s a good idea for that sort of thing to happen? The radical reduction of friction leads to more rapid centralization if there is any advantage to centralizing. To avoid centralization, eliminate the advantages for it. The web infrastructure is extremely successful at decentralizing web page delivery and a few other services. A lot more decentralized protocols will be needed if SaaS and similar complex distributed applications are to be successfully decentralized.

These things are pendulums. My guess is we go heavily central before we decentralize again.

Cheers,

BW

Reply

	Camels to Cars, Arti… on A Picture of the Multicore Cri…
	LinkedIn shuts down… on Get Ready to Give Up on Linked…
	LinkedIn shuts down… on Get Ready to Give Up on Linked…
	Start With an Audien… on The Very First Thing a Foundin…
	Breaking through the… on Reflections on Six Years of Co…

SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Blog Tools

Archives

Recent Comments

Pages

Top Posts

Recent Posts

Meta