As I was driving along pondering the imponderables, I suddenly realized the folks talking about the Multicore Crisis have gotten it all wrong. For those who haven’t heard of it, the Multicore Crisis is basically concern about what happens as chipmakers shift from being able to deliver ever-faster clock speeds according to Moore’s Law to delivering ever more processor cores on the same chip. The crisis comes about because its much harder to write truly parallel software than it is to just let the chip get faster and run conventional software twice as fast every 18-24 months. No lesser folks than Microsoft’s Craig Mundie have proclaimed that we are 10 years away from having the proper languages and other tools to efficiently harness the hardware that will exist in a multicore world.
Some of the pundits in the blogosphere have argued that we have plenty of time to get ready for the Multicore Crisis, and that all the hubub today is just hype and hand wringing. They will do projections that say it’s easy with a couple cores to just give one to the OS, save the other for the app, and see an immediate speedup. By the time there’s enough cores on a chip that this quits working, 10 years will have gone by and we’ll have all those great new tools needed to harness the big chips. There are some pretty good rebuttals for this already, BTW.
Never mind that quad core chips have already shipped, motherboards are cheaply available to put two of these together in a “V8” 8-core configuration, 8-core chips are nearly here from Intel and already here from Sun. Never mind that Intel has an 80 core chip in their labs and there are startups looking at 64 cores in the relative near term. Let’s also forget that with 4 cores shipped now and 8 cores due out next year we will see 64 cores in more like 6 years than 10, according to standard Moore’s Law rates. Despite all that, it’s all going to be okay. Really!
Here is my problem with all this back and forth: we’ve already hit the Multicore Brick Wall without leaving skid marks and most people just don’t realize it! I hear the crowd out there now, beyond the klieg lights, grumbling in the dark, “What’s he on about now?” Patience please. The problem with multicore is it teaches us that someday we will expect software to scale linearly. That Alpha Geek Speak means if I double the number of available cores, I want my software to run twice as fast. Hallelujah! I’m back to getting twice the speed every 18-24 months just like in the heyday of Moore’s Law. In the post-clockspeed-doubling world that’s coming, this will be a requirement or all computing progress grinds to a halt (that means the money stops: true crisis), or so say the Multicored Chicken Littles.
Linear Scalability is hard to do, but ironically, it is nothing new. Guess what? We’ve already been fighting with “scalability” for a long time. Can you see where I’m going with this? Let me give you some examples.
Once upon a time eBay was plagued by terrible outages. Analysts stated that this was due to eBay’s failure to build a redundant, scalable web architecture. One of my startups was located on eBay’s campus in Campbell, and the story we heard at the local Starbucks was interesting. It seems eBay had built out their original architecture around the idea of running a 3rd party search engine on a mainframe. Eventually, they reached a point where they had purchased the largest mainframe Sun had to offer. Unfortunately, being a Red Shifted business, they were growing at a rate faster than Moore’s Law, and hence faster than Sun could provide them more powerful machines! Or, as eBay themselves put it in a presentation on their architectural evolution, “By November 1999, the database servers approached their limits of physical growth.”
In August of 1999, Meg Whitman hired Maynard Webb on the heels of all this to fix it. The fix (despite many protestations that at least some of the problem was due to issues with eBay’s vendors like Sun) boiled down rearchitecting the very fabric of eBay to allow for:
“clustering the servers for greater availability, dividing the workload among its Oracle databases”
Wow! Deja Vu all over again. They needed to find a way to harness more cores to keep up with the load: eBay had a Multicore Crisis in 1999!
When I worked for Oracle, we used to employ the Multicore Crisis to make sure our server win the benchmarks against competitors. It was easy. Just insist on running the benchmark on a server that had more cpus than Microsoft SQL Server could utilize. If Oracle could run 2x the cpus and keep them all efficiently humming away, we would run 2x as fast on the same hardware. As I recall, at first SQL Server could utilize just 4 cores. At some point, and after a lot of pain, they upped it to 8. I’ve worked on big Enterprise projects where we successfully harnessed well over 100 cpus.
Which brings me to my last company, Callidus Software. We used scalability as a powerful competitive weapon. We had built a grid computing infrastructure to run our incentive compensation software. The competition literaly had to throw in the towel at certain volume levels. Beyond here there be scalability dragons. There’s nothing quite like competing in a deal where you know your competition can’t produce a single happy reference at the volume levels the prospect requires.
More recently, the Skype VOIP service was down for an extended time due to what was basically a scaling problem. Microsoft forced some updates through to Windows users, Windows had to reboot (what else is new), and suddenly there were millions of rebooted machines trying to log onto Skype all at the same time. Skype’s explanation was:
Our software’s peer-to-peer network management algorithm was not tuned to take into account a combination of high load and supernode rebooting.
Consider the costs to businesses that depend on Skype? Looking closer to home at eBay, Skype’s owner, investors saw a loss of $1B in market value as the drama unfolded. A Multicore Crisis can be really bad for your business! As more and more of the computing world turns to centralized models like SaaS and Web 2.0, it becomes more important than ever to solve the Multicore Crisis, or at least the Scalability Crisis for these businesses to succeed.
If we want to move beyond this, SaaS and Web 2.0 sites have to be architected for massive scalability, particularly if they’re built on cost-effective Lintel (Linux on commodity Intel boxes) architectures like so many of these sites are. In addition, companies need to invest in utility computing at the hosting end so they can rapidly increase (or decrease) the hardware they have on line when demand hits. One example of a utility computing service would be Amazon’s EC2 and S3 services that let you dynamically provision a machine in their data center in about 10 minutes.
Have you ever encountered massive outages on a new and rapidly growing service? Perhaps a newly minted Web 2.0 startup? Perhaps you’ve been really unlucky and encountered the problem as you company tried to install a mission critical piece of Enterprise Software. Post a comment here to share your experiences. I know many of you have already had a Multicore Crisis, and now you know what to look for.
For those who are thinking you’ll worry about the Multicore Crisis in 10 years when it’s an easy problem to solve, remember:
You’ve already had a Multicore Crisis and just didn’t know it!
A Picture of the Multicore Crisis: See a timeline of it unfolding.
Submit to Digg | Submit to Del.icio.us | Submit to StumbleUpon