SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Does Data Volume Trump the Multicore Crisis?

Posted by Bob Warfield on August 14, 2007

Bill de hÓra says Data Trumps Processing.  His blog post elaborates that he sees the scaling of the DBMS as a far bigger problem for most software developers than the multicore crisis (how to effectively utilize the many cores that are appearing on chips).  I saw the post linked to on Making It Stick, which talks a lot about multicore.  Since I’m intrigued by all things multicore, I took a look, but came away in fairly strong disagreement with the proposition.  Let me walk you through the two reasons I am not so concerned about data volume.

First, we have to ask where the data volumes are coming from and at what rate they’re growing.  In the Enterprise world, data volumes are largely a function of transactions.  That’s not the only source, but it is by far the biggest.  Here’s a news flash:  transaction volumes do not grow at anything like the Moore’s Law doubling eveyr 18 months rate except for relatively small companies that start with lower volumes and grow until they’ve nearly saturated their volumes.  My old company Callidus Software worked with data volumes at the extreme high end of the Enterprise scale when calculating sales commissions.  We paid the sales commissions for companies like Allstate, Sprint Nextel, and Washington Mutual Bank.  To pay those commissions we had to analyze every transaction coming in.  Terabytes were routine stuff.  To make matters worse, we kept both a highly normalized transaction processing store as well as a denormlized data market for reporting and analysis. 

 There are potentially sources of data that grow at Moore’s Law rates.  We can talk about data acquisition from sensors, or even about analyzing the web itself, but for the most part, we start with a volume and deal with it.  Also, let’s leave aside data volumes that are high in bytes, but not so high in terms of rows and columns.  Media sharing generates a lot of bytes, but the complexity and scalability issues come about when you have equivalent terabytes of structured data without the padding.

The second reason I’m less concerned is that the tools are set up to deal with this problem.  Oracle scales beautifully to very large numbers of cores.  That’s always been one of their competitive weapons, and when I used to be a VP there, we managed benchmarks against the competition by simply lining up more cores than they could handle (SQL Server was the usual victim) and then crushing the numbers by using all the cores.  To make matters even more one sided, the data world already uses an inherently parallel language: SQL.  They’re not trying to hack Java or C++ to add parallelism after the fact.  This frees the DBMS vendors to do all sorts of dirty work under the covers without threatening the legacy code base.  And the vendors have done a good job, or at least Oracle has and DB2 is also quite good in this respect.  They will continue to get better.

Now where might the data side be a problem?  First, Oracle and DB2 are not necessarily answers people like to hear.  They are relatively expensive.  Even SQL Server is not cheap and isn’t very “Web” fashionable.  What the world wants is for their LAMP-based implementation to scale out to vast proportions.  In other words, mySQL and the other Open Source players.  Here we may have a problem.  Those products are great, but they’re not necessarily set up to scale in this way, at least not to hundreds of cpus.  That leaves programmers holding the bag and having to code around these issues on those platforms.  If someone wanted to perform a pious task, it would be to address these scalability issues for these platforms.  After all, as I’ve mentioned, it can be tackled deep down in the bowels of the code without messing about the code that accesses the DBMS too much.

A second issue is DBMS skills.  They’re not hard, especially not as hard as writing fine grained parallel code in a conventional language, but they are relatively hard to come by.  I guess it isn’t a sexy enough field for a lot of great programmers.  Bill de hÓra mentions in his post that “Every web 2.0 scaling war story I’ve heard indicates RDBMS access becomes the fundamental design challenge.”  I hope you’re not too surprised, but I agree with Bill and would add a corollary that every Enterprise scaling war story I’ve heard indicates RDMS access is the fundamental design challenge.  Where Bill and I might differ is that in the majority of cases I’ve seen it has been due to the database developers doing something silly that was not that hard to fix.  Sometimes even the smart guys make a mistake in not understanding the usage patterns of the system very well.  DBMS programming turns out to be an iterative process.  If you rush you Facebook applet out without much testing (load tests are expensive), run it against mySQL, and it hits 1 million users a week later you’re going not going to get much sleep for awhile.

Then there is a category where the DBMS forces you closer to multicore.  Bill alludes to this too when he comments that “things like joins, 3nf, triggers, and integrity constraints have to go – in other words, key features of RDBMSes, the very reasons you’d decide to use one, get in the way. The RDBMS is reduced to an indexed filesystem.”  These are all areas where folks that are not DBMS gods fear to tread, especially when the performance stakes are high.  Eventually we come to grips with the idea that the DBMS is solving a communication problem with a member of the computing ecosystem that is as remote in round trip times as the Age of the Dinosarus on Earth.  I’m referring to the disk drive, of course.  The cpus are shielded behind two layers of cache, a bunch of RAM, disk caches in the OS or app, more caches on the disk controller, RAID stripping, and who knows what all else.  It ain’t enough!  So when the going gets tough, you use the DBMS just exactly the way Bill says, as an indexed file system.  Guess what?  Building your own joins across multiple cpus, implementing integrity constraints when the data is fed in or modified, and so on are all easier than writing fine grained parallel code in Java. Most systems will find it has to be done for a relatively small fraction of their overall schema.

Besides which, where else can I get such a great indexed file system?

Submit to Digg | Submit to | Submit to StumbleUpon

Leave a Reply