SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Balancing Process and Agility, Google’s Cautionary Tale

Posted by Bob Warfield on January 31, 2009

For about an hour this morning. Google was reporting every search result as leading to a site with malware.  Some were chuckling that Google even reported its own site as a bad risk.

Marissa Mayer reports that the error was due to a 3rd party, called, which periodically sends them a list of the bad site URL’s.  In this case, the list contained the “universal URL” consisting of just the slash:  “/”.  The upshot is that for a while, you had to manual cut and paste results into your browser and you had no idea if you were going to a site that really did have malware, or a site that was simply afflicted by this obvious bug.   To make matters worse, Mayer’s original explanation, that they got a bad file from StopBadware, turns out to be wrong.  StopBadware says Google generates their own files, and Google amended their statement, but they still claim it was a simple human error.  It was, but that error was in creating the bug and not testing for it that allowed this to happen.  It wasn’t because someone put a “/” in a data file, although that is also a human error.

Of course this raises lots of questions that range from the wisdom of letting Google be a single point of failure in our Internet lives, to it being no big deal–an obvious problem that shouldn’t confuse anyone that would likely be quickly fixed.  I guess I am somewhere in the middle, but mostly towards the right.  The incident is an embarassment for Google, but didn’t really cause lasting harm.

The question is how much of this sort of thing a company like Google can tolerate before it does harm the brand, and what should the company be doing to protect itself.  A couple of thoughts present themselves:

–  “/” is a pretty obvious case to test for, but clearly it hadn’t been tested for.  Google originally blamed a third party for giving them bad data, and that turned out not to be the case, but even if it were true, most QA organizations I know of would blame the software as well for not having been thoroughly tested.  It begs the question of what other mischief is lurking about that hasn’t been tested for.  What level of process and quality should Google be aiming for?

–  This is a time when Google is trying to be more financially efficient, and when the old culture of hire the smart ones as fast as you can and we’ll figure out what to do with them later is rapidly being cast off as unworkable.  Is the culture capable of increasing the level of process, testing, and other “overheads” which will mean further cuts on the innovation side to pay for it? 

Process and Quality are not quite the same thing, but they are certainly related.  And, they are often viewed as costs or overhead, rather than as benefits, although the Japanese certainly showed that quality can be a powerful factor in success, and Microsoft is certainly showing that a lack of quality (percieved if not outright) can make selling difficult even for a monopoly.

But there are trade offs to be balanced.  Originally I wanted the title of this post to be, “Balancing Process and Innovation,” to underscore the investment in more developers to make new things versus other investments in process to increase quality.  Google could’ve spent more time testing it’s malware detection, it could have instituted some simple testing (wouldn’t have taken much to catch the latest) every time a new set of URL’s came in, or it could have done a variety of things that may have contributed to quality, but would have reduced available investment in innovation.  But as I was writing the post, I realized the tradeoff was more between Process and Agility than Process and Innovation.


Because if you can respond fast enough, you can successfully respond after the fact.   Such a response might still be ineffective at preserving customer satisfaction if the problem was bad enough, but in general, the fewer customers that experience a problem, and the less it impacts them (whether due to lower severity, shorter time, etc.), the better things are.  Having no problems at all may be best (though not always, more on that in a minute), but failing that ideal, having many fewer percieved problems is not bad.

Some time ago, while I was with Callidus, I did a benchmarking survey of SaaS companies and On-premises companies that were in transition or thinking of moving to SaaS.  We did this to understand what was involved and to help us decide whether the move to SaaS was right for Callidus.  We ultimately decided it was, and the transition has been going very well for Callidus, but I learned some interesting things.  At one point, I started asking Customer Service organizations at On-prem software companies what percentage of product problems reported were fixed in the latest release.  The answer that came back surprised me.  It ranged from 40% to as much as 70%.  What was happening is customers were reluctant to move to the latest On-prem release for whatever reasons, and so they were encountering bugs in old versions that had already been fixed.   A SaaS company has the luxury of control over what version of their software customers run, so they can fix bugs as soon as they are discovered by just a few customers so that the majority of customers may never see those bugs.  It’s an abject demonstration of the value of Agility versus Process.  The On-prem companies can invest the same in QA, but deliver a worse experience because they can’t be as Agile about fixing the problems.

This applies to the Google case in two ways.  First, if Google had to patch software on every one’s machine that accessed Google, that would be a nightmare.  Much slower and more painful.  Instead, they were able to fix it on their own servers so the total incident lasted a relatively short time.  Second, Google had the opportunity to mitigate their risk further by expanding on this theme, but they didn’t take it.  It should be straightforward for Google to roll out changes like this to subsets of their audience.  Perhaps they would do so by data center, region, country, or time zone.   Doing such a staged rollout would ensure that they got early warning of catastrophic and obvious problems before their entire infrastructure was infected.  They could rollback whatever changed happened and stop further rollouts until the problem was resolved.  That’s a level of agility that would greatly benefit any organization that has as many customers as Google does.

Consider SaaS companies.  Multi-tenancy is great for cost savings, but it carries the risk of Google’s problem.  Namely, every customer gets fed the “bad” change at once.  No large SaaS organization really tries to put every customer onto the same instance.  Salesforce has NA1, NA2, NA3, EMEA, and so on.  The question is whether their process for rolling out patches and new releases is always uniformly applied to every instance, or whether they apply to one instance, wait and see whether there is an adverse impact, and then move on to the rest of the instances after a suitable settling period.  Such a policy seems very prudent to me, and capable of helping reduce the risk that your customers get upgraded to a bad release.  To be sure, I am absolutely not advocating everyone running a separate code release.  I am advocating finer granularity in rollouts that lasts for a very temporary period as a way to mitigate risk.

Does such an approach risk the cost savings of multi-tenancy?  Not at all.  That savings is based on dynamically allocating unusued capacity between the tenants, and on reducing the management overhead so that management of a single instance yields management of many tenants.  Having a few more instances (or even a lot more for a big outfit like Google) doesn’t really impact either benefit.  the virtualization benefit (sharing unused capacity) probably means a smaller instance than you may think.  Take the smallest (in terms of machine resources) instance that will run your largest customers and start putting small customers on it.  It’s pretty efficient.  Now take an instance 2x larger than needed for your largest customer.  Put the largest customer on it with a bunch of very small customers to fill in the “gaps”.  It runs pretty nicely.

My own company, Helpstream, uses the Cloud to facilitate this sort of thing.  We use Amazon for our Cloud, and it has really helped us.  For example, we roll out a Beta instance to our biggest customers (eventually to everyone) 2 weeks in advance of a new release.  It has all of their data on it, so they can start playing with it and tell us any problems they see.  We also recently rolled out our first private instance for a customer that wanted to see what their performance would look like on an isolated instance.  Over time, we’ll use relatively more instances together with the ability to seamlessly move tenants (our customers) between instances to provide a great deal of operational flexibility.  And I fully expect we’ll get to staged rollouts of new releases as well.

Google and other large Software as a Service or web companies should already be doing so today.  If you’re in a position to limit the impact of a change to a small audience, and to fix it quickly if there is a problem, Agility can take the place of the expensive process needed to prevent any error from ever happening.   But, you have to plan ahead to be in a position to operate this way.

2 Responses to “Balancing Process and Agility, Google’s Cautionary Tale”

  1. ffortini said

    Google does of course have the technical ability for staged rollouts — why, just in the last few days they’ve rolled out the offline-email feature for gmail labs in a nicely staged way and without any glitches that I’ve heard of, so their strong capabilities in this key area of “cloud computing” release-engineering can’t be doubted.

    The mystery is why they CHOSE not to use “normal” operational prudence (in the huge helpings obviously needed given their huge size) for releasing these updates to their “badware” info — not just staging, but deliberate “canary in the coalmine” tactics starting with deployment to a tiny sample, followed by careful scrutiny of the results (and this particular error had effects so obvious that it would inevitably have been spotted), followed by a “make haste slowly” rollout.

    Further: if they run on skeleton operational crews in “quiet” times (like, early on Saturday mornings), which is a not unreasonable compromise between opex and service quality, then such times should best be avoided for any service updates with any potential for disruption — do them when there IS plenty of operational personnel available to spot and remedy a disruption (typically by rollback).

    Ah well, I’m sure they’ll tighten their operating procedures now — as I said, it’s obvious that they DO have all the technical capabilities in place, so it’s only a matter of establishing and enforcing proper releng procedural constraints.

%d bloggers like this: