SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

WordPress and the Dark Side of Multitenancy

Posted by Bob Warfield on June 11, 2010

Quite a bit of hubbub over WordPress’s recent outage.  A number of high profile blogs including Techcrunch, GigaOm, CNN, and your very own SmoothSpan use WordPress.  Matt Mullenweg told Read/WriteWeb:

“The cause of the outage was a very unfortunate code change that overwrote some key options in the options table for a number of blogs. We brought the site down to prevent damage and have been bringing blogs back after we’ve verified that they’re 100% okay.”

Apparently, WordPress has three data centers, 1300 servers, and is home to on the order of 10 million blogs.   Techcrunch is back and talking about it, but as I write this, GigaOm is still out.  Given the nature of the outage, WordPress presumably has to hand tweak that option information back in for all the blogs that got zapped.  If it is restoring from backup, that can be painful too.

While one can lay blame at the doorstep of whatever programmer made the mistake, the reality is that programmers make mistakes.  It is unavailable.  The important question is what has been done from an Operations and Architecture standpoint that either mitigates or compounds the likelihood such mistakes cause a problem.  In this case, I blame multitenancy.  When you can make a single code change that zaps all you customers very quickly like this, you had to have help from your architecture to pull it off.

Don’t get me wrong, I’m all for multitenancy.  In fact, it’s essential for many SaaS operations.  But, companies need to have a plan to manage the risks inherent in multitenancy.  The primary risk is the rapidity with which rolling out a change can affect your customer base.   When operations are set up so that every tenant is in the same “hotel”, this problem is compounded, because it means everyone gets hit.

What to do?

First, your architecture needs to support multiple hotels, and it needs to include tools that make it easy for your operations personnel to manage which tenants are in which hotels, which codelines run on which hotels (more on that one in a minute), and to rapidly rehost tenants to a different hotel, if desired.  These capabilities pave the way for a tremendous increase in operational flexibility that makes it far easier to do all sorts of things and possible to do some things that are completely impossible with a single hotel. 

Second, I highly encourage the use of a Cloud data center, such as Amazon Web Services.  Here again, the reason is operational flexibility.  Spinning up more servers rapidly for any number of reasons is easy to do, and you take the cost of temporarily having a lot more servers (for example, to give your customers a beta test of a new release) off the table because it is so cheap to temporarily have a lot of extra servers.

Last step: use a feathered release cycle.  When you roll out a code change, no matter how well-tested it is, don’t deploy to all the hotels.  A feathered release cycle delivers the code change to one hotel at a time, and waits an appropriate length of time to see that nothing catastrophic has occurred.  It’s amazing what a difference a day makes in understanding the potential pitfalls of a new release.  Given the operational flexibility of being able to manage multiple hotels, you can adopt all sorts of release feathering strategies.  Start with smaller customers, start with brand new customers, start with your freemium customers, and start out by beta testing customers are all possibilities that can result in considerable risk mitigation for the majority of your customer base.

If you’re a customer looking at SaaS solutions, ask about their capacity for multiple hotels and release feathering.  It just may save you considerable pain.

10 Responses to “WordPress and the Dark Side of Multitenancy”

  1. den said

    …or you could self host as I do and wait for releases to become stable? Even then I would say WP has its own problems. Its plugin architecture is great but some of the many free plugins I see are…err…horrible for compatibility. One wonders the amount of testing some of these people undertake.

  2. smoothspan said

    Dennis, too many better things to do with my time than self-host, but you raise the germ of an idea.

    Multi-tenant architectures need to know how to migrate users to new releases. Not all releases are so over-arching that everyone needs to instantaneously go there. Perhaps some could even be voluntary so that people could do the equivalent of what you suggest but without the hosting.

    That flies in the face of the conventional SaaS wisdom of one code line, and there are serious problems with it if customers don’t move forward fairly rapidly. For that reason, I would suggest there be a window during which the customer has leeway to choose their migration point. If you release quarterly, perhaps the window is the first 2-4 weeks after the release is available.

    Also, some releases will just have to be mandatory because they involve huge bugs or will hold up things happen beyond just a single customer’s instance.



  3. den said

    @Bob – I know what you mean but in truth I self host because WP puts lame (IMHO) restrictions on what you can do on their servers. Plus I am used to its vagaries so I guess maintenance might take 1/2 a day a year tops? Not bad really.

    But to your feathered release cycle, I think you are absolutely spot on. I’ll riff that later today.

  4. […] WordPress outage is getting lots of attention. Understandably. Bob Warfield weighs into the mix with his Dark Side post, arguing that a feathered approach to rollout might be a good idea: When you roll out a code […]

  5. mingfwu said


    I think you’re dead on with feathered release cycles, but would add that there is an inherent level of complexity that comes with running and maintaining multiple releases simultaneously. A structured approach can help solve that such as a planned migration schedule with target conversion numbers and dates along with KPIs that indicate that the new release is operating smoothly.

    Another key item with feathered releases is decoupling changes to the tiers of your application. If you can decouple database changes from application changes, this will greatly improve your options should any issues arise. The most common problem with releases is the “all-or-nothing” approach. Web and application servers can be rolled back pretty easily, but backing out a database change can take hours or might even require restoring from a backup. By decoupling the changes and pushing the most sensitive updates out incrementally, you can manage your risk far more effectively.

  6. smoothspan said

    Ming, agree totally on both points. Was discussing offline with friends how the decoupling might come about and how it can be used for other interesting purposes.

    Running the multiple instances simultaneously is problematic, and hence my call for a limited window for it.

    The reality is it doesn’t take very long to discover a catastrophic problem like the one that hit WordPress. If you can rollout to 10-20% of users at a time, you can find it out before everyone is exposed. From that standpoint, I would think you would need to keep separate versions available no more than a week and perhaps as little as 24-48 hours.



  7. […] was analysis quickly available regarding the bugs in the rolled-out code: SaaS provider like Automattic should consider […]

  8. hakre said

    Thanks for the insight, me as Developer would naturally focus more on the structural problems I see in developing wordpress. I’ve put some examples together in WordPress Outage Feedback.

  9. […] to everyone in one fell swoop.  I’m trying to calm down, and sure, I’ve written about release feathering myself.  That’s kewl and all, but how long is this going to take and why isn’t there […]

  10. […] it out to everyone in one fell swoop.  I’m trying to calm down, and sure, I’ve written about release feathering myself.  That’s kewl and all, but how long is this going to take and why isn’t there more […]

Leave a Reply