No software company in their right mind would change code and move it to production without extensive testing to make sure the new code wasn’t broken. It’s a tried and true business process supported by tools like automated build (gather up all the files, compile as necessary, package, and produce a running version), source code control systems (check in and check out with auditing so all the files that go into a particular version are known and verified), and so on. That’s all fine and well, and though I do sometimes find a company that hasn’t made it to even that basic stage of operational process, there is a level of complacency associated with having such a process working. The AmazonFail incident shows us that there is a lot more than just program code at stake here.
First, what was the AmazonFail incident? It seems that Amazon suddenly started delisting Gay and Lesbian publications from their sales rankings. This had the effect of removing the books from search according to the WSJ. Needless to say, the incident resulted in a great hue and cry since many assumed Amazon had done this intentionally and took it as evil behavior (along the lines of Google’s “do no evil” mandate). Before long there were charges of online censorship, and Twitter and the blogosphere were lit up bright talking about it. Rumors even emerged that the incident was a result of hackers. Eventually the whole thing became known as “AmazonFail“, and there is a hashcode for that on Twitter if you want to read all about it. As I write this, “#amazonfail” is the third most popular search term on all of Twitter.
Clearly it’s been a major PR black eye for Amazon, but what caused it?
Amazon calls it, “an embarassing and ham-fisted cataloging error.” Ultimately it affected over 57,000 titles including books well beyond the Gay and Lesbian themes first reported. A more detailed internal story comes to us from Andrea James of SeattlePi:
Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as “adult,” the source said. (Technically, the flag for adult content was flipped from ‘false’ to ‘true.’)
“It’s no big policy change, just some field that’s been around forever filled out incorrectly,” the source said.
Amazon employees worked on the problem well past midnight, and then handed it over to an international team, he said.
Doesn’t this sound just like the sort of problem that can be caused by making a minor code change and then rolling out the changed code to production without adequate testing? First, it was a minor change. One field was filled out incorrectly by one person in France. Second, it created a huge problem as minor changes in code often can. Third, the problem was not immediately obvious, nor was the cause, so it got pretty far along before it could be fixed. Yet it wasn’t code, it was just data.
In this day and age of Cloud Computing, SaaS, and web applications, data is becoming increasingly just as critical as code. Metadata, for example, is the stuff of which customizations to multi-tenant architectures are made of. In that sense, it is code of a sort. “Soft” heuristics are common in search applications that have to decide which words to ignore completely, how to weight different words, which words might be profanity (or adult content in this case), which words are positive, negative, might be Spam related, and all the rest. That’s all critical metadata to a search engine such as Amazon’s. There’s a lot of other critical data to consider including credit card and other privacy-related information, financial information (what if someone gets the decimal point wrong when entering sales tax for a particular area?), and so on.
Data drives modern applications so completely, that we have to think about what this means for the security and robustness of our applications. We’re still in our infancy on that front. Modern software will test all the data entered to be sure it doesn’t contain a malicious payload (the SQL injection attack is one way to hack by entering special data in a field exposed to users) and there are many similar low level tests that are made. But what about the business process that ensures the integrity of that data? How can a single individual in France create such a big problem for Amazon by changing a little bit of data?
Let’s assume for the moment that we choose to treat that data as code. That would mean we do some or all of the following:
– Archive all the data changes in the equivalent of a source control (also called a “version control“) system. We’d know every version of the data that was entered, who entered it, when they entered it, and there would be comments about why they entered it.
– The incorporation of that data into a production environment would happen only at carefully orchestrated times (called “releases”) and only after testing had been done to ensure no problems were created.
– If a problem manifested itself, an automated system would be available to roll back the changes one or more versions until the problem went away. This is an important step with extremely serious problems because earlier releases will have been functioning long enough that there are no known serious problems. Rolling back gives time to find the error without exposing all the customers to it in the meanwhile. The error is eliminated and then the updated change is tested, and rolled out again.
Does that sound hopelessly painful as a process? It’s exactly what most software developers go through for even the slightest change to code. I’ll admit it would be too painful for every data change. Amazon must add, delete, or change thousands of new listings every day. Each one can’t be a full development release cycle. But it does seem that applications should have some safety valves built into their fabric, and into the all-important data that they rely on. Changing a listing is different than changing some data that almost 60,000 books rely on in search. That data should be marked as sensitive and handled differently.
There’s lots of architecture and process work to think about in order to avoid or minimize similar problems in the future for a whole host of applications.
It’s the Data Stupid. Vinnie Mirchandani offers more examples of how we get into trouble with data.
TechCrunch gets it all wrong in a guest post by Marry Hodder. Hodder argues unconvincingly that this was all due to an algorithm gone wrong and kept secret. What happened is precisely NOT an algorithm.
If it had been an algorithm, it would have been an unambiguous set of rules by Hodder’s own Wikipedia-quoted definition. Algorithms are explicit for their creators, and they are not accidental. That they may be hidden from users does not detract from their explicitness, purposefulness, or unambiguity.
To assume an algorithm is at fault, is to misunderstand what an algorithm is, or to assume to Amazon purposefully set about creating an explicit and unambiguous set of steps to discriminate against Gay and Lesbian titles. While the conspiracy theorists may like to natter on about this theme, it ain’t so.
The problem here, and my point in this blog post, is that we assumed we could focus on the algorithm and ignore the data that fed it. Clearly a bad assumption. In so doing, we allowed a minor change by an individual to create a major PR problem. If we really think it was all a conspiracy against all things Gay, why not look for more explicit evidence. Did Patricia Cornwell’s books about Kay Scarpetta, which include some Gay themes (Scarpetta is Gay) disappear as well? I don’t think so.
It’s been true since we’ve had computers: garbage in leads to garbage out. The data is just as important as the algorithms.