SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Archive for October 19th, 2007

People Can Build the Semantic Web

Posted by Bob Warfield on October 19, 2007

Semantic Web is one of those terms like Web 2.0 where you need to define what you mean by the term before plunging ahead.  I’m okay with Wikipedia’s definition:

The semantic web is an evolving extension of the World Wide Web in which web content can be expressed not only in natural language, but also in a format that can be read and used by software agents, thus permitting them to find, share and integrate information more easily.

The problem with regular old web pages is you have to be an intelligent life form to capture all the meaning, and so far, computers don’t qualify.  One way to think of the semantic web is that we want to make it possible to add hints of various kinds to the web page so that the meaning becomes explicit and structure can be derived from unstructured data.  The semantic web is a slippery beast, though, as you can imagine.  Easy to talk about, hard to build.  Therein lies the rub between being a person and being a computer: we people don’t really know how we do a lot of things, but we think we do.  Computers need to know exactly how to do everything and the recipe had better be right.

Now there is an application out called Twine that lots of people are talking about.  It purports to be the first Semantic Web application.  They say they’re going to create a “Semantic Graph” that links people to topics.  Twine is in private beta, but there are screenshots out in the blogosphere to be seen.  It’s easy to see why it seems to be Social Networky–lots of tags associated with people and organizations that seem to bring similar information together with the people interested in the information.  There is talk of the thing being like a Wiki married to Facebook, which boils down to being able to organize your Twine-tagged content onto pages of your own choosing, and in fact this collection is called a “Twine”.  Twines can be owned by individuals or groups and they can be themed.

The really cool part of Twine is that it tries to automatically tag content by using proprietary semantic algorithms.  One of the huge challenges of the Semantic Web is wondering how all that content was going to get the “hints” to tell what it was.  They call this process “entity extraction“.  They’re not promising they can tag everything, but they’ve established a beach head around a set of things that are important to people.  Type in a note, and Twine will do entity extractions relating to people, places, companies/organizations, books, and a few other categories.  That’s pretty cool, and I agree, it’s a good start.  Supposedly, Twine can also reach conclusions about tags and concepts even when particular documents don’t have a tag.  Apparently they’ve mined some sort of Semantic Ontology out of Wikipedia.

The proof is in the pudding for Twine.  If it is as automatic as they say, it will be very cool.  If it isn’t, it’ll be another web thingey.  It’s all about whether it can really extract semantics.  I’ve worked on a couple of projects that lead me to believe it is possible to extract semantics if you use the right combination of humans and computers.  We didn’t call what we were doing the Semantic Web, but there were shades of what Twine talks about there.

First was a little startup idea I had to generate a search hub for private investors.  It was built around a Wikipedia-like centerpiece.  People would search that Wikipedia to find investment related topics they were interested in, and the users could add new topics.  So far, just like a Wiki, although this was pre-Wikis back around 1994.  The other thing the system did that was cool was it used the content in the Wiki “article” for each term to create a search for the search engines of the day.  You saw the results as sort of a newsfeed-like section at the bottom of the Wiki article.  The reason we never finished more than a prototype was that at the same time Microsoft was talking about buying Intuit and every investor I talked to wanted no part of being a potential competitor to that.  Doh!

The second similar thing was PriceRadar.  This system was aimed at figuring out what product ebay ads were really selling, and how to optimize listings for others who wanted to sell the same product.  As such, we created a taxonomy that had over 1 million SKU’s (“SKU” is a retail term that means a particular item).  The challenge with eBay, as any user knows, is that the data is really dirty.  Some ads are even intentionally misleading.  And, there is a lot of data flowing.  At that time we would process on the order of 1 million new ads per nightly run or some such.

We approached the problem in a novel way.  We hired Library Science and English Majors to create our taxonomy.  The taxonomy basically contained a little specialized search written for a special search engine we had created.  It had a lot of features that made it easier to get really pure search results than it would be with something like Google.  For one thing the queries built on one another.  You couldn’t get into the “Pioneer KH-100 in-dash CD player” category until you got past the “Stereo”, “Car Stereo”, “CD Player”, “In-dash”, and so on categories that led to the final category.  We had statistical sampling tools that let us measure how good our algorithms were working, and also let us take a category and figure out likely ways to break it down further to sub-categories.  Just 4 of our “Taxonomy experts” were able to crank out this giant taxonomy in a little less than 1 year.  That’s actually not very expensive given the scope of the project, and the pace accerates greatly once the major bones are in place on the skeleton.

We then used the taxonomy on other sites, such as catalog retailers.  We could go into the Sharper Image web site and automatically match up products there with items on eBay with very high accuracy.  From that, we could tell Sharper Image exactly what they could expect to get for an item on eBay, how many they could sell without crashing the price, how to create an optimal listing, and so on.  Very cool stuff, and very much along the lines of the Semantic Web.

Alas, PriceRadar fell victim to the dot com bust.  We wound up selling the company to eBay for a song and I never heard back from them what became of it.

The moral is that you can marry humans and algorithms to produce very high quality Semantic Web-like entity extraction.  It’s not as impossible as most people think.  When you can’t make a computer do everything automatically, figure out what it can do automatically, and then make humans awesomely efficient at doing the rest by giving them the right tools.  It worked for our Taxonomers, and it could work for a lot of similar problems as well.

Posted in strategy, Web 2.0 | 10 Comments »