In a previous blog post, we introduced the Four Demons of Search that search engines must constantly battle: query speed, relevance (ranking results), frequency of index updates and recall (indexing historical data). As we noted, traditional web search engines compromise on update frequency; real-time search engines typically compromise on recall and relevance. In this post we look at each of these four problems and how they interact. We show with examples what choices search engines typically make, and illustrate how we’ve tried to architect the Topsy search platform to scale up in all four dimensions.
Search is advanced mind-reading
Today’s search experience is one of advanced mind-reading. Users have a question in mind that they want answered – “What is the weather in San Francisco over the next few days?”. This question is articulated, poorly, in the form of a query – “weather”.¬† The search engine takes the query along with all sorts of other data, such as the user’s location, what type of search results other users select when they search for “weather”, and tries to guess what the user really has in mind.
Google, for instance, guesses right when I type “weather” and gives me the weather in San Francisco, rather than, say, the Wikipedia page on weather. Topsy’s guess-work is better for questions related to what people are talking about right now – when I search for “sheriff”, Topsy knows I want to know about the Justice Dept suing Arizona’s Sheriff Arpaio (and not the home page of the San Francisco Sheriff or the Wikipedia page, which is Google’s guess). Either way, when search engines have to provide results by guessing user intent, they’re going to often be wrong. As search consumers, we’ve all been trained that when search results appear wrong, we try refining the query to be clearer about what we mean.
People expect websites to be fast, and search engines to be faster. The user experience for search is one of continuous refinement, which works only when results for a query are returned very fast. Google’s Jake Brutlag and Bing’s Eric Schurman quantified the effect of speed on search in a talk at last year’s Velocity Conference, reported by O’Reilly Radar. This chart from Bing is particularly clear – when search results take even half a second, user interest drops off rapidly:
Search engines are always trying to speed up the process of delivering search results – Google Instant being the latest example.
One way of making a search engine faster is to throw more hardware at the problem, although more servers can also add latency. Search engines typically compromise with the other three demons in order to maintain speed. Speed is achieved by:
- simplifying the process of finding results (simplifying ranking, reducing relevance);
- searching a smaller data set (reducing recall);
- or limiting updates to the index – i.e. avoiding reading from the index while also writing to it. In practice, search engines make some combination of all these compromises for the sake of speed.
Not all search engines are designed to rank results. Real-time results are often displayed without any ranking whatsoever, in a reverse chronology stream (newest first) of items matching the query term. This is what a search for ‘peace talks’ on search.twitter.com and Collecta shows. It’s also what Google shows in their “Latest” view, and Bing shows in their Social / Public Updates view.
Avoiding ranking is clearly the simplest way to beat the relevance demon. More typically, search engines use complex methods of ranking results based on among other things, the “importance” of the result. This importance is often a function of websites that link to the result, which means that the link structure of the entire web has to be available and computed before results can be ranked. This is one reason why real-time results aren’t ranked by these sites, and why new documents take a relatively long time to rank high in search results on Google or Bing.
At Topsy, we wanted to rank social content in real-time, as we believe that, for our users, freshness and relevance go hand in hand. So we developed our patent-pending citation search model, where the multi-level link structure between authors and search results is retained within the index in a way that allows ranking to be performed coherently and continuously. You can see this in Topsy’s “past hour” results for ‘peace talks’.
Continued in Four Demons of Search: Part 2. Check back here next week to read our thoughts on Update Frequency and Recall!