links for 2008-07-29

Upcoming: Business of Software 2008 in Boston

I was about to delete an email from Bob Cramblitt on my old blog, until I actually read it and realized it was relevant to at least some of my readers:

Hi Tim:

Thought you’d like to know that Seth Godin, Joel Spolsky, Jason Fried and others are coming to Boston for the Business of Software 2008 conference.  This is the only conference run by people who actually manage successful software companies.  All substance, no BS and not a Web 2.0 to be found.

Your blog readers can get $100 off registration by entering “MASS” when registering at www.businessofsoftware.org.

So there you go. Never let it be said that reading my blog got you nowhere. (Disclaimer: this was my only contact with Bob Cramblitt and I’m not getting anything for posting this.)

BrowseRank and the challenge of improving search

I posted a quick link to an article about Microsoft’s new BrowseRank search technology a few days ago. Here’s why the paper is informative, why I think BrowseRank is an interesting technology for improving search, and why I think it’s doomed as a general-purpose basis for building relevance data for the web.

Informative: This paper should be required reading for anyone who wants to know the fundamentals of how web search ranking currently works, what PageRank actually does for Google, and how to objectively test the quality of a search engine. It also offers an interesting two-pronged critique of PageRank:

  • PageRank can be manipulated. PageRank assumes that a link from a page with authority to another page confers some higher rank on the second page. The paper points out the well-known issue that, since the “authority” of the first page is also derived from inbound links, it’s possible to use Google bombing, link farms and other mechanisms to artificially inflate the importance of individual pages for fun and profit. It’s pretty well known that Google periodically adjusts its implementation of PageRank to correct for this problem.
  • PageRank neglects user behavior. The paper argues this somewhat tendentiously, saying that PageRank doesn’t incorporate information about the amount of time the user spends on the page–of course, the paper’s whole hypothesis is that time on page matters, so this doesn’t reveal any deep insight into PageRank. But it’s an interesting point that PageRank does assume that only web authors contribute to the ranking algorithm. Or does it? I’ll come back to this in a bit.

Interesting: The proposed BrowseRank algorithm uses user data–pages visited, browse activity, and time on page–to create a user browsing graph that relies on the user’s activity in the browser to confer value on pages. The authors suggest that the user data could be provided by web server administrators, in the form of logs, or directly by users via browser add-ins. A footnote helpfully suggests that “search engines such as Google, Yahoo, and Live Search provide client software called toolbars, which can serve the purpose.”

The claim of the paper is that user behavior such as time on page confers an “implicit vote” on the content in a way that’s harder to spam than PageRank. I’ll come back to this point too.

Doomed: BrowseRank relies on the following:

  1. A way to obtain a statistically valid sample of user browsing information
  2. A reliable way to determine intent from user browsing information, such as session construction
  3. Time on page is a statistically valid indicator of page quality.

There are problems with each of these requirements that are non-trivial.

User browsing information. The paper proposes that user browsing data can be obtained by the user of a client-side browsing input or by parsing server logs, and says that this practice would eliminate linkspam. Well, yeah, but it opens up two concerns: first, how are you going to recruit those users and site administrators so that you get a representative sample? And second, how do you ensure that the users are not themselves spamming the quality information? In the first case, we have plenty of evidence (Alexa, Comscore) that user-driven panel results can yield misleading information about things like site traffic. In the second case, we know that it’s trivial to trick the browser into doing things even without having a toolbar installed (botnet, anyone?), and it’s been proven that Alexa rankings can be manipulated.

There are two main problems with the user browse data model: it’s difficult enough to recruit a representative panel of honest users to install a browser plugin that will monitor their online activities, but screening out spam activities becomes far more difficult.

Session construction: Knowledge about the user’s session is one of those interesting things that turn out to be quite difficult to construct in practice, especially when you care about meaningful time on page data. The method described in the Microsoft paper is pretty typical, and neglects usage patterns like the following:

  1. Spending large amounts of time in a web app UI opening tabs to read later (web based blog aggregator)
  2. Going quickly back and forth between multiple windows or multiple tabs (continuous partial attention)
  3. The last page in a session gets assigned too much time on page because of the arbitrary 30 minute session limit (the “bathroom break” problem)

Time on page as an indicator of search quality: This is where my main gripe with the article comes from. The authors conclude that their user browsing graph yields better results than PageRank and TrustRank. The problem is, better results at what? The tests posed were to construct a top 20 list of web sites; differentiate between spam and non-spam sites; and identify relevant results for a sample query. The authors claim BrowseRank’s superiority in all three areas. I would argue that the first test is irrelevant; the second was not done on an even playing field; and the third is incomplete. To wit: First, if you aren’t using the relationship between web pages in your algorithm, you shouldn’t need to know what the absolute top 20 sites are because the information is completely irrelevant to the results for a specific query. Second, conducting a test on spam sorting with user input that operates on a spammy corpus without spammy users is not a real world test.

Third, the paper’s authors themselves note that “The use of user behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages, which have low frequency or even zero frequency in the user behavior data.” In other words, BrowseRank is great, if you only care about what everyone else cares about. The reality is that most user queries are in the long tail, so optimizing how you’re doing on the head web pages is a little like rearranging deck chairs on the Titanic. And because we don’t know what the sample queries were for this part of the study, it’s impossible to tell for which type of searches BrowseRank performs better.

Finally, there’s a real philosophical difference between BrowseRank and PageRank. BrowseRank assumes that the only interaction a user can have with a web page is to read it. (This is the model of user as consumer.) PageRank makes a more powerful assumption: that if a user is free to make contributions to the web by adding to it, specifically by writing new content. The paper talks a lot about Web 2.0 in the context of sites like MySpace and Facebook, but arguably PageRank, which implictly empowers the user by assuming their equal participation in authoring the Web, is the more Web 2.0-like metric.