Merry Christmas from Google: Cavalier Daily in Google News

A nice Christmas present from the Googlemind: if not a complete run, then a pretty good sampling of the full archives of the Cavalier Daily and its predecessor College Topics, the long standing student newspaper of the University of Virginia.

The boon to a researcher of the University (or the Virginia Glee Club) cannot be overestimated. Just in a few minutes I found:

If Google News’s presentation of archival newspapers leaves something to be desired (I find it much more difficult to manage searching through a single issue than with the UVA library’s search interface), there is still a real treasure trove here, and not just on the Glee Club but on just about ever other topic.

Google Chrome 1.0 (.154.36)

Well, that was fast. Google Chrome went from new to 1.0 in about 100 days:

chrome1dot0

But is it ready? And why so soon?

chromepngbug

I expected Google to add more features over time, since the merely architectural improvements of the browser didn’t seem to meet the critical differentiator threshold to justify launching a new browser. But that didn’t really happen. And in fact, Google seems to be launching Chrome with some rough edges intact. Check out this snippet of the WordPress 2.7 login screen (right).See those black edges around the box? That’s a rendering bug in Chrome’s version of WebKit. (The black corners aren’t there in Safari.)

So: Google is rushing a new browser that they “accidentally” leaked just 100 days ago, a browser that has significant speed but demonstrable rendering flaws, into an already crowded market. Why? And why launch two days after previewing the Google Native Code plug-in, a web technology that seems a far more significant leap forward?

My guess: they’re scared of having their thunder stolen, maybe by Firefox. The new Mozilla JavaScript engine, TraceMonkey, appears to be running neck-and-neck with Google’s V8. And when the major feature in your browser is speed, you don’t want to risk being merely as good as your better established competitor. So maybe releasing Chrome ahead of Firefox 3.1 (which still has no release date, and at least one more beta to go) was simply a defensive move to make sure they aren’t competitively dead before they launch.

Remix culture: NASA’s bootleg Snoopy from 1969

I had read about NASA’s use of Snoopy and the Peanuts characters as unofficial mascots for Apollo 10 (it was well documented in Charlie Brown and Charlie Schulz, which sat on my Pop-Pop’s bookshelf alongside the Peanuts Treasury), but don’t remember seeing this. Courtesy Google Image Search and the LIFE archives:

As good an argument for the Commons as I’ve ever seen. The irony is, of course, that it sits in Google Images with no reasonable licensing in place. Even this bootleg image is claimed as copyright LIFE magazine.

Google LIFE archive: where’s the usage rights?

I’m impressed by the new LIFE photo archive at Google Images–it’s a truly significant work of digital content. But it’s missing one important thing: a usage policy. The images are marked (c) Time Inc., so it’s clear they aren’t public domain. But is there any way to purchase usage rights? The only reuse provision seems to be a framed print purchase.

Compare it to what Flickr does with the images in its commons, or anywhere else for that matter–a clear licensing agreement, selectable by the poster, that explains how images can be used. The LIFE archive may be visually striking, but it would be much more valuable if the images could have a life beyond Google’s servers.

Google and publishers agree to sit down and make some money

New York Times: Google Settles Suit Over Book-Scanning. It’s good to see the book publishing industry come to its senses.

Now that the parties have agreed to revenue sharing from book sales and library use, it becomes even more clear that Google Books is yet another Internet mediated disintermediation. Google Books is probably the best targeted marketing vehicle for the book industry since the original Amazon, because of its reach and ease of use and its ability to make transparent the previously opaque covers of books to help us find useful content. I’ve personally found it more useful than the usual suspects (book reviews, bestseller lists) when it comes to finding useful research works; sometimes you need to read the original book to decide if it’s useful to you rather than relying on third-hand opinion.

Here’s to a win for all involved–Google, book publishers, and above all, for you and me.

BrowseRank and the challenge of improving search

I posted a quick link to an article about Microsoft’s new BrowseRank search technology a few days ago. Here’s why the paper is informative, why I think BrowseRank is an interesting technology for improving search, and why I think it’s doomed as a general-purpose basis for building relevance data for the web.

Informative: This paper should be required reading for anyone who wants to know the fundamentals of how web search ranking currently works, what PageRank actually does for Google, and how to objectively test the quality of a search engine. It also offers an interesting two-pronged critique of PageRank:

  • PageRank can be manipulated. PageRank assumes that a link from a page with authority to another page confers some higher rank on the second page. The paper points out the well-known issue that, since the “authority” of the first page is also derived from inbound links, it’s possible to use Google bombing, link farms and other mechanisms to artificially inflate the importance of individual pages for fun and profit. It’s pretty well known that Google periodically adjusts its implementation of PageRank to correct for this problem.
  • PageRank neglects user behavior. The paper argues this somewhat tendentiously, saying that PageRank doesn’t incorporate information about the amount of time the user spends on the page–of course, the paper’s whole hypothesis is that time on page matters, so this doesn’t reveal any deep insight into PageRank. But it’s an interesting point that PageRank does assume that only web authors contribute to the ranking algorithm. Or does it? I’ll come back to this in a bit.

Interesting: The proposed BrowseRank algorithm uses user data–pages visited, browse activity, and time on page–to create a user browsing graph that relies on the user’s activity in the browser to confer value on pages. The authors suggest that the user data could be provided by web server administrators, in the form of logs, or directly by users via browser add-ins. A footnote helpfully suggests that “search engines such as Google, Yahoo, and Live Search provide client software called toolbars, which can serve the purpose.”

The claim of the paper is that user behavior such as time on page confers an “implicit vote” on the content in a way that’s harder to spam than PageRank. I’ll come back to this point too.

Doomed: BrowseRank relies on the following:

  1. A way to obtain a statistically valid sample of user browsing information
  2. A reliable way to determine intent from user browsing information, such as session construction
  3. Time on page is a statistically valid indicator of page quality.

There are problems with each of these requirements that are non-trivial.

User browsing information. The paper proposes that user browsing data can be obtained by the user of a client-side browsing input or by parsing server logs, and says that this practice would eliminate linkspam. Well, yeah, but it opens up two concerns: first, how are you going to recruit those users and site administrators so that you get a representative sample? And second, how do you ensure that the users are not themselves spamming the quality information? In the first case, we have plenty of evidence (Alexa, Comscore) that user-driven panel results can yield misleading information about things like site traffic. In the second case, we know that it’s trivial to trick the browser into doing things even without having a toolbar installed (botnet, anyone?), and it’s been proven that Alexa rankings can be manipulated.

There are two main problems with the user browse data model: it’s difficult enough to recruit a representative panel of honest users to install a browser plugin that will monitor their online activities, but screening out spam activities becomes far more difficult.

Session construction: Knowledge about the user’s session is one of those interesting things that turn out to be quite difficult to construct in practice, especially when you care about meaningful time on page data. The method described in the Microsoft paper is pretty typical, and neglects usage patterns like the following:

  1. Spending large amounts of time in a web app UI opening tabs to read later (web based blog aggregator)
  2. Going quickly back and forth between multiple windows or multiple tabs (continuous partial attention)
  3. The last page in a session gets assigned too much time on page because of the arbitrary 30 minute session limit (the “bathroom break” problem)

Time on page as an indicator of search quality: This is where my main gripe with the article comes from. The authors conclude that their user browsing graph yields better results than PageRank and TrustRank. The problem is, better results at what? The tests posed were to construct a top 20 list of web sites; differentiate between spam and non-spam sites; and identify relevant results for a sample query. The authors claim BrowseRank’s superiority in all three areas. I would argue that the first test is irrelevant; the second was not done on an even playing field; and the third is incomplete. To wit: First, if you aren’t using the relationship between web pages in your algorithm, you shouldn’t need to know what the absolute top 20 sites are because the information is completely irrelevant to the results for a specific query. Second, conducting a test on spam sorting with user input that operates on a spammy corpus without spammy users is not a real world test.

Third, the paper’s authors themselves note that “The use of user behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages, which have low frequency or even zero frequency in the user behavior data.” In other words, BrowseRank is great, if you only care about what everyone else cares about. The reality is that most user queries are in the long tail, so optimizing how you’re doing on the head web pages is a little like rearranging deck chairs on the Titanic. And because we don’t know what the sample queries were for this part of the study, it’s impossible to tell for which type of searches BrowseRank performs better.

Finally, there’s a real philosophical difference between BrowseRank and PageRank. BrowseRank assumes that the only interaction a user can have with a web page is to read it. (This is the model of user as consumer.) PageRank makes a more powerful assumption: that if a user is free to make contributions to the web by adding to it, specifically by writing new content. The paper talks a lot about Web 2.0 in the context of sites like MySpace and Facebook, but arguably PageRank, which implictly empowers the user by assuming their equal participation in authoring the Web, is the more Web 2.0-like metric.