Personal wikis, and other diversions

The gloaming

photo

So here I am back in Lenox. It’s beautiful but ominous skies and a day of Russian ahead; our residency for Tschaikovsky’s Eugene Onegin has begun.

I’m currently flashing back to my one encounter with the language, a class in 1986, and am very grateful that I was exposed to the soft consonants ahead of time. Some of our Boston-bred palates are having real difficulty with the vowel sounds, though you can’t tell en masse, thank goodness.

It’s always a crapshoot, the lodging that our fair parent organization provides. Usually it’s just fine, but tonight my roommate isn’t here, they almost mixed up my room with a bunch of sopranos next door, and I had to manually configure my IP address so that I could get on the motel wireless. But I’m on now. (And it’s a good thing I’m not doing demos anymore; it’s slow, slow, slow.)

Around Boston: new light, old park

Two stories caught my eye in the Globe, one with proximity to my vocation and one to my avocation.

The first was regarding the undeveloped land to the south of our offices in Burlington. Pointedly subtitled “city can’t develop land in Burlington, Woburn,” the story details the ongoing dance between citizens of the suburbs who want to see the Mary Cummings Park maintained as parkland, and the City of Boston, which was deeded the land by Cummings on the condition that it stay a “public pleasure ground,” who apparently would prefer that nothing ever be done with it. If the city can’t develop it, that is. A word to the Friends of the Park: better keep a close eye on the docket. Boston’s actions here smell like a delaying tactic until they can get a judge to break the conditions of the deed and allow them to sell the property to developers.

Speaking of delays, the second article regards the removal of the blackout panels over the top windows in Symphony Hall. I remember looking up at the interior panels from the stage during a rehearsal this spring and wondering about them to my fellow tenors, none of whom agreed that they were really windows. And no wonder; there’s no living memory of them ever having been windows. The panels were put into place in the early 1940s, and their removal, I imagine, leaves the old hall emerging blinking into the sunlight like Hiroo Onoda. But the removal, as the article highlights, indicates the profoundly conservative attitude of the BSO regarding the hall’s acoustics. I wonder what the impact on the aesthetics will be?

links for 2008-07-29

Upcoming: Business of Software 2008 in Boston

I was about to delete an email from Bob Cramblitt on my old blog, until I actually read it and realized it was relevant to at least some of my readers:

Hi Tim:

Thought you’d like to know that Seth Godin, Joel Spolsky, Jason Fried and others are coming to Boston for the Business of Software 2008 conference.  This is the only conference run by people who actually manage successful software companies.  All substance, no BS and not a Web 2.0 to be found.

Your blog readers can get $100 off registration by entering “MASS” when registering at www.businessofsoftware.org.

So there you go. Never let it be said that reading my blog got you nowhere. (Disclaimer: this was my only contact with Bob Cramblitt and I’m not getting anything for posting this.)

BrowseRank and the challenge of improving search

I posted a quick link to an article about Microsoft’s new BrowseRank search technology a few days ago. Here’s why the paper is informative, why I think BrowseRank is an interesting technology for improving search, and why I think it’s doomed as a general-purpose basis for building relevance data for the web.

Informative: This paper should be required reading for anyone who wants to know the fundamentals of how web search ranking currently works, what PageRank actually does for Google, and how to objectively test the quality of a search engine. It also offers an interesting two-pronged critique of PageRank:

  • PageRank can be manipulated. PageRank assumes that a link from a page with authority to another page confers some higher rank on the second page. The paper points out the well-known issue that, since the “authority” of the first page is also derived from inbound links, it’s possible to use Google bombing, link farms and other mechanisms to artificially inflate the importance of individual pages for fun and profit. It’s pretty well known that Google periodically adjusts its implementation of PageRank to correct for this problem.
  • PageRank neglects user behavior. The paper argues this somewhat tendentiously, saying that PageRank doesn’t incorporate information about the amount of time the user spends on the page–of course, the paper’s whole hypothesis is that time on page matters, so this doesn’t reveal any deep insight into PageRank. But it’s an interesting point that PageRank does assume that only web authors contribute to the ranking algorithm. Or does it? I’ll come back to this in a bit.

Interesting: The proposed BrowseRank algorithm uses user data–pages visited, browse activity, and time on page–to create a user browsing graph that relies on the user’s activity in the browser to confer value on pages. The authors suggest that the user data could be provided by web server administrators, in the form of logs, or directly by users via browser add-ins. A footnote helpfully suggests that “search engines such as Google, Yahoo, and Live Search provide client software called toolbars, which can serve the purpose.”

The claim of the paper is that user behavior such as time on page confers an “implicit vote” on the content in a way that’s harder to spam than PageRank. I’ll come back to this point too.

Doomed: BrowseRank relies on the following:

  1. A way to obtain a statistically valid sample of user browsing information
  2. A reliable way to determine intent from user browsing information, such as session construction
  3. Time on page is a statistically valid indicator of page quality.

There are problems with each of these requirements that are non-trivial.

User browsing information. The paper proposes that user browsing data can be obtained by the user of a client-side browsing input or by parsing server logs, and says that this practice would eliminate linkspam. Well, yeah, but it opens up two concerns: first, how are you going to recruit those users and site administrators so that you get a representative sample? And second, how do you ensure that the users are not themselves spamming the quality information? In the first case, we have plenty of evidence (Alexa, Comscore) that user-driven panel results can yield misleading information about things like site traffic. In the second case, we know that it’s trivial to trick the browser into doing things even without having a toolbar installed (botnet, anyone?), and it’s been proven that Alexa rankings can be manipulated.

There are two main problems with the user browse data model: it’s difficult enough to recruit a representative panel of honest users to install a browser plugin that will monitor their online activities, but screening out spam activities becomes far more difficult.

Session construction: Knowledge about the user’s session is one of those interesting things that turn out to be quite difficult to construct in practice, especially when you care about meaningful time on page data. The method described in the Microsoft paper is pretty typical, and neglects usage patterns like the following:

  1. Spending large amounts of time in a web app UI opening tabs to read later (web based blog aggregator)
  2. Going quickly back and forth between multiple windows or multiple tabs (continuous partial attention)
  3. The last page in a session gets assigned too much time on page because of the arbitrary 30 minute session limit (the “bathroom break” problem)

Time on page as an indicator of search quality: This is where my main gripe with the article comes from. The authors conclude that their user browsing graph yields better results than PageRank and TrustRank. The problem is, better results at what? The tests posed were to construct a top 20 list of web sites; differentiate between spam and non-spam sites; and identify relevant results for a sample query. The authors claim BrowseRank’s superiority in all three areas. I would argue that the first test is irrelevant; the second was not done on an even playing field; and the third is incomplete. To wit: First, if you aren’t using the relationship between web pages in your algorithm, you shouldn’t need to know what the absolute top 20 sites are because the information is completely irrelevant to the results for a specific query. Second, conducting a test on spam sorting with user input that operates on a spammy corpus without spammy users is not a real world test.

Third, the paper’s authors themselves note that “The use of user behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages, which have low frequency or even zero frequency in the user behavior data.” In other words, BrowseRank is great, if you only care about what everyone else cares about. The reality is that most user queries are in the long tail, so optimizing how you’re doing on the head web pages is a little like rearranging deck chairs on the Titanic. And because we don’t know what the sample queries were for this part of the study, it’s impossible to tell for which type of searches BrowseRank performs better.

Finally, there’s a real philosophical difference between BrowseRank and PageRank. BrowseRank assumes that the only interaction a user can have with a web page is to read it. (This is the model of user as consumer.) PageRank makes a more powerful assumption: that if a user is free to make contributions to the web by adding to it, specifically by writing new content. The paper talks a lot about Web 2.0 in the context of sites like MySpace and Facebook, but arguably PageRank, which implictly empowers the user by assuming their equal participation in authoring the Web, is the more Web 2.0-like metric.

links for 2008-07-28

links for 2008-07-26

Veracode is hiring

If you’ve ever wondered what it would be like to work at an amazing company in the security space, wonder no more. Veracode is growing, and we’ve got quite a few openings in sales, engineering, QA, research, and even (particularly) in product management.

If you’ve read my posts about security and product management, if you’ve read about us in the press, and if you think you’ve got what it takes, drop us a line. Of course you’re welcome to contact me and ask questions about the company too.

links for 2008-07-25

Update: Images and WordPress 2.6

I may have been too hasty to condemn the WordPress for iPhone app. One of my criticisms was that it couldn’t upload a photo to my site. Well, I just discovered that I couldn’t either, even using the browser. This appears to be another issue with WordPress 2.6.

Fortunately the fix is simple: fill in the otherwise optional Full URL path to files (optional) field on the Settings » Miscellaneous section of your control panel with the actual path to your images–usually http://yourdomain.com/yourwordpressdirectory/wp-content/–and save the settings. The forum doesn’t have a consensus on what caused this optional field to become mandatory, but that appears to fix it for most users.

I’m going long in arks.

Seriously, people, what is going on with the rain out here? We have had deluging thunderstorms every day this week. There was a stranded van on my commute this morning. On Route 2A in Burlington, for heavens’ sake.

On Monday this week, I was picking up some things at the Walgreens in Arlington Heights, which has the World’s Smallest Parking Lot™ — and shares it with a Trader Joe’s and a Starbucks. The parking lot abuts the Minuteman Trail, which runs alongside some six feet below street level between the parking lot and a field behind. On this particular day, there was a lake about fifteen feet across in the middle of the parking lot, of unknown depth. I skirted it carefully as I parked my car, but when I got out I heard a noise like a waterfall. And I realized that there was a storm drain in the middle of the lake, which connected to an overflow pipe that emptied out beside the trail. Well, there must have been a few hundred gallons a minute going through that pipe:

(That’s the overflow pipe on the left. The lake in the background is the bicycle trail.)

It was raining so hard on Monday that Mass Ave flooded in Arlington Heights in front of the Panera. There were still sandbags there later in the week. And it was raining harder than that this morning.

All I’m saying is, when I start to see animals coming up the hill to get to higher ground at my office, I’m cornering the market on gopher wood.

links for 2008-07-24