On the difficulty of measuring online traffic

Boing Boing: BoingBoing traffic stats are back. John Battelle talks about the difficulties in interpreting web statistics. A few comments based on my own experience at Microsoft.com:

…of the columns you see, only the first one – “Unique Visitors,” and the last two “Hits” and “Bandwidth” can be taken at face value. “Unique Visitors” counts unique IP addresses that are hitting the site, so it’s a fairly accurate count of actual humans reading Boing Boing. (If anything, its count is a bit low, as it does not account for sites like AOL which may have one IP address for thousands of unique users.)

There are more problems with the Unique Visitors stat than Battelle lets on, of course. AOL will always be the big problem in any attempt to measure Internet users for the reason that John mentions, namely AOL’s proxy looking like one big, extraordinarily active user. However, AOL is certainly not the only place that you see a proxy server that only presents one IP address to the outside world—this is pretty common at large corporations, as well as wifi hotspots. Also, IP addresses can change from session to session if you are doing dial-up, if you reboot a lot, or even if your broadband modem goes down a lot. End result: IP addresses are a good approximation of unique visitors, but I wouldn’t take them at face value.

Another way to count UVs (or Unique Users) is to issue a cookie and count the number of unique cookies hitting the site. There are problems here too—users clean their cookies or refuse to accept them in the first place—but this gets around the proxy server problem.

Neither of these solutions deal with the possibility that you have users who visit from multiple machines, which will have both different IPs (unless they are behind the same proxy server) and different cookies (unless you explicitly require authentication each time you set the cookie).

Nonetheless, one or the other of these methods is in use in most major web stats programs.

…the other two columns – “Pages” and “Number of Visits” – are more difficult to understand. They are AWStats’ best guess as to how many total visits a site gets, as well as how many pages are actually viewed by those visitors. These columns have always disregarded image and video files, but because a lot of our traffic comes from RSS readers, they are certainly inflated by some amount.

Ah yes. Tracking visits means you divide all the hits up from a given user into periods of time when the user was on your site without interruption. As you can imagine there are a lot of assumptions there, starting with how you identify users (your count of visits will be thrown off by the proxy server assumptions discussed above), the time frames you pick (if you expect users to spend up to five minutes on each hit, when a user takes six minutes to read a page before requesting the next one, his activity counts as two visits), and so on. And pages… What is a page? Does it include server-side included pages? Images? What if the images are part of the reason people come to your site? And what about those RSS feeds? As I wrote a long time ago, tracking RSS upsets a lot of the assumptions you make when tracking plain old web traffic.

I did a lot of work in this area when I worked at Microsoft; hopefully the part of my experience that I can actually share will be relevant to the ongoing discussion.