On knowing the limitations of your stats

I have three sources of statistics about the readership of my weblog. One is the built in statistics in Manila, which tells me about referrers and page views in the last 24 hours, and the most read messages since I started the weblog. But it doesn’t tell me anything about the readers doing the page views—browser, IP, source country, anything. Also, it doesn’t report on pages served by my static site, which is where I try to point most of my readers these days.

Another is Site Meter, the seemingly ubiquitous tracking device that shows up on everyone’s blogs. My Site Meter reports give me a few more things, such as showing what the most popular entry and exit pages are, who the most popular browsers are, and so forth. But no ranking of messages by popularity, no reports of 404s, etc. Plus, um, it is a potential privacy violation. (What do they do with that data?)

The third, which I finally was able to access last week, is the actual Apache logs for the static site. And I discovered something about Site Meter: it filters out quite a few things from its Browser report. Like robots. Site Meter claims that most of my visitors run Internet Explorer. Actually, twice as many—over 10,000 visits, or almost 25% of my static site’s traffic—are done by GoogleBot. Which is fine, but another 9% are apparently done by Slurp. And there are lots of other visitors who are running evil bots, like the cunningly named EvilBot.

Which makes me grateful for the new Spam-Free Email feature in Manila. Used to be that if you left a comment, visitors saw your email address. Now they see the name you used to register, and clicking on that takes you to a form that allows you to submit email, but not see their address. Which means that the spambots crawling my site looking for email addresses won’t find any.

A great start. But there are more limitations:

  • My RSS file is served by Manila, not Apache, so I can’t see which RSS readers are subscribing to me—or how many distinct people there are, or how often they’re hitting my server.
  • I can put a robots.txt file on my static server to control how the robots crawl my site (and I will), but I can’t do the same on Manila.

Ah, for a unified stats solution. In the meantime, two new projects start today on the weblog:

  • A robots.txt file for the static site.
  • A P3P-compliant (and human-readable!) privacy policy.