Better usage data?

I noted that I had let a new high water mark go by last week; the data watch has been updated and you can download the new data set here. The new high water mark is amazing, too: out of nowhere, it jumped from under 3000 to above 4000 weblogs in a three-hour period. Anyone know what was happening on Friday to drive that much traffic???

Anyway, this was the kick in the butt I needed to look at my cron script that I set up to download the changes file. I had set cron to run a custom AppleScript (source to be shared shortly) to download changes.xml every two hours and gzip it, or so I thought. Looking at it today, the first day I left the machine on overnight since adding the cron setting, I realized I had asked it to download the file once a minute during the 2 am hour instead. Oops. Sorry about the bandwidth, Userland.

So why is this important? As I’ve been saying for a bit, I want to understand the dynamics of a day and a week in terms of blog posting frequency. Which are the high traffic days? What percentage of blog users post more than once a day? More than once every few days? Just how many unique blogs ping in a two week period?

Starting today, I’ll be working on finding out. My cron script is now working (it’s amazing the difference between * 2 * * and * /2 * *). My machine won’t be sleeping or shut down for the next two weeks. I’ll make my summary data available at the end of the experiment and see if I can draw some conclusions about the meaning of the high water marks we’ve been seeing. Hopefully, if I’m successful with the project, this can be a longer term study. But for that to be true, I’ll have to automate the process of importing the data file and aggregating the statistics, and that may be too much to get done right now.

Is anyone else engaging with the changes data in this way? Are there any questions about the weblog population that two weeks of granular update data would provide?