Monday, March 28, 2011

Statistician Can't Distinguish Library Patrons from Monkeys

If you're a librarian nodding at the title, no, that's not what I mean.

The statistician in question is Carnegie-Mellon Statistics Professor Cosma Shalizi. He's made a habit of debunking claims by physicists, economists, and computer scientists that their data shows power-law behavior in this-or-that system. When I say he can't distinguish library patrons from monkeys, I don't mean that Prof. Shalizi is near-sighted or that he's unfamiliar with the grooming habits of library patrons. I mean that Shalizi is arguing that the distribution of book circulation that I wrote about two weeks ago can be explained by completely random processes.

In his comment on my blog post, Shalizi reanalyzed the circulation data from University of Huddersfield and shows that it can be fit well by a "log-normal" distribution, and that the very high-usage tail of the Huddersfield data is not consistent with a power law (such as the one I gave in my post). I've confirmed  his analysis, which went much farther into the high-usage tail than my first pass. This is done by looking at the cumulative distribution, i.e. plotting the number of books that have circulated less than a certain number of times.

If you want to make the connection to the monkeys in the library, it's important to understand the generating mechanisms that lead to log-normal distributions. These often arise from random growth processes, and are just like the standard "bell-curve", but on a log scale.

Here's how a random growth process could apply to book use. Let's suppose that every day, everyone who has read a book flips a coin. If heads, they do nothing. If tails they try to get someone else to also read the book. The group of people that has read the book thus grows by some percentage. Repeating this process over and over causes the book's usage to grow randomly. If we then measure the  size of these groups, the readership sizes will follow a log-normal distribution.

There's a saying among experimental physicists. "Keep taking data until you have enough to write an article for Physical Review Letters. Then stop taking data." In my previous post on book use, I violated this rule by asking other libraries to share their circulation data for analysis. Ross Riker at Goshen Public Library in Indiana stepped up to the challenge.

Goshen has accumulated circulation data since their automation system was installed in 1996. Riker sent me the number of times each of 144,269 items currently held had been circulated, for a total of 3.04 million circulation events. I've plotted the data on the graph below, alongside the Huddersfield data. It looks somewhat different, doesn't it? I sent the Goshen data to Shalizi, and his analysis was that neither log-normal or power-law distributions could fit the data.

Is book use in an American public library governed by different principles from that in a British academic library? Probably not. I noticed that the maximum number of circulations at Goshen was 251. The standard circulation period at Goshen is 3 weeks, so there's one book that's been checked out for 14.44 years solid, or since late 1996, which is about when Goshen began collecting data.

If we want to look at book use, what we should be plotting is the the rate at which the book is being circulated. That's equal to the number of circulations divided by the time the book is actually on the shelf.

After applying time-on-shelf corrections, the data from both Goshen and Huddersfield are well fit by log-normal distributions. To compare the Huddersfield data to the Goshen data, we need to take into consideration another difference. The Goshen data is listed item by item, so if there were two copies of a book, they count as two items. The Huddersfield data groups the circulation counts for all copies of the same book. To properly compute the time-on-shelf factor, I adjusted the circulation rate based on the number of copies held for each book.

After applying the appropriate corrections, the resulting distributions (below) are amazingly similar for the two libraries, and fit beautifully to log-normal distributions. Both distributions even have a bulge at the very highest circulation rates. At Huddersfield, inspection of the relevant bulge items suggests that they're texts used in particular courses, and have circulation times shorter than the main collection.

You may be disappointed to learn that the distribution of book use can be explained by random processes without reference to metadata quality, selection efficiency, or discovery system details. Nor does it derive from a power law characterizing the structure of user networks or citation graphs. All of the circulation distribution data I've looked at is consistent with there being one driving force in the distribution of book use. The non-technical term for this driving force: word of mouth.

Maybe I should stop taking data.


Notes:
  1. The formula for a log-normal distribution is:
    where μ and σ are the mean and variance of the logarithm of the distribution. If you use Excel, the lognormal distribution is built-in:   LOGNORMAL(f,μ,σ,FALSE). (TRUE gives the cumulative distribution function)
  2. It's not surprising that you get a better fit with a log-normal distribution than with a power-law. The log-normal distribution gives you an extra fitting parameter, after all. But when you include the full high usage tail, the power law predicts a lot more extremely high usage books than is observed.
  3. The time-on-shelf correction has a bit of fudge-factor in it. If the standard circulation period is 3 weeks, that doesn't mean that every user keeps it for three weeks, or that the book gets reshelved immediately after 21 days. My fit uses an average time-off-shelf period of 18 days.
  4. My log-normal fit for Goshen has a mean of 2.95 and a sigma of 0.94. For Huddersfield, I get a mean of 2.22 and a sigma of 0.77 after conversion to item data. The larger mean gives the higher circulation per item at Goshen. Feel free to speculate about the sigmas.
  5. The titles with the highest per-copy circulation rates at Huddersfield are:
    • Music in medieval Europe
    • An introduction to business ethics
    • A guide to the harpsichord
    • On humour : its nature and its place in modern society
    • Japan
    • The BBC and public service broadcasting
    • Authenticity in performance : eighteenth-century case studies
    • Mozart's Requiem : on preparing a new edition.
    • Asia's next giant : South Korea and late industrialization
    • Handel's operas : 1704-1726
    • Cognitive psychology : a student's handbook
  6. Raw data sets are available for Huddersfield and Goshen.
  7. For a readable discussion of generating mechanisms for power laws and log-normal distribution, I recommend "A Brief History of Generative Models for Power Law and Lognormal Distributions" by Michael Mitzenmacher, Internet Mathematics Vol. 1, No. 2: 226-251. [PDF 382KB].
  8. My comments re Harper-Collins are unaffected by this re-analysis, but my quantitative modeling of the budget impact of the new ebook policy will change a bit.
  9. Monkeys aren't really random, but I bet if one started reading a book, there would soon be a crowd of monkeys wanting to read the same book!
Enhanced by Zemanta

1 comment:

  1. Having struggled with statistics as a student, I long ago came to the conclusion that statistics are a form of religious belief, of the mysterious sort. If you invoke them, they will give you proof. But you have to believe that they actually do mean something. Sort of like the king's new clothes. I remain in the dark, outside the company of believers. Now, word of mouth I understand.

    ReplyDelete