Tuesday, March 15, 2011

Help Me Study the Physics of Book Use

I am not a librarian. I'm not a bookseller. I'll admit to some librarian tendencies- when I was little, I liked to line up my trucks and sort them from biggest to smallest. But my education and training was in engineering and physics. My approach to the analysis of data is that of a scientist. So when I analyzed the distribution of circulation across the collection of University of Huddersfield, I treated the data as a window into the physics of book use.

"Physics???" you may be thinking to yourself. Yes, physics. Well, maybe it would be economics if I had gotten past Econ 101 in college. But I feel comfortable with physics- I have 76 published articles to fall back on. Physics tries to describe things that happen in terms of simpler phenomena. It aims to connect observables (thing you can measure) to their root causes, and then uses that understanding to predict other observables. It doesn't matter so much whether the basic event is one particle hitting another, or one patron checking out a book, if broad patterns can be observed in these events, then a physicist can measure the patterns and try to deduce the causes.

That's why I was so excited to observe a power-law dependence in book-circulation frequency when I analyzed the data made available by the University of Huddersfield. In 15 years of research into crystal growth and electronic properties of semiconductors and superconductors, I never worked with such a well-behaved set of measurements. And as a physicist, I'm trained to believe that when a measured quantity obeys a mathematical relationship, then there must be a reason for it, even if I don't understand that reason yet.

Right now, I don't know why the book circulation in the Huddersfield library obeys a power law. A physicist would call this power law "phenomenology". Without an understanding of how it arises, I can't say whether it should apply to other libraries. I can't say if it would apply to ebook sales at Amazon, or holdings in Worldcat. It might be an accident. But it would be really cool if it was real, because at the core, it must be connected to how people choose things to read.

What causes people to buy a particular book, or borrow a particular book from a library? You would think that many people might want to know. Publishers and librarians might answer that books are read because they're good. But is there any concrete evidence that book quality has anything to do with sales or circulation? Ask any author if sales are correlated to quality, and they'll tell you about a wonderful book that nobody has bought or read. So maybe other factors  are more important.

A lot of recent discussion has revolved around the unproven hypothesis that library circulation leads to increased sales. The evidence cited, though compelling, is anecdotal and non-quantitative:
Eat, Pray, Love: One Woman's Search for Everything Across Italy, India and IndonesiaPenguin’s runaway hit, Eat, Pray, Love (Viking), was published in February 2006 with an initial run of 30,000 hardcover copies. The title didn’t become a bestseller until March 2007. In the meantime, copies of Eat, Pray, Love changed hands thousands of times through book clubs and libraries, scoring rave reviews from Library Journal and stirring up chatter among leading library blogs such as Memphis Public Library and San Mateo Public Library. Thanks to word-of-mouth marketing and library lending, when the paperback hit newsstands, Eat, Pray, Love sales skyrocketed.

It would be useful to really know how important this factor is.

I'm guessing that the power law I observed has very little to do with distributions of book quality and much more to do with how people are distributed and connected to each other- for example, city sizes are well described by a power law. I think that people pick books to read based mostly on what other people have read. That's what creates a best-seller. By studying the distribution of book usage, we may be able to prove that this is so.

So here's where I need help. We need to have more data sets to look at. If the power-law behavior is universal, it should show up in a wide variety of circulation statistics.

There are also situations where the power-law won't apply. It may seem odd to say this, since we don't understand where the power law comes from in the first place, but there are things it CAN'T do. For example, in the comments on the last post, "miker" reported some circulation numbers from a consortium. He blindly plugged in his numbers to my formulae, and got predicted numbers within a factor of two of the observed numbers, which seemed pretty miraculous to me. He was disappointed. 

Miker's data covers 4 years compared to Huddersfield's 13, and so a book that has circulated 100 times probably has spent little time on a library's shelves. A power law predicts significant numbers of books even at impossibly high usage. For example, the power-law fit to miker's data predicts that over a thousand books would be circulated more than once a day, which isn't possible given normal lending periods. See the notes if you're not scared of math and want to know how to adjust a fit.

The best way to advance the study of this phenomenon is to look at more data. If you have access to library circulation data, you can extract some numbers and publish them. A comment here would be appreciated. It's most helpful to report the number of items that have circulated f times as a function of f. Tab delineated text works great. In addition, analysts need to know the total number of items, total number of circulations, and the number of years covered by the data. An indication of the typical lending period would also be nice.

Along with a better understanding of how book collections get used, a better science of book use will help libraries and publishers formulate ebook circulation models that make sense for everybody who benefits from the reading of books. That's all of us.

  1. If you want to fit a power law to circulation data truncated at some lending frequency fmax, you have to adjust the fitting parameters. We still have the same expression for the number of circulations for a given frequency, N(f).
    But the computation of the parameters from collection size and total circulations is more complicated:
    It's easiest to solve these equations numerically for N0 and A from the known C, N and fmax
  2. Please read the follow-up.
Enhanced by Zemanta


  1. First: as always, love your methodology.

    Second: I'd be really curious to see what happens across different kinds of libraries, if you get the data. Because, off the top of my head, academic libraries are going to be strongly influenced by some constraints not important elsewhere -- e.g. some books getting upticks as course reserves, or the known power-law pattern in citation frequency of academic papers. So I hope you get data that let you ask questions like "is this a pattern about people and their networks, or a pattern about academe and its practices?"

  2. Yes, what Andromeda said was also my first reaction. This applies to the Huddersfield data obviously.
    Another thing: catalogue search results may be influenced by the way search and ranking algorithms are applied by the system. I know for a fact that this applies to federated search tools (where the presentation of default databases to search is also extremely important) and probably also to discovery tools using central metadata indexes.
    Finally: in academic environments (e)journal articles make up a large portion of information used by students and staff. These data are not covered by book circulation

  3. "There are also situations where the power-law won't apply."

    Notably this one. You shouldn't use least squares regression fit power laws (even if you do have 76 articles to fall back on =)), and in this case, it seems to have made a difference: Circ Data Not Power Law.

  4. apology: scratch the "notably this one." I didn't mean it sardonically, but upon re-reading, it certainly sounds that way. It was late and I was studying too long for a physics exam.