How to Dig for Book Data Treasure

To me, surest indicator of an impending doom for book publishing is hearing a publisher cite the advertising of Attributor, an anti-piracy solutions company, as if it were science. It's not the attitude towards piracy that bothers me, that's entirely sensible. It's the implied devaluation of honest data that depresses me.

There's hope though. I've gotten to know quite a number of people throughout the reading ecosystem with whom I can use the word "data" as high praise, roughly equivalent to the word "gold". If you're reading this, chances are you're a member of this secret society, and what follows is a sketch of a treasure map.

In a recent post, I promised to suggest ways that we might measure the effects of library ebook lending on book sales. If you think about it, there are many parallels between attempting such a measurement and previous studies that have tried to measure the effect of ebook piracy on book sales. Unfortunately, the only objective study I know of was a small study done by Brian O'Leary, and the effects observed in that study were small and in a direction counter to popular narratives (and thus rarely noted in the sort of presentations that cite Attributor advertising).

In that study, O'Leary looked for time-domain correlations between sales figures for books from two publishers and the appearance of the same books on BitTorrent. A similar study focused on library Lending could be much more compelling, because library circulation data is a much more direct measure of distribution than any sort of torrent tracking, and librarians are much better than pirates at sharing data.

With the cooperation of booksellers, library circulation and holdings could be compared and correlated to store-by-store sales. For example, you could look at a book that's held in a significant fraction of libraries and look for correlations (positive AND negative) between areas where a library is circulating the book and stores where the book is selling. You've have to remove regional and demographic variance, of course, but with enough data, almost anything is possible.

With the cooperation of a large publisher, rigorous experiments could be done. Scientific experiments derive rigor from the use of controls. To prove that lending influences sales, it's not enough to do lending and look for sales. A rigorous experiment would have both a trial where books are lent and an identical trial where the same books are not lent.

One way to control a lending experiment would be to make a random selection of a publisher's catalog available for lending. Imagine if Penguin had worked with the library community on an experimental withholding of a random part of its catalog from Overdrive. The sales could be analyzed for patterns and trends.

It's important that data analysis of this sort be done objectively by researchers with integrity. In any large collection of data, it's possible to focus on data which supports one narrative over another. If lending-sales studies were done, my guess is that some types of books would show correlations very different from others.

I've used the word "cooperation" several times already. I'm not so naïve as to think that data sharing will materialize out of thin air. Perhaps the sort of eco-system wide organization envisaged by the same Brian O'Leary could be the vehicle to make data treasure digging possible. Opportunity in Abundance for the win!

  1. Building cooperation between libraries, publishers and distributors... Sounds like a goal for the ALA OITP E-book Task Force.