Friday, March 11, 2011

The Pareto Principle and the True Cunning of HarperCollins

I take it back. I see now that HarperCollin's new strategy for ebooks in libraries is not nearly as senseless as it first seemed to me. In fact, it's a cunning plan worthy of Blackadder.
Black Adder IV - Black Adder Goes Forth
In case you're new to this library and publishing controversy, HarperCollins, one of the "Big 6" US publishers, has decided to require the expiration of the ebooks it offers to libraries after 26 checkouts. A library would have to relicense the ebook after the 26 checkouts if it want to keep the ebook in its circulating collection. Needless to say, librarians and many others were not happy about this.

HarperCollins' strategy puzzled me, because I couldn't figure out how it would make any money for them. I thought any extra sales caused by ebook expirations would likely be offset by poor sales of the limited-durability ebooks.

Libraries struggled to figure out how the new policy would affect them, and started looking at their circulation statistics. For example, Laura Crossett reported that at her library, 23,083 out of the 88,680 circulating books in her library's collection had been checked out more than 26 times over the course of 15 years. 220 books had been checked out more than 100 times. Matt Hamilton reported his numbers: 7566 books from a collection of 288,793 had circulated more than 26 times; 942 items had circulated more than 52 times. Most of the materials in his library are 3-4 years old. On Twitter, West Chester Public Library reported over 10,000 books from its collection of 58,000 had been borrowed more than 26 times over 17 years. Jason Griffey reported stats from his (academic) library: in 10 years, only 126 items from a collection of 409,213 had circulated more than 26 times.

These numbers are a bit all over the map, and I wanted to make some sense of them. According to IMLS data for 2007, US public libraries had collections totaling a bit more than 812 million print volumes. They circulated these items 2.17 billion times in 2007. That works out to an average of 2.6 circs/volume. Of course circulations will be unevenly distributed, but if HarperCollins terms were applied to print, the "average" volume would be expected to last 10 years.

A true understanding of these numbers would come from a better characterization of how circulation is distributed over the collection of a real library. You've probably heard of the "80/20 Rule" which in this case would say that 80% of the borrowing is concentrated on 20% of the collection. This is also known as the "Pareto Principle" which is a consequence of power-law distributions. I wondered if this was a good description of book circulation in libraries. I wanted to see some data.

OCLC's Lorcan Dempsey pointed me to the motherlode. The University of Huddersfield, in England, has released a huge file containing circulation and recommendation data extracted from almost 3 million transactions spanning over 13 years. I set to work analyzing the data.

The result is quite remarkable. The data shows a distribution of circulation frequency following a power law over 3 orders of magnitude, with a R2 of 0.9969! (update: see note 10 below.) Here's the plot of the number of books that have been circulated N times at Huddersfield:

The equation for the circulation is pretty simple:

Here, N(f) is the number of books that have been checked out f times. N0 and A are fitting parameters; I used A=9 in my plot of the Huddersfield data. If I use the total number of circulations and the total size of the collection to fix these two parameters, I get a zero-parameter fit of the data that's still amazingly good, R2 of 0.9760

Using this equation, I can calculate what a limited check-out ebook "should" be worth, but I'll leave that to another post, seeing as even one equation may be too much for this blog post.

What I'll focus on here is the what's been referred to in the library literature as the "vital few" principal that results from this distribution. A large majority of the circulations are taken up by a relatively small fraction of the collection. In the Huddersfield data, roughly 20% of the collection is in fact responsible for roughly 80% of the circulation.

If we think about this in the context of ebook lending models, we see that HarperCollins has played a neat trick. By focusing our attention on the books that are lent many times, supposedly shortchanging the publisher and the author, HarperCollins has gotten us to overlook the 80% of books that don't circulate much at all. Libraries pay full price for those, too, and it's pretty clear that publishers make infinitely more money on books that don't circulate in libraries than on books that don't sell in bookstores!

On balance, the economic effect of libraries, in addition to those I've discussed before, is to shift money from very popular books to those that are less popular. It can be argued that libraries support a breadth of culture that would go away without their support. Guess who publishes those very popular books? The Big 6 publishers, of course. They pay the big advances to authors, the big coop advertising fees to bookstores, they get their authors on talk shows and their books reviewed in the Times. That takes a lot of money, but the expenditure is richly rewarded by a "vital few" or "smash hit" economy.

So here's the cunning. By focusing on popularity-driven revenue mechanisms, HarperCollins is pushing money towards the smash hits and away from the long tail. Libraries may be adversely affected, but they're collateral damage. It's the long tail publishers that HarperCollins is trying to destroy.

All of HarperCollins' strategy is directed  at making hits bigger. The loss of big-box bookstores like Borders has disproportionately hurt  smash-hit publishing houses. They're poorly positioned to take advantage of the internet-induced fattening of the long tail that has been documented by Brynjolfsson, Hu and Smith in their paper on Amazon sales rankings. Rather, Big 6 profitability is improved by selling more copies of fewer books.

I didn't think so, but the HarperCollins strategy really does make sense. It's part of the big push.

  1. For a review of what people have written about HarperCollins, Librarian by Day is all over it.
  2. Thanks to Dave Pattern at Huddersfield and the JISC TILE Project for making the release of the circulation data possible.
  3. The Huddersfield data starts at books with 5 circulations. For counts greater than 100, I binned the data in groups of 10 to reduce noise. The data falls off the power law at over 400 circulations/book. This must be close to limit of always being in circulation.
  4. Yes, all you need is the total circulation and the collection size to predict the distribution of the circulation. If you want to model your own circ stats, the formulae for A and N0 are as follows:
    • A = C/2N where C is the total circ and N is the number of items in the collection.
    • N0 = (3/4) (C3/2N)1/2
    Amazing, isn't it? Remember this is an idealized system, so your mileage may vary. Weeding will pull down the small N part of the curve; availability limits will truncate the large N part of the curve.
  5. The "vital few" principle was articulated by JM Juran in 1954. "Universals in management planning and controlling" Manage. Rev. 43(11), 748–61 (1954).
  6. JD Eldridge has a nice discussion of Juran, Pareto, and Trueswell (another scholar of book circulation) in "The vital few meet the trivial many: unexpected use patterns in a monographs collection", Bull. Med. Libr. Assoc. 86(4), 496–503 (1998).
  7. Brynjolfsson, Erik, Hu, Yu Jeffrey and Smith, Michael D., "The Longer Tail: The Changing Shape of Amazon’s Sales Distribution Curve" (September 20, 2010). Available at SSRN: I plotted the Huddersfield data as done in this paper, and the library curve has the same slope they report for the 2008 Amazon data. Not very straight, though.
  8. Brynjolfsson, Erik, Hu, Yu Jeffrey and Simester, Duncan, Goodbye Pareto Principle, "Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales" (November 2007). Available at SSRN: This is a study very relevant to libraries. I wish these guys would show more data, though.
  9. There's a lot of old work (60s and 70s) on library circulation distributions with a whole bunch of theory. It's impressive, because they seem to have collected data by hand, but I fear the theory is too old to be useful. The 80s and 90s were marked by huge advances in the scientific study of self-organizing systems resulting in power laws.
  10. (added March 17) Cosma Shalizi (first commenter on this post) has done a fit of the Huddersfield data to a Log-Normal distribution; I'll try to explain what this means in a subsequent post.
Enhanced by Zemanta


  1. I have a professional interest in this sort of question. If you are interested, and have the unbinned circulation counts, I'd be happy to try fitting a power law with modern tools, and testing the fit.

  2. The raw data is available from Huddersfield; my extract is available at Sorry, it's binned above 100.

  3. So what's the most logical next chess move for libraries? Should they still buy HC but never re-up after 26 uses to try to even out the field? Boycott HC ebooks altogether and seek to reward pubs that aren't doing this - if we can even tell who those are?

  4. bmljenny- Switch to football; it's a team sport.

  5. Eric,
    On gut instinct this seemed the most likely reasoning behind the move from the beginning, nice to have a fact foundation for the hunch though!

  6. I'm afraid your equation did lose me, but one explanation of the rather arbitrary 26 lends and you're out model made a lot of sense to me: most eBooks are lent out for two weeks at a time, therefore you're now licensing a Harper Collins eBook for a theoretical year of lending if your library's bought it.

    But even though I don't understand the equation, your theory's very interesting. ;)

  7. A component of the HC strategy not taken into account here is that they are, apparently, discouraging consortial purchasing. This is important because, in a resource sharing consortium, physical items circulate more frequently.

    One summary data point, in rough numbers:
    * data set age: 4 years
    * collection size: 12M
    * total circulations: 78M
    * items circulating exactly 26 times: 47k
    * items circulating more than 26 times: 640k
    * items circulating more than 52 times: 130k
    * items circulating more than 104 times: 13k

    However, your equation above [N(f)=N'(A+f)^(-5/2)] does not fit the data:
    A = 78000000/(2*12000000) = 3.25
    N0 = 0.75*((78000000^3)/(12000000*2))^0.5
    N(26) = N0(A+26)^(-5/2) = 22,792

    That's 23k predicted, 47k recorded.

    Summing over f from 27 to 2133 (first predicting 0) we get 455k, which is quite a bit shy of the 640k we actually see.

    What this means is that the long tail is more important in this (and, anecdotally, most) consortium. I believe that HC knows this (and has done much more rigorous analyses than I have here) and is discouraging consortial purchasing not because it can't make money there, but to discourage the creation of consortia at all. The trend towards consortia has been accelerating in US public libraries, and the purchasing power that could provide scares publishers like HC who don't seem to "get" the digital world.

    In other words, I think your theory that they're trying to kill off the long tail is sound on it's face and worth exporation, but I believe the reasoning behind the attack on the long tail may be more malicious, with respect to libraries, than simply "collateral damage." With the trend, at least in US publics, toward consortial consolidation and resource sharing, and the data (of which the above is only one point) showing the long tail matters when it is available (read: in resource sharing consortia), these resource sharing consortia may very well be an intended target.

  8. miker- your data shows a very steep power law, (with 3 data points, most anything is a power law!) If an item circulates 104 times in 4 years, I'd expect to be seeing suppression due to availability effects- that's where the Huddersfield data starts dropping off the curve, scaled by age. Try using a circulation of 148 million. Think of that as the circulation you would have had if every book was always available.

    Really, getting within a factor of 2 with a magical formula is a miracle! You need to look more closely at your data, considering only N~<100 to see if it fits or not.

  9. Ruth- It's not a theory, it's more of a conjecture, that the distribution of "idealized" use across a collection can be described by a particular power law. That's astonishing. By idealized, I mean not accounting for books not being available due to high demand, or somehow being hidden.

  10. Hang on, hang on.
    I'm a long time user of academic and public libraries, and there's something missing here.

    Circulation stats don't consider the use of books in libraries. People do read books without checking them out. I have read, or at least looked at, many Harper Collins books inside the library.

    If my in-library use of an ebook counts against the 26 uses as a checkout, this is not as benign a deal as it appears.

  11. This comment has been removed by the author.

  12. It seems to me that there should be symmetry here.

    If, after x years, a book has only been read 10 times, then the other 16 times should be returnable for partial credit from the same publisher.

    This would encourage very wide library catalogs (good for libraries and patrons and smaller publishers). Maybe not so good for Harper, as you note.

  13. There is no shelf-space limit for an ebook catalogue. Libraries could theoretically offer every book and pay only when that book is "e-lent".