Monday, March 28, 2011

Statistician Can't Distinguish Library Patrons from Monkeys

If you're a librarian nodding at the title, no, that's not what I mean.

The statistician in question is Carnegie-Mellon Statistics Professor Cosma Shalizi. He's made a habit of debunking claims by physicists, economists, and computer scientists that their data shows power-law behavior in this-or-that system. When I say he can't distinguish library patrons from monkeys, I don't mean that Prof. Shalizi is near-sighted or that he's unfamiliar with the grooming habits of library patrons. I mean that Shalizi is arguing that the distribution of book circulation that I wrote about two weeks ago can be explained by completely random processes.

In his comment on my blog post, Shalizi reanalyzed the circulation data from University of Huddersfield and shows that it can be fit well by a "log-normal" distribution, and that the very high-usage tail of the Huddersfield data is not consistent with a power law (such as the one I gave in my post). I've confirmed  his analysis, which went much farther into the high-usage tail than my first pass. This is done by looking at the cumulative distribution, i.e. plotting the number of books that have circulated less than a certain number of times.

If you want to make the connection to the monkeys in the library, it's important to understand the generating mechanisms that lead to log-normal distributions. These often arise from random growth processes, and are just like the standard "bell-curve", but on a log scale.

Here's how a random growth process could apply to book use. Let's suppose that every day, everyone who has read a book flips a coin. If heads, they do nothing. If tails they try to get someone else to also read the book. The group of people that has read the book thus grows by some percentage. Repeating this process over and over causes the book's usage to grow randomly. If we then measure the  size of these groups, the readership sizes will follow a log-normal distribution.

There's a saying among experimental physicists. "Keep taking data until you have enough to write an article for Physical Review Letters. Then stop taking data." In my previous post on book use, I violated this rule by asking other libraries to share their circulation data for analysis. Ross Riker at Goshen Public Library in Indiana stepped up to the challenge.

Goshen has accumulated circulation data since their automation system was installed in 1996. Riker sent me the number of times each of 144,269 items currently held had been circulated, for a total of 3.04 million circulation events. I've plotted the data on the graph below, alongside the Huddersfield data. It looks somewhat different, doesn't it? I sent the Goshen data to Shalizi, and his analysis was that neither log-normal or power-law distributions could fit the data.

Is book use in an American public library governed by different principles from that in a British academic library? Probably not. I noticed that the maximum number of circulations at Goshen was 251. The standard circulation period at Goshen is 3 weeks, so there's one book that's been checked out for 14.44 years solid, or since late 1996, which is about when Goshen began collecting data.

If we want to look at book use, what we should be plotting is the the rate at which the book is being circulated. That's equal to the number of circulations divided by the time the book is actually on the shelf.

After applying time-on-shelf corrections, the data from both Goshen and Huddersfield are well fit by log-normal distributions. To compare the Huddersfield data to the Goshen data, we need to take into consideration another difference. The Goshen data is listed item by item, so if there were two copies of a book, they count as two items. The Huddersfield data groups the circulation counts for all copies of the same book. To properly compute the time-on-shelf factor, I adjusted the circulation rate based on the number of copies held for each book.

After applying the appropriate corrections, the resulting distributions (below) are amazingly similar for the two libraries, and fit beautifully to log-normal distributions. Both distributions even have a bulge at the very highest circulation rates. At Huddersfield, inspection of the relevant bulge items suggests that they're texts used in particular courses, and have circulation times shorter than the main collection.

You may be disappointed to learn that the distribution of book use can be explained by random processes without reference to metadata quality, selection efficiency, or discovery system details. Nor does it derive from a power law characterizing the structure of user networks or citation graphs. All of the circulation distribution data I've looked at is consistent with there being one driving force in the distribution of book use. The non-technical term for this driving force: word of mouth.

Maybe I should stop taking data.

  1. The formula for a log-normal distribution is:
    where μ and σ are the mean and variance of the logarithm of the distribution. If you use Excel, the lognormal distribution is built-in:   LOGNORMAL(f,μ,σ,FALSE). (TRUE gives the cumulative distribution function)
  2. It's not surprising that you get a better fit with a log-normal distribution than with a power-law. The log-normal distribution gives you an extra fitting parameter, after all. But when you include the full high usage tail, the power law predicts a lot more extremely high usage books than is observed.
  3. The time-on-shelf correction has a bit of fudge-factor in it. If the standard circulation period is 3 weeks, that doesn't mean that every user keeps it for three weeks, or that the book gets reshelved immediately after 21 days. My fit uses an average time-off-shelf period of 18 days.
  4. My log-normal fit for Goshen has a mean of 2.95 and a sigma of 0.94. For Huddersfield, I get a mean of 2.22 and a sigma of 0.77 after conversion to item data. The larger mean gives the higher circulation per item at Goshen. Feel free to speculate about the sigmas.
  5. The titles with the highest per-copy circulation rates at Huddersfield are:
    • Music in medieval Europe
    • An introduction to business ethics
    • A guide to the harpsichord
    • On humour : its nature and its place in modern society
    • Japan
    • The BBC and public service broadcasting
    • Authenticity in performance : eighteenth-century case studies
    • Mozart's Requiem : on preparing a new edition.
    • Asia's next giant : South Korea and late industrialization
    • Handel's operas : 1704-1726
    • Cognitive psychology : a student's handbook
  6. Raw data sets are available for Huddersfield and Goshen.
  7. For a readable discussion of generating mechanisms for power laws and log-normal distribution, I recommend "A Brief History of Generative Models for Power Law and Lognormal Distributions" by Michael Mitzenmacher, Internet Mathematics Vol. 1, No. 2: 226-251. [PDF 382KB].
  8. My comments re Harper-Collins are unaffected by this re-analysis, but my quantitative modeling of the budget impact of the new ebook policy will change a bit.
  9. Monkeys aren't really random, but I bet if one started reading a book, there would soon be a crowd of monkeys wanting to read the same book!

Wednesday, March 23, 2011

Simon and Schuster is Looking at Limited Lending Library eBook Models

Michael Healy and Carolyn Reidy
Today's installment of the Publishing Point series of CEO interviews featured Simon and Schuster CEO Carolyn Reidy. Reidy was appointed to her current position at the very bottom of the economic cycle; she has had a lot on her plate to say the least. Although she knew when she took the job that she'd need to steer the company through a transition to a purely digital product, she had no idea it would happen as quickly as we're seeing.

Michael Healy, who didn't appear to be unemployed yet, again played the role of master interlocutor. He started out by asking the obligatory question about S&S's reaction to yesterday's decision on the Google Books Settlement.
We of course are disappointed that the judge didn't approve it, although I would also say that it wasn't so surprising that he didn't just approve it in total. Along with the other publishers who were party to the suit, we do hope that it's just another way-station on what will be a final settlement. I think he did give some indication in his ruling on ways that we can get to a final settlement and I think that all parties involved have a hope that we'll get there and there will be further conversations about it; it'll just take us longer to get there.
Healy, the Executive Director Designate of the Book Rights Registry that would be created by the settlement, joked that "this particular party certainly hopes that we get there!" Reidy continued:
I'm sure you do. And we all do, because there are real principles involved in the whole thing that we'd like to see preserved and definitely settled for the good of all of us.
Healy then asked if Reidy shared the optimism about growth in the book industry recently expressed by Len Riggio, Chairman of Barnes&Noble. Reidy does:
I share the optimism because I think that even though  there are negative fallouts occurring, as you see with bookstores closing and things like that, the ease by which consumers can acquire books, the ease by which people can publish books, also of selling and being able to put books in front of consumers the variety of ways you can market, all of these things are just exploding. and we don't yet have the same grasp of them that we did of the old system but there are so many opportunities facing us that I'm I definitely share the optimism about it, there's no doubt about it. The biggest question is whether or not you're going to take the old market and transform it into a new way of consuming or whether it'll get bigger; of course we all hope its going to get bigger. That's the biggest question in front of us: can we in fact enlarge the market for reading by using all the new tools and opportunities in front of us.
Reidy discussed at length the many challenges facing a publishing CEO in times of technological change. 20 years ago, even 10 years ago, a publishing CEO would be wrestling with questions of production systems and bandwidth pipes and why a best-selling crossword puzzle iPad App couldn't just be moved to the iBookstore. She believes that the biggest problem facing publishers is maintaining their ability to create value compared to the many entities ready and willing to disintermediate them.

Healy's last question concerned Harper-Collins and the "eye of the hurricane" that they've found themselves in regarding their change in ebook lending policies. Reidy's answer was succinct:
Simon and Schuster does not yet sell ebooks to libraries. We have not yet found a business model that makes us happy. That's why we're not in it
Later, in the Q&A period, I pressed Ready about finding a business model for providing ebooks to libraries: "libraries are worried about whether they'll survive the transition to digital books and funding difficulties at the same time. Are you at all worried about the survival of libraries across the transition to ebooks?" I asked.
There's a part of me that worries about it, but I'm first worried about my company... and my authors, and their survival. So we have met with several people who are trying to come up with a solution to sell into libraries and there are people who are working on various and sundry different models that are not just sell one ebook and let it be loaned forever, and in fact we met with one last week. So we've actually been meeting with people and think there will come a solution that we can live with. We just haven't seen one yet.
I hope that Reidy also finds a model that will allow libraries to thrive in time to help Simon and Schuster grow the market for reading.
Enhanced by Zemanta

Tuesday, March 22, 2011

What's next for "Google's" Book Rights Registry?

Judge Denny Chin's order came down today; he decided NOT to approve the "Amended Settlement Agreement" (ASA) that would have ended the lawsuit between Google, the Authors Guild, and the AAP.  (The decision is only 48 pages, shorter than Chin's senior thesis.) What happens next?

There is speculation that the rejection will be appealed; apparently, Chin must give his permission for an appeal to proceed. (update: or maybe not). Another possibility is that the parties will go back and try to renegotiate a new settlement along the lines suggested by Judge Chin.
In the end, I conclude that the ASA is not adequate, fair, and reasonable. As the United States and other objectors have noted, many of the concerns raised in the objections would be ameliorated if the ASA were converted from an "opt-out" settlement to an "opt-in" settlement. (See, e.q., DOJ SO1 23, ECF No. 922; Internet Archive Mem. 10, ECF No. 811). I urge the parties to consider revising the ASA accordingly.
I'm guessing, based on Judge Chin's fairly broad and terse rejection of the settlement's key terms, that a settlement will have to start at square one; even given the AAP statement:
While the March 22 decision of U.S. District Court Judge Denny Chin on the Google Book Settlement Agreement that was filed on November 13, 2009 is not the final approval we were hoping for, it provides clear guidance to all parties as to what modifications are necessary for its approval. The publisher plaintiffs are prepared to enter into a narrower Settlement along those lines to take advantage of its groundbreaking opportunities. We hope the other parties will do so as well.
One provision of the settlement that has had broad support, even from opponents of the agreement, is the Book Rights Registry. This is essentially a huge database of rights information that has been accumulated by the plaintiff's attorneys. While it's often referred to as "Google's Book Rights Registry", and it was set up with assistance from Google, it doesn't belong to Google at all.

Michael Healy, the "Executive Director Designate" of the Book Rights Registry, as been working on the registry for almost two years, writing specifications, answering inquiries from confused rights-holders, and waiting for a decision from the court before he can proceed to make it a functioning entity. What will become of this almost-there resource, now that the settlement has been rejected?

Here's one possibility: the plaintiffs could establish a book rights registry with or without Google.

Clearly, the plaintiffs and their attorneys have run up a huge legal bill for this lawsuit. They've been expecting an infusion of $34.5 million to set up the registry and another $30 million for attorney's fees. It's not clear how much has been spent on the registry so far, but it has registered claims for about 1.1 million books. This is potentially a very valuable resource, which, if put into play could greatly increase the possibilities for transactions of book rights.

One possibility is that as part of a side deal, the plaintiffs together with Google arrange to put the registry into operation, presumably with some cash from Google. I do not know if such an arrangement would be legal.  Another possibility is that the database is offered to investors who would set up a rights clearance business. The cash from the sale could help the plaintiffs continue the suit.

The opposite possibility is depressing, but not unlikely. The entanglement of the rights information with Google's proprietary information could be so deep that a functional Book Rights Registry would have to be started from scratch, and the ebook industry is locked into a continuing effort at cleaning up the rights mess.

  1. Follow all the Settlement news at The Public Index.
  2. Although Gluejar's effort to unglue ebooks will be hampered by the absence of a Book Rights Registry, there will be an even greater need for efforts such as ours that offer a path to increased access to books.
  3. James Grimmelmann's summary of the decision is required reading.
Enhanced by Zemanta

Tuesday, March 15, 2011

Help Me Study the Physics of Book Use

I am not a librarian. I'm not a bookseller. I'll admit to some librarian tendencies- when I was little, I liked to line up my trucks and sort them from biggest to smallest. But my education and training was in engineering and physics. My approach to the analysis of data is that of a scientist. So when I analyzed the distribution of circulation across the collection of University of Huddersfield, I treated the data as a window into the physics of book use.

"Physics???" you may be thinking to yourself. Yes, physics. Well, maybe it would be economics if I had gotten past Econ 101 in college. But I feel comfortable with physics- I have 76 published articles to fall back on. Physics tries to describe things that happen in terms of simpler phenomena. It aims to connect observables (thing you can measure) to their root causes, and then uses that understanding to predict other observables. It doesn't matter so much whether the basic event is one particle hitting another, or one patron checking out a book, if broad patterns can be observed in these events, then a physicist can measure the patterns and try to deduce the causes.

That's why I was so excited to observe a power-law dependence in book-circulation frequency when I analyzed the data made available by the University of Huddersfield. In 15 years of research into crystal growth and electronic properties of semiconductors and superconductors, I never worked with such a well-behaved set of measurements. And as a physicist, I'm trained to believe that when a measured quantity obeys a mathematical relationship, then there must be a reason for it, even if I don't understand that reason yet.

Right now, I don't know why the book circulation in the Huddersfield library obeys a power law. A physicist would call this power law "phenomenology". Without an understanding of how it arises, I can't say whether it should apply to other libraries. I can't say if it would apply to ebook sales at Amazon, or holdings in Worldcat. It might be an accident. But it would be really cool if it was real, because at the core, it must be connected to how people choose things to read.

What causes people to buy a particular book, or borrow a particular book from a library? You would think that many people might want to know. Publishers and librarians might answer that books are read because they're good. But is there any concrete evidence that book quality has anything to do with sales or circulation? Ask any author if sales are correlated to quality, and they'll tell you about a wonderful book that nobody has bought or read. So maybe other factors  are more important.

A lot of recent discussion has revolved around the unproven hypothesis that library circulation leads to increased sales. The evidence cited, though compelling, is anecdotal and non-quantitative:
Eat, Pray, Love: One Woman's Search for Everything Across Italy, India and IndonesiaPenguin’s runaway hit, Eat, Pray, Love (Viking), was published in February 2006 with an initial run of 30,000 hardcover copies. The title didn’t become a bestseller until March 2007. In the meantime, copies of Eat, Pray, Love changed hands thousands of times through book clubs and libraries, scoring rave reviews from Library Journal and stirring up chatter among leading library blogs such as Memphis Public Library and San Mateo Public Library. Thanks to word-of-mouth marketing and library lending, when the paperback hit newsstands, Eat, Pray, Love sales skyrocketed.

It would be useful to really know how important this factor is.

I'm guessing that the power law I observed has very little to do with distributions of book quality and much more to do with how people are distributed and connected to each other- for example, city sizes are well described by a power law. I think that people pick books to read based mostly on what other people have read. That's what creates a best-seller. By studying the distribution of book usage, we may be able to prove that this is so.

So here's where I need help. We need to have more data sets to look at. If the power-law behavior is universal, it should show up in a wide variety of circulation statistics.

There are also situations where the power-law won't apply. It may seem odd to say this, since we don't understand where the power law comes from in the first place, but there are things it CAN'T do. For example, in the comments on the last post, "miker" reported some circulation numbers from a consortium. He blindly plugged in his numbers to my formulae, and got predicted numbers within a factor of two of the observed numbers, which seemed pretty miraculous to me. He was disappointed. 

Miker's data covers 4 years compared to Huddersfield's 13, and so a book that has circulated 100 times probably has spent little time on a library's shelves. A power law predicts significant numbers of books even at impossibly high usage. For example, the power-law fit to miker's data predicts that over a thousand books would be circulated more than once a day, which isn't possible given normal lending periods. See the notes if you're not scared of math and want to know how to adjust a fit.

The best way to advance the study of this phenomenon is to look at more data. If you have access to library circulation data, you can extract some numbers and publish them. A comment here would be appreciated. It's most helpful to report the number of items that have circulated f times as a function of f. Tab delineated text works great. In addition, analysts need to know the total number of items, total number of circulations, and the number of years covered by the data. An indication of the typical lending period would also be nice.

Along with a better understanding of how book collections get used, a better science of book use will help libraries and publishers formulate ebook circulation models that make sense for everybody who benefits from the reading of books. That's all of us.

  1. If you want to fit a power law to circulation data truncated at some lending frequency fmax, you have to adjust the fitting parameters. We still have the same expression for the number of circulations for a given frequency, N(f).
    But the computation of the parameters from collection size and total circulations is more complicated:
    It's easiest to solve these equations numerically for N0 and A from the known C, N and fmax
  2. Please read the follow-up.
Enhanced by Zemanta

Friday, March 11, 2011

The Pareto Principle and the True Cunning of HarperCollins

I take it back. I see now that HarperCollin's new strategy for ebooks in libraries is not nearly as senseless as it first seemed to me. In fact, it's a cunning plan worthy of Blackadder. In case you're new to this library and publishing controversy, HarperCollins, one of the "Big 6" US publishers, has decided to require the expiration of the ebooks it offers to libraries after 26 checkouts. A library would have to relicense the ebook after the 26 checkouts if it want to keep the ebook in its circulating collection. Needless to say, librarians and many others were not happy about this.

HarperCollins' strategy puzzled me, because I couldn't figure out how it would make any money for them. I thought any extra sales caused by ebook expirations would likely be offset by poor sales of the limited-durability ebooks.

Libraries struggled to figure out how the new policy would affect them, and started looking at their circulation statistics. For example, Laura Crossett reported that at her library, 23,083 out of the 88,680 circulating books in her library's collection had been checked out more than 26 times over the course of 15 years. 220 books had been checked out more than 100 times. Matt Hamilton reported his numbers: 7566 books from a collection of 288,793 had circulated more than 26 times; 942 items had circulated more than 52 times. Most of the materials in his library are 3-4 years old. On Twitter, West Chester Public Library reported over 10,000 books from its collection of 58,000 had been borrowed more than 26 times over 17 years. Jason Griffey reported stats from his (academic) library: in 10 years, only 126 items from a collection of 409,213 had circulated more than 26 times.

These numbers are a bit all over the map, and I wanted to make some sense of them. According to IMLS data for 2007, US public libraries had collections totaling a bit more than 812 million print volumes. They circulated these items 2.17 billion times in 2007. That works out to an average of 2.6 circs/volume. Of course circulations will be unevenly distributed, but if HarperCollins terms were applied to print, the "average" volume would be expected to last 10 years.

A true understanding of these numbers would come from a better characterization of how circulation is distributed over the collection of a real library. You've probably heard of the "80/20 Rule" which in this case would say that 80% of the borrowing is concentrated on 20% of the collection. This is also known as the "Pareto Principle" which is a consequence of power-law distributions. I wondered if this was a good description of book circulation in libraries. I wanted to see some data.

OCLC's Lorcan Dempsey pointed me to the motherlode. The University of Huddersfield, in England, has released a huge file containing circulation and recommendation data extracted from almost 3 million transactions spanning over 13 years. I set to work analyzing the data.

The result is quite remarkable. The data shows a distribution of circulation frequency following a power law over 3 orders of magnitude, with a R2 of 0.9969! (update: see note 10 below.) Here's the plot of the number of books that have been circulated N times at Huddersfield:

The equation for the circulation is pretty simple:

Here, N(f) is the number of books that have been checked out f times. N0 and A are fitting parameters; I used A=9 in my plot of the Huddersfield data. If I use the total number of circulations and the total size of the collection to fix these two parameters, I get a zero-parameter fit of the data that's still amazingly good, R2 of 0.9760

Using this equation, I can calculate what a limited check-out ebook "should" be worth, but I'll leave that to another post, seeing as even one equation may be too much for this blog post.

What I'll focus on here is the what's been referred to in the library literature as the "vital few" principal that results from this distribution. A large majority of the circulations are taken up by a relatively small fraction of the collection. In the Huddersfield data, roughly 20% of the collection is in fact responsible for roughly 80% of the circulation.

If we think about this in the context of ebook lending models, we see that HarperCollins has played a neat trick. By focusing our attention on the books that are lent many times, supposedly shortchanging the publisher and the author, HarperCollins has gotten us to overlook the 80% of books that don't circulate much at all. Libraries pay full price for those, too, and it's pretty clear that publishers make infinitely more money on books that don't circulate in libraries than on books that don't sell in bookstores!

On balance, the economic effect of libraries, in addition to those I've discussed before, is to shift money from very popular books to those that are less popular. It can be argued that libraries support a breadth of culture that would go away without their support. Guess who publishes those very popular books? The Big 6 publishers, of course. They pay the big advances to authors, the big coop advertising fees to bookstores, they get their authors on talk shows and their books reviewed in the Times. That takes a lot of money, but the expenditure is richly rewarded by a "vital few" or "smash hit" economy.

So here's the cunning. By focusing on popularity-driven revenue mechanisms, HarperCollins is pushing money towards the smash hits and away from the long tail. Libraries may be adversely affected, but they're collateral damage. It's the long tail publishers that HarperCollins is trying to destroy.

All of HarperCollins' strategy is directed  at making hits bigger. The loss of big-box bookstores like Borders has disproportionately hurt  smash-hit publishing houses. They're poorly positioned to take advantage of the internet-induced fattening of the long tail that has been documented by Brynjolfsson, Hu and Smith in their paper on Amazon sales rankings. Rather, Big 6 profitability is improved by selling more copies of fewer books.

I didn't think so, but the HarperCollins strategy really does make sense. It's part of the big push.

  1. For a review of what people have written about HarperCollins, Librarian by Day is all over it.
  2. Thanks to Dave Pattern at Huddersfield and the JISC TILE Project for making the release of the circulation data possible.
  3. The Huddersfield data starts at books with 5 circulations. For counts greater than 100, I binned the data in groups of 10 to reduce noise. The data falls off the power law at over 400 circulations/book. This must be close to limit of always being in circulation.
  4. Yes, all you need is the total circulation and the collection size to predict the distribution of the circulation. If you want to model your own circ stats, the formulae for A and N0 are as follows:
    • A = C/2N where C is the total circ and N is the number of items in the collection.
    • N0 = (3/4) (C3/2N)1/2
    Amazing, isn't it? Remember this is an idealized system, so your mileage may vary. Weeding will pull down the small N part of the curve; availability limits will truncate the large N part of the curve.
  5. The "vital few" principle was articulated by JM Juran in 1954. "Universals in management planning and controlling" Manage. Rev. 43(11), 748–61 (1954).
  6. JD Eldridge has a nice discussion of Juran, Pareto, and Trueswell (another scholar of book circulation) in "The vital few meet the trivial many: unexpected use patterns in a monographs collection", Bull. Med. Libr. Assoc. 86(4), 496–503 (1998).
  7. Brynjolfsson, Erik, Hu, Yu Jeffrey and Smith, Michael D., "The Longer Tail: The Changing Shape of Amazon’s Sales Distribution Curve" (September 20, 2010). Available at SSRN: I plotted the Huddersfield data as done in this paper, and the library curve has the same slope they report for the 2008 Amazon data. Not very straight, though.
  8. Brynjolfsson, Erik, Hu, Yu Jeffrey and Simester, Duncan, Goodbye Pareto Principle, "Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales" (November 2007). Available at SSRN: This is a study very relevant to libraries. I wish these guys would show more data, though.
  9. There's a lot of old work (60s and 70s) on library circulation distributions with a whole bunch of theory. It's impressive, because they seem to have collected data by hand, but I fear the theory is too old to be useful. The 80s and 90s were marked by huge advances in the scientific study of self-organizing systems resulting in power laws.
  10. (added March 17) Cosma Shalizi (first commenter on this post) has done a fit of the Huddersfield data to a Log-Normal distribution; I'll try to explain what this means in a subsequent post.

Saturday, March 5, 2011

eBook Carrots for Libraries

"Provide a great service and charge a lot for it" was the advice of an old friend who became a successful businesswoman. I frequently think of this advice; I have sometimes failed to follow the second part and have mostly regretted it. If you provide what your customers value, you should have no qualms about asking them to pay a premium. If you don't give the customers what they value, they won't be happy even if you give them a big discount.

The results of the dual survey I posted on Monday are confirming my guesses about HarperCollins' new strategy for limiting checkouts of ebooks they license to libraries though Overdrive, which sparked the so-tagged #HCOD furor. (The limitations are in addition to a one user at a time limitation imposed on these ebooks.) The results indicate that HarperCollins' new service terms don't give the customers what they value. They'll be unhappy, even if they're offered big discounts.
At what price discount would your library opt for a 26-check-out ebook?
At what premium would your publishing company offer an unlimited-check-out ebook?
The survey for publishers has only attracted 28 responses so far, not enough to make anything other than very broad statements. The survey for libraries has attracted 155 responses, and thus has much better statistics. The poll is in no way scientific; there is sure to be significant sampling bias. In other words, the survey only measures the opinions of librarians and publishers who are motivated to answer.

Significantly, 37% (±5%) of librarians indicated they would not purchase limited-check-out ebooks at any price. I would characterize this response as arising from non-quantitative considerations, which might be practical, ideological or philosophical. A similar percentage of publishers, 28% (±12%) indicated that no amount of money would convince them to offer an unlimited-check-out ebook (which is the most common type today). So it seems that publishers also have considerations that transcend math, which I find a bit surprising.

If we compare the rest of the responses, omitting the non-quants, we see that the librarians perceive a much lower value for limited-check-out ebooks than do publishers. 52 of these 97 librarians would purchase limited-check-out ebooks only if the they were priced at a quarter or a tenth of the ebooks offered without checkout limitations. In contrast, only 1 of 18 quantitative publishers thought the relative value of limited-check-out ebooks was so small.

What's clear is that even omitting the non-quant responses, librarians are perceiving the new HarperCollins licenses as being worth a small fraction of the previous licenses, offered at the same price. It's not surprising that they think it's an awful deal. It's a stick, not a carrot.

Publishers SHOULD be valuing the two licenses based on revenue lift, and they don't seem to expect a huge revenue lift by limiting check-outs. 10 of 18 quantitative publisher respondents seem to expect a revenue difference of 50% or less. My guess is that they're roughly right; I will do some modeling based on library check-out statistics and report on that next week or so.

Looking at the survey results from the other side, librarians are reporting that they put a huge value on the "durability" of the ebooks they license. They don't want books of any kind that wear out! Publishers that want to deliver the highest perceived value (and thus justify the highest prices) should consider finding ways to add to this quality.

One way to increase an ebook's durability is to use standard formats, such as ePub or PDF. This increases a library's confidence that the ebooks will survive into the future; ePub and PDF are the formats used by Overdrive. Unfortunately the DRM ("Digital Right Management") systems that wrap these files are proprietary, and there is a risk that a library's "purchases" will disappear if their ebook platform vendor (Overdrive) or DRM provider (Adobe) disappear in the future. Libraries are used to thinking with long time horizons, and it's a rare library that doesn't have books over 50 years old, much older than either Overdrive or Adobe.

The simplest way to add to the long-term durability for ebooks is to provide libraries with DRM-free, not-for-circulation files in addition to the  DRM wrapped files for circulation. Libraries are used to dealing with license restrictions and have a good record of compliance in this sort of matter; it's likely they would opt to delegate the safekeeping of such files to third-parties. They'd also want to be able to use the files to replace the statutory copying of print books allowed to libraries under US copyright law and to aid discovery in their catalog systems.

Another way to increase the value of an ebook license to libraries without reducing publisher revenue is to selectively allow those uses that are most likely to create publicity and lead to sales. Imagine what would happen if most library ebooks allowed simultaneous use in the first month after a book's publication. This would help libraries attract patrons with "hot" items, and would likely increase total sales by building buzz. Many library readers would want to purchase the book once their loan period expired. More patrons for libraries translates into stronger funding, (or at least less cuts!) which in turn allows for better acquisition budgets.

Andy Woodworth has some more ideas on making ebook rights packages that would be attractive to libraries, and I'm sure there are be many more ways for publishers to offer ebook carrots to libraries. Or at least a parsnip.

Updates: The polls remain open. Gluejar is still hiring, but it's looking like the team will be awesome!
Enhanced by Zemanta

Tuesday, March 1, 2011

Gluejar is Hiring

March comes in like a lion.

I've spent the last two months talking to as many people as possible about my ideas for a new business surrounding "ungluing e-books". I've been telling people I would start hiring in March. And here it is, March, and I really need to avoid getting too sucked-in to #hcod bloggery. I hope I haven't pissed-off anyone I'll need to be friends with.

I'll get around to writing some real job descriptions, but that's hard to do when you're creating something new. In a start-up environment, people need to do multiple jobs. They need to be expert at something but they have to also be willing to help fill in the cracks between the team's expertises. A good startup always has a breadth of ambition that exceeds its funding, so people have to stretch. If you hire to match your ambition you not only exceed your funding, you also tend to lose focus and struggle to keep the team on a single page.

Gluejar's funding allows for the hiring about of 4 people (in addition to me). They'll have to be able to build enough of a product and business lift within a year to attract additional funding that can provide scale and gravity. Unfortunately, I need at least 8 different job descriptions.

The one-sentence description of what Gluejar will build is "a website like, but for 20 million ebooks".

So here's a first pass at describe the talents that I'll be trying to hire.

I'll start by focusing on how to build the product (a marketplace where individuals and institutions can join their purchasing power together to acquire ebooks for the public commons).

Web application engineering talent will be needed. We'll need to build a scalable-ish transactable databases of contributors and content items, be able to collect money from contributors, convert currencies, transport books lists and bibliographic data in and out via API's and do lots of data analysis. It would be nice to have someone who knows what OPDS is. Or be able to tell between an ISBN and a book. The large team of engineers will probably want to do a lot of sprints. That's a code word.

eBook logistics engineering expertise will be needed. We'll need to verify and manipulate ePub, PDF, Mobi files, script the sideloading of them onto consumer devices and platforms, and make distribut happen automagically.

Web and UI design talent will be needed. Our website will need to be oriented towards users who love books and want to support their journeys into the public commons.

Product management expertise will be needed. I'm rather fond of the Pragmatic Marketing approach to this, myself. Which takes a lot of domain expertise. In this case, the domain is ebooks, ebook licensing and the book industry, as well as billing, payments, charitable fund-raising and all sorts of other things. Doing the stuff I stupidly omitted to write a job description for. Oh, and customer support and mailing checks.

Legal expertise will be needed. Gluejar and its product will raise novel licensing, copyright and tax issues, as well as lots of mundane licensing, copyright, tax, and accounting issues. We have to know how all this works internationally, as well. It would be nice if the technical systems we build don't make it impossible to address the legal issues.

Practical IP licensing expertise will be needed. By this, I mean not the writing of licenses, but the tracking down of rights owners around the world and mitigating the complete mess that has been bequeathed on us by the international publishing industry and rights rigamarolaries.

Business development expertise will be needed. We'll need to develop partnerships with charitable organizations to identify and promote the un-gluing of specific content categories for which creative-commons licensing will best serve the public's purpose. We'll also need to work with libraries and non-profit publishers that see the public commons as their best hope for staying relevant in the digital content future.

Finally, and most importantly, Gluejar will need marketing genius. Explaining the value proposition of public content to individuals will require imagination, skillful articulation, and passion. Plus, we may need a better name.

I expect that Gluejar will operate office-lessly in its early phases but will occasionally have good parties. I haven't screwed up the courage to look at what health benefits cost these days. If you're interested in helping me make this happen, please send me e-mail. You can figure out how to do that!

Update 3/3: It's been pointed out to me, clueless as I am, that I need a much better description of what Glujar will do. Here it is (and on the Gluejar Home Page, as well!)
Gluejar is building a place for individuals and institutions to join together to liberate specific ebooks and other types of digital content by paying rightsholders to relicense their works under Creative Commons licenses.
Enhanced by Zemanta