Friday, December 16, 2011

How to Dig for Book Data Treasure

To me, surest indicator of an impending doom for book publishing is hearing a publisher cite the advertising of Attributor, an anti-piracy solutions company, as if it were science. It's not the attitude towards piracy that bothers me, that's entirely sensible. It's the implied devaluation of honest data that depresses me.

There's hope though. I've gotten to know quite a number of people throughout the reading ecosystem with whom I can use the word "data" as high praise, roughly equivalent to the word "gold". If you're reading this, chances are you're a member of this secret society, and what follows is a sketch of a treasure map.

In a recent post, I promised to suggest ways that we might measure the effects of library ebook lending on book sales. If you think about it, there are many parallels between attempting such a measurement and previous studies that have tried to measure the effect of ebook piracy on book sales. Unfortunately, the only objective study I know of was a small study done by Brian O'Leary, and the effects observed in that study were small and in a direction counter to popular narratives (and thus rarely noted in the sort of presentations that cite Attributor advertising).

In that study, O'Leary looked for time-domain correlations between sales figures for books from two publishers and the appearance of the same books on BitTorrent. A similar study focused on library Lending could be much more compelling, because library circulation data is a much more direct measure of distribution than any sort of torrent tracking, and librarians are much better than pirates at sharing data.

With the cooperation of booksellers, library circulation and holdings could be compared and correlated to store-by-store sales. For example, you could look at a book that's held in a significant fraction of libraries and look for correlations (positive AND negative) between areas where a library is circulating the book and stores where the book is selling. You've have to remove regional and demographic variance, of course, but with enough data, almost anything is possible.

With the cooperation of a large publisher, rigorous experiments could be done. Scientific experiments derive rigor from the use of controls. To prove that lending influences sales, it's not enough to do lending and look for sales. A rigorous experiment would have both a trial where books are lent and an identical trial where the same books are not lent.

One way to control a lending experiment would be to make a random selection of a publisher's catalog available for lending. Imagine if Penguin had worked with the library community on an experimental withholding of a random part of its catalog from Overdrive. The sales could be analyzed for patterns and trends.

It's important that data analysis of this sort be done objectively by researchers with integrity. In any large collection of data, it's possible to focus on data which supports one narrative over another. If lending-sales studies were done, my guess is that some types of books would show correlations very different from others.

I've used the word "cooperation" several times already. I'm not so naïve as to think that data sharing will materialize out of thin air. Perhaps the sort of eco-system wide organization envisaged by the same Brian O'Leary could be the vehicle to make data treasure digging possible. Opportunity in Abundance for the win!

Enhanced by Zemanta

Monday, December 12, 2011

SOPA Could Put Common Library Software in the Soup

The "Stop Online Piracy Act", or SOPA, is promoted as something that will... stop online piracy. So I was a bit surprised when I learned how it's supposed to work. A key provision of SOPA will shut down "notorious" websites by setting up a national web filter based on domain names. I'm sure the pirates had a great laugh about that one. They'll be the ones benefiting while the rest of us figure out how to avoid collateral damage. Members of Congress should consult the nearest available 14-year-old on the ease of web filter evasion: school teachers in my town routinely access their filter-blocked Facebook accounts by asking students to show them how it's done.

Rerouting domain names to alternate IP addresses is pretty easy to do, and can be very useful as well. One type of software used to accomplish this is called a "proxy server". It's called that because it acts as your web browser's proxy in requesting files from a web site. For example, after connecting to a proxy server in Stockholm, my requests for web pages would appear to issue from a computer in Sweden instead of from my computer in New Jersey.

Libraries often use proxy servers to simplify IP authentication of their networks to digital information providers. When an academic library buys access to a database, for example, they'll give the IP address of their proxy-server to the database provider, which then puts the IP address on an "allow" list. Then everyone at the school accesses the database through the address of the proxy server. In effect, those proxy-authenticated users circumvent the IP address-based filter that blocks unauthorized users.

Passage of SOPA would inevitably spawn the creation of a network of proxy servers hosted in countries that reject filtering of the internet. Users in the US could then connect transparently to  blocked sites by connecting through a constantly shifting network of proxy servers. The key to that connection would be a Proxy Auto-config, or PAC file- essentially a mini DNS file installed in the user's web browser software.

SOPA contains provisions that allow the US Attorney General to
bring an action for injunctive relief against any entity that knowingly and willfully provides or offers to provide a product or service designed or marketed for the circumvention or bypassing of [domain name blocking] and taken in response to a court order issued pursuant to this subsection, to enjoin such entity from interfering with the order by continuing to provide or offer to provide such product or service.
Proxy servers meet the condition of being designed to route around filters and therefore fall into the category of services that could be subject to injunctive action under SOPA. The proxy servers most frequently used in libraries are OCLC's EZProxy and the open-source software known as SQUID, but there are many others in use.

In particular, SQUID makes use of PAC files, and thus could be vulnerable if the Justice Department decides that PAC files make it too easy to evade SOPA blockages. Conceivably, the Justice department could force browser developers to omit support for PAC files, or perhaps to restrict their transmission.

Similar concerns about important software have been raised by Jim Fruchterman on behalf of Benetech, a non-profit that among other things, provides ebooks to the reading disabled. Benetech is also one of the largest developers of software for human rights activists around the world. They operate TOR servers designed to foster anonymous communications. On Beneblog, Fruchterman worries that Benetech services could be impacted by SOPA. In response, a commenter signing in as "Copyright Alliance" argues that such action would be unlikely because "The State Department is strongly committed to advancing both Internet freedom and the protection and enforcement of intellectual property rights on the Internet." Too bad it's the Justice Department that gets to decide which services constitute circumvention.

I don't think that libraries will have their proxy servers taken away anytime soon, even if SOPA is enacted. But it's likely that the widespread development of SOPA-circumventing infrastructure would degrade the ability of rights holders to find and prosecute copyright violators. Knowledge of the actual locations of unauthorized files would by hidden offshore in distributed proxy servers, completely out of the reach of US law enforcement. The "file lockers" of today would dissolve into ungraspable bit vapors, and the online piracy problem would just get worse and worse.

There are many ways to address the online piracy problem- too many to list in this post. My own company is working on a piracy-neutering business model for ebooks. I don't know enough to evaluate the possible effectiveness of the payment and advertising network components of SOPA. But it appears to me that from the technical point of view, the internet filter component of SOPA will be a charm of powerful trouble, like a hell-broth, boil and bubble.

  1. @amac has a good post on SOPA's scope issues, as well as links to other articles.
  2. I focus here on SOPA, but there are similar issues with PROTECT IP, as described by Steve Crocker and 4 other prominent internet engineers.
  3. The Crocker paper describes a number of other ways that domain name filtering might be circumvented. These include using replacing .hosts files on the user's computer (similar to PAC file installation) and switching the user to using a non-filtered DNS server. Apparently this is done transparently by some types of computer malware. This can only end badly.

Enhanced by Zemanta

Friday, December 9, 2011

Book Lending Ignorance

To what degree does library book lending complement book sales, and to what degree does library lending substitute for book sales? I don't think anyone knows for sure. (Well maybe Amazon, but they're not telling.)

With over 40 billion dollars per year of sales at stake, you would think that the US book publishing industry would want to know as much as possible about how those sales are generated. Since US public libraries circulate more items than US bookstores sell, the industry needs to understand the role of libraries in getting people to read and purchase books. Is it small or big? Does the existence of libraries promote sales or hurt sales? How do the equations change when books become digital?

Publishers do a pretty good job of compiling sales data, and they spend a lot of money to figure out what books are selling and who's buying them. According to BookStats, a cooperative study by the AAP and BISG, Americans bought an average of 7.32 books in 2010.

On the library side, there's a bunch of interesting data. IMLS has been compiling a wealth of data about the footprint of public libraries, which is why I can tell you that the average American borrowed 8.1 items from public libraries in 2009. Library Journal has recently published the first installment of results from a fascinating survey of library patrons. (Aside: this study should be made available in every library!) They find that 46% of respondents use the public library less than 2 times per year.

The LJ Patron Profiles survey shows a strong relationship between library use and book purchasing. For example, over half of survey respondents report buying a book by an author whose works they'd previously borrowed from the library. That's a huge number, considering that 20% of respondent never go to the library, period. At the same time the survey indicates a competition between reading and borrowing. Respondents who report that they've decreased their use of libraries buy 12.18 books per year, while those who've increased their library usage buy only 10.9 books per year. What we can't tell from the data is cause and effect. With the recession having a wide impact, who's to know whether the folks showing up more at libraries might buy even fewer books if the libraries weren't around!

It costs about 11 billion dollars a year to run public libraries in the US, and libraries work hard to demonstrate their value to the communities that support them. They compile data to measure their activity and the community's return on their investment in libraries. These studies assign much of the benefit of library spending to substitutional activity. For example, a survey by Denver Public Library determined in 2009 that it saved its community $105 million based on the cost to use alternative sources of information, and delivered an additional $5 million by avoiding "lost use", activity that wouldn't have occurred if the library did not exist. (See Public Libraries- A Wise Investment (PDF, 1.4 MB) from Library Research Service)

Do libraries really believe that 91% of their circulations would have resulted in purchases if they didn't exist? There's no hard evidence anywhere that that's true. Every librarian can tell you about patrons who loved a book so much they went and bought the whole series, but there are also users who never buy a book they can get in the library. And what about those readers who never go to the library? Surveys are a cheap way to collect data, but they often don't reflect the real behavior of the people surveyed.

So much is unknown, and so much is to be gained by knowing more. What hasn't been done, as far as I know, is to try to compare and correlate hard data on book sales and library lending in any meaningful way. In my next post, I'll describe how a cross-industry cooperative approach to book data collection and analysis might provide some light amid the gloom of the reading industry's winter solstice of understanding.