Friday, March 5, 2010

Business Idea Number 3: Gluejar Book Search

A few years ago, I was invited to give a talk about the future of libraries at a library staff retreat. After the talk, the speakers were given a special tour of the library, which had recently undergone renovation. I was struck by the loneliness of the stacks. So many books, so much knowlege, so little usage.

As OCLC's Lorcan Dempsey has recently observed, the lawsuit over Google Book Search and its proposed settlement has highlighted the limitations on libraries' ownership of their book collections. There are many things that libraries would like to do with their books that they are prevented from doing by copyright law. The possibility that the Google Books service will enable libraries to reanimate their lonely book collections is the reason that libraries have, for the most part, been sympathetic to Google's digitization program.

One session at last week's Code4Lib conference sharpened my awareness of how libraries are struggling to acheive this reanimation on their own. There were 3 different presentations, from Stanford, NC State (3.65 MB ppt), and University of Wisconsin, Oshkosh, on "virtual bookshelves". The virtual bookshelf tries to enliven the presentation of an electronic library catalog by trying to reproduce part of the experience of browsing a physical library- sometimes the book you really need is sitting there next to the book you're looking for. It's an idea based on a sound user-interface design principle: try to present information in ways that that look familiar to the user.

The virtual bookshelf is not a new idea. Google has even been awarded a patent on virtual bookshelves- see the commentary here and here. Given that Naomi Dushay (who presented the Stanford work) wrote about Virtual Bookshelves in 2004, it appears to unlikely that the Google patent (filed in 2006) will apply broadly at all.

While the virtual bookshelf is a sensible and practical incremental improvement on the library catalog interface, it's also backward looking. People looking for information today want to search inside the books, not just "browse the stacks". But libraries don't have the ability (today) to search inside the books that they think they own.

Google Books could enable libraries to do just that. Google is spending huge sums of money to digitize books in libraries and make them searchable. When they got sued for doing this, the library community looked forward to having questions surrounding the fair use of digitized books settled in court. For example, while it's pretty clear that using digitization to create an full-text index of a book would be allowed as fair use, the display of "snippets" (as done by Google) may or may not be held to be a fair use of the page scans. When a settlement of the lawsuit was announced, much of the library community was disappointed that these fair-use questions would not be settled.

Google Books already allows users to set up book collections of their own and search them. The results come with snippets (see pictures), but if the settlement is approved, Google's ability to show snippets with vastly reduced infringement liability would leave it with a dominant position in libraries because of its ability to search inside huge numbers of books. If the settlement is not approved, Google's dominance would be similar, except that a copyright decision could shut down Google Books at some time in the distant and irrelevant future.

Some aspects of the settlement create holes in Google's index. As part of the settlement, rights holders can exclude their works from Google's index. Google's publisher partner program allows publishers to create these holes today. For example, even if you add Tolkein's "the Two Towers" in your Google library, Google won't let you search inside it. Only limited research uses can be made of the digitized works; as the Open Book Alliance's Peter Brantley has argued, it's very hard to tell what sort of innovations might arise from the availability of large numbers of digitized texts as data; the same goes for indices of these works.

Many other works have been excluded from the settlement. Works published only outside the US, Canada, UK and Australia, as well as works published in the US, but not registered with the copyright office, are not covered by the settlement. Works other than books, such as newspapers, magazines, and other periodicals are also excluded.

For these reasons and others, I've begun talking to people about "Gluejar Book Search". Gluejar Book Search would be a business focused on collecting, aggregating and redistributing full-text indices of copyrighted material. To comply with copyright law, it would focus on indices that can be distributed without infringinging copyright, and would help provide libraries and publishers with tools  to produce copyright-safe index documents.

I've frequently encountered the assertion that digitizing all the books in libraries is prohibitively expensive, and that only Google (or possibly the government) could possibly have the financial resources to do it. For example, Ivy Anderson reports an estimate by the California Digital Library that digitization of the 15 million books in the libraries University of California would take a half a billion dollars and one and a half centuries. There are two coutervailing arguments. First, the cost of book digitization software and equipment has rapidly fallen, and will continue to fall. Last year, I wrote about the Dan Reetz' DIY book scanner, but even commercial devices capable of both image aquisition and OCR are currently available for as little as $1,400. I described how it could cost as little as $10,000,000 to put scanners in 10,000 libraries to enable scanning of 5,000,000 books per year.

The other factor that could drastically lower the cost of producing digital full-text indices of all types of copyrighted materials is the drastically lower technical demands of an indexing system compared to that of an archival imager. Archival imagers produce huge scanned image files because of the need for high resolution in an archival image. The resulting demands on storage hardware are significant and expensive. In contrast, an index file can be quite small; the laptop I'm typing on could store indices for 3,000,000 books; I estimate that full-text indices of all the worlds books would today require at most ten commercially available hard drives.

Gluejar Book Search would be fueled by two main revenue streams. The first stream would come from customized search services to enable library patrons to search inside the library's books. The second stream would be to provide aggregated feeds of index files to mass-market and specialized search providers- Google's competitors, and book retailers such as Amazon and its competitors. Google may even want to acquire index files for works it has been asked to remove from its own index, such as the Tolkein book mentioned above.

A possible third revenue stream would come from partnerships with rightsholders willing to permit page or snippet display in exchange for link traffic. If a Book Rights Registry comes into existence, it's possible that many business models could be arranged without prohibitive transaction costs.

Part of the revenue from Gluejar Book Search could be returned to libraries, publishers and other institutions that have contributed index files to the aggregation. Libraries could choose to use these funds to fund further digitization; alternatively, they may prefer to contribute to an Open-Access index.

The success of Gluejar Book Search would depend to a significant extent on its ability to reach critical mass. If it could reach index 80% of a library's book collection, it would deliver significant value to the library. (That statement is based purely on conjecture- email me or leave a comment if you agree or disagree!) Critical mass might be rapidly attained by working closely with publishers and by partnering with low-cost digitization providers and existing content aggregators so obtain indices for the most widely held books. Once critical mass is obtained, the "long tail" could be addressed by encouraging the particpation of large numbers of libraries around the world.

A Gluejar Book Search business would require a significant but not huge raise of capital, if for no other reason than to address litigation risk. Although I believe the legal position of building copyright-safe book indices is secure, there are bound to be litigious rightsholders with a poor grasp of fair use under copyright. The other big risks involve Google. Google might well develop services that greatly undercut Gluejar Book Search's revenue streams. Finally, the "copyright-safe" approach might be completelyundermined if courts in many countries were to rule decisively for an expansive view of fair-use.

If you want to know more about Gluejar, read this post. I have been exploring many possibilities about "what to do next", and I've written about other ideas, as well. As always, I'm interested in feedback of all kinds. Over the next few months, I hope to develop this and other ideas in more depth, so stay tuned.
Reblog this post [with Zemanta]


  1. What is the major difference between your Gluejar full-text indices of copyrighted material and OCLC's Worldcat?

    Also, check out the U.S. Department of Educations ERIC site. There are 1.3 million search-able records for educational materials. Some point to copyrighted full-text locations with links. The full text is provided if copyright release is obtained.

    Is this what you had in mind for Gluejar only on a larger scale?

  2. WorldCat and ERIC search on metadata, not full text (for the most part). Google Books searches the full text when it has a digitized version of the book.

  3. The goal of the Gluejar Book Search, narrowed to the library perspective of getting people into the stacks, call to mind Paratext's Reference Universe product. Paratext has created an index of indices of print reference works, thus exposing the library's print reference collection to users. Not full text though, as far as I know.