Overlooked in the controversy is the fact that Google could have chosen a safer course in its effort to make full-text indices of books. In this article, I'll argue that it's possible to make full-text indices of books in a way that steers well clear of copyright infringement. But first, I should note that playing it safe would not have been a good plan for Google. By pushing fair use to its limits, Google assured itself a favorable competitive position. In a lawsuit, Google could have lost on 90% of the fair use they were claiming and would still have ended up 10% ahead of where a safe course would have taken them. Google is large enough that even a 10% victory in court would have paid off in the long run. As it is, Google chose to settle the lawsuit under terms that put them in a better position than they would have occupied by playing it safe, and potential competitors don't gain the benefits of a fair-use precedent.
I make two assumptions about copyright in devising an copyright-safe indexing method:
- You can't infringe the copyright to a work if you don't copy the work.
- If you can't reconstruct a work from its index, then distributing copies of the index doesn't infringe on the work's copyright.
First, the fall-back assumption: full-text indexing is allowed as fair use under US copyright law. Indices are allowed as "transformative uses". Judge Robert Patterson's decision (pdf, 195K) in the "Harry Potter Lexicon" case gives an excellent background of this jurisprudence and concludes:
The purpose of the Lexicon’s use of the Harry Potter series is transformative. Presumably, Rowling created the Harry Potter series for the expressive purpose of telling an entertaining and thought provoking story centered on the character Harry Potter and set in a magical world. The Lexicon, on the other hand, uses material from the series for the practical purpose of making information about the intricate world of Harry Potter readily accessible to readers in a reference guide. To fulfill this function, the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.The author of the Lexicon lost his case not because his indexing was not allowed, but rather because he copied too much of J. K. Rowling's creative expression in doing so.
Second, you have to copy to infringe copyright. A more accurate statement is this: You have to either make a copy or a derivative work to infringe copyright. The second piece of this can be a bit more confusing, because "derivative work" has a specific meaning in copyright law. A translation into another language is an example of a derivative work. Indices are not derivative works. The law considers indices to be more akin to metadata. I might need access to a book to count the number of figures it contains, but a report of the number of figures in a book and what page they're on is in no way a derivative work. The copyright act defines a derivative work as
a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.If you make copies by scanning, however, as Google is doing, you must also establish that your use is allowed as fair use. If you don't, then you don't even need to reach the fair use provision.
The last assumption gets more technical. The simplest form of a word index is a sorted list of words with pointers to the occurrence of the word within the text. So an index of that last sentence might look like this:
a 5,9 form 3 index 7 is 8 list 11 occurrence 18 of 4,12,19 pointers 15 simplest 2 sorted 10 text 24 the 1,17,20,23 to 16 with 14 within 22 word 6,21 words 13It doesn't take a computer science degree to see that it's easy to reconstruct the sentence from this index. For that reason this form of index is equivalent to a copy. If you remove the position pointers, however, the index loses enough information that the sentence cannot be reconstructed. So if we take the words on a page of text and sort the words in each sentence, then sort the word-sorted sentences, we get an index of a page that can't be used to reconstruct text, but can be used to build a useful full-text index of a book.
The trickiest step of completely copyright-safe indexing is producing the page index from a book without producing intermediate copies of the pages. In a conventional scanning process, a digital image of a page is stored to disk and the copy is passed to OCR software. Indexing software then works on the OCR text. A scanning process that was fastidious about copyright, however, could scan lines of text word by word and never acquire an image large enough to be subject to copyright.
US courts have considered the loading of a copyrightable work into a computer's RAM storage to constitute copying, but scanning sufficient to produce an index can in principle be done without requiring that to occur. (For an excellent law review article on the RAM-copying situation, read Jonathan Band and Jeny Marcinko's article in Stanford Technology Law Review.) Also, even sentences of more than a few words can be considered copyrightable works, as I discussed in an article from November.
Another possible way to avoid copying is to build a black-box indexer. A closer look at the RAM-copying precedent, MAI SYSTEMS v. PEAK COMPUTER suggests that a non-copying scanning indexer can be built even if page images exist somewhere in RAM. In that case, the court reasoned that the software copy could be viewed via terminal readouts, system logs, and that sort of thing. If a closed-box indexing system were built so that page images resident in RAM could never be "perceived, reproduced, or otherwise communicated", then there is a fair chance that a court would find that copying was not occurring.
I'm a technologist, not a lawyer. I would welcome comment and criticism from experts of all stripes on this analysis. For example, I've not considered international aspects at all. There are many technical aspects of copyright-safe indexing that would need to be sorted out, but doing so could open the way to countless transformative uses of all the books in the world.