Friday, October 23, 2009

Copyless Crowdscanning: How to Legally Index the World's Books

Here's how I know that I have engineering in my DNA. Whenever I hear something labeled as impossible, impractical or unlawful, I can't restrain myself from trying to think of ways around the physical, logistical and legal constraints that supposedly imply impossibility. "That", "is" and "impossible" are fighting words to an engineer. And that's why I've admired the proposed Google Books Settlement. By way of a spectacular feat of legal engineering, it has suggested a way to do the seemingly impossible- to build a database of all the worlds books- in the face of the tremendous obstacle posed by an extremely messy legal situation.

But despite my admiration for the "engineering" involved in the settlement, there have always been some things I didn't like about it. And despite all that's been written about it, and the many aspects that people people have objected to, I've never seen anyone voice my particular misgivings, perhaps because of their peculiar engineer's orientation.
  1. The settlement uses a legal innovation to accomplish its goals. I don't like that (the "legal" part, not the "innovation" part). Many people have objected to the particular innovation that is used, arguing that this precedent could lead to a reign of tyranny and/or other cataclysm, but I've not seen any objection to the use of legal apparatus in the first place. I've often made the disclaimer here that I Am Not A Lawyer, but I've generally downplayed my ingrained bias for using technology rather than law to solve the world's problems.
  2. The settlement seems to be based on a presumption that Google's database of all the world's books cannot be built without making copies. I don't like to assume things are impossible. I should also note that several of the arguments opposing the Google Books Settlement rely on exactly the same presumption!
As the months have dragged on and the postponements pile up, I'm thinking that my first objection is starting to make more and more sense. After thinking it over for over 6 months I'm starting to think that my second objection is also valid. The rest of this post describes how it might be possible to build a full-text database of all the worlds' books without doing any copyright-infringing copying. I'll call this scheme "Copyless Crowdscanning".

What got me started on this line of thought were some simple cost calculations I presented in my article on Dan Reetz' DIY book scanner. It made me realize that the idea of having hundreds of thousands of people scanning their books with cheap scanners was not out of the realm of possibility. The main barrier to assembling a database of all the world's books will no longer be the scanning, but rather the laws governing copyright. So my focus is on how to do crowdscanning so that copyrights are not infringed; the easiest way to do that is to not make any copies.

Here are the assumptions I start with. As I've been learning about copyright, I've learned that there will always be a copyright lawyer somewhere willing to contest any common-sense assumption about copyright, so it's important to start somewhere. First, I'm assuming that scanning a small number of pages of a book (suppose that number is 1% of the book) for the purpose of indexing those pages is not a violation of copyright, as long as I don't redistribute the scans and destroy them after I finish my indexing. The indices are things I should be able to keep and redistribute.

Second, I'm assuming that it is not a violation of copyright to redistribute single sentences from a book. So, for example, publishing the following sentence:
The punishment lay in knowing that you were putting all of that effort into letting a kind of intellectual poison infiltrate your brain down to its very roots.
is not a violation of Neal Stephenson's copyright to the book Anathem. A corollary of that is that if I shuffle the order of all the sentences in a book, I can redistribute that jumble without violating copyright.

Finally, I'm assuming that scanning and distributing the title page of a book and its verso cannot be a violation of copyright; such distribution would be necessary in many cases just to convey statements of fact and as such are not subject to copyright. I recognize that artwork on these pages may need excision.

Let's suppose that we had a large number of people participating in our database building project. Suppose for example, that 100,000 people participated. Each person would scan a small fraction of each book they owned, along with its title pages. The title pages would be submitted to a book identity server, which would return a book identifier. The rest of the page scans would be processed by software, and the scans would then be destroyed. The software would digitize the scans, then chop the pages into individual sentences. An index of the pages would be generated and submitted to an "index aggregation" service. The sentences would be shuffled and submitted to a "sentence serving" service.

After many people have made partial scans and submitted partial indices to the index aggregator, a complete index would emerge that can be used just as Google Book Search is used. The complete sentences would be provided by the sentence server to provide the context of the result sets.

Note neither the index aggregator nor the sentence server would be able to reconstitute a book or even the pages from a book. It seems to me that it should be possible to add some encrypted information and send the keys to yet another party so as to allow reconstitution of the pages in authorized circumstances, such as for use by people with disabilities. If you can't use the information to reconstitute the book, then it seems to me that no copy exists and no copyrights have been infringed.

If my assumptions are incorrect, then I should expect that Harper-Collins will soon be suing me for copyright infringement. I'll be sure to let you know. If they are correct, but there's some theory that would expose any of the crownscanning participants to liability, then perhaps someone who Really-Is-A-Lawyer could elaborate in the comments. I recognize that copyless crowdscanning wouldn't be applicable without modification to things like art books, artwork in books, poetry collections, sheet music, periodicals, reference works, but it would be a start. And it would make some engineers happy.

Update: Several people (including real lawyers) have commented to me that crowdscanning would not help much as an infringement defense if the result of the entire system had the effect of making the entire text available. I just want to emphasize that I think a system can be engineered so as to enable indexing while preventing text reconstruction and avoiding the use of copies.
Reblog this post [with Zemanta]


  1. I like the idea, but let's get even more ridiculous. Suppose I had an index server where you could type in the name of a book, and it would return a webpage for each word in the book, in order, along with the information enabling locating that word in the web page.

    Then you wouldn't even have a sentence server -- in effect, the sentence server would be the Internet -- but the key here is the index server. It tells you the information you need to know to assemble a copy. This is somewhat different from, say, The Pirate Bay, which merely links to full copies. It is also different from GBS, which does not provide the full book, in any encoding, for in-copyright works.

    I think you'd have a hard time convincing any judge that the index server isn't providing an encoding of a copy of the data, and is merely pointing to tiny bits of the data. So... :(

  2. Robert Baruch- What you describe is not an index, but an encoding. If it's possible to reconstruct the full text losslessly from the index, then it can serve as a copy. What I'm suggesting is that an index of the text can be built to preclude reconstruction, and thus cannot be considered a copy.

  3. I encourage you to read this article about copyright and data gymnastics:

  4. now figure out a way to do it with
    both hands tied behind your back.

    that would make as much sense...

    if the law makes you act stupid,
    change the stupid freaking law,
    don't figure out how to be stupid.


  5. If your service got popular enough to matter, publishers probably WOULD take you to court. Whether they'd ultimately win or not, assuming you had the money to keep fighting it -- I don't think even an IP lawyer would be comfortable betting much either way. Your argument that this is obviously perfect legal is unlikely to be seen as so obvious to a judge.

    On the other hand, there is some settled case law that says showing thumbnail images in search results is generally fair use. Only using excerpts for context is probably more important than whether you copy the whole thing or crowd source sentances -- if you crowd sourced-sentances, but then DID reassemble them into full texts, I seriously doubt courts would find it much of a defense that any individual person only copied a sentence. The law doesn't always work out 'mathematically', what actually ends up happening matters too.

    And yeah, I think they DID sue Google even for just providing a search service with excerpts. But that case didn't come to trial, right, it settled. If it had... and then been appealed until it was done with that... it's hard to predict WHO would have won.

    Whether you like legal solutions or not, the law is pretty unsettled with this kind of stuff, if you want to start being able to make confident predictions about legality, there's gonna need to be some new law, by the courts or legislature.

  6. bibwild- To some extent, I think it's a ends vs. means thing. Although IANAL, I think Google's weakness in the book search lawsuit was less the legality of its ends (snippet display) than the legality of its means (making copies of lots of copyrighted books). The point of crowdscanning here is to avoid the copying. If there's no copying, how can copyright be relevant? I agree that crowd-sourcing sentences in a way that would allow reconstruction of full-text would still constitute copying.

  7. Copyright lawyers who work with technical people have heard variations on this theme a million times. It is premised on a fundamental mistake: that judges are like computers, that they follow clear rules literally. They do not -- they are persuaded to interpret ambiguous rules to reach particular results. The boundaries within which the persuasion and interpretation occur are called "law."

    Here's another way to put it: this is no different from a lawyer thinking he can fix a bug on his computer by talking to it gently, flattering its pedigree, and deploying analogies to make it understand that the bug is unwise and illogical.

  8. This is a brilliant technical thought experiment Eric. One hassle with the premise though, is that copying a sentence *is* prima facie infringement. However, having performed that illegal act, you can, in defense, say that it is fair use. So legally there isn't a distinction between your approach and Google's approach as both involve copying, and both have to invoke a fair use defense. You see, fair use is not codifed in law, it is what the lawyer call an affirmative defense, like, for example, self-defense in murder. So it isn't actually an ends and means thing, because the means in both instances, yours and Google's, are in fact considered illegal.

  9. "If there's no copying, how can copyright be relevant?"

    Well, there definitely IS copying. Whether you copy a page, or a sentence, or a word, or a letter, you're copying. It's just highly likely to be fair use if you only copy a sentence. But that doesn't mean when a million people each copy a sentence and put them together it's still just as likely not to be fair use.

    I haven't heard of anything in the law that distributing your copying amongst many people acting in a distributed fashion can make your copying non-infringing fair use. Can you find any legislation or precedent to that effect? You could try making the argument to a judge... but I wouldn't bet on the judge agreeing.

  10. [And actually, Richard, fair use IS codified in law -- in legislation even, which I think is what you meant. (lawyers consider judicial precedent to be 'law' too). It started out simply as case law, not as legislation, but Congress took what had become established as case law and put it in legislation in 1976. Copyright Act of 1976, 17 U.S.C. § 107.

    Of course, many details and fine-tunings on top of that still exist in precedential case law, not in legislation]