Some people who read the article, such as Robert Baruch, assumed that I was trying simply to evade copyright by distributing the copying among many people and then enabling users to reassemble books in their browser. Many critics doubt that a salami strategy is much of a defense against copyright infringement, because a judge will see a copy appearing on a users browser, and won't be impressed at the details of how the copy was assembled. James Grimmelmann wrote me:
The problem is in your assumptions; two of them depend on a belief that copying small portions (or making small portions available) is categorically not infringement. It probably isn't, taken alone. But that doesn't mean we can aggregate those small amounts and have them stay non-infringing, any more than we can integrate infinitesimals and have them stay infinitesimal. I think a court would wind up saying that the activities of the various people in the system could be treated as part of a common plan of action, at which point there's full-text copying going on, and thus potential infringement.I agree with this assessment, and thus a "copyless crowdscanning" organization would have to figure out how to make the reconstitution of the fulltext impossible. And as I explained, engineers hate to hear that something is impossible!
My main interest, however, was not to make salami by reassembling slices, but to make hash. I was assuming that an index of a book is not a copy of a book or a derivative work, but rather a collection of facts about the book. I am willing to grant that if a copy of a significant fraction of a book can be generated from a collection of facts, then that collection is equivalent to a copy, but if not, then it is neither a copy nor a derivative work.
I recognize however, that there are lawyers willing to argue that any index is inherently a derivative work and that rights holders should be able even to control the indexing of their work. It's clear from looking at cases such as The Harry Potter Lexicon case and Seinfeld Aptitude Test case that judges use a variety of tests to determine if a work is indeed a collection of facts. In the latter case, infringement was established because the "facts" were fictional. Because judges look at a variety of factors, we can't remove copyright law from the picture just because there's no copying. Still, the judge in the Harry Potter Lexicon case was pretty clear that an index per se is a transformative and thus allowed use of the work:
... the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.Another set of criticisms said that the slices in my salami were too big. Grimmelmann says "A scan of a single page is probably enough to infringe." Richard Nash asserts that "copying a sentence *is* prima facie infringement." And Wes Felter, in a comment, pointed at this article which goes into the rather deep philosophical and abstract problems created by information reducible to a number that can be "coloured" with infringing intent.
We could modify the crownscanning software to use an even smaller slice, of course, but here we get to an interesting question: what is the smallest bit of text that can be protected by copyright? I think it would be silly to argue that single words could be copyrighted. As poetry, I'm sure a thousand word sentence could be. But as a practical matter, I'm guessing most any single sentence could be reused fairly. In the US, at least. And it would be really difficult for anyone to prove that a particular sentence had not been previously used in a copyrighted work. But unfortunately there's no obvious rule.
Is there any objective principle that could be used to determine the copyrightability of text fragments? A computer scientist might argue that a relevant measure should be the probability that a given text fragment could be generated at random, and if there were a lot of math-major judges, that might work. It seems to me that copyright law should at the very least be built using solid blocks. A legal apparatus built on a foundation of copyrightable 3-word sequences, for example, would quickly melt into uselessness.
The squishyness of today's copyright system imposes a huge cost on both users and owners of copyright. If the rules were clearer, rightsholders would find it easier to monetize their work, and society would benefit from the increased non-infringing use. For example, there could be a decision that said that reuse of less than 10 words cannot by itself be infringing and that 100 words is by default infringing. You don't need any calculus to enjoy a good bite of salami!