Tuesday, October 27, 2009

Rehashing the Copyright Salami

I got a lot of feedback on my post on "Copyless Crowdscanning" from a variety of people. The comments, taken as a whole, do an excellent job of illuminating the ambiguities and difficulties of copyright law as applied to digitization.

Some people who read the article, such as Robert Baruch, assumed that I was trying simply to evade copyright by distributing the copying among many people and then enabling users to reassemble books in their browser. Many critics doubt that a salami strategy is much of a defense against copyright infringement, because a judge will see a copy appearing on a users browser, and won't be impressed at the details of how the copy was assembled. James Grimmelmann wrote me:
The problem is in your assumptions; two of them depend on a belief that copying small portions (or making small portions available) is categorically not infringement. It probably isn't, taken alone. But that doesn't mean we can aggregate those small amounts and have them stay non-infringing, any more than we can integrate infinitesimals and have them stay infinitesimal. I think a court would wind up saying that the activities of the various people in the system could be treated as part of a common plan of action, at which point there's full-text copying going on, and thus potential infringement.
I agree with this assessment, and thus a "copyless crowdscanning" organization would have to figure out how to make the reconstitution of the fulltext impossible. And as I explained, engineers hate to hear that something is impossible!

My main interest, however, was not to make salami by reassembling slices, but to make hash. I was assuming that an index of a book is not a copy of a book or a derivative work, but rather a collection of facts about the book. I am willing to grant that if a copy of a significant fraction of a book can be generated from a collection of facts, then that collection is equivalent to a copy, but if not, then it is neither a copy nor a derivative work.

I recognize however, that there are lawyers willing to argue that any index is inherently a derivative work and that rights holders should be able even to control the indexing of their work. It's clear from looking at cases such as The Harry Potter Lexicon case and Seinfeld Aptitude Test case that judges use a variety of tests to determine if a work is indeed a collection of facts. In the latter case, infringement was established because the "facts" were fictional. Because judges look at a variety of factors, we can't remove copyright law from the picture just because there's no copying. Still, the judge in the Harry Potter Lexicon case was pretty clear that an index per se is a transformative and thus allowed use of the work:
... the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.
Another set of criticisms said that the slices in my salami were too big. Grimmelmann says "A scan of a single page is probably enough to infringe." Richard Nash asserts that "copying a sentence *is* prima facie infringement." And Wes Felter, in a comment, pointed at this article which goes into the rather deep philosophical and abstract problems created by information reducible to a number that can be "coloured" with infringing intent.

We could modify the crownscanning software to use an even smaller slice, of course, but here we get to an interesting question: what is the smallest bit of text that can be protected by copyright? I think it would be silly to argue that single words could be copyrighted. As poetry, I'm sure a thousand word sentence could be. But as a practical matter, I'm guessing most any single sentence could be reused fairly. In the US, at least. And it would be really difficult for anyone to prove that a particular sentence had not been previously used in a copyrighted work. But unfortunately there's no obvious rule.

Is there any objective principle that could be used to determine the copyrightability of text fragments? A computer scientist might argue that a relevant measure should be the probability that a given text fragment could be generated at random, and if there were a lot of math-major judges, that might work. It seems to me that copyright law should at the very least be built using solid blocks. A legal apparatus built on a foundation of copyrightable 3-word sequences, for example, would quickly melt into uselessness.

The squishyness of today's copyright system imposes a huge cost on both users and owners of copyright. If the rules were clearer, rightsholders would find it easier to monetize their work, and society would benefit from the increased non-infringing use. For example, there could be a decision that said that reuse of less than 10 words cannot by itself be infringing and that 100 words is by default infringing. You don't need any calculus to enjoy a good bite of salami!


  1. "what is the smallest slice of text that could still be protected by copyright."

    I don't think this is the right question for your use case. Again, this stuff doesn't work mathematically, Grimmelman has it right in my understanding (no surprise since Grimmelman knows this stuff better than me).

    I think you almost certainly can copy a sentance from a book, and it's "fair use". You can quote that sentence almost anywhere you want. Sure, in the US there's a four factor test for fair use, even a sentence could hypothetically be NOT fair use depending on the factors. But realistically, you can do just about whatever you want with a sentance and it's fair use.

    But what you CAN'T do is copy every single sentance, and then put them back in order, and distribute it. And it _probably_ doesn't matter if one person or entity does it, or if it's crowdsourced. The crowd can collectively violate copyright too.

    So, okay, you don't want to put it back together. Can you copy ever single sentance, and put them in an index of that book?

    Well, see that's exactly what the GBS lawsuit was about -- the lawsuit that is in the process of being settled out of court (or not), but at any rate hasn't been tried yet. And it's not really settled law, we (not even Grimmelman) don't really _know_ what a court will be likely to decide -- and what the next court up when it gets appealed by either party will be likely to decide, and what will eventually end up being more or less predictable law (we who like predictability hope).

    Doesn't matter if you change it to a word. Doesn't matter if you change it to a letter. Your assumption that you can take a bunch of non-infringing fair uses and AGGREGATE them into a useful product, and your useful product will itself be non-infringing because each of the actions in isolation would be... is fundamentally flawed, legally.

  2. PS: But I agree with you that "If the rules were clearer, rightsholders would find it easier to monetize their work, and society would benefit from the increased non-infringing use." This contradicts your preference for technical solutions over legal ones a bit -- it's hard to come up with technical solutions if you can't predict if they are legal or not.

    There's only two ways that the rules can become more clearer (in the US). Congress can pass legislation; or courts can create law through precedent. The first is quicker and more decisive than the second (assuming the law succesfully clarifies things!).

    As to how to make rules more clearer internationally... ooh boy, I think we're going to be waiting a while for that. The WTO, I guess, is what would do it. But it took quite a buncha decades to harmonize international copyright laws even for the traditional cases, forget the contested digital era.

  3. bibwild- the reason I went through this exercise was to demonstrate that what I believe is a fair use- indexing- could be accomplished by means confined to fair uses. This is partly to address arguments by some that even if Google's end use- search- is a fair use, it was accomplished using full-book copying, which might not be fair use, even if done in pursuit of a fair use. One strain of criticism of the exercise is that a judge might well overlook the means and address only the ends, in which case the exercise is irrelevant.

    Part of the US copyright system's squishyness comes from the doctrine of fair use, which could allow a judge to overlook the means when considering the end, as I've learned from reading some of the precedents. I wonder whether legal systems that do not recognize fair use- in other words most outside the US- would pay more attention to the means. If so, the copyless crowdscan could be more relevant in non-US jurisdictions.

    Thanks for the comments!

  4. Yeah, that makes some sense, avoiding full book copying as a thought experiment just in case it's the full book copying that dooms the legal success, not the end product (and partial copying WOULD be okay). I get it now, cool. Not sure how likely those potentials are, but i'm not a lawyer, and we'll never know until someone takes it to trial.

    If they didn't recognize anything equivalent to fair use, wouldn't ANY copying typically be illegal? Fair use is what allows you to do any copying at all, doesn't it?

    Wikipedia says that most countries have some kind of exemption from copyright for certain uses such as educational, but that the particular rules are indeed all over the place.