Wednesday, November 11, 2009

The Uniqueness of Sentences and J. K. Rowling's (Non)Infringement of Tanya Tucker

Have you ever heard someone say something unusual and wonder to yourself if anyone in the history of humanity had ever said that before, ever? It happens a lot more than you might think.

In the discussion of my article on copyright salami, I suggested that copyright based on content as short as a sentence would not be very robust. I had reasoned that if the sentences were short enough, the would be a high probability that the same sentence had already appeared in a copyrighted work, or even in a work that was in the public domain. I imagined building huge databases of sentences that had already been used so as to clear them for reuse.

I decided to do some testing first. I chose a page at random (p. 447) from my (print) copy of J. K. Rowling's Harry Potter and the Deathly Hallows. I extracted the sentences, and put each sentences into Google and into Google Book Search. The results surprised me.

My first test sentence was
"Get - off - her!" Ron shouted.
With only 5 words, none of them uncommon, I expected to get a a few close matches. The book search produced zero hits, and no results at all close. The general Google search was more interesting. Of the 7 hits, all of them exact matches, the top two of seven hits appear to be properly attributed fair use quotations from the book. Two other hits were to complete, unauthorized copies of the book. One of these, on SlideShare, offers this disclaimer:
"hey here i got this book in pdf format .. am i violating anything after .. uploading this stuff over here ... just let me know .. if any issue come in existence, will remove it
Although the item has had 34,000 views, it pdf itself appears to have been removed from SlideShare. The pdf posted by a Filipino web designer on his web site, though, is still available (and has been since August) and is of quite good quality.

The oddest hits are to a site which masquerades as a "game ranking" portal site.
RPGRank is a real-time online game ranking system which provide a best MMORPG ranking portal for both players and games of all genre with the exclusive news, press release, review, preview, interview, trailer and vedio. RPGRank strive to provide all gamers things that they never experienced before by newest game beta keys, live-event, and online tournamentsa with attractive giveaways from games.
It appears that this site generates pages of random text for the benefit of search engines by extracting sentences from books and feeding the sentences to Google in a random order. This site has convinced Google to index "about 318,000" pages of its meaningless "content", and offers to sell "background" advertising space on the site at $1200 per month.

The last hit appears to be to a site which is presenting a Vietnamese translation of the book alongside the complete English text. Although I can't read Vietnamese, I doubt very much that it is authorized use. Vietnam joined the Berne convention only 5 years ago, so this is certainly an illegal infringement.

Of the 26 sentences on page 447, I could find only three that had been used in places that Google knows about. The first, "Leave him alone, leave him alone!" is a line from a Tanya Tucker song. The second, "Harry's stomach turned over.", has been used in James Edward Amesbury's "bloody but weakly conceived thriller", A Sporting Chance and in D. Edwards Bradley's Harry's War.

The third,"Harry did not answer immediately." is firmly in the public domain, having done duty as a complete sentence in Smith Hempstone's A Tract of Time, as a fragment in Frances Elizabeth G. Carey-Brock's 1867 My father's Hand: and Other Stories, and in Adam Williams' 2007 gripping adventure of modern China, The Dragon's Tail.


Three sentences comprising bits of dialog: "Been Stung", "And your first name?", and "Vernon Dudley", turned up numerous matches to fragments of sentences in Google. It was also amusing to see matches for the sentence "What happened to you, ugly?" This phrase matched two people-search sites which specialize in feeding Google pages with text like "What happened to Joe Smith?" Apperently there is someone who uses the screen name "you_ugly", and the people search engines just leapt to the wrong conclusions!

Most of the sentences on page 447 appear to be purely original to J. K. Rowling. Was she lucky, or were the odds stacked in her favor? Word frequencies for English have been measured, so we can easily generate a simplistic estimate of sentence occurrence rate. Ignoring the proper name "Ron", the words "Get", "off", "her" and "shout" have occurrence frequencies of 0.22%, 0.046%, 0.22%, and 0.0055%, respectively. Multiplying these occurrence rates gives us a weighted occurrence probability of this combination of 1 per 8 trillion. If you had the entire population of earth speaking random four-word English sentences they might come up with this combination in a day or two. Add "Ron" into the mix, and they might take the greater part of a year to generate the sentence J. K. Rowling wrote.

For context, it's interesting to guess at the total number of sentences that humanity has written or spoken. It's estimated that 100 billion humans have lived so far. If those humans spent 16 hours a day for an average of 65 years generating 3 sentences per minute, we'd be up to about 20 million trillion sentences. The real number is probably a factor of 100 to a thousand less (half of us are men, after all!). This estimate roughly agrees with estimates of others that all the words ever spoken could be archived using 10 exabytes of storage.

Ten exabytes is not as much storage as it used to be. The Internet Archive currently has 0.003 exabytes; although Google is quite secretive about its hardware deployment, it seems likely that their current storage capacity is in excess of 10 exabytes. Yesterday, Google announced a pricing plan where they'll rent you 0.000016 exabytes for $4096 per year. I'll do the math for you. If you want to store everything anyone has ever said, Google will rent you the space for only $2.5 billion dollars per year!

Given that Google will soon have digitized a large fraction of the world's books, there are a few things we can learn from this exercise.
  • It will soon be very easy for Google to detect unauthorized copies of books in its index, and presumably to remove them. The benefit to publishers of doing this would hugely outweigh any damages they're suffering from the Google Books digitization program. Why have publishers overlooked getting this to happen as part of the agreement settling their lawsuit?
  • It will not be difficult for Google to accurately de-duplicate the Google Books index.
  • J.K. Rowling's hesitancy to release her books in ebook format is really, really stupid.
Before you get distracted with something useful, do this: pick about 5 random words, make a sentence from them, and become the first human ever to say that sentence. Depending on what you do next, you may also be the last!
Reblog this post [with Zemanta]

2 comments:

  1. > It will not be difficult for Google to accurately
    > de-duplicate the Google Books index.

    you missed an important aspect of this, namely
    that google will be able to use the multiple copies
    to eliminate many of the errors in their o.c.r. data.

    -bowerbird

    ReplyDelete