Wednesday, October 20, 2010

Attributor eBook Piracy Numbers Don't Add Up

In my article on "Consumer Demand for Pirated eBooks", I showed that Google Trends data tells a very different story from the one that anti-piracy services vendor Attributor derived from the very same data. I did not comment, however, on the headline that Attributor gave for its press release. The key finding of the report heralded by that release was that "Daily demand for pirated e-books can be estimated at 1.5-3 million people worldwide." This result has garnered some significant attention, because the number is quite large.

Extracting numbers using the tools used by Attributor is rather involved, and it's taken a while for me to carefully examine the available data. After doing this work, I've decided that when Attributor wrote "can be estimated at 1.5-3 million", they left out the word "blindly". As far as I can tell, Attributor is recklessly inflating the magnitude of ebook piracy; using the very same traffic measurement tools, I estimate the truth to be about 10% of the number they claim.

The Attributor numbers come from data generated by Google's AdWords service. AdWords is designed to help advertisers select advertising keywords and to manage budgets. For example, AdWords will tell you that the keyword "PDF" is used in approximately 101 million searches per month, worldwide, or 3.32 million searches per day. "PDF" is a keyword that a searcher might use in the course of a search for a pirated ebook, so you could reasonably assume that some percentage of these searches involve a consumer looking for a book they can avoid paying for. The trouble with this assumption is that most searches that include "PDF" have nothing to do with ebooks.

Another AdWords tool designed to assist Google advertisers is the keyword suggestion tool. In practice, you use this tool to refine keywords. Here is a table of the top ten refined searches for "PDF":
Keywordpercent of "pdf"
filetype pdf 36.69%
doc to pdf 6.03%
pdf download 3.30%
pdf to swf 3.30%
pdf to xls 2.70%
free pdf 2.70%
pdf free 2.70%
pdf to word 2.21%
pdf to rtf 2.21%
php pdf 1.81%
Of these, it's reasonable to assume that some percentage of the "pdf free" and a smaller fraction of the "pdf download" searches are related to consumers trying to avoid paying for books. The other searches are clearly unrelated to books. We can further use the keyword suggestion tool to refine these estimates. My review of over 700 refined keywords indicates that at most 4% of PDF searches, or 132,000 per day, are looking for ebooks of any kind.

A review of AdWords' suggested refinements for the term "rapidshare" reveals that searcher interest in ebooks is negligible compared to that for movies, TV, music and games. For example, Rapidshare is a "file-locker" site, and might be expected to appear in search terms for illegally distributed files. Of 743 suggested keywords, only one, accounting for 0.24% of "rapidshare" queries, or about 4,000 per day, is clearly related to ebooks:
Keywordpercent of "rapidshare"
files rapidshare 13.45%
rapidshare download 6.03%
download rapidshare 6.03%
download from rapidshare 6.03%
rapidshare megaupload 4.93%
free rapidshare 3.29%
rapidshare free download 2.70%
free rapidshare downloader 2.70%
free rapidshare download 2.70%
rapidshare download free 2.70%
free download rapidshare 2.70%
rapidshare free downloader 2.70%
download rapidshare free 2.70%
free rapidshare downloads 2.70%
download free rapidshare 2.70%
rapidshare searcher 2.19%
rapidshare search 1.80%
search on rapidshare 1.80%
dvdrip rapidshare 1.21%
rapidshare file 1.21%
rapidshare windows 7 1.21%
rapidshare mp3 1.21%
rapidshare dvd 0.99%
windows 7 rapidshare 0.81%
movie rapidshare 0.54%
rapidshare movie 0.54%
rapidshare upload 0.54%
upload rapidshare 0.54%
rapidshare downloader 0.44%
rapidshare file download 0.44%
rapidshare music 0.44%
music rapidshare 0.44%
download rapidshare files 0.36%
movies rapidshare 0.36%
rapidshare files download 0.36%
rapidshare windows xp 0.36%
720p rapidshare 0.36%
rapidshare premium accounts 0.30%
rapidshare password 0.30%
xbox 360 rapidshare 0.30%
game rapidshare 0.30%
password rapidshare 0.30%
rapidshare game 0.30%
rapidshare premium account 0.24%
premium account rapidshare 0.24%
rapidshare account premium 0.24%
premium rapidshare account 0.24%
rapidshare generator 0.24%
rapidshare engine 0.24%
rapidshare engine search 0.24%
up rapidshare 0.24%
rapidshare software 0.24%
software rapidshare 0.24%
rapidshare ebook 0.24%
Harry Potter and the Twilight Saga make appearances farther down the list, but only the titles that exist as movies.

Although direct interest in ebook torrents is so small that AdWords can barely measure it (~1500 searches per day), torrent search sites can give us another way to estimate the magnitude of interest in pirated ebooks. According to "KickassTorrents", the torrents active recently had this composition:
movies 30.04%
music 27.62%
tv 16.22%
apps 13.76%
games 5.52%
anime 5.43%
ebooks 1.42%
About 1.4 million searches using the keyword "torrent" are made on Google daily, according to AdWords. If the distribution of searches mirrors the distribution of files, this would indicate that searches for ebook torrents comprise about 46,200 per day.

All in all, I estimate that about 210,000 searches made on Google per day represent possible interest in pirated ebooks. About 30,000 of these come from the US. The "real" number for all countries could be as high as 300,000 or as low as 100,000. The 1.5-3 million numbers reported by Attributor are not within the range of plausibility.

One difficulty with using Google AdWords to gain insight into piracy is that it measures only a "shadow cast by piracy", as expressed by a commenter on my previous post. Nonetheless, AdWords sheds considerable light on patterns of demand. For example, the tools show clearly that it's common for people to search for movies and TV shows and acquire them extralegally. Also, they indicate that most of the demand, about 82%, for pirated ebooks comes from outside of the US, UK and Canada. Publishers should plan antipiracy strategies accordingly, based on data that can be confirmed independently.

Update: I have a followup post.
  1. It's probably worth commenting on how antipiracy strategy would be affected by the numbers discussed here.

    If Attributor is correct, then ebook customers have already been lost, and publishers need to figure out how to win them back. If I am correct, then for the most part, ebook customers have NOT been lost (in the US), and publishers need strategies to avoid losing them. These are two rather different objectives.

  2. All of this is interesting but I'm skeptical that Google will accurately reflect piracy numbers. One, most pirates start their search on their favorite piracy site rather than using Google. Two, most of the truly big sites use methods to hide themselves from Google.
    These warez sites are aggregators dealing mainly in software and movies though most will have an ebook section as well. Visitor demographics skew towards young men and boys, so anything dealing with eg computers can be easily found. Dedicated fiction and even romance forums is the norm. You will not find most of these via Google.

  3. eug- Your point is well taken, but what is of greatest interest in the Google numbers for the book publishing industry is not so much the activities of the experienced downloader of pirated books, but rather the tendency of consumers to seek out free unlicensed ebooks to avoid licensed ebook channels.