Showing posts with label AdWords. Show all posts
Showing posts with label AdWords. Show all posts

Thursday, November 18, 2010

Real Research Gets Reproduced

It's not often that I'm identified as a physicist, as Richard Curtis did in his commentary on my followups of Attributor's piracy "demand" report. But it's true, I worked in materials physics research at Bell Labs from 1988-1998.

Crystal structure of YBCO
Those were great years to be in materials physics. In 1986, two guys at IBM Zurich discovered some amazing new superconducting materials. By the end of that year, a team in Japan had reproduced their results; a group I was a part of at Stanford did the same after talking with the Japan group in December. By March, so many groups around the world had made exciting discoveries that the American Physical Society meeting in New York became known as "the Woodstock of Physics".

A blue semiconductor laser.
A few years later, a guy in Japan reported that he had made a semiconductor light emitting diode (LED) glow blue. His work was a lot harder to reproduce; it took years for anyone to come close to what his team reported; although he published many details, it was hard work. I even sawed one of his LEDs in half to try to understand how it worked. Today, my kitchen (and the screen of my MacBook) is lit by white LED's made from that same semiconductor.

Around that time, some chemists in Utah announced a truly amazing discovery: they saw fusion reactions occurring in palladium electrochemical cells. Since they were respected electrochemists, their results were taken seriously, and lots of people tried to reproduce the incredible results. The promise of a seemingly magical, unlimited power source seemed almost too good to be true. This time, however, nobody could reproduce the results. Some scientists saw odd things happen, but they were different in every lab. At Bell Labs, the scientists trying to reproduce so-called "cold fusion" became convinced that the guys in Utah were being led astray by their excitement.

In science, it's usual that a surprising result will only be accepted once it has been reproduced by someone else. My scientific training has sometimes gotten me in trouble in the world of libraries and publishing. When presented with something that seems surprising to me, I ask for the evidence. In cultures that are more comfortable assigning and recognizing authority, my questions have sometimes been seen as irritants.

It's been that way with my questions about the Attributor report. I was surprised at some of the findings, and I tried to reproduce them. My results can't reproduce some of the key findings reported by Attributor. It would be nice to better understand the factor of a hundred difference between my results and those of Attributor; much might be learned from such an analysis. Attributor is a company that sells anti-piracy services; one would hope that the data they report is somehow rooted in fact, even though they benefit from overestimates of privacy.

In Richard Curtis' article, Jim Pitkow, Attributor's CEO, is quoted:
Our study’s rigorous methodology ensured highly accurate results that align with actual consumer behavior. We analyzed 89 titles, using multiple keyword permutations per title, across different days of the week, with very high bids to ensure placement – each of which is fundamental in guaranteeing accuracy and legitimacy. Each of these variables impact the findings, and analyzing all variables together produce highly accurate results. We stand by our research, and we’re confident that the study addresses an accurate portrayal of the consumer demand for pirated e-books.
If Attributor really stands by its research, it will make it easier for people like me to reproduce their results. In particular, they should publish the complete list of the "869 effective keyword terms" used as keywords for their Google AdWords experiment. There are mistakes they might have made in permuting and combining search terms; they might also have thought of a class of effective search terms that my study totally overlooked. As it stands, it's impossible to know.

I can understand why Attributor might not want to release their search term list. First of all, they should expect people to try to tear it to shreds. The marketing department isn't going to like that. That's what happened to the superconductor guys, the blue LED guy, and cold fusion guys. They stood behind their work, and let the scientific community look for weaknesses and make their own judgments.

Cold fusion didn't pan out, and Pons and Fleischmann, the Utah guys, tried for years to figure out what it was they measured. Bednorz and Müller, the guys in Zurich, won the Nobel Prize. Shuji Nakamura, the LED guy, won a Millenium Prize and a lawsuit.

It may be easier to do a followup study without the worry of spurious searches for widely known terms. But at this point, Attributor customers and the book industry as a whole stand to learn a lot from understanding where the irreproducibility of Attributor's study is coming from. Publishers need that information to plan out a response to the threat of ebook piracy, and their needs should come first- no matter what the marketing department says.
Enhanced by Zemanta

Saturday, November 6, 2010

Consumer Interest in Pirated eBooks is Even Lower Than I Thought

My recent posts following up on Attributor's most recent study on demand for pirated ebooks have been republished on TeleRead, probably the longest running blog covering ebooks and related topics. Paul Biba, the current editor, has been doing a great job bringing together interesting articles from many different perspectives.

The Teleread discussion on my article Attributor ebook piracy numbers don’t add up had some interesting contributions. Jim Pitkow, CEO of Attributor, suggested that the difference between the data I got from Google AdWords and Attributor's data might be a result of a difference in methodology. While I used Google's predicted traffic numbers, Attributor used numbers from an actual AdWords campaign. For example, Attributor bought the keyword “lost symbol free ebook” along with 868 others and counted how many impressions were generated by Google.

AdWords predicts a total query volume of 62 per day for the keyword “lost symbol free ebook”. If each of the other keywords did the same volume of queries, Attributor should have seen 53,900 impressions per day for its ad campaign. It's not at all clear how 53,900 impressions turns into 1.5-3 millions searches for pirated content; that would be a factor of 30-60 larger than the traffic predicted by the AdWords estimator tool. Pitkow's comment seemed to suggest that the result of an actual advertising campaign would be different from the estimate, possibly accounting for the discrepancy.

So I did an actual advertising campaign of my own to see if this was true.

I've used AdWords in the past with amazingly cost-effective results. I would spend about five dollars a month at five cents per click to advertise a service that sold for thousands of dollars per year. My experience, albeit somewhat dated, was that the estimator tool overestimated, not underestimated, the actual traffic. I was interested to see if this was still true.

I constructed a "free ebook survey" landing page for my ad campaign, and added StatCounter analytics so I could see who clicked on my ad. I bought the keyword suggested by Pitkow: “lost symbol free ebook”.

AdWords gives you a number of settings to fine-tune your ad campaign. For example, I checked all the boxes for geographical coverage so my ads would be seen in as many countries and languages as possible.

AdWords offers three important options that determine the distribution of the ads. You can choose to advertise on Google only, Google and its "search partners" or on Google, it's search partners, and Google's "display network". I spoke with Pitkow last week, and he indicated that Attributor's study used both Google and its search partners.

The initial results of my ad campaign were alarming. In just four hours, I accumulated 7,000 impressions and 19 clicks; my campaign halted because my bids were too low. This traffic level would easily support Attributor's estimates. But when I looked into it, I found that I had included the "display network" without meaning to. What's more, the referring sites were really junky. I couldn't imagine who would use sites like "Lost World TV", NDParking and Sebaidu. The 61 cents I spent in those four hours may well have gone to a bunch of sites engaging in click fraud.

A full week of advertising on just Google, by contrast, resulted in a grand total of 15 impressions, much less than Googles estimate of 62 per day. I next added Google's search partners to my campaign, and got an impression rate about three times higher.

Even with the search partners, the reported search volume is about a tenth of the predicted volume. I must therefore revise the estimate I made about "consumer demand for pirated ebooks". Instead of 100,000-300,000 searches per day, 10,000-30,000 per day throughout the world seems to be a better estimate.

As a result of this experiment, the Attributor numbers are even more inexplicable than before. It's worth noting however, the one area where there's no disagreement. Both my investigations and Attributor's show that consumer interest in piracy is mostly located outside the US, UK, and Canada. Jane Litte's recent post on the geographical restrictions conundrum for ebooks (and comments thereto) does an excellent job of describing why that may be so.
Enhanced by Zemanta

Wednesday, October 20, 2010

Attributor eBook Piracy Numbers Don't Add Up

In my article on "Consumer Demand for Pirated eBooks", I showed that Google Trends data tells a very different story from the one that anti-piracy services vendor Attributor derived from the very same data. I did not comment, however, on the headline that Attributor gave for its press release. The key finding of the report heralded by that release was that "Daily demand for pirated e-books can be estimated at 1.5-3 million people worldwide." This result has garnered some significant attention, because the number is quite large.

Extracting numbers using the tools used by Attributor is rather involved, and it's taken a while for me to carefully examine the available data. After doing this work, I've decided that when Attributor wrote "can be estimated at 1.5-3 million", they left out the word "blindly". As far as I can tell, Attributor is recklessly inflating the magnitude of ebook piracy; using the very same traffic measurement tools, I estimate the truth to be about 10% of the number they claim.

The Attributor numbers come from data generated by Google's AdWords service. AdWords is designed to help advertisers select advertising keywords and to manage budgets. For example, AdWords will tell you that the keyword "PDF" is used in approximately 101 million searches per month, worldwide, or 3.32 million searches per day. "PDF" is a keyword that a searcher might use in the course of a search for a pirated ebook, so you could reasonably assume that some percentage of these searches involve a consumer looking for a book they can avoid paying for. The trouble with this assumption is that most searches that include "PDF" have nothing to do with ebooks.

Another AdWords tool designed to assist Google advertisers is the keyword suggestion tool. In practice, you use this tool to refine keywords. Here is a table of the top ten refined searches for "PDF":
Keywordpercent of "pdf"
filetype pdf 36.69%
doc to pdf 6.03%
pdf download 3.30%
pdf to swf 3.30%
pdf to xls 2.70%
free pdf 2.70%
pdf free 2.70%
pdf to word 2.21%
pdf to rtf 2.21%
php pdf 1.81%
Of these, it's reasonable to assume that some percentage of the "pdf free" and a smaller fraction of the "pdf download" searches are related to consumers trying to avoid paying for books. The other searches are clearly unrelated to books. We can further use the keyword suggestion tool to refine these estimates. My review of over 700 refined keywords indicates that at most 4% of PDF searches, or 132,000 per day, are looking for ebooks of any kind.

A review of AdWords' suggested refinements for the term "rapidshare" reveals that searcher interest in ebooks is negligible compared to that for movies, TV, music and games. For example, Rapidshare is a "file-locker" site, and might be expected to appear in search terms for illegally distributed files. Of 743 suggested keywords, only one, accounting for 0.24% of "rapidshare" queries, or about 4,000 per day, is clearly related to ebooks:
Keywordpercent of "rapidshare"
files rapidshare 13.45%
rapidshare download 6.03%
download rapidshare 6.03%
download from rapidshare 6.03%
rapidshare megaupload 4.93%
free rapidshare 3.29%
rapidshare free download 2.70%
free rapidshare downloader 2.70%
free rapidshare download 2.70%
rapidshare download free 2.70%
free download rapidshare 2.70%
rapidshare free downloader 2.70%
download rapidshare free 2.70%
free rapidshare downloads 2.70%
download free rapidshare 2.70%
rapidshare searcher 2.19%
rapidshare search 1.80%
search on rapidshare 1.80%
dvdrip rapidshare 1.21%
rapidshare file 1.21%
rapidshare windows 7 1.21%
rapidshare mp3 1.21%
rapidshare dvd 0.99%
windows 7 rapidshare 0.81%
movie rapidshare 0.54%
rapidshare movie 0.54%
rapidshare upload 0.54%
upload rapidshare 0.54%
rapidshare downloader 0.44%
rapidshare file download 0.44%
rapidshare music 0.44%
music rapidshare 0.44%
download rapidshare files 0.36%
movies rapidshare 0.36%
rapidshare files download 0.36%
rapidshare windows xp 0.36%
720p rapidshare 0.36%
rapidshare premium accounts 0.30%
rapidshare password 0.30%
xbox 360 rapidshare 0.30%
game rapidshare 0.30%
password rapidshare 0.30%
rapidshare game 0.30%
rapidshare premium account 0.24%
premium account rapidshare 0.24%
rapidshare account premium 0.24%
premium rapidshare account 0.24%
rapidshare generator 0.24%
rapidshare engine 0.24%
rapidshare engine search 0.24%
up rapidshare 0.24%
rapidshare software 0.24%
software rapidshare 0.24%
rapidshare ebook 0.24%
Harry Potter and the Twilight Saga make appearances farther down the list, but only the titles that exist as movies.

Although direct interest in ebook torrents is so small that AdWords can barely measure it (~1500 searches per day), torrent search sites can give us another way to estimate the magnitude of interest in pirated ebooks. According to "KickassTorrents", the torrents active recently had this composition:
movies 30.04%
music 27.62%
tv 16.22%
apps 13.76%
games 5.52%
anime 5.43%
ebooks 1.42%
About 1.4 million searches using the keyword "torrent" are made on Google daily, according to AdWords. If the distribution of searches mirrors the distribution of files, this would indicate that searches for ebook torrents comprise about 46,200 per day.

All in all, I estimate that about 210,000 searches made on Google per day represent possible interest in pirated ebooks. About 30,000 of these come from the US. The "real" number for all countries could be as high as 300,000 or as low as 100,000. The 1.5-3 million numbers reported by Attributor are not within the range of plausibility.

One difficulty with using Google AdWords to gain insight into piracy is that it measures only a "shadow cast by piracy", as expressed by a commenter on my previous post. Nonetheless, AdWords sheds considerable light on patterns of demand. For example, the tools show clearly that it's common for people to search for movies and TV shows and acquire them extralegally. Also, they indicate that most of the demand, about 82%, for pirated ebooks comes from outside of the US, UK and Canada. Publishers should plan antipiracy strategies accordingly, based on data that can be confirmed independently.

Update: I have a followup post.
Enhanced by Zemanta