Tuesday, November 23, 2010

On The Value of Things - at a Garage Sale

The proceeds
I spent this past Friday and Saturday pondering the value of all sorts of things- toys, games, books, furniture, household items, clothing. I had four colors of stickers. Green was one dollar, blue was two, and red was just fifty cents. yellow dots we priced "as marked" which meant more than two dollars. Three dollars seems to be a lot to ask - at a garage sale.

We had a beautiful day on Saturday, and people came non-stop. The timing was great- people are starting to think about the holidays, and we had lots and lots of toys. The sale was a great success for us- about 80% of the stuff we put out disappeared, which is a good thing, because otherwise we'd have to figure out another way to get rid of it all. My colored dot assignments weren't about intrinsic value; it was more about how much we wanted to get rid of a thing.

Putting a price on things also meant that people had to value them. If 50¢ was two much for someone to pay for a kitchen knife, well, that person was unlikely to provide the knife a worthy home. Of course some people felt compelled to bargain, despite the dime-on-the-dollar pricing. So I bargained a bit, and they left happy. Others apologized for the low prices; they left happy, too.

The free items were my favorites. I had a bag of shoes in the garage; originally meant for discard. A gentleman asked if they were for sale, and I said they were free to our good customers. So he tried them on, and he was so happy that they fit. They were old shoes, but Rockports do last a long time.

Another fellow had selected bunch of books including several of my father's old particle physics books. Why my dad, an electronics engineer, had particle physics books, is one story; why I shipped them from Indianapolis to store them in my basement is entirely another. But anyone that interested in particle physics deserved to get those books for free!

Dad said: "Just go to bed!"
Most of the books we were selling were ones that our kids had grown out of. They were a dollar each, half off if you bought more than 10. Chapter books were 25 cents each, though I couldn't bear to part with Mercer Mayer's "Just Go to Bed" at any price and took it off the sale shelves.

One eleven year old gleamed when she found out there were books for sale. Her mom had bought some furniture and was arranging to pick it up later. "You can look at the books when we come back" she said, herding the girl and her 8 year old cousin to the car.

The Librarian from the Black Lagoon
It was dark and the sale was long over when they returned for the furniture. I had already packed the leftover books into my basement. After helping to load the furniture into the car, I told the mom that if the kids wanted to see the left over books, I'd be happy to give them any books they wanted. So the four of us went down to my basement and looked at the books. "Oooh, I want!" said the girl at some age appropriate books. "Ooh look, Shakespeare, Mommy! I want!" The mom and I looked at each other and smiled. The mom's smiles were understandable to any parent; mine were because it was the copy of "A Midsummer Night's Dream" that I had read in high school. The cousin did not go away empty handed. I pushed some "Black Lagoon" books on him.

Reflecting on the joy I experienced in seeing kids excited to get some books, I got a better understanding of why so many librarians love what they do. Imagine if you could do the same thing for lots and lots of kids. It would be like taking that joy and multiplying it by thousands.

Maybe that's why I've been obsessed with "ungluing ebooks".

Personal Note: I believe that in life, when you discern a calling, you need to remove whatever obstacles there may be to answering that call. I hear this call to "unglue" ebooks quite personally and clearly. And also to have a "bounty market" to make it happen ready before Thanksgiving in 2011. You can consider that an announcement. For now, why not celebrate Thanksgiving by taking a book off your shelves and find the person who is meant to read it next?

Thursday, November 18, 2010

Real Research Gets Reproduced

It's not often that I'm identified as a physicist, as Richard Curtis did in his commentary on my followups of Attributor's piracy "demand" report. But it's true, I worked in materials physics research at Bell Labs from 1988-1998.

Crystal structure of YBCO
Those were great years to be in materials physics. In 1986, two guys at IBM Zurich discovered some amazing new superconducting materials. By the end of that year, a team in Japan had reproduced their results; a group I was a part of at Stanford did the same after talking with the Japan group in December. By March, so many groups around the world had made exciting discoveries that the American Physical Society meeting in New York became known as "the Woodstock of Physics".

A blue semiconductor laser.
A few years later, a guy in Japan reported that he had made a semiconductor light emitting diode (LED) glow blue. His work was a lot harder to reproduce; it took years for anyone to come close to what his team reported; although he published many details, it was hard work. I even sawed one of his LEDs in half to try to understand how it worked. Today, my kitchen (and the screen of my MacBook) is lit by white LED's made from that same semiconductor.

Around that time, some chemists in Utah announced a truly amazing discovery: they saw fusion reactions occurring in palladium electrochemical cells. Since they were respected electrochemists, their results were taken seriously, and lots of people tried to reproduce the incredible results. The promise of a seemingly magical, unlimited power source seemed almost too good to be true. This time, however, nobody could reproduce the results. Some scientists saw odd things happen, but they were different in every lab. At Bell Labs, the scientists trying to reproduce so-called "cold fusion" became convinced that the guys in Utah were being led astray by their excitement.

In science, it's usual that a surprising result will only be accepted once it has been reproduced by someone else. My scientific training has sometimes gotten me in trouble in the world of libraries and publishing. When presented with something that seems surprising to me, I ask for the evidence. In cultures that are more comfortable assigning and recognizing authority, my questions have sometimes been seen as irritants.

It's been that way with my questions about the Attributor report. I was surprised at some of the findings, and I tried to reproduce them. My results can't reproduce some of the key findings reported by Attributor. It would be nice to better understand the factor of a hundred difference between my results and those of Attributor; much might be learned from such an analysis. Attributor is a company that sells anti-piracy services; one would hope that the data they report is somehow rooted in fact, even though they benefit from overestimates of privacy.

In Richard Curtis' article, Jim Pitkow, Attributor's CEO, is quoted:
Our study’s rigorous methodology ensured highly accurate results that align with actual consumer behavior. We analyzed 89 titles, using multiple keyword permutations per title, across different days of the week, with very high bids to ensure placement – each of which is fundamental in guaranteeing accuracy and legitimacy. Each of these variables impact the findings, and analyzing all variables together produce highly accurate results. We stand by our research, and we’re confident that the study addresses an accurate portrayal of the consumer demand for pirated e-books.
If Attributor really stands by its research, it will make it easier for people like me to reproduce their results. In particular, they should publish the complete list of the "869 effective keyword terms" used as keywords for their Google AdWords experiment. There are mistakes they might have made in permuting and combining search terms; they might also have thought of a class of effective search terms that my study totally overlooked. As it stands, it's impossible to know.

I can understand why Attributor might not want to release their search term list. First of all, they should expect people to try to tear it to shreds. The marketing department isn't going to like that. That's what happened to the superconductor guys, the blue LED guy, and cold fusion guys. They stood behind their work, and let the scientific community look for weaknesses and make their own judgments.

Cold fusion didn't pan out, and Pons and Fleischmann, the Utah guys, tried for years to figure out what it was they measured. Bednorz and Müller, the guys in Zurich, won the Nobel Prize. Shuji Nakamura, the LED guy, won a Millenium Prize and a lawsuit.

It may be easier to do a followup study without the worry of spurious searches for widely known terms. But at this point, Attributor customers and the book industry as a whole stand to learn a lot from understanding where the irreproducibility of Attributor's study is coming from. Publishers need that information to plan out a response to the threat of ebook piracy, and their needs should come first- no matter what the marketing department says.
Enhanced by Zemanta

Wednesday, November 10, 2010

Infochimps and the scaling of dataset value

Image representing Infochimps as depicted in C...Image via CrunchBaseSure, a picture is worth a thousand words, but what is a thousand words worth? How about a million? If I had a dataset of the most recent trillion words spoken by humanity, (anonymized and randomized of course!) would that be worth any more than the set of words in this blog post?

These are real questions. A Texas company called Infochimps has datasets quite similar to these, ready for you to use. Some of the datasets are free, others you have to pay for. More interesting is that if you have a dataset you think other people might be interested in, or even pay for, InfoChimps will host it for you and help you find customers. (Infochimps just announced they had raised $1.2 million in its first round of institutional funding.)

One of the datasets you can get from Infochimps for free is the set of smileys used on twitter in tweets sent between March 2006 and November 2009. It's free. It tells you that the smiley ":)" was used 13,458,831 times, while ";-}" was only used 1,822 times.

If you're willing to fork over $300, you can get a 160MB file conatining a month-by-month summary of all the hashtags, URLs and smiley's used on twitter during the same period. That dataset wil tell you that during September of 2009, the hashtag #kanyeisagayfish was used 11 times while #takekanyeinstead was used 141 times.

If you're a scrabble player, you can spend $4 for a list of the 113,809 official words, with definitions. Or you can get them free, without the definitions.

courtesy of Infochimps, Inc. CC-BY-A
I had a great talk with Infochimps President and Co-Founder Flip Kromer a few weeks ago before his presentation to the New York Data Visualization Meetup. I fell in love with one of the visualizations he showed in his presentation, and he's given me permission to reproduce it here. (Creative Commons Attribution License) It's derived from the same Twitter data set you can get from Infochimps, and shows networks of characters that are found in the same tweet. So if ♠ and ♣ appear in the same tweet over and over again, the two characters will have a strong connection in the network of characters.

The character connection data was fed to a program called Cytoscape, which is an open source visualization program used in bioinformatics; Mike Bergmann has a nice article about its use for large RDF graphs. The networks are laid out using a force-directed algorithm (which is pretty much the simplest thing you can do). Coloring is applied arbitrarily.

As you might expect, the main character networks that show up are associated with languages, but there are some anomalies. For example, the katakana character ツ (tu) sticks out. Katakana is a set of phonetic characters used in Japanese for non-Japanese words. The reason "tu" is set apart from all the other katakana is that people use it on Twitter as a smiley.

The other anomalous character subnet is labeled "???" in the graph. A closer look reveals this to be the set of characters that look like upside down roman text.

Kromer has noticed that the price (or perhaps cost) of a partial data set follows a non-monotonic curve (see graphic). Small amounts of data are essentially free, but a peak value is reached when portions of the data set are extracted from the full data set. If we were discussing book metadata, for example, peak value might accrue for a set of the 100,000 top selling books.

There's much less value, according to Kromer, in having a large incomplete chunk of a data set. Data for 10,000,000 books, for example, would have less value than the 100,000 book data set, because it's not complete. Complete data sets become extremely expensive because of the logistics involved, and because of the value of having the complete set.

This pattern seems plausible to me, but I'd like to see some clearer examples. I've previously written about having too much data, but that article looked at the effect of error rates on data collection; Kromer's curve is about utility.

For me, the most interesting thing about Infochimps is the idea that the best way to make data flow in large volumes and create new types of knowledge is to provide the right incentives for data producers through the establishment of a market. This makes a lot of sense to me; however I'm not sure that the Infochimps market has also established incentives needed for data set maintenance; the world's most valuable and expensive data sets are one that change rapidly.

Kromer contrasted the Infochimps approach to that of Wolfram, whose Alpha service is produced by "putting 100 PhDs and data in a lab". He also feels that much of the work being put into the semantic web is a "crock" because its technology stack solves problems that we don't have. Humans are pretty good at extracting meaning from data, given a good visualization.

We can even recognize upside-down text.
Enhanced by Zemanta

Saturday, November 6, 2010

Consumer Interest in Pirated eBooks is Even Lower Than I Thought

My recent posts following up on Attributor's most recent study on demand for pirated ebooks have been republished on TeleRead, probably the longest running blog covering ebooks and related topics. Paul Biba, the current editor, has been doing a great job bringing together interesting articles from many different perspectives.

The Teleread discussion on my article Attributor ebook piracy numbers don’t add up had some interesting contributions. Jim Pitkow, CEO of Attributor, suggested that the difference between the data I got from Google AdWords and Attributor's data might be a result of a difference in methodology. While I used Google's predicted traffic numbers, Attributor used numbers from an actual AdWords campaign. For example, Attributor bought the keyword “lost symbol free ebook” along with 868 others and counted how many impressions were generated by Google.

AdWords predicts a total query volume of 62 per day for the keyword “lost symbol free ebook”. If each of the other keywords did the same volume of queries, Attributor should have seen 53,900 impressions per day for its ad campaign. It's not at all clear how 53,900 impressions turns into 1.5-3 millions searches for pirated content; that would be a factor of 30-60 larger than the traffic predicted by the AdWords estimator tool. Pitkow's comment seemed to suggest that the result of an actual advertising campaign would be different from the estimate, possibly accounting for the discrepancy.

So I did an actual advertising campaign of my own to see if this was true.

I've used AdWords in the past with amazingly cost-effective results. I would spend about five dollars a month at five cents per click to advertise a service that sold for thousands of dollars per year. My experience, albeit somewhat dated, was that the estimator tool overestimated, not underestimated, the actual traffic. I was interested to see if this was still true.

I constructed a "free ebook survey" landing page for my ad campaign, and added StatCounter analytics so I could see who clicked on my ad. I bought the keyword suggested by Pitkow: “lost symbol free ebook”.

AdWords gives you a number of settings to fine-tune your ad campaign. For example, I checked all the boxes for geographical coverage so my ads would be seen in as many countries and languages as possible.

AdWords offers three important options that determine the distribution of the ads. You can choose to advertise on Google only, Google and its "search partners" or on Google, it's search partners, and Google's "display network". I spoke with Pitkow last week, and he indicated that Attributor's study used both Google and its search partners.

The initial results of my ad campaign were alarming. In just four hours, I accumulated 7,000 impressions and 19 clicks; my campaign halted because my bids were too low. This traffic level would easily support Attributor's estimates. But when I looked into it, I found that I had included the "display network" without meaning to. What's more, the referring sites were really junky. I couldn't imagine who would use sites like "Lost World TV", NDParking and Sebaidu. The 61 cents I spent in those four hours may well have gone to a bunch of sites engaging in click fraud.

A full week of advertising on just Google, by contrast, resulted in a grand total of 15 impressions, much less than Googles estimate of 62 per day. I next added Google's search partners to my campaign, and got an impression rate about three times higher.

Even with the search partners, the reported search volume is about a tenth of the predicted volume. I must therefore revise the estimate I made about "consumer demand for pirated ebooks". Instead of 100,000-300,000 searches per day, 10,000-30,000 per day throughout the world seems to be a better estimate.

As a result of this experiment, the Attributor numbers are even more inexplicable than before. It's worth noting however, the one area where there's no disagreement. Both my investigations and Attributor's show that consumer interest in piracy is mostly located outside the US, UK, and Canada. Jane Litte's recent post on the geographical restrictions conundrum for ebooks (and comments thereto) does an excellent job of describing why that may be so.
Enhanced by Zemanta