Go To Hellman: May 2010

Sunday, May 30, 2010

BookExpo, Digital Book 2010, and eBook Messes

When I was 5 years old, I wanted to be either a doctor or a garbage man. When I got my Ph. D., I thought that I had checked off the first option; that was stretching a 5-year-old's conception of a doctor pretty far. Looking back on my career so far, however, I see that using machines to get rid of dirt is a much more consistent theme. In graduate school, I used high vacuum systems to remove impurities from semiconductors; at Openly Informatics, we built software systems to clean up electronic resource metadata.

This week at IDPF's DigitalBook2010 Conference and BookExpo America, I was reminded over and over again how messy the so-called "supply chain" for books had become, and how the transition to ebooks from print is just making everything that much messier. I had known about the ebook ISBN mess, the book metadata mess, the territorial rights mess and of course the orphan works mess, but the presentations at IDPF on the "agency model" staggered me with the realization that big publishers were dumping a huge load of local sales-tax excrement on their channel partners. The last straw for me was a BISG presentation on rights standards. The speaker was trying to convince the audience that huge piles of money were to be made if only content rights could be efficiently chopped into smaller, more insidious pieces.

I hit the floor of the expo hoping to find solace in some shiny clean gizmos. I found all sorts of reader devices that I hadn't seen before, along with the alluring iPads and the competant Sony readers that I'd seen before. I didn't see a single Kindle on the entire show floor.

Well maybe I wasn't looking too hard. But it was hard not to get the impression that IDPF and BookExpo was a gathering of the anti-Amazon forces of Openness.

It's easy to swallow the story line that Amazon is building a closed, sterile system with its Kindle and that B&N, Sony, and all the others are unleashing a torrent of innovation with their open ePUB standards and promises of interoperability. This story line usually makes an analogy with the early days of the PC, in which Apple's proprietary Mac system was swamped a wave of innovation fostered by the PC's open design and Microsoft software that worked with all the hardware. The irony of Apple using ePUB for their iBookstore on the iPad is dutifully noted and left unexamined.

Somehow, BEA failed to sell me on the open vs. closed story line for ebooks. I don't see how open standards are going to clean up the scrapheaps on which the current book industry is built and in which the ebook industry is stuck.

I've mentioned that I've been reading about the early days of Intel. The Windows-Intel platform was never an open one; it was designed to sell Intel chips and Microsoft software. Apple's strategy, in contrast, was designed to sell computers, and avoid all the mess of keeping the hardware compatible.

After a day to reflect on Book Expo (and some time to sleep off a very nice party!) I came to a different story line that I find more useful in giving insight into the future course of the ebook industry. I think the key to understanding the different entrants in the ebook race is to understand which messes they're trying to tidy.

Amazon, with its Kindle, has focused on maintaining a clean shopping experience and a clean reading environment. By eliminating the computer tether with wireless Whispernet, they avoid a hardware compatibility mess. By choosing a proprietary file format, they avoid a document compatibility mess. By launching only in the US and extending to other territories slowly, they avoid all the territorial mess. Since their online bookstore had already addressed all the messy details of e-commerce for a huge catalog, the execution of Kindle was in large part an exercise in avoiding having to deal with any new messes.

Overdrive had a surprisingly large presence at BEA- they had two separate booths. Working with hardware makers such as Sony, Overdrive has attacked the problems of messy distribution channels- libraries and bricks and mortar retailers, in particular. The work of Overdrive has allowed publishers to pretty much forget that libraries use ebooks- the only place that libraries were mentioned on the show flow in connection with ebooks was in the Sony booth- they work directly with Overdrive to surface the library channel to consumers through their Library Finder feature.

The company that made the biggest impression on me this week was Kobo, the ebook seller that spun out of Canada's Indigo Books. More than any of the current ebook players, Kobo is emphasizing an any-screen strategy. Unlike Amazon, Kobo is not afraid to takle the mess of making a consumer's ebook work on all the devices they own. Kobo's $150 ebook reader device, which launches in the US on June 17, looks and feels like the device that Apple would have designed if Steve Jobs bothered to reads books anymore. Perhaps most significantly, Kobo is tackling the ebook territorial rights mess. At IDPF, Michael Tamblyn, Kobo's Executive Vice President for Content, Sales and Merchandising, described Kobo's global reach. On one day last month, Kobo sold books to customers in 174 countries. Kobo does business in 6 different currencies and has agreements to sell books from 1600 publishers.

Apple's Disneyfied approach, as expressed in the iPad, is to sweep all sorts of application messes into app sandboxes. Apple has done very little, though, to clean up ebook messes, and their complicity in letting the big 6 agencies dump on the supply chain suggests that they want to be like Target and Walmart and just skim the blockbuster cream off the incumbent publishing ecosystem. I agree with Mike Cane that Apple will open the digital book floodgates by targeting the ebook creation tools mess.

There are still a lot of ebook messes looking for entrepreneurial sorts to show up with a broom, or perhaps a tsunami to just wash away the whole rickety shantytown. It should be an interesting couple of years.

I hope this piece has cleaned up the picture for you a bit!

Wednesday, May 26, 2010

Monstrous eBook Sales...in Japan

This Friday, May 28th, 2010 is the day the Apple starts selling the iPad in Japan.

It's easy for Americans to forget that there is still a big world beyond our borders. No, Europe didn't get swallowed up by a volcano in Iceland. And while the International Digital Publishing Forum (IDPF) just released some astounding sales figures for January and February ebook sales, the Japanese market helps provide some perspective. US trade wholesale ebook sales for the first two months of 2010 totaled $60.8 million, roughly matching sales for the entire first half of 2009. Extrapolated to a full year, this would be over $360 million, which doesn't even begin to account for any bump caused by the iPad, which launched after the reporting period.

Over in Japan, they're also expecting big things from the iPad (the iPhone has been a huge hit in Japan, capturing 72% of the Japanese Smartphone market.) But the Japanese ebook market is already huge. In 2009 ebook sales in Japan totaled $600 million, more than triple the US sales, and without any Kindles! These numbers are from this morning's presentation at IDPF's DigitalBook2010 conference by Daihei Shiohama, Head of Corporate Strategy at Voyager Japan Inc.

According to Shiohama, 80% of that $600 million is sales on mobile platforms, i.e. phones. The most avid consumers are women aged 20-30, and the most popular ebook content is manga. In Japan, everyone has a mobile phone capable of delivering rich content, and people spend on lot of time commuting on crowded trains. eBooks on phones are very practical.

I had a chance to speak to Shiohama and Voyager President Masaaki Hagino, who demoed Voyager's iPad ebook reader applications for me. The T-Time App and DotBook store make use of a format (dotBook) which is specialized for Japanese books, which not only use four different scripts (Kanji, Hiragana, Katakana and Romanji) but are often rendered top to bottom and right to left. Voyager has been active in the ePub standards process because the current ePub standard doesn't quite support everything these books need- yet. For example, a manga will have an irregular layout and flow of frames within pages that are arranged opposite to a typical English language book. On the iPad, the navigation of such a book using the T-Time App is simple and natural.

It's interesting that Voyager is so eager to have an international standard to work with, given that it owns and controls a format that already works. Shiohama explained that Japanese ebooks have been like the iguanas that evolved separately on the Galapagos islands. The market and technology for the ebooks produced by Voyager will be much larger and more efficient if it can use the same technology and standards used by the rest of the world.

I should note that the iguanas I saw last year in the Yucatan were smaller and probably lazier than Galapagos iguanas, but that's a lizard of a completely different color.

Friday, May 21, 2010

Bit.ly Preview Add-on Leaks User Activity; Referer Header Considered Harmful

I've been reading a book called "Inside Intel" by Tim Jackson that reports the history of the chip giant up to 1997. At the end of the book, Intel is dealing with the famous flaw in the Pentium's division circuitry. Jackson observes that Intel's big mistake in dealing with the bug was to deal with it as a minor technical issue rather than as the major marketing issue it really was. If Intel's management had promptly addressed consumer concerns by offering to replace chips for any customer that wanted it rather than dismissing the problem as the inconsequential bug it actually was, it could have avoided 90% of the expense it actually incurred. The public doesn't want to deal with arcane technology bugs; they want to know who to trust.

This week FaceBook and MySpace had to deal with the consequences of obscure bugs that leaked personal subscriber information to advertisers. The Wall Street Journal reported that because Facebook and MySpace put user handles in URLs on their sites, these user handles, which can very often be traced back to a user identity, leaked to advertisers via the referer headers sent by browser software.

Reaction on one technology blog reminded me of Intel's missteps. Marshall Kirkpatrick, on ReadWriteWeb, called the the Journal's article "a jaw dropping move of bizarreness", going on to explain that passing referrer information was "just how the Internet works" and accusing the Journal of "anti-technology fear-mongering".

When a web browser requests a file from a website, it sends a bunch of extra information via http headers. One header gives the address of the file, which might be a web page, an image, or a script file. Other headers give the name of the software being used, the language and character sets supported by the browser. The Referer header (yes, that's how it's spelled, blame the RFC for getting the spelling wrong) reports the address of the page that requested or linked to the file. If the request is made to an advertiser's site, the Referer URL identifies the page that the user is looking at. When that page has an address that include private information, the private stuff can leak.

The controversy spurred me to take a look at some library websites to see what sort of data they might leak using referer headers. I used the very handy Firefox add-on called "Live HTTP Headers". I was astounded to see that a well known book database website seemed to be reporting the books I was browsing to Bit.ly, the URL shortening service! In another header, Bit.ly was also getting an identifying cookie. I went to another website, and found the exact same thing. This set off some alarm bells.

I soon realized that a report of EVERY web page I visit is being sent to Bit.ly. The culprit turned out to be Bit.ly's Bit.ly Preview add-on for Firefox. It turns out that for every web page I visit, this line of javascript is executed:

this.loadCss("https://s.bit.ly/preview.s3.v2.css?v=4.2");

This request for a CSS stylesheet has the side effect of causing Firefox to transmit to Bit.ly the address for each and every web page I visit in a referer header.

It's ironic. My last post described how URL shortening services can be abused for evil, but my point was that these abuses were a burden for the services, not that the services were abusive themselves. In fact, Bit.ly has probably done more than any shortening service to combat abuse and the Preview add-on is part of that anti-abuse effort. With Preview installed, users can safely check what's behind any of the short URLs they encounter by hovering over the link in question.

The privacy leak in bit.ly Preview is almost certainly an unintentional product of sloppy coding and deficient testing rather than an effort to spy on the 100,000 users who have installed the add-on. Nonetheless, it's a horrific privacy leak. There are other add-ons that intentionally leak private information, but typically they disclose their activity as a natural part of the add-on's functionality. One example would be GetGlue, which I've written about, and even Bit.ly preview cannot help but leak some info when it's doing what it's supposed to do (expand and preview shortened URLs).

I'm sure that Bit.ly will fix this bug quickly; their support was amazingly fast when I reported another issue. But a larger question remains. How do we make sure that the services we use everyday aren't leaking our info all over the place? The most widely deployed services- Google, Amazon, Facebook, etc. all deserve a higher level of scrutiny because of the quantity of data at their fingertips. All the privacy policies in the world aren't worth a dime if web sites can't be held accountable for the effects of sloppy coding. It's high time for popular sites to submit to strict third-party privacy auditing, and for web users to demand it. It doesn't matter whether any advertisers actually used the personal information that Facebook sent them; what matters is whether users can trust Facebook.

It's also time for the internet technology community to recognize that referer headers are as dangerous to privacy as they are to spelling. They should be abolished. Browser software should stop sending them. The referer header was originally devised to help dispersed server admins fix and control broken links. Today, the referer header is used for "analytics", which is a polite word for "spying". The collection of referer headers helps web sites to "improve their service", but you could say the same of informants and totalitarian governments.

The pipe is rusty- that's why it leaks. We need to fix it.

Thursday, May 20, 2010

Ten Evil Uses for URL Shortening Services

If you're like me, you don't work for Google. Now that Google has been given the worldwide legal monopoly on Not Being Evil, the rest of us must make our livings otherwise. And with Facebook starting to corner the market on monitoring our social interactions, it's getting harder and harder to make a splash on the Dark Side. Don't let that stop you. There are lots of nifty tools to help you run your start-up evilenture. Today, we cover URL shortening services: Bit.ly, TinyURL, Ow.ly and friends.

Here are ten link shortening menaces for you to nibble on.

No doubt you have your favorite website with a cross-site scripting vulnerability. But it can be a real pain to deliver a good attack script, and if you load it from a web site, there's a chance something might get traced back to you. No worries! A link shortener can help you load a bushel of attack code into one small friendly package. When your mark clicks on the link, he's delivered to that well-trusted but slightly buggy e-commerce website. Swipe his session cookies, forge an access token and personal info. He'll never even notice.
Phishing attacks are starting to look so lame. By now, people know to be suspicious when the 1etters in a hostname become numer1c. With a link shortener you can easily hide the hostname or IP address; when asking for credit card info, it's SO important to be user friendly.
You're into SQL injection? Link shorteners help you keep that DROP TABLES command from needlessly worrying your involuntary partners with privileges.
Spam blocking getting you down? URL Shorteners can help you neutralize unsolicited email identification systems which use hostnames to identify possible spam. "Girlz.xxx" is a great site, but "bit.ly" is a name you can show your fiancée's parents!
Don't forget that once you get past the spam blocker, you still need to avoid the porn filter used by the school system or Panera Bread. Also, your corporate and government customers will appreciate the deniability offered by a shortened link.
You've sent out the email blasts, but how do you know whether your eager audience receives your processed meat food or clicks on the links? The analytics provided by URL shortening services are a great solution! Shortened links are free, so you can use a new one for every recipient.
Is your botnet being detected and your links being broken? Most shorteners won't help you because they won't let you change your link after you've created it, but take a look at PURL. If one of your machines gets taken out, you can edit the PURL to keep your link working, and shorten it for good measure.
Ever wonder why there are so many URL shortening services? Chain a bunch of them together for fun, loopy amusement, and to confuse bit.ly! And add a Rickroll, while you're at it!
Want to slander Islam, you blasphemer? Or gossip about your boss, you slacker? Avoid those annoying fatwahs and performance improvement plans by using a shortener service that is blocked in Saudi Arabia or in your office.
Want to hog the credit for links to other people's content? Ow.ly can help you there.
BONUS! You know how the Evil guys torturing James Bond and Austin Powers are always based in a tiny island country or desert oasis? There's no better way to help those guys than to use the .LY (Libya), .CC (Cocos Islands), .GD (Grenada), .TO (Tonga) and .IM (Isle of Man) top level domains for as many links as possible.

But seriously...

Although Bit.ly and other URL shortening services tout their automated spam and malware detection and protection, they don't really explain why a URL shortening service needs spam and malware protection, or why this is a good reason for people to use their service. It's a bit like Domino's Pizza's big advertising campaign that explained how their pizza didn't taste awful anymore. You may have known that Domino's was not the tastiest of pizza's, but perhaps you didn't realize that shortened links might also be greasy and indigestive. Now you do.

In my post on shortDOI, I made a passing comment about Bit.ly's spam countermeasures that seemed to imply that the Digital Object Identifier (DOI) redirection service was somehow lacking in spam prevention. That was a mistake and a result of absent context.

As I've described here, there are lots of ways to abuse a link redirection service. If a service is frequently abused, its links may get blocked, its ISP may start to get complaints and threaten to shut it off, and its reputation will suffer. So link redirection services of all kinds need to have robust and scaleable ways to prevent abuse.

DOI uses a very different mechanism to avoid malware and spam. They are selective about who may create and administer DOI links. This is great if you're someone who clicks on DOI links, but bad if you haven't been approved by DOI's vetting procedures. You probably can't even figure out if DOI would approve you or not. PURL, which has a similar objective of improving link persistence, takes a similar strategy but has a lower entry barrier.

The contrast between Bit.ly and DOI makes clear that the biggest benefit of Bit.ly's spam and malware mechanisms is not that they make bit.ly links safer than DOI links, it's that they allow you to use their service, even when they don't trust you.

It's still pizza, even if the sauce is better.

Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Google has launched a social networking site called Orkut. Already the site has over 100 million members worldwide! What's that? You haven't heard of it? No it's not new, it's actually more than 6 years old. The 100 million users- they're mostly in Brazil and India.

You might be asking yourself, "What is Google doing running a social networking site for Brazil and India?", but a better question might be "What do you do in life when you take a home run swing and manage a nub single?"

The technology behind the Digital Object Identifier has proven to be one of these infield hits. It's called the "Handle" system, and it was developed fifteen years ago by computer scientists who feared that the DNS system that maps host names to IP addresses on the internet would fail to scale for the very large numbers of digital objects needed in future digital library applications. Bob Kahn, who with Vint Cerf designed the Transmission Control Protocol (TCP) underlying today's internet, was the driving force behind the Handle system. It implements a highly scaleable distributed naming infrastructure for digital objects. Kahn's organization, CNRI, then developed the Digital Object Identifier to serve as a core application of the Handle System.

The Digital Object Identifier, or DOI, was designed and promoted as a solution to two problems, broken URLs and rights management. The fact that fifteen years later broken URLs are still a problem and digital rights are a quagmire suggests that the DOI has had limited impact in these areas. Did something go wrong?

Fifteen years ago, Google did not exist. Netscape was just a year old. The developers of the Handle system hoped that Handle resolution would get built into web browser software alongside DNS. That never really happened, perhaps because Netscape and Microsoft had development roadmaps for web servers and browsers that diverged from CNRI's vision. To allow Handles to work in unmodified browser software, CNRI was forced to implement a proxy service that connected the system of DNS resolution to the system of handle resolution via a web server. This proxy, at http://dx.doi.org/, serves almost all of the world's DOIs. A link pointed at the DOI proxy gets redirected to a web address designated by the owner of the digital object or perhaps to a library with DOI-enabled software. This redirection capability gives publishers the flexibility to move digital objects from one address to the other without breaking links, even if the object is traded to another publisher using a completely different website.

Things have changed in 15 years. The World Wide Web turned out to be not so interested in digital objects with complex management systems and rights regimes. Instead, the World Wide Web turned out to be an unthinkably large number of web pages with neglible rights management indexed by search engines. Persistence of links turned out to be less important than the findability of content in search engines.

Since search engines never bothered to learn about the Handle system, the DOI proxy turned out to be much more important than the Handle resolution system that hides behind it. Details that were inconsequential 15 years ago have become important. One of these details is the DOI proxy's HTTP status code. This code tells a requestor the meaning of the redirect URL. As I've previously written, the DOI proxy returns a 302 status code. The Google index inteprets this to mean that it should assign indexing weight to the URL beginning with "http://dx.doi.org/", and not the publisher-specified URL. 302 is the is the correct code for the DOI proxy, because if the publisher-specified URL changes, the publisher doesn't want to lose all the "link juice" it has earned by being linked to from other sites.

The DOI has not been generally adopted by the web at large, with an important exception, CrossRef. CrossRef added some useful machinery onto the DOI and turned it into an absolutely essential service for publishers of ejournals other types of content that need to embed persistent links. CrossRef accounts for 96% of all registered DOIs and DOI resolutions (about 60 million per month).

60 million resolutions per month might seem like a lot of traffic, but it's not so big on the scale of today's web. Google delivers that many searches in 3 hours. A more direct comparison would be bit.ly, the URL shortening service, which reported 3.4 BILLION resolutions in March, or 4.6 million resolutions per hour.

The staggering popularity of URL shortening services such as bit.ly prompted CrossRef's Geoffrey Bilder to propose last year a similar service for DOI links. Evidently, the folks at the International DOI Foundation agreed that this was a good idea, because last week, they launched the "shortDOI" service.

ShortDOI is meant to address a shortcoming of DOIs- their length and ugliness. When DOI started, no one could have imagined that URLs would appear prominently in boxes of children's cereal, as they do today. It was assumed that they would be hidden in links and be used exclusively by machines. The original spec for the DOI string even allowed DOIs to include spaces and non-printing Unicode characters! Digital object owners were free to choose ID strings that were long and full of punctuation, even punctuation that was incompatible with web pages. ShortDOI uses a small number of alphanumeric digits to do away with all the DOI ugliness. It also does away with the publisher prefix, which hasn't been much use anyway. So instead of 10.1103/PhysRevLett.48.1559 or 10.1002/(SICI)1097-0258(19980815/30)17:15/16<1661::AID-SIM968>3.0.CO;2-2 shortDOI lets us use URLs like http://doi.org/aa9 and http://doi.org/aabbe.

ShortDOI can't quite decide whether it's a DOI or a shortener. Like DOI, it's implemented using the Handle system. Instead of redirecting through the DOI proxy, shortDOI acts as an alternate proxy, and uses the same 302 redirects that the long DOI proxy uses. From the perspective of search engines, a shortDOI is a new object to be ranked separately from the DOI. The link juice earned by a short DOI won't accrue to the DOI it has shortened.

Like a shortener, shortDOI assigns codes sequentially, making it easy for robots to harvest content identified by shortDOIs. ShortDOI allows anyone to create the shortened URL, but provides none of the tracking, statistics, spam protection and malware rejection offered by other shortener services. Library OpenURL servers don't yet work with shortDOI, even though the shortDOI proxy will try to send shortDOI handles to them.

The implementation choices made for shortDOI aren't mistakes- they make perfect sense in the context of the Handle naming architecture. Nonetheless, the difficulties they present for libraries and search engine optimization highlight the Handle system's misfit with the dominant link resolution architecture of the web.

The Handle system has been very well designed and managed. I've written software that uses the Handle software libraries and I found them to be elegant and easy to work with. The principles and algorithms built into Handle system are very similar to those used years later inside Google's internal file system or by any number of other large digital object management systems.

The Handle System is relatively inexpensive, but the costs are now higher than the large scale URL shorteners. According to public tax returns, the DOI Foundation pays CNRI about $500,000 per year to run the DOI resolution system. That works out to about 0.7 cents per thousand resolutions. Compare this to Bit.ly, which has attracted $3.5 million of investment and has resolved about 20 billion shortened links- for a cost of about 0.2 cents per thousand. It remains to be seen whether bit.ly will find a sustainable business model; competing directly with DOI is not an impossibility.

What do you do with infrastructure that has been successful in Brazil or scholarly publishing but not elsewhere? Do you keep it alive in hopes that after twenty years, some unforeseen circumstance will result in its overnight universal adoption? Do you scale back, phase out or sell out in favor of more cost effective alternatives? Or do you just do your best to continue serving loyal users? I don't know the answer, but I do know that in baseball and cricket you've got to run the bases to score.

Update: Corrected CrossRef share of DOI resolutions.

Tuesday, May 4, 2010

Authors are Not People: ORCID and the Challenges of Name Disambiguation

In 1976, Robert E. Casey, the Recorder of Deeds of Cambria County, Pennsylvania, let his bartender talk him into running for State Treasurer. He didn't take the campaign very seriously, in fact, he went on vacation instead. Nonetheless, he easily defeated the party-endorsed candidate in the Democratic Primary and went on to win the general election. It seems that voters thought they were voting for Robert P. Casey, a popular former State Auditor General and future Governor.

Robert P. Casey almost won the Pennsylvania Lieutenant Governor's race in 1978. No, not that Robert P. Casey, this Robert P. Casey was a former teacher and ice cream salesman. Robert P. Casey, Jr., the son of the "real" Robert P. Casey, was elected to the United States Senate in 2006. Name disambiguation turns out to be optional in politics.

That's not to say ambiguous names don't cause real problems. My name is not very common, but still I occasionally get messages meant for another Eric Hellman. A web search on a more common name like "Jim Clark" will return results covering at least eight different Jim Clarks. You can often disambiguate the Jim Clarks based on their jobs or place of residence, but this doesn't always work. Co-authors of scholarly articles with very similar or even identical names are not so uncommon- think of father-son or husband-wife research teams.

The silliest mistake I made in developing an e-journal production system back when I didn't know it was hard was to incorrectly assume that authors were people. My system generated webpages from a database, and each author corresponded to a record in the database with the author's name, affiliations, and a unique key. Each article was linked to the author by unique key, and each article's title page was generated using the name from the author record. I also linked the author table to a database of cited references; authors could add their published papers to the database. Each author name was hyperlinked to a list of all the author's articles.

I was not the first to have this idea. In 1981, Kathryn M. Soukup and Silas E. Hammond of the Chemical Abstracts Service wrote:

If an author could be "registered" in some way, no matter how the author's name appeared in a paper, all papers by the author could automatically be collected in one place in the Author Indexes.

Here's what I did wrong: I supposed that each author should be able to specify how their name should be printed; I always wanted my name on scientific papers to be listed as "E. S. Hellman" so that I could easily look up my papers and citations in the Science Citation Index. I went a bit further, though. I reasoned that people (particularly women) sometimes changed their names, and if they did so, my ejournal publishing system would happily change all instances of their name to the new name. This was a big mistake. Once I realized that printed citations to old papers would break if I retroactively changed an author's name, I made author name immutable for each article, even when the person corresponding to the author changed her name.

Fifteen years later, my dream of a cross-publication author identifier may be coming true. In December, a group of organizations led by Thomson Reuters (owners of the Web of Knowledge service that is the descendent of the Science Citation Index) and the Nature Publishing Group announced (pdf, 15kB) the creation of an effort to create unique identifiers for scientific authors. Named ORCID, for Open Researcher & Contributor ID, the organization will try to turn Thomson Reuters' Researcher ID system into an open, self-sustaining non-profit service for the scholarly publishing, research and education communities.

This may prove to be more challenging than it sounds, both technically and organizationally. First, the technical challenges. There are basically three ways to attack the author name disambiguation problem: algorithmically, manually, and socially.

The algorithmic attack, which has long history, has been exploited on a large scale by Elsevier's SCOPUS service, so the participation of Elsevier in the ORCID project bodes well for its chances of success. Although this approach has gone a long way, algorithms have their limits. They tend to run out of gas when faced with sparse data; it's estimated that almost half of authors have their names appear only once on publications.

The manual approach to name disambiguation turns out not to be as simple as you might think. Thomson Reuters's ISI division has perhaps the longest experience with this problem, and the fact that they're leading the effort to open name disambiguation to their competitors suggests that they've not found any magic bullets. Neil R. Smalheiser and Vetle I. Torvik have published an excellent review of the entire field (Author Name Disambiguation, pdf 179K) which includes this assessment:

... manual disambiguation is a surprisingly hard and uncertain process, even on a small scale, and is entirely infeasible for common names. For example, in a recent study we chose 100 names of MEDLINE authors at random, and then a pair of articles was randomly chosen for each name; these pairs were disambiguated manually, using additional information as necessary and available (e.g., author or institutional homepages, the full-text of the articles, Community of Science profiles (http://www.cos.com), Google searches, etc.). Two different raters did the task separately. In over 1/3 of cases, it was not possible to be sure whether or not the two papers were written by the same individual. In a few cases, one rater said that the two papers were “definitely by different people” and the other said they were “definitely by the same person”!

(Can it be a coincidence that so much research in name disambiguation is authors by researchers with completely unambiguous names?)

The remaining approach to the author name problem is to involve the authoring community, which is the thrust of the ORCID project. Surely authors themselves know best how to disambiguate their names from others! There are difficulties with this approach, not the least of which is to convince a large majority of authors to participate in the system. That's why ORCID is being structured as a non-profit entity with participation from libraries, foundations and other organizations in addition to publishers.

In addition to the challenge of how to gain acceptance, there are innumerable niggling details that will have to be addressed. What privacy expectations will authors demand? How do you address publications by dead authors? How do you deal with fictitious names and pseudonyms? What effect will an author registry have on intellectual property rights? What control will authors have over their data? How do you prevent an author from claiming another's publications to improve their own publication record? How do you prevent phishing attacks? How should you deal with non-roman scripts and transliterations?

Perhaps the greatest unsolved problem for ORCID is its business model. If it is to be self-sustaining, it must have a source of revenue. The group charged with developing ORCID's business model are currently looking at memberships and grants as the most likely source of funds, recognizing that the necessity for broad author participation precludes author fees as a revenue source. ORCID commercial participants hope to use ORCID data to pull costs out of their own processes, to fuel social networks for authors or to drive new or existing information services. Libraries and research foundations hope to use ORCID data to improve information access, faculty rankings and grant administration processes. All of these applications will require that restrictions on the use of ORCID data must be minimal, limiting ORCID's ability to offer for-fee services. The business conundrum for ORCID is very similar to that faced by information producers who are considering publication of Linked Open Data.

ORCID will need to navigate between the conflicting interests of its participants. CrossRef, which I've written about frequently, has frequently be cited as a possible model for the ORCID organization. (CrossRef has folded its Contributor ID project into ORCID.) The initial tensions among CrossRef's founders, which resulted from the differing interests of large and small publishers, primary and second publishers, and commercial and nonprofit publishers, may seem comparatively trivial when libraries, publishers, foundations and government agencies all try to find common purpose in ORCID.

It's worth imagining what an ORCID and Linked Data enabled citation might look like in ten years. In my article on linking architecture, I used this citation as an example:

D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).

Ten years from now, that citation should have three embedded ORCID identifiers (and will arrive in a tweet!). My Linked Data enabled web browser will immediately link the ORCID ids to wikipedia identifiers for the three authors (as simulated by the links I've added). I'll be able find all the articles they wrote together or separately, and I'll be able to search all the articles they've written. My browser would immediately see that I'm friends with two of them on Facebook, and will give me a list of articles they've "Liked" in the last month.

You my find that vision to be utopian or nightmarish, but it will happen, ORCID or not.

More ORCID and author ID, and name disambiguation links:

ORCID website
Nature Editorial on ORCID: Credit where credit is due (pdf, 553KB)
Thomson Reuters Researcher ID (technical basis for ORCID prototype)
Interview with CrossRef's Geoffrey Bilder on Contributor ID
SCOPUS Author ID
AuthorClaim Service. Also, my AuthorClaim Profile
VIAF (The Virtual International Authority File)
International Standard Name Identifier Draft ISO 27729
Improving the performance of personal name disambiguation using web directories Quang Minh Vu, Atsuhiro Takasu, Jun Adachi, Information Processing and Management 44 (2008) 1546–1561. doi:10.1016/j.ipm.2007.11.001 (available at pdf, 355KB)
Author Identifiers in Scholarly Repositories, Simeon Warner. (pdf, 680K)
Author name processing at Chemical Abstracts Service: name matching by using nonunique bibliographic identifiers, Kathryn M. Soukup, Silas E. Hammond, J. Chem. Inf. Comput. Sci., 1982, 22 (3), pp 172–176. DOI: 10.1021/ci00035a008

Photo of the "real" Robert P Casey taken by Michael Casey, 1986, licensed under the Creative Commons Attribution 2.5 Generic license.

Go To Hellman

Sunday, May 30, 2010

BookExpo, Digital Book 2010, and eBook Messes

Wednesday, May 26, 2010

Monstrous eBook Sales...in Japan

Friday, May 21, 2010

Bit.ly Preview Add-on Leaks User Activity; Referer Header Considered Harmful

Thursday, May 20, 2010

Ten Evil Uses for URL Shortening Services

Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Tuesday, May 4, 2010

Authors are Not People: ORCID and the Challenges of Name Disambiguation

Blog Archive

Popular Posts

Me

Go To Hellman Fan Page

Labels

Go To Hellman

Sunday, May 30, 2010

BookExpo, Digital Book 2010, and eBook Messes

Wednesday, May 26, 2010

Monstrous eBook Sales...in Japan

Friday, May 21, 2010

Bit.ly Preview Add-on Leaks User Activity; Referer Header Considered Harmful

Thursday, May 20, 2010

Ten Evil Uses for URL Shortening Services

Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Tuesday, May 4, 2010

Authors are Not People: ORCID and the Challenges of Name Disambiguation

Blog Archive

Popular Posts

Subscribe To

Me

Go To Hellman Fan Page

Labels