Tuesday, May 4, 2010

Authors are Not People: ORCID and the Challenges of Name Disambiguation

In 1976, Robert E. Casey, the Recorder of Deeds of Cambria County, Pennsylvania, let his bartender talk him into running for State Treasurer. He didn't take the campaign very seriously, in fact, he went on vacation instead. Nonetheless, he easily defeated the party-endorsed candidate in the Democratic Primary and went on to win the general election. It seems that voters thought they were voting for Robert P. Casey, a popular former State Auditor General and future Governor.

Robert P. Casey almost won the Pennsylvania Lieutenant Governor's race in 1978. No, not that Robert P. Casey, this Robert P. Casey was a former teacher and ice cream salesman. Robert P. Casey, Jr., the son of the "real" Robert P. Casey, was elected to the United States Senate in 2006. Name disambiguation turns out to be optional in politics.

That's not to say ambiguous names don't cause real problems. My name is not very common, but still I occasionally get messages meant for another Eric Hellman. A web search on a more common name like "Jim Clark" will return results covering at least eight different Jim Clarks. You can often disambiguate the Jim Clarks based on their jobs or place of residence, but this doesn't always work. Co-authors of scholarly articles with very similar or even identical names are not so uncommon- think of father-son or husband-wife research teams.

The silliest mistake I made in developing an e-journal production system back when I didn't know it was hard was to incorrectly assume that authors were people. My system generated webpages from a database, and each author corresponded to a record in the database with the author's name, affiliations, and a unique key. Each article was linked to the author by unique key, and each article's title page was generated using the name from the author record. I also linked the author table to a database of cited references; authors could add their published papers to the database. Each author name was hyperlinked to a list of all the author's articles.

I was not the first to have this idea. In 1981, Kathryn M. Soukup and Silas E. Hammond of the Chemical Abstracts Service wrote:
If an author could be "registered" in some way, no matter how the author's name appeared in a paper, all papers by the author could automatically be collected in one place in the Author Indexes.

Here's what I did wrong: I supposed that each author should be able to specify how their name should be printed; I always wanted my name on scientific papers to be listed as "E. S. Hellman" so that I could easily look up my papers and citations in the Science Citation Index. I went a bit further, though. I reasoned that people (particularly women) sometimes changed their names, and if they did so, my ejournal publishing system would happily change all instances of their name to the new name. This was a big mistake. Once I realized that printed citations to old papers would break if I retroactively changed an author's name, I made author name immutable for each article, even when the person corresponding to the author changed her name.

Fifteen years later, my dream of a cross-publication author identifier may be coming true. In December, a group of organizations led by Thomson Reuters (owners of the Web of Knowledge service that is the descendent of the Science Citation Index) and the Nature Publishing Group announced (pdf, 15kB) the creation of an effort to create unique identifiers for scientific authors. Named ORCID, for Open Researcher & Contributor ID, the organization will try to turn Thomson Reuters' Researcher ID system into an open, self-sustaining non-profit service for the scholarly publishing, research and education communities.

This may prove to be more challenging than it sounds, both technically and organizationally. First, the technical challenges. There are basically three ways to attack the author name disambiguation problem: algorithmically, manually, and socially.

The algorithmic attack, which has long history, has been exploited on a large scale by Elsevier's SCOPUS service, so the participation of Elsevier in the ORCID project bodes well for its chances of success. Although this approach has gone a long way, algorithms have their limits. They tend to run out of gas when faced with sparse data; it's estimated that almost half of authors have their names appear only once on publications.

The manual approach to name disambiguation turns out not to be as simple as you might think. Thomson Reuters's ISI division has perhaps the longest experience with this problem, and the fact that they're leading the effort to open name disambiguation to their competitors suggests that they've not found any magic bullets. Neil R. Smalheiser and Vetle I. Torvik have published an excellent review of the entire field (Author Name Disambiguation, pdf 179K) which includes this assessment:
... manual disambiguation is a surprisingly hard and uncertain process, even on a small scale, and is entirely infeasible for common names. For example, in a recent study we chose 100 names of MEDLINE authors at random, and then a pair of articles was randomly chosen for each name; these pairs were disambiguated manually, using additional information as necessary and available (e.g., author or institutional homepages, the full-text of the articles, Community of Science profiles (http://www.cos.com), Google searches, etc.). Two different raters did the task separately. In over 1/3 of cases, it was not possible to be sure whether or not the two papers were written by the same individual. In a few cases, one rater said that the two papers were “definitely by different people” and the other said they were “definitely by the same person”!
(Can it be a coincidence that so much research in name disambiguation is authors by researchers with completely unambiguous names?)

The remaining approach to the author name problem is to involve the authoring community, which is the thrust of the ORCID project. Surely authors themselves know best how to disambiguate their names from others! There are difficulties with this approach, not the least of which is to convince a large majority of authors to participate in the system. That's why ORCID is being structured as a non-profit entity with participation from libraries, foundations and other organizations in addition to publishers.

In addition to the challenge of how to gain acceptance, there are innumerable niggling details that will have to be addressed. What privacy expectations will authors demand? How do you address publications by dead authors? How do you deal with fictitious names and pseudonyms? What effect will an author registry have on intellectual property rights? What control will authors have over their data? How do you prevent an author from claiming another's publications to improve their own publication record? How do you prevent phishing attacks? How should you deal with non-roman scripts and transliterations?

Perhaps the greatest unsolved problem for ORCID is its business model. If it is to be self-sustaining, it must have a source of revenue. The group charged with developing ORCID's business model are currently looking at memberships and grants as the most likely source of funds, recognizing that the necessity for broad author participation precludes author fees as a revenue source. ORCID commercial participants hope to use ORCID data to pull costs out of their own processes, to fuel social networks for authors or to drive new or existing information services. Libraries and research foundations hope to use ORCID data to improve information access, faculty rankings and grant administration processes. All of these applications will require that restrictions on the use of ORCID data must be minimal, limiting ORCID's ability to offer for-fee services. The business conundrum for ORCID is very similar to that faced by information producers who are considering publication of  Linked Open Data.

ORCID will need to navigate between the conflicting interests of its participants. CrossRef, which I've written about frequently, has frequently be cited as a possible model for the ORCID organization. (CrossRef has folded its Contributor ID project into ORCID.) The initial tensions among CrossRef's founders, which resulted from the differing interests of large and small publishers, primary and second publishers, and commercial and nonprofit publishers, may seem comparatively trivial when libraries, publishers, foundations and government agencies all try to find common purpose in ORCID.

It's worth imagining what an ORCID and Linked Data enabled citation might look like in ten years. In my article on linking architecture, I used this citation as an example:
D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).
Ten years from now, that citation should have three embedded ORCID identifiers (and will arrive in a tweet!). My Linked Data enabled web browser will immediately link the ORCID ids to wikipedia identifiers for the three authors (as simulated by the links I've added). I'll be able find all the articles they wrote together or separately, and I'll be able to search all the articles they've written. My browser would immediately see that I'm friends with two of them on Facebook, and will give me a list of articles they've "Liked" in the last month.

You my find that vision to be utopian or nightmarish, but it will happen, ORCID or not.

More ORCID and author ID, and name disambiguation links:
Photo of the "real" Robert P Casey taken by Michael Casey, 1986, licensed under the Creative Commons Attribution 2.5 Generic license.
Enhanced by Zemanta

5 comments:

  1. Are you aware of LibraryThing's Combiner's Groups? (http://www.librarything.com/groups/combiners). They might serve as a useful model for the social approach to disambiguation.

    ReplyDelete
  2. Do you know if VIAF is participating in ORCID? They really really ought to be.

    ReplyDelete
  3. Thanks for this Eric - very useful and timely. One of my favourite examples of authors (or names) not being people is "Devil May Care", which is described on the cover as being authored by "Sebastian Faulks, writing as Ian Fleming" (http://www.amazon.co.uk/gp/reader/0718153766/ref=sib_dp_pt#reader-link). I love this because it suggests that Ian Fleming himself was Ian Fleming, writing as Ian Fleming.

    I'm also fond of this conversation between Alice and the White Knight http://homepages.tcp.co.uk/~nicholson/alice.html

    ReplyDelete
  4. Hi Eric

    The UK's Names Project (http://names.mimas.ac.uk - funded by the JISC and a partnership between the British Library and Mimas at The University of Manchester) is looking at this problem space, too. We've started with an algorithmic approach (using journal article data) but plan to allow human intervention to improve the machine-identified entities.

    There are connections between Names, ORCID, VIAF and ISNI: it's a very fluid and rapidly-evolving area to be working in. Thanks for this overview - useful and interesting!

    ReplyDelete
  5. Thom Hickey and colleagues from OCLC involved in VIAF have been participating in ORCID meetings.

    I'm sure ORCID could learn from LibraryThing Combiners, but motivations of the people who might do the combining has to be considered. Same, sort of, with JISC Names Project.

    Owen, surely Ian Fleming has contributed to more than one case of cataloger madness.

    ReplyDelete