Robert P. Casey almost won the Pennsylvania Lieutenant Governor's race in 1978. No, not that Robert P. Casey, this Robert P. Casey was a former teacher and ice cream salesman. Robert P. Casey, Jr., the son of the "real" Robert P. Casey, was elected to the United States Senate in 2006. Name disambiguation turns out to be optional in politics.
That's not to say ambiguous names don't cause real problems. My name is not very common, but still I occasionally get messages meant for another Eric Hellman. A web search on a more common name like "Jim Clark" will return results covering at least eight different Jim Clarks. You can often disambiguate the Jim Clarks based on their jobs or place of residence, but this doesn't always work. Co-authors of scholarly articles with very similar or even identical names are not so uncommon- think of father-son or husband-wife research teams.
I was not the first to have this idea. In 1981, Kathryn M. Soukup and Silas E. Hammond of the Chemical Abstracts Service wrote:
If an author could be "registered" in some way, no matter how the author's name appeared in a paper, all papers by the author could automatically be collected in one place in the Author Indexes.
Science Citation Index. I went a bit further, though. I reasoned that people (particularly women) sometimes changed their names, and if they did so, my ejournal publishing system would happily change all instances of their name to the new name. This was a big mistake. Once I realized that printed citations to old papers would break if I retroactively changed an author's name, I made author name immutable for each article, even when the person corresponding to the author changed her name.
Fifteen years later, my dream of a cross-publication author identifier may be coming true. In December, a group of organizations led by Thomson Reuters (owners of the Web of Knowledge service that is the descendent of the Science Citation Index) and the Nature Publishing Group announced (pdf, 15kB) the creation of an effort to create unique identifiers for scientific authors. Named ORCID, for Open Researcher & Contributor ID, the organization will try to turn Thomson Reuters' Researcher ID system into an open, self-sustaining non-profit service for the scholarly publishing, research and education communities.
This may prove to be more challenging than it sounds, both technically and organizationally. First, the technical challenges. There are basically three ways to attack the author name disambiguation problem: algorithmically, manually, and socially.
The manual approach to name disambiguation turns out not to be as simple as you might think. Thomson Reuters's ISI division has perhaps the longest experience with this problem, and the fact that they're leading the effort to open name disambiguation to their competitors suggests that they've not found any magic bullets. Neil R. Smalheiser and Vetle I. Torvik have published an excellent review of the entire field (Author Name Disambiguation, pdf 179K) which includes this assessment:
... manual disambiguation is a surprisingly hard and uncertain process, even on a small scale, and is entirely infeasible for common names. For example, in a recent study we chose 100 names of MEDLINE authors at random, and then a pair of articles was randomly chosen for each name; these pairs were disambiguated manually, using additional information as necessary and available (e.g., author or institutional homepages, the full-text of the articles, Community of Science profiles (http://www.cos.com), Google searches, etc.). Two different raters did the task separately. In over 1/3 of cases, it was not possible to be sure whether or not the two papers were written by the same individual. In a few cases, one rater said that the two papers were “definitely by different people” and the other said they were “definitely by the same person”!(Can it be a coincidence that so much research in name disambiguation is authors by researchers with completely unambiguous names?)
The remaining approach to the author name problem is to involve the authoring community, which is the thrust of the ORCID project. Surely authors themselves know best how to disambiguate their names from others! There are difficulties with this approach, not the least of which is to convince a large majority of authors to participate in the system. That's why ORCID is being structured as a non-profit entity with participation from libraries, foundations and other organizations in addition to publishers.
In addition to the challenge of how to gain acceptance, there are innumerable niggling details that will have to be addressed. What privacy expectations will authors demand? How do you address publications by dead authors? How do you deal with fictitious names and pseudonyms? What effect will an author registry have on intellectual property rights? What control will authors have over their data? How do you prevent an author from claiming another's publications to improve their own publication record? How do you prevent phishing attacks? How should you deal with non-roman scripts and transliterations?
Perhaps the greatest unsolved problem for ORCID is its business model. If it is to be self-sustaining, it must have a source of revenue. The group charged with developing ORCID's business model are currently looking at memberships and grants as the most likely source of funds, recognizing that the necessity for broad author participation precludes author fees as a revenue source. ORCID commercial participants hope to use ORCID data to pull costs out of their own processes, to fuel social networks for authors or to drive new or existing information services. Libraries and research foundations hope to use ORCID data to improve information access, faculty rankings and grant administration processes. All of these applications will require that restrictions on the use of ORCID data must be minimal, limiting ORCID's ability to offer for-fee services. The business conundrum for ORCID is very similar to that faced by information producers who are considering publication of Linked Open Data.
ORCID will need to navigate between the conflicting interests of its participants. CrossRef, which I've written about frequently, has frequently be cited as a possible model for the ORCID organization. (CrossRef has folded its Contributor ID project into ORCID.) The initial tensions among CrossRef's founders, which resulted from the differing interests of large and small publishers, primary and second publishers, and commercial and nonprofit publishers, may seem comparatively trivial when libraries, publishers, foundations and government agencies all try to find common purpose in ORCID.
It's worth imagining what an ORCID and Linked Data enabled citation might look like in ten years. In my article on linking architecture, I used this citation as an example:
D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).Ten years from now, that citation should have three embedded ORCID identifiers (and will arrive in a tweet!). My Linked Data enabled web browser will immediately link the ORCID ids to wikipedia identifiers for the three authors (as simulated by the links I've added). I'll be able find all the articles they wrote together or separately, and I'll be able to search all the articles they've written. My browser would immediately see that I'm friends with two of them on Facebook, and will give me a list of articles they've "Liked" in the last month.
You my find that vision to be utopian or nightmarish, but it will happen, ORCID or not.
More ORCID and author ID, and name disambiguation links:
- ORCID website
- Nature Editorial on ORCID: Credit where credit is due (pdf, 553KB)
- Thomson Reuters Researcher ID (technical basis for ORCID prototype)
- Interview with CrossRef's Geoffrey Bilder on Contributor ID
- SCOPUS Author ID
- AuthorClaim Service. Also, my AuthorClaim Profile
- VIAF (The Virtual International Authority File)
- International Standard Name Identifier Draft ISO 27729
- Improving the performance of personal name disambiguation using web directories Quang Minh Vu, Atsuhiro Takasu, Jun Adachi, Information Processing and Management 44 (2008) 1546–1561. doi:10.1016/j.ipm.2007.11.001 (available at pdf, 355KB)
- Author Identifiers in Scholarly Repositories, Simeon Warner. (pdf, 680K)
- Author name processing at Chemical Abstracts Service: name matching by using nonunique bibliographic identifiers, Kathryn M. Soukup, Silas E. Hammond, J. Chem. Inf. Comput. Sci., 1982, 22 (3), pp 172–176. DOI: 10.1021/ci00035a008