Tuesday, November 24, 2009

Publish-Before-Print and the Flow of Citation Metadata

Managing print information resources is like managing a lake. You need to be careful about what flows into your lake and you have to keep it clean. Managing electronic information resources is more like managing a river- it flows though many channels, changing as it goes, and it dies if you try to dam it up.

I have frequently applied this analogy to libraries and the challenges they face as their services move online, but the same thing is true for journal publishing. A journal publisher's duties are no longer finished when the articles are bound into issues and put into the mail. Instead, publication initiates a complex set of information flows to intermediaries that help the information get to its ultimate consumer. Metadata is sent to indexing services, search engines, information aggregators, and identity services. Mistakes that occur in these channels will prevent customer access just as profoundly as the loss of a print issue, and are harder to detect, as well.

A large number of journals have made the transition from print distribution to dual (print+electronic) distribution; many of those journals are now considering the transition to online-only distribution. As they plan these transitions, publishers are making decisions that may impact the distribution chain. Will indexing services be able to handle the transition smoothly? Will impact factors be affected? Will customer libraries incur unforeseen management costs?

I was recently asked by the steering committee of one such journal to look into some of these issues, in particular to find out about the effects of the "publish-before-print" model on citations. I eagerly accepted the charge, as I've been involved with citation linking in one way or another for over 10 years and it gave me an opportunity to reconnect with a number of my colleagues in the academic publishing industry.

"Publish-before-print" is just one name given to the practice of publishing an article "version of record" online in advance of the compilation of an issue or a volume. This allows the journal to publish fewer, thicker issues, thus lowering print and postage costs, while at the same time improving speed-to-publication for individual articles. Publish-before-print articles don't acquire volume, issue and page metadata until the production of the print version.

Before I go on, I would like to recommend the NISO Recommended Practice document on Journal Article Versions (pdf, 221KB). It recommends the use of "Version of Record" as the terminology to use instead of "published article" which is widely used in a number of circumstances:
  1. Version of Record (VoR) is also known as the definitive, authorized, formal, or published version, although these terms may not be synonymous.
  2. Many publishers today have adopted the practice of posting articles online prior to printing them and/or prior to compiling them in a particular issue. Some are evolving new ways to cite such articles. These “early release” articles are usually [Accepted Manuscripts], Proofs, or VoRs. The fact that an “early release” article may be used to establish precedence does not ipso facto make it a VoR. The assignment of a DOI does not ipso facto make it a VoR. It is a VoR if its content has been fixed by all formal publishing processes save those necessary to create a compiled issue and the publisher declares it to be formally published; it is a VoR even in the absence of traditional citation data added later when it is assembled within an issue and volume of a particular journal. As long as some permanent citation identifier(s) is provided, it is a publisher decision whether to declare the article formally published without issue assignment and pagination, but once so declared, the VoR label applies. Publishers should take extra care to correctly label their “early release” articles. The use of the term “posted” rather than “published” is recommended when the “early release” article is not yet a VoR.
"Version of Record before Print" is a bit of a mouthful, so I'll continue to use "publish-before-print" here to mean the same thing.

It's worth explaining "Assignment of a DOI" a bit further, since it's a bit complicated in the case of publish-before-print. Crossref issued DOIs are the identifiers used for articles by a majority of scholarly journal publishers. To assign the DOI, the a publisher has to submit a set of metadata for the article, along with the DOI that they want to register. The Crossref system validates the metadata and stores it in its database so that other publishers can discover the DOI for citation linking. In the case of publish-before-print, the submitted metadata will include journal name, the names of the authors, the article's title, and the article's URL, but will be missing volume, issue and page numbers. After the article has been paginated and bound into an issue, the publisher must resubmit the metadata to Crossref, with added metadata and the same DOI.

What happens if the online article is cited in an article in another journal during the time between the version of record going online and the full bibliographic data being assigned? This question is of particular importance to authors whose citation rates may factor into funding or tenure decisions. Since the answer depends on the processes being used to publish the citing article and produce the citation databases, so I had to make a few calls to get some answers.

As you might expect, journal production processes vary widely. Some journals, particularly in the field of clinical medicine, are very careful to check and double check the correctness of citations in their articles. For these journals, it's highly likely that the editorial process will capture updated metadata. Other publishers take a much more casual approach to citations, and publish whatever citation data the author provides. Most journals are somewhere in the middle.

Errors can creep into citations in many ways, including import of incorrect citations from another source, mispelling of author names, or simple miskeying. DOIs are particularly vulnerable to miskeying, due to their length and meaninglessness. One of my sources estimates that 20% of author keyed DOIs in citations are incorrect! If you have the opportunity to decide on the form of a DOI, don't forget to consider the human factor.

It's hard to get estimates of the current error rate in citation metadata; when I was producing an electronic journal ten years ago, my experience was consonant with industry lore that said that 10% of author-supplied citations were incorrect in some way. My guess, based on a few conversations and a small number of experiments, is that a typical error rate in published citations is 1-3%. A number of processes are pushing this number down, most of them connected with citation linking in some way.

Reference management and sharing tools such as RefWorks, Zotero, and Mendeley now enable authors to acquire article metadata without keying it in and link citations even before they even submit manuscripts for publication; this can't help but improve citation accuracy. Citation linking in the copy editing process also improves the accuracy of citation metadata. By matching citations to databases such as Crossref and PubMed, unlinked citations can be highlighted for special scrutiny by the author.

Integration of citation linking into publishing workflow is becoming increasingly common. In publishing flows hosted by HighWire Press' Bench>Press manuscript submission and tracking system, Crossref and Pubmed can be used at various stages to help copyeditors check and verify links. Similarly, ScholarOne Manuscripts, a manuscript management system owned by Thomson Reuters, integrates with Thomson Reuters' Web of Science and EndNote products. Inera's xStyles, software that focuses specifically on citation parsing and is integrated with Aries Systems' Editorial Manager, has recently added an automatic reference correction feature that not only checks linking, but also pulls metadata from Crossref and Pubmed to update and correct citations. I also know of several publishers that have developed similar systems internally.

In most e-journal production flows, there is still a publication "event", at which time the content of the article, including citations, becomes fixed. The article can then flow to third parties that make the article discoverable. Of particular interest are citation databases such as Thomson Reuters' Web of Science (this used to be ISI Science Citation Index). The Web of Science folks concentrate on accurate indexing of citations; they've been doing this for almost 50 years.

Web of Science will index an article and its citations once it has acquired its permanent bibliographic data. The article's citations will then be matched to source items that have already been indexed. Typically there are cited items that don't get matched - these might be unpublished articles, in-press articles, and private communications. Increasingly, the dangling items include DOIs. In the case of a cited publish-before-print article, the citation will remain in the database until the article has been included in an issue and indexed by Web of Science. At that point, if the DOI, journal name, and first author name all match, the dangling citation is joined to the indexed source item so that all citations of the article are grouped together.

Google's PageRank is becoming increasingly important for electronic journals, so it's important to help Google group together all the links to your content. The method supported by Google for grouping URL's is the rel="canonical" meta tag. By putting a DOI based link into this tag on the article web pages, publishers can ensure that the electronic article will be ranked optimally in Google and Google Scholar.

An increasingly popular alternative to publish-before-print is print-oblivious article numbering. Publishers following this practice do not assign issue numbers or page numbers, and instead assign article numbers when the version-of-record is first produced. Downstream bibliographic systems have not universally adjusted to this new practice; best paractices for article numbers are described in an NFAIS Report on Publishing Journal Articles (pdf 221KB).

In summary, the flow of publish-before-print articles to end users can be facilitated by proper use of DOIs and Crossref.
  1. Prompt, accurate and complete metadata deposit at the initial online publication event and subsequent pagination is essential.
  2. DOI's should be constructed with the expectation that they will get transcribed by humans.
  3. Citation checking and correction should be built into the article copyediting and production process.
  4. Use of DOI in rel="canonical" metatags will help in search engine rankings.
Enhanced by Zemanta

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Only a member of this blog may post a comment.