Friday, July 8, 2011

Library Data: Why Bother?

When face recognition came out in iPhoto, I was amused when it found faces in shrubbery and asked me whether they were friends of mine. iPhoto, you have such a sense of humor!

But then iPhoto looked at this picture of a wall of stone faces in Baoding, China. It highlighted one of the faces and asked me "Is this Jane?" I was taken aback, because the stone depicted Jane's father. iPhoto was not as stupid as I thought it was- it could even see family resemblances.

Facial recognition software is getting better and better, which is one reason people are so worried about the privacy implications of Facebook's autotagging of pictures. Imagine what computers will be able to do with photos in 10 years! They'll be able to recognize pictures of bananas, boats, beetles and books. I'm thinking it's probably not worth it to fill in a lot of iPhoto metadata.

I wish I had thought about facial recognition when I was preparing my talk for the American Library Association Conference in New Orleans. I wanted my talk to motivate applications for Linked Open Data in libraries, and in thinking about why libraries should be charting a path towards Linked Data, I realized that I needed to examine first of all the motivation for libraries to be in the bibliographic data business in the first place.

Originally, libraries invested in bibliographic data to help people find things. Libraries are big and have a lot of books. It's impractical for library users to find books solely by walking the stacks, unless the object of the search has been anticipated by the ordering of books on the shelves. The paper cards in the card catalog could be easily duplicated to enable many types of search in one compact location. The cards served as surrogates for the physical books.

When library catalogs became digital, much more powerful searches could be done. The books acquired digital surrogates that could be searched with incredible speed. These surrogates could be used for a lot of things, including various library management tasks, but finding things was still the biggest motivation for the catalog data.

We're now in the midst of a transition where books are turning into digital things, but cataloging data hasn't changed a whole lot. Libraries still need their digital surrogates because most publishers don't trust them with the full text of books. But without full text, libraries are unable to provide the full featured discovery that a a search engine with access to both the full text and metadata (Google, Overdrive, etc.) can provide.

At the same time, digital content files are being packed with more and more metadata from the source. Photographs now contain metadata about where, when and how they were taken; for a dramatic example of how this data might be used, take a look at this study from the online dating site OKCupid. Book publishers are paying increased attention to title-level metadata, and metadata is being built into new standards such as EPUB3. To some extent, this metadata is competing for the world's attention with library-sourced metadata.

Libraries have two paths to deal with this situation. One alternative is to insist on getting the full text for everything they offer. (Unglued ebooks offer that, that's what we're working on at Gluejar.)

The other alternative for libraries is to feed their bibliographic data to search engines so that library users can discover books in libraries. Outside libraries, this process is known as "Search Engine Optimization". When I said during my talk that this should be the number one purpose of library data looking forward, one tweeter said it was "bumming her out". If the term "Search Engine Optimization" doesn't work for you, just think of it as "helping people find things".

Library produced data is still important, but it's not essential in the way that it used to be. The most incisive question during my talk pointed out that the sort of cataloging that libraries do is still absolutely essential for things like photographs and other digital archival material. That's very true, but only because automated analysis of photographs and other materials is computationally hard. In ten years, that might not be true. iPhoto might even be enough.

In the big picture, very little will change: libraries will need to be in the data business to help people find things. In the close-up view, everything is changing- the materials and players are different, the machines are different, and the technologies can do things that were hard to imagine even 20 years ago.

In a following post, I'll describe ways that libraries can start publishing linked data, feeding search engines, and keep on helping people find stuff. The slides from my talk (minus some copyrighted photos) are available as PDF (4.8MB) and PPTX (3.5MB).


  1. Do you think iPhoto would recognize that the second photo shows the main level of Princeton's Firestone Library? Do you think photo recognition software could ever say when that photo was taken?

  2. I should have added the absence of embedded photo metadata?

  3. Well, iPhoto has access to the GPS info embedded in the photo file, so I doubt it would have much trouble. Why would you deny it the embedded data? That would be silly.

  4. I am not saying I would deny it embedded metadata. That's not the thrust of my question. I am can be silly but not that silly! I don't look at photo metadata too often, so I am not sure what gets recorded as a matter of course. It occurs to me, however, that sometimes data, for whatever reason, can go missing, get corrupted, or whatever. Restated, do you think recognition software will get good enough, assuming the embedded metadata is just not available, to scan the photo and pick up on enough clues to report "A ha! Picture shows Lobby of Firestone Library, past the guards, near the circ desk (taking into account the angle of the sunlight entering the room in the back) at 4:50 p.m. on 10/16/2010." Mind you, I am assuming the guards and circ desk are still where they were when I worked there. I have not seen the place in a number of years.

  5. I think recognition software may already be good enough to sift through a hundred billion photos and find a match to other photos, tagged by other users, taken at the same spot. But more to the point, given the timestamp, it will know why I was there and who I talked with that day. You could discover the same things, but the software is already 1000 times better at that.


Note: Only a member of this blog may post a comment.