Monday, September 15, 2014

Analysis of Privacy Leakage on a Library Catalog Webpage

My post last month about privacy on library websites, and the surrounding discussion on the Code4Lib list prompted me to do a focused investigation, which I presented at last weeks Code4Lib-NYC meeting.

I looked at a single web page from the NYPL online catalog. I used Chrome developer tools to trace all the requests my browser made in the process of building that page. The catalog page in question is for The Communist Manifesto. It's here: http://nypl.bibliocommons.com/item/show/18235020052907_communist_manifesto .

You can imagine how reading this work might have been of interest to government investigators during the early fifties when Sen. Joe McCarthy was at the peak of his power. Note that, following good search-engine-optimization practice, the URL embeds the title of the resource being looked at.

I chose the NYPL catalog as my example, not because it's better or worse than any other library catalog with respect to privacy, but because it's exemplary. The people building it are awesome, and the results are top-notch. I happen to know the organization is working on making privacy improvements. Please don't take my investigation to be a criticism of NYPL. But it was Code4Lib-NYC, after all.

As an example of how far ahead of the curve the NYPL catalog is, note that the webpage offers links to free downloads at Project Gutenberg. The Communist Manifesto is in the public domain, so any library catalog that tells you that no ebook is available is lying. The majority of library catalogs today lie about this.

So here are the results.

In building the Communist Manifesto catalog page, my browser contacts 11 different hosts from 8 different companies.
  • nypl.secure.bibliocommons.com
  • cdn.bibliocommons.com
  • api.bookish.com
  • contentcafe2.btol.com
  • www.google-analytics.com
  • www.googletagmanager.com
  • cdn.foxycart.com
  • idreambooks.com
  • ws.sharethis.com
  • wd-edge.sharethis.com
  • b.scorecardresearch.com
Each of these hosts is informed of the address of the web page that generates the address. They are told, essentially, "this user is looking at our Communist Manifesto page". Some of the hosts need this information to deliver the services they contribute. Others get the same information via the "referer" header generated as part of the HTTP protocol.  If the catalog were served with the more secure protocol "HTTPS", the referer header would not be sent.

The first of these is Bibliocommons. I've written about Bibliocommons before. They host the NYPL catalog "in the cloud". I'm not particularly concerned about Bibliocommons with respect to privacy, because they contract directly with NYPL, and I'm pretty sure that contracts are in place that bind Bibliocommons to the privacy policies in place at NYPL. But since HTTP is used rather than HTTPS, every host between me and the bibliocommons server can see and capture the URL of the web page I'm looking at. At the moment, I'm using the wifi in a Paris cafe, so the hosts that can see that are in the proxad.net, aas6453.net, level3.net, firehost.com and other domains. I don't know what they do with my browsing history.

I've previously written about the NYPL's use of the Bookish recommendation engine.  The BTOL.com link is for Baker&Taylor's "Content Cafe" service that provides book covers for library catalogs. I'm guessing (but don't know for sure) that these offerings have privacy policies that are aware of the privacy expectations of library users.

Yes, Google is one of the companies that NYPL tells about my web browsing. I'm pretty sure that Google knows who I am. A careful look at the Google Analytics privacy policy suggests that they can't share my browsing history outside Google. Unless required to by law.

Foxycart is not a company I was familiar with. They provide the shopping cart technology that lets me buy a book from the NYPL website and benefit them with part of the proceeds. I've been in favor of enabling such commerce on library sites because libraries need to do it to participate fully in the modern reading ecosystem. But it's still controversial in the library world.

Foxycart's privacy policy, like all privacy policies ever written, takes your privacy very seriously. Some excerpts:
When you visit this website, some information, such as the site that referred you to us, your IP and email address, and navigational and purchase information, may be collected automatically as part of the site’s operation. This information is used to generate user profiles and to personalize the web site to your particular interests. 
The information collected online is stored indefinitely and is used for various purposes. 
Cookies offer you many conveniences. They allow FoxyCart.com LLC, and certain third party content providers, to recognize information, and so can determine what content is best suited to your needs.  
We also reserve the right to disclose your personal information if required to do so by law, or in the good faith belief that such action is reasonably necessary to comply with legal process, respond to claims, or protect the rights, property or safety of our company, employees, customers or the public.

Here I need to explain about cookies. When a website gives you a cookie, it acquires the ability to track you across all the websites that company serves. This can be a great convenience for you. When you fill out a credit card form with your name and address, Foxycart can remember it for you so you don't have to type it in again when you come back to order something else. You might find that creepy if the last order you placed was on a porn site. But while NYPL hasn't told FoxyCart anything that could identify you personally, your interaction with FoxyCart is such that you may well chose to identify yourself. And all that information is stored forever. And FoxyCart can pass that information to all the Sen. Joe McCarthys of 2020. As well as certain 3rd party content providers. FoxyCart probably doesn't give away your information today, but will they even be around in 2020?

IdreamBooks syndicates book reviews. I don't know anything about them, and their homepage doesn't seem to have a privacy policy.

ScorecardResearch "conducts research by collecting Internet web browsing data and then uses that data to help show how people use the Internet, what they like about it, and what they don’t." They probably know whether I like ScorecardResearch. Their cookie is set by the ShareThis software.

ShareThis was one of the companies I mentioned in my last post. ShareThis provides social sharing buttons for the NYPL catalog. They also take your privacy very seriously. Some more excerpts:
In addition to the sharing service offered directly to users, the technology we use to assist with user sharing also allows us to gather information from publisher Web sites that include our ShareThis Sharing Icon or use our advertising technology, and enables ShareThis and our partner publishers and advertisers to use the value of the shared content and other information gathered through our technology to facilitate the delivery of relevant, targeted advertising (the ShareThis Services). 
we also receive certain non-personally identifiable information (e.g., demographic information such as zip code) from our advertisers, ad network and publisher partners, and we may combine this information with what we have collected. We also collect information from third-party Web sites with whom you have registered, like social networks, that those third parties make publicly available. 
While using the ShareThis Services, We may place third party advertisers’ and publishers’ cookies and pixels on their behalf regarding Usage Information. 
We are not responsible for the information practices of these third parties and the cookies placed by ShareThis on behalf of those third parties.
So ShareThis turns out to be in the business of advertising. They use your browsing behavior over thousands of websites to help advertisers target advertising and content to you. That scene in Minority report where Tom Cruise gets personalized ads on the billboards he walks by? Thats what ShareThis is helping to make happen today, and the NYPL website is helping them.
Ad Mall from Minority Report
They do this by cookie-sharing. In addition to setting a sharethis.com cookie, they set cookies for other companies, so they also get to know what you're reading. And when they do this, they enable other companies to connect your browsing behavior at NYPL with information you've provided to social networks. The result is that it's possible for a company selling Karl Marx merch to target ads you based on browsing the Communist Manifesto catalog page.

But it's not like ShareThis is completely promiscuous. Their privacy agreement limits their cookie sharing to an exclusive group of advertising companies. Here's the beginning of the list:
  • 33across.png
  • accuen.png
  • Adap.png
  • adaramedia.com
  • adblade.com
  • addthis.com
  • adroll.com
  • aggregateknowledge.com
  • appnexus.com
  • atlassolutions.com
  • AudienceScience.com
That's just the A's.

In 1972, Zoia Horn, a librarian at Bucknell University, was jailed for almost three weeks for refusing to testify at the trial of the Harrisburg 7 concerning the library usage of one of the defendants. That was a long time ago. No longer is there a need to put librarians in jail.



2 comments:

  1. Eric, I am glad that you are shifting some of your attention to patron privacy. Not enough library technologists care about it. Thank you.

    FYI, here is a link to a presentation I did earlier this year on the topic. http://ecommons.library.cornell.edu/handle/1813/36316

    ReplyDelete
  2. Thanks for this exploration!

    ReplyDelete