Thursday, December 22, 2016

How to check if your library is leaking catalog searches to Amazon

I've been writing about privacy in libraries for a while now, and I get a bit down sometimes because progress is so slow. I've come to realize that part of the problem is that the issues are sometimes really complex and  technical; people just don't believe that the web works the way it does, violating user privacy at every opportunity.

Content embedded in websites is a a huge source of privacy leakage in library services. Cover images can be particularly problematic. I've written before that, without meaning to, many libraries send data to Amazon about the books a user is searching for; cover images are almost always the culprit. I've been reporting this issue to the library automation companies that enable this, but a year and a half later, nothing has changed. (I understand that "discovery" services such as Primo/Summon even include config checkboxes that make this easy to do; the companies say this is what their customers want.)

Two indications that a third-party cover image is a privacy problem are:
  1. the provider sets tracking cookies on the hostname serving the content.
  2. the provider collects personal information, for example as part of commerce. 
For example, covers served by Amazon send a bonanza of actionable intelligence to Amazon.

Here's how to tell if your library is sending Amazon your library search data.

Setup

You'll need a web browser equipped with developer tools; I use Chrome. Firefox should work, too.

Log into Amazon.com. They will give you a tracking cookie that identifies you. If you buy something, they'll have your credit card number, your physical and electronic addresses, records about the stuff you buy, and a big chunk of your web browsing history on websites that offer affiliate linking. These cookies are used to optimize the advertisements you're shown around the web.

To see your Amazon cookies, go to Preferences > Settings. Click "Show advanced setting..." (It's hiding at the bottom.)

Click the  "Content settings.." button.

Now click the "All cookies and site data" button.

in the "Search cookies" box, type "amazon". Chances are, you'll see something like this.

I've got 65 cookies for "amazon.com"!

If you remove all the cookies and then go back to Amazon, you'll get 15 fresh cookies, most of them set to last for 20 years. Amazon knows who I am even if a delete all the cookies except "x-main".

Test the Library

Now it's time to find a library search box. For demonstration purposes, I'll use Harvard's "Hollis" catalog. I would get similar results at 36 different ARL libraries, but Harvard has lots of books and returns plenty of results. In the past, I've used What to expect as my search string, but just to make a point, I'll use Killing Trump, a book that Bill O'Reilly hasn't written yet.

Once you've executed your search, choose View > Developer > Developer Tools

Click on the "Sources" tab and to see the requests made of "images.amazon.com". Amazon has returned 1x1 clear pixels for three requested covers. The covers are requested by ISBN. But that's not all the information contained in the cover request.

To see the cover request, click on the "Network" tab and hit reload. You can see that the cover images were requested by a javascript called "primo_library_web" (Hollis is an instance of Ex Libris' Primo discovery service.)

Now click on the request you're interested in. Look at the request headers.


There are two of interest, the "Cookie" and the "Referer".

The "Cookie" sent to Amazon is this:
x-main="oO@WgrX2LoaTFJeRfVIWNu1Hx?a1Mt0s";
skin=noskin; session-token="bcgYhb7dksVolyQIRy4abz1kCvlXoYGNUM5gZe9z4pV75B53o/4Bs6cv1Plr4INdSFTkEPBV1pm74vGkGGd0HHLb9cMvu9bp3qekVLaboQtTr+gtC90lOFvJwXDM4Fpqi6bEbmv3lCqYC5FDhDKZQp1v8DlYr8ZdJJBP5lwEu2a+OSXbJhfVFnb3860I1i3DWntYyU1ip0s="; x-wl-uid=1OgIBsslBlOoArUsYcVdZ0IESKFUYR0iZ3fLcjTXQ1PyTMaFdjy6gB9uaILvMGaN9I+mRtJmbSFwNKfMRJWX7jg==; ubid-main=156-1472903-4100903;
session-id-time=2082787201l;
session-id=161-0692439-8899146
Note that Amazon can tell who I am from the x-main cookie alone. In the privacy biz, this is known as "PII" or personally identifiable information.

The "Referer" sent to Amazon is this:
http://hollis.harvard.edu/primo_library/libweb/action/search.do?fn=search&ct=search&initialSearch=true&mode=Basic&tab=everything&indx=1&dum=true&srt=rank&vid=HVD&frbg=&tb=t&vl%28freeText0%29=killing+trump&scp.scps=scope%3A%28HVD_FGDC%29%2Cscope%3A%28HVD%29%2Cscope%3A%28HVD_VIA%29%2Cprimo_central_multiple_fe&vl%28394521272UI1%29=all_items&vl%281UI0%29=contains&vl%2851615747UI0%29=any&vl%2851615747UI0%29=title&vl%2851615747UI0%29=any
To put this plainly, my entire search session, including my search string killing trump is sent to Amazon, alongside my personal information, whether I like it or not. I don't know what Amazon does with this information. I assume if a government actor wants my search history, they will get it from Amazon without much fuss.

I don't like it.

Rant

[I wrote a rant; but I decided to save it for a future post if needed.] Anyone want a Cookie?

Notes 12/23/2016:


  1. As Keith Jenkins noted, users can configure Chrome and Safari to block 3rd Party cookies. Firefox won't block Amazon cookies, however. And some libraries advise users to not to block 3rd party cookies because doing so can cause problems with proxy authentication.
  2. If Chrome's network panel tells you "Provisional headers are shown" this means it doesn't know what request headers were really sent because another plugin is modifying headers. So if you have HTTPS Everywhere, Ghostery, Adblock, or Privacy Badger installed, you may not be able to use Chrome developer tools to see request headers. Thanks to Scott Carlson for the heads up.
  3. Cover images from Google leak similar data; as does use of Google Analytics. As do Facebook Like buttons. Et cetera.
  4. Thanks to Sarah Houghton for suggesting that I write this up.

Update 3/23/2017:

There's good news in the comments!

29 comments:

  1. That's great information! If my library is running Primo, what's the most helpful thing I can tell them about changing this behavior? Is there a 'dont' do that' flag somewhere in the Primo configuration?

    ReplyDelete
    Replies
    1. I don't have first hand experience with Primo admin, but I know that library automation companies want to make their customers happy. If libraries want to use Amazon covers, the vendor tries to make that an option.

      Delete
    2. There must be a way to anonymize the request, though. Curious to find out more.

      Delete
    3. Oh, so the request to the Amazon service comes straight from your own browser, and that's how/why the cookie gets accessed. That is a very bad way to build a catalog system, in terms of privacy. Even if you use it in a privacy mode, Amazon has a record of your IP address and your search string, from the referrer header. Crappy design! I hope librarians can put some pressure on the developer to change that.

      Delete
    4. If only it was just one developer.

      Delete
    5. Yeah. If we want secure, private services on the WWW, there's a lot that'd have to be rebuilt from the ground up. Inertia seems to be in the opposite direction.

      Delete
    6. What's exasperating to me is how easy most of this would be to fix! Se here for example.

      Delete
    7. In the meantime, creating a hosts file to redirect those requests to the loopback address might be a reasonable solution.

      Delete
  2. Thanks Eric, a problem I wasn't even aware existed. On a somewhat related note, I get a Facebook certificate error when I view this page...

    ReplyDelete
  3. Thanks for this! I went through the process but did not see any "images.amazon.com" on the Sources tab...does that mean that we're in the clear? I hope so!

    ReplyDelete
    Replies
    1. Justin, you probably see some other hostnames in the sources tab. Two other cover image providers are syndetics and baker and taylor. If you use one of these, your library probably has a contract/agreement with them. If so, you'll probably want to review the contract to make sure that the vendor has committed to privacy and data retention policies that you're happy with.

      Delete
    2. Thank you so much Eric. I appreciate how detail oriented this post is!

      Delete
  4. This is all good information but surely a huge problem is that this blog is hosted on Google - and therefore Google is both tracking all the users to this blog (including myself) and all their other usage activity.

    For all these issues it's culture change. We use Google Analytics on websites because it's quick and easy, we use blogger sites for the same reason. Libraries use Amazon cover images because it's free hosting for enhanced content on their sites and most users don't seem to care.

    But to highlight the issue on a blogging platform that is leaking user information all over the place does seem like the height of hypocrisy.

    ReplyDelete
    Replies
    1. Obviously you haven't read the Go To Hellman Blog License Agreement.

      But seriously, there's a huge difference in expectations about privacy when you visit a blog and when you visit a library site. There's also a difference between leaking a library website visit and a catalog search. As well as a difference in user expectations between a google-hosted blog capturing usage information and a library catalog spitting out identified user searches to Amazon.

      Now that blogger is HTTPS, it's actually not leaking anything to any 3rd parties except twitter and statcounter (at least on this page).

      As for "height of hypocrisy", perhaps you have slept through 2016. My condolences on waking.

      Delete
    2. Quite an honorable goal for the Library to seal itself off from third-parties, even if it's not the most realistic. But I think the larger take-away here is that libraries have the opportunity to recognize the lay of the land and be transparent enough to inform the public's expectations of privacy. We need to come to terms with the fact that we are not the bastion of some of our more loftier principles. We transgress. But, if we come out from under the hood of that delusion perhaps we can educate in the process?

      Delete
    3. I 'm not saying libraries should seal themselves off from third-parties. Rather, they should make careful, considered choices about third party resources and should avoid spewing patron data when it can be avoided. Building a safe space does not require building a "bastion". You're right though, a safe space requires transparency and education.

      Delete
  5. Evergreen ILS can be counted in the not effected column. It caches cover art server side, so the client request is always to the Evergreen server. And I don't think amazon cover art is even supported, since the last time I checked it was against Amazon's TOS (granted that was several years ago, I remember something about how the use of the images must be used to drive traffic to Amazon.com as a requirement.).

    ReplyDelete
    Replies
    1. Caching 3rd party content is a good way to prevent privacy leakage. Nice to hear!

      Delete
  6. It is good to be mindful that "free" services that libraries and users alike depend on typically are funded with data about users.

    It is also good to keep things in perspective. It's a safe bet that the Internet access that libraries routinely provide hemorrhages much more sensitive patron data than this.

    ReplyDelete
    Replies
    1. True, but if Amazon came to your library and said "we'll pay $50,000/yr for your complete search logs, identified by user", what would you tell them?

      Delete
    2. We'd say that violates our policies and mission. There are a number of things we don't do that could generate significant revenue such as accept advertising. BTW, we don't use Amazon cover images.

      I do think some perspective is in order. Library staff have always known a lot about what individuals read as well as why -- this knowledge is the basis of common war stories. And what we call tracking is typically anonymized because those doing the tracking don't care who it is but use the tracking to provide personalization services. Yes, there are ways of deanonymizing data, but if you want to spy on people, the library ILS is one of the worst sources possible.

      There are so many other services that the library facilitates access to that give away so much more information.

      Truly protecting privacy renders most of the Internet practically unusable.

      Delete
    3. Umm, @banerjek, I did a search for "venereal disease" on the Primo I think your library has, and I got an Amazon cover image.

      Delete
    4. You are correct -- my error. I will raise the issue, as I do think people should be aware of how things work.

      I still do think that this is a minor privacy issue compared to many others patrons face while using library services.

      It reminds me of recently having to secure a server in response to our central IT. They were upset that a machine that existed only to serve public images and no sensitive data on the machine over HTTPS supported SSLv3 (weak encryption). Meanwhile, email remains the preferred channel for critical communication and the pagers that our clinical operations rely on are not encrypted.

      Delete
    5. Pagers are a great example.

      I like to compare privacy threats to global warming. Each individual action is minuscule; we can only solve the problem by changing the larger "context". How do we do that? "Think globally, act locally" is a good start.

      Delete
    6. Fully agreed. But I'd also want to direct limited resources where they have the most impact.

      I do think raising awareness of these things is good (as well as the need to read TOS since Amazon makes it clear what's going on). I have serious concerns about the widespread practice of depending on "free" commercial products for production services.

      Delete
  7. Thought I'd update that ExLibris decided to proxy all their requests for Book Covers for Primo, which solves the problem for all their users.

    ReplyDelete
  8. As we communicated to the entire Primo customer community a few weeks ago I would like to update on the measures Ex Libris already took.
    In order to protect privacy in Primo searches, we have redirected all requests for book covers from third party providers such as Amazon and Google through a proxy on the Ex Libris cloud data center. This way, there is no transfer of client IP data or cookies to these providers’ systems.
    This solution was rolled out to all cloud environments during February.

    Yuval Kiselstein
    Director of Product Management,
    Ex Libris Discovery and Delivery solutions
    yuval.kiselstein@exlibrisgroup.com

    ReplyDelete