Saturday, September 27, 2014

Online Bookstores to Face Stringent Privacy Law in New Jersey

Before you read this post, be aware that this web page is sharing your usage with Google, Facebook, StatCounter.com, unglue.it and Harlequin.com. Google because this is Blogger. Facebook because there's a "Like" button, StatCounter because I use it to measure usage, and Harlequin because I embedded the cover for Rebecca Avery's Maid to Crave directly from Harlequin's website. Harlequin's web server has been sent the address of this page along with you IP address as part of the HTTP transaction that fetches the image, which, to be clear, is not a picture of me.

I'm pretty sure that having read the first paragraph, you're now able to give informed consent if I try to sell you a book (see unglue.it embed -->) and constitute myself as a book service for the purposes of a New Jersey "Reader Privacy Act", currently awaiting Governor Christie's signature. That act would make it unlawful to share information about your book use (borrowing, downloading, buying, reading, etc.) with a third party, in the absence of a court order to do so. That's good for your reading privacy, but a real problem for almost anyone running a commercial "book service".

Let's use Maid to Crave as an example. When you click on the link, your browser first sends a request to Harlequin.com. Using the instructions in the returned HTML, it then sends requests to a bunch of web servers to build the web page, complete with images, reviews and buy links. Here's the list of hosts contacted as my browser builds that page:

  • www.harlequin.com
  • stats.harlequin.com
  • seal.verisign.com (A security company)
  • www.goodreads.com  (The review comes from GoodReads. They're owned by Amazon.)
  • seal.websecurity.norton.com (Another security company)
  • www.google-analytics.com
  • www.googletagservices.com
  • stats.g.doubleclick.net (Doubleclick is an advertising network owned by Google)
  • partner.googleadservices.com
  • tpc.googlesyndication.com
  • cdn.gigya.com (Gigya’s Consumer Identity Management platform helps businesses identify consumers across any device, achieve a single customer view by collecting and consolidating profile and activity data, and tap into first-party data to reach customers with more personalized marketing messaging.)
  • cdn1.gigya.com
  • cdn2.gigya.com
  • cdn3.gigya.com
  • comments.us1.gigya.com
  • gscounters.us1.gigya.com
  • www.facebook.com (I'm told this is a social network)
  • connect.facebook.net
  • static.ak.facebook.com
  • s-static.ak.facebook.com
  • fbstatic-a.akamaihd.net (Akamai is here helping to distribute facebook content)
  • platform.twitter.com (yet another social network)
  • syndication.twitter.com
  • cdn.api.twitter.com
  • edge.quantserve.com (QuantCast is an "audience research and behavioural advertising company")

All of these servers are given my IP address and the URL of the Harlequin page that I'm viewing. All of these companies except Verisign, Norton and Akamai also set tracking cookies that enable them to connect my browsing of the Harlequin site with my activity all over the web. The Guardian has a nice overview of these companies that track your use of the web. Most of them exist to better target ads at you. So don't be surprised if, once you've visited Harlequin, Amazon tries to sell you romance novels.

Certainly Harlequin qualifies as a commercial book service under the New Jersey law. And certainly Harlequin is giving personal information (IP addresses are personal information under the law) to a bunch of private entities without a court order. And most certainly it is doing so without informed consent. So its website is doing things that will be unlawful under the New Jersey law.

But it's not alone. Almost any online bookseller uses services like those used by Harlequin. Even Amazon, which is pretty much self contained, has to send your personal information to Ingram to fulfill many of the book orders sent to it. Under the New Jersey law, it appears that Amazon will need to get your informed consent to have Ingram send you a book. And really, do I care? Does this improve my reading privacy?

The companies that can ignore this law are Apple, Target, Walmart and the like. Book services are exempt if they derive less than 2% of their US consumer revenue from books. So yay Apple.

Other internet book services will likely respond to the law with pop-up legal notices like those you occasionally see on sites trying to comply with European privacy laws. "This site uses cookies to improve your browsing experience. OK?" They constitute privacy theater, a stupid legal show that doesn't improve user privacy one iota.

Lord knows we need some basic rules about privacy of our reading behavior. But I think the New Jersey law does a lousy job of dealing with the realities of today's internet. I wonder if we'll ever start a real discussion about what and when things should be private on the web.

Wednesday, September 24, 2014

Emergency! Governor Christie Could Turn NJ Library Websites Into Law-Breakers

Nate Hoffelder over at The Digital Reader highlighted the passage of a new "Reader Privacy Act" passed by the New Jersey State Legislature. If signed by Governor Chris Christie it would take effect immediately. It was sponsored by my state senator, Nia Gill.

In light of my writing about privacy on library websites, this poorly drafted bill, though well intentioned, would turn my library's website into a law-breaker, subject to a $500 civil fine for every user. (It would also require us to make some minor changes at Unglue.it.)
  1. It defines "personal information" as "(1) any information that identifies, relates to, describes, or is associated with a particular user's use of a book service; (2) a unique identifier or Internet Protocol address, when that identifier or address is used to identify, relate to, describe, or be associated with a particular user, as related to the user’s use of a book service, or book, in whole or in partial form; (3) any information that relates to, or is capable of being associated with, a particular book service user’s access to a book service."
  2. “Provider” means any commercial entity offering a book service to the public.
  3. A provider shall only disclose the personal information of a book service user [...] to a person or private entity pursuant to a court order in a pending action brought by [...] by the person or private entity.
  4. Any book service user aggrieved by a violation of this act may recover, in a civil action, $500 per violation and the costs of the action together with reasonable attorneys’ fees.
My library, Montclair Public Library, uses a web catalog run by Polaris, a division of Innovative Interfaces, a private entity, for BCCLS, a consortium serving northern New Jersey. Whenever I browse a catalog entry in this catalog, a cookie is set by AddThis (and probably other companies) identifying me and the web page I'm looking at. In other words, personal information as defined by the act is sent to a private entity, without a court order.

And so every user of the catalog could sue Innovative for $500 each, plus legal fees.

The only out is "if the user has given his or her informed consent to the specific disclosure for the specific purpose." Having a terms of use and a privacy policy is usually not sufficient to achieve "informed consent".

Existing library privacy laws in NJ have reasonable exceptions for "proper operations of the library". This law does not have a similar exemption.

I urge Governor Christie to veto the bill and send it back to the legislature for improvements that take account of the realities of library websites and make it easier for internet bookstores and libraries to operate legally in the Garden State.

You can contact Gov. Christie's office using this form.

Update: Just talked to one of Nia Gill's staff; they're looking into it. Also updated to include the 2nd set of amendments.

Update 2: A close reading of the California law on which the NJ statute was based reveals that poor wording in section 4 is the source of the problem. In the California law, it's clear that it pertains only to the situation where a private entity is seeking discovery in a legal action, not when the private entity is somehow involved in providing the service.

Where the NJ law reads
A provider shall only disclose the personal information of a book service user to a government entity, other than a law enforcement entity, or to a person or private entity pursuant to a court order in a pending action brought by the government entity or by the person or private entity.  
it's meant to read
In a pending action brought by the government entity other than a law enforcement entity, or by a person or by a private entity, a provider shall only disclose the personal information of a book service user to such entity or person pursuant to a court order. 

Monday, September 22, 2014

Attribution Meets Open Access

Credits Dancer (see on YouTube)
It drives my kids crazy, but I always stay for the credits after the movie. I'm writing this while on a plane over the Atlantic, and I just watched Wes Anderson's Grand Budapest Hotel. Among the usual credits for the actors, the producers, the directors, writers, editors, composers, designers, musicians, key grips, best boys, animators, model makers and the like, Michael Taylor is credited as the painter of "Johannes von Hoytl's Boy with Apple" along with his model, Ed Munro. "The House of Waris" is credited for "Brass Knuckle-dusters and Crossed Key Pins". There's a "Drapesmaster", a Millener and two "Key Costume Cutters". There are even "Photochrom images courtesy of The Library of Congress". To reward me for watching to the end there's a funny Russian dancer over the balalaika chorus.

It says a lot about the movie industry that so much work has gone into the credits. They are a fitting recognition of the miracle of a myriad of talents collaborating to result in a Hollywood movie. But the maturity of the film industry is also reflected in the standardization of the form of this attribution.

The importance of attribution is similarly reflected by its presence is each of the Creative Commons licenses. But many of the digital media that have adopted Creative Commons licensing have not reached the sort of attribution maturity seen in the film industry. The book publishing industry, for example, hides the valuable contributions of copy editors, jacket designers, research assistants and others. It's standard practice to attribute a work to the author alone. If someone spends time to make an ebook work well, that generally doesn't get a credit alongside the author.

The Creative Commons licenses require attribution, but don't specify much about how the attribution is to be done, and it's taken a while for media specific conventions to emerge. It seems to be accepted practice, for example, that CC licensed blog posts require a back-link to the original blog post. People who use CC licensed photos to illustrate a slide presentation typically have a credits page with links to the sources at the end.

Signs of maturation were omnipresent at the 6th Conference for Open Access Scholarly Publishing, which I'm just returning from. Prominent in the list of achievements was the announcement of a "Shared Statement and Community Principles on Expectations of Scholarly Standards on Attribution", a set of attribution principles for open access scholarly publications, signed by all the important open access scholarly publishers.

The four agreed-upon principles are as follows:

  1. Researchers choosing Open Access and using liberal licenses do so because they wish to maximise access to and re-use of their work. We acknowledge the tradition of both freely giving knowledge to our communities and also the expectation that contributions will be respected and that full credit is given according to scholarly norms.
  2. Authors choose Creative Commons licenses in part to ensure attribution and the assignment of credit. The community expects that where a work is reprinted, collected, aggregated or otherwise re-used substantially as a whole that the original source, location and free availability of the original version will be both made explicit and emphasised.
  3. The community expects that where modifications have been made to an article that this will be made explicit and every practicable effort will be made to make the nature and scope of modifications explicit. Where a derivative is digital all practicable efforts should be made to make comparison with the original version as easy as possible for the user.
  4. The community assumes, consistent with the terms of the Creative Commons licenses, that unless noted otherwise authors have not endorsed any republication or modification of their original work. Where authors have explicitly endorsed the republication or modified version this should be made explicit in a way which is separate to the attribution.

These principles, and the implementation guidelines that will result from further consultations, are particularly needed because many scholars, while supporting the reuse enabled by CC BY licenses, are concerned about possible misuse. The principles reinforce that when a work is modified, the substance of the modifications should be made clear to the end user, and that further, there must be no implication that republication carries any endorsement by the original authors.

One thing that is likely to emerge from this process is the use of CrossRef DOI's as attribution urls. DOIs can be resolved (via redirection) to an authoritative web and can be maintained by the publisher so that links needn't break when content moves.

As scholarly content gets remixed, revised and repurposed, there will increasingly be a need to track contributions every bit as elaborate as for Grand Budapest Hotel. Imagine a paper by Alice analyzing data from Bob on a sample by Carol, with later corrections by Eve. Luckily we live in the future and there's already a technology and user framework that shows how it can be done. That technology, the future of attribution (I hope), is Distributed Version Control. A subsequent post will discuss why every serious publisher needs to understand GitHub.

The emphasis on community in the the "Shared Statement" is vitally important. With consultation and shared values, we'll soon all be dancing at the end of the credits.

Monday, September 15, 2014

Analysis of Privacy Leakage on a Library Catalog Webpage

My post last month about privacy on library websites, and the surrounding discussion on the Code4Lib list prompted me to do a focused investigation, which I presented at last weeks Code4Lib-NYC meeting.

I looked at a single web page from the NYPL online catalog. I used Chrome developer tools to trace all the requests my browser made in the process of building that page. The catalog page in question is for The Communist Manifesto. It's here: http://nypl.bibliocommons.com/item/show/18235020052907_communist_manifesto .

You can imagine how reading this work might have been of interest to government investigators during the early fifties when Sen. Joe McCarthy was at the peak of his power. Note that, following good search-engine-optimization practice, the URL embeds the title of the resource being looked at.

I chose the NYPL catalog as my example, not because it's better or worse than any other library catalog with respect to privacy, but because it's exemplary. The people building it are awesome, and the results are top-notch. I happen to know the organization is working on making privacy improvements. Please don't take my investigation to be a criticism of NYPL. But it was Code4Lib-NYC, after all.

As an example of how far ahead of the curve the NYPL catalog is, note that the webpage offers links to free downloads at Project Gutenberg. The Communist Manifesto is in the public domain, so any library catalog that tells you that no ebook is available is lying. The majority of library catalogs today lie about this.

So here are the results.

In building the Communist Manifesto catalog page, my browser contacts 11 different hosts from 8 different companies.
  • nypl.secure.bibliocommons.com
  • cdn.bibliocommons.com
  • api.bookish.com
  • contentcafe2.btol.com
  • www.google-analytics.com
  • www.googletagmanager.com
  • cdn.foxycart.com
  • idreambooks.com
  • ws.sharethis.com
  • wd-edge.sharethis.com
  • b.scorecardresearch.com
Each of these hosts is informed of the address of the web page that generates the address. They are told, essentially, "this user is looking at our Communist Manifesto page". Some of the hosts need this information to deliver the services they contribute. Others get the same information via the "referer" header generated as part of the HTTP protocol.  If the catalog were served with the more secure protocol "HTTPS", the referer header would not be sent.

The first of these is Bibliocommons. I've written about Bibliocommons before. They host the NYPL catalog "in the cloud". I'm not particularly concerned about Bibliocommons with respect to privacy, because they contract directly with NYPL, and I'm pretty sure that contracts are in place that bind Bibliocommons to the privacy policies in place at NYPL. But since HTTP is used rather than HTTPS, every host between me and the bibliocommons server can see and capture the URL of the web page I'm looking at. At the moment, I'm using the wifi in a Paris cafe, so the hosts that can see that are in the proxad.net, aas6453.net, level3.net, firehost.com and other domains. I don't know what they do with my browsing history.

I've previously written about the NYPL's use of the Bookish recommendation engine.  The BTOL.com link is for Baker&Taylor's "Content Cafe" service that provides book covers for library catalogs. I'm guessing (but don't know for sure) that these offerings have privacy policies that are aware of the privacy expectations of library users.

Yes, Google is one of the companies that NYPL tells about my web browsing. I'm pretty sure that Google knows who I am. A careful look at the Google Analytics privacy policy suggests that they can't share my browsing history outside Google. Unless required to by law.

Foxycart is not a company I was familiar with. They provide the shopping cart technology that lets me buy a book from the NYPL website and benefit them with part of the proceeds. I've been in favor of enabling such commerce on library sites because libraries need to do it to participate fully in the modern reading ecosystem. But it's still controversial in the library world.

Foxycart's privacy policy, like all privacy policies ever written, takes your privacy very seriously. Some excerpts:
When you visit this website, some information, such as the site that referred you to us, your IP and email address, and navigational and purchase information, may be collected automatically as part of the site’s operation. This information is used to generate user profiles and to personalize the web site to your particular interests. 
The information collected online is stored indefinitely and is used for various purposes. 
Cookies offer you many conveniences. They allow FoxyCart.com LLC, and certain third party content providers, to recognize information, and so can determine what content is best suited to your needs.  
We also reserve the right to disclose your personal information if required to do so by law, or in the good faith belief that such action is reasonably necessary to comply with legal process, respond to claims, or protect the rights, property or safety of our company, employees, customers or the public.

Here I need to explain about cookies. When a website gives you a cookie, it acquires the ability to track you across all the websites that company serves. This can be a great convenience for you. When you fill out a credit card form with your name and address, Foxycart can remember it for you so you don't have to type it in again when you come back to order something else. You might find that creepy if the last order you placed was on a porn site. But while NYPL hasn't told FoxyCart anything that could identify you personally, your interaction with FoxyCart is such that you may well chose to identify yourself. And all that information is stored forever. And FoxyCart can pass that information to all the Sen. Joe McCarthys of 2020. As well as certain 3rd party content providers. FoxyCart probably doesn't give away your information today, but will they even be around in 2020?

IdreamBooks syndicates book reviews. I don't know anything about them, and their homepage doesn't seem to have a privacy policy.

ScorecardResearch "conducts research by collecting Internet web browsing data and then uses that data to help show how people use the Internet, what they like about it, and what they don’t." They probably know whether I like ScorecardResearch. Their cookie is set by the ShareThis software.

ShareThis was one of the companies I mentioned in my last post. ShareThis provides social sharing buttons for the NYPL catalog. They also take your privacy very seriously. Some more excerpts:
In addition to the sharing service offered directly to users, the technology we use to assist with user sharing also allows us to gather information from publisher Web sites that include our ShareThis Sharing Icon or use our advertising technology, and enables ShareThis and our partner publishers and advertisers to use the value of the shared content and other information gathered through our technology to facilitate the delivery of relevant, targeted advertising (the ShareThis Services). 
we also receive certain non-personally identifiable information (e.g., demographic information such as zip code) from our advertisers, ad network and publisher partners, and we may combine this information with what we have collected. We also collect information from third-party Web sites with whom you have registered, like social networks, that those third parties make publicly available. 
While using the ShareThis Services, We may place third party advertisers’ and publishers’ cookies and pixels on their behalf regarding Usage Information. 
We are not responsible for the information practices of these third parties and the cookies placed by ShareThis on behalf of those third parties.
So ShareThis turns out to be in the business of advertising. They use your browsing behavior over thousands of websites to help advertisers target advertising and content to you. That scene in Minority report where Tom Cruise gets personalized ads on the billboards he walks by? Thats what ShareThis is helping to make happen today, and the NYPL website is helping them.
Ad Mall from Minority Report
They do this by cookie-sharing. In addition to setting a sharethis.com cookie, they set cookies for other companies, so they also get to know what you're reading. And when they do this, they enable other companies to connect your browsing behavior at NYPL with information you've provided to social networks. The result is that it's possible for a company selling Karl Marx merch to target ads you based on browsing the Communist Manifesto catalog page.

But it's not like ShareThis is completely promiscuous. Their privacy agreement limits their cookie sharing to an exclusive group of advertising companies. Here's the beginning of the list:
  • 33across.png
  • accuen.png
  • Adap.png
  • adaramedia.com
  • adblade.com
  • addthis.com
  • adroll.com
  • aggregateknowledge.com
  • appnexus.com
  • atlassolutions.com
  • AudienceScience.com
That's just the A's.

In 1972, Zoia Horn, a librarian at Bucknell University, was jailed for almost three weeks for refusing to testify at the trial of the Harrisburg 7 concerning the library usage of one of the defendants. That was a long time ago. No longer is there a need to put librarians in jail.



Wednesday, August 13, 2014

Libraries are Giving Away the User-Privacy Store

AddThis makes some really nice widgets. Here are some for sharing this blogpost:

ShareThis is another company that does pretty much the same thing. Their share buttons are down at the end of the post. AddThis is bigger. It provides "behavioral, contextual, and interest based data that spans across hundreds of content categories and topics, reaching 1.7 billion uniques a month."

The widgets help users share your content. At the same time, AddThis and ShareThis widgets help a publisher figure out who is sharing what, while distributing the content into other websites. To do this, they track users, see what sort of web sites they like. They can also work with advertising networks to improve the relevancy of ads shown to users. The user tracking works by setting user cookies, or "web beacons" that enable the tracking of users across websites. In the case of AddThis, users are also tracked using "Canvas Fingerprinting", a technique that works even when a user blocks cookie tracking. ProPublica recently wrote about this technology, calling it the "Online Tracking Device that's Nearly Impossible to Block".

Here's what the ShareThis Privacy Policy says:
In some cases, if you have chosen to make PII (like your name) publicly available through third party sites like social networks, we may seek your consent to use that PII in connection with services we offer in conjunction with our partners. We will not disclose your PII without your consent.
We and our publisher, advertiser and ad network partners also use this data for other related purposes (for example, to do research regarding the results of our online advertising campaigns or to better understand the interests or activities of users of the ShareThis Services).
Similarly, AddThis says:
When an End User downloads a page that contains an AddThis Button, we may deploy a cookie on our own behalf or on behalf of our data partners, to record information about how an End User uses the web, such as the web search that landed the End User on a particular page or categories of the End User's interests. We may use the Data to target advertising toward the End User or authorize others to do the same. 
Many websites are using Google Analytics to measure usage; they let Google track their users in the same way (the website I run, Unglue.it, uses Google Analytics). However, the Analytics terms of service seem not to allow Google to share the collected data as freely as AddThis and ShareThis do.

Both AddThis and ShareThis assert in the legal terms that they mustn't collect usage information from children, so if children use your site, you're not supposed to use these services. Google Analytics does not have this restriction, which presumably means they can't use their data to advertise to children.

Together with "Cookie Syncing" and "Evercookies", the cumulative effect of all this tracking is that website users can be pretty comprehensively tracked, and if need be, identified, whether they like it or not. In exchange for deploying the trackers, websites get access to the valuable pool of information about their users.

Matt Mullenweg (of WordPress) has an interesting perspective:
services like AddThis and ShareThis will always spy on and tag your audience when you use their widgets, and you should avoid them if you care about that sort of thing.
This puts libraries in somewhat of a quandary. Traditionally, libraries have been havens of privacy for their users. Librarians have famously gone to jail for their refusal to turn over circulation records to law enforcement. But it seems that libraries are not much protecting their users from the sort of information gathering done by AddThis, ShareThis, and Google. For example, New York Public Library uses Google Analytics and ShareThis. OCLC and Worldcat use AddThis. My own public library catalog (hosted by BCCLS)  sets cookies for AddThis. I suppose they don't consider that their websites could be directed at children. Even the American Library Association's webpage extolling the important of privacy in libraries makes use of Google Analytics. (ironically, the link to a website privacy policy is broken on that page!)

It's true that these trackers are very common- even WhiteHouse.gov has employed AddThis buttons. But it seems to me that if libraries still think that user privacy is valuable  in this age of social media, they need to rethink out their use of web user tracking companies. What disturbs me most is there hasn't been much public discussion about the future role of privacy in library websites, even as it's rapidly being lost.

Update (Aug 15): AddThis says they're not using canvas fingerprinting and have terminated their test of it. I don't think this really changes the cost/benefit analysis for libraries. It remains true that libraries that use AddThis or ShareThis are allowing a third party to track their patrons' catalog browsing (not just their social sharing), under terms which permit the companies to use the data for advertising purposes. Use of Google Analytics allows Google to do the same tracking, but does not appear to permit use for advertising. Either way, libraries need to make informed choices and communicate those choices to their users. Same for Facebook "Like" buttons. Commercial sites, obviously, have different priorities and responsibilities.

Update (Aug 19): There are a number of free open-source solutions available both for social sharing and for analytics. There's a very useful discussion of these issues on Hacker News.