Sunday, August 30, 2015

Update on the Library Privacy Pledge

The Library Privacy Pledge of 2015, which I wrote about previously, has been finalized. We got a lot of good feedback, and the big changes have focused on the schedule.

Now, any library , organization or company that signs the pledge will have 6 months to implement HTTPS from the effective date of their signature. This should give everyone plenty of margin to do a good job on the implementation.

We pushed back our launch date to the first week of November. That's when we'll announce the list of "charter signatories". If you want your library, company or organization to be included in the charter signatory list, please send an e-mail to pledge@libraryfreedomproject.org.

The Let's Encrypt project will be launching soon. They are just one certificate authority that can help with HTTPS implementation.

I think this is an very important step for the library information community to take, together. Let's make it happen.

Here's the finalized pledge:

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. It’s just a first step: HTTPS is a privacy prerequisite, not a privacy solution. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.

We focus on HTTPS as a first step because of its timeliness. The Let's Encrypt initiative of the Electronic Frontier Foundation will soon launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.

The 3rd article of the American Library Association Code of Ethics sets a broad objective:

We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.
It's not always clear how to interpret this broad mandate, especially when everything is done on the internet. However, one principle of implementation should be clear and uncontroversial:
Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.

The current best practice dictated by this principle is as following:
Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.

The Pledge for Libraries:

1. We will make every effort to ensure that web services and information resources under direct control of our library will use HTTPS within six months. [ dated______ ]

2. Starting in 2016, our library will assure that any new or renewed contracts for web services or information resources will require support for HTTPS by the end of 2016.

The Pledge for Service Providers (Publishers and Vendors):

1. We will make every effort to ensure that all web services that we (the signatories) offer to libraries will enable HTTPS within six months. [ dated______ ]

2. All web services that we (the signatories) offer to libraries will default to HTTPS by the end of 2016.

The Pledge for Membership Organizations:

1. We will make every effort to ensure that all web services that our organization directly control will use HTTPS within six months. [ dated______ ]

2. We encourage our members to support and sign the appropriate version of the pledge.

There's a FAQ available, too. All this will soon be posted on the Library Freedom Project website.

Sunday, July 26, 2015

Library Privacy and the Freedom Not To Read

One of the most difficult privacy conundrums facing libraries today is how to deal with the data that their patrons generate in the course of using digital services. Commercial information services typically track usage in detail, keep the data indefinitely, and regard the data as a valuable asset. Data is used to make many improvements, often to personalize the service to best meet the needs of the user. User data can also be monetized; as I've written here before, many companies make money by providing web services in exchange for the opportunity to track users and help advertisers target them.

A Maginot Line fortification. Photo from the US Army.
The downside to data collection is its impact on user privacy, something that libraries have a history of defending, even at the risk of imprisonment. Since the Patriot Act, many librarians have believed that the best way to defend user privacy against legally sanctioned intrusion is to avoid collecting any sensitive data. But as libraries move onto the web, that defense seems more and more like a Maginot Line, impregnable, but easy to get around. (I've written about an effort to shore up some weak points in library privacy defenses.)

At the same time, "big data" has clouded the picture of what constitutes sensitive data. The correlation of digital library use with web activity outside the library can impact privacy in ways that never would occur in a physical library. For example, I've found that many libraries unknowingly use Amazon cover images to enrich their online catalogs, so that even a user who is completely anonymous to the library ends up letting Amazon know what books they're searching for.

Recently, I've been serving on the Steering Committee of an initiative of NISO to try to establish a set of principles that libraries, providers of services to libraries, and publishers can use to support privacy patron privacy. We held an in-person meeting in San Francisco at the end of July. There was solid support from libraries, publishers and service companies for improving reader privacy, but some issues were harder than others. The issues around data collection and use attracted the widest divergence in opinion.

One approach that was discussed centered on classifying different types of data depending on the extent to which they impact user privacy. This also the approach taken by most laws governing privacy of library records. They mostly apply only to "Personally Identifiable Information" (PII), which usually would mean a person's name, address, phone number, etc., but sometimes is defined to include the user's IP address. While it's important to protect this type of information, in practice this usually means that less personal information lacks any protection at all.

I find that the data classification approach is another Maginot privacy line. It encourages the assumption that collection of demographics data – age, gender, race, religion, education, profession, even sexual orientation – is fair game for libraries and participants in the library ecosystem. I raised some eyebrows when I suggested that demographic groups might deserve a level of privacy protection in libraries, just as individuals do.

OCLC's Andrew Pace gave an example that brought this home for us all. When he worked as a librarian at NC State, he tracked usage of the books and other materials in the collection. Every library needs to do this for many purposes. He noticed that materials placed on reserve for certain classes received little or no usage, and he thought that faculty shouldn't be putting so many things on reserve, effectively preventing students not taking the class from using these materials. And so he started providing usage reports to the faculty.

In retrospect, Andrew pointed out that, without thinking much about it, he might have violated the privacy of students by informing their teachers that that they weren't reading the assigned materials. After all, if a library wants to protect a user's right to read, they also have to protect the right not to read. Nobody's personally identifiable information had been exposed, but the combination of library data – a list of books that hadn't circulated – with some non-library data – the list of students enrolled in a class and the list of assigned reading – had intersected in a way that exposed individual reading behavior.

What this example illustrates is that libraries MUST collect at least SOME data that impinges on reader privacy. If reader privacy is to be protected, a "privacy impact assessment" must be made on almost all uses of that data.  In today's environment, users expect that their data signals will be listened to and their expressed needs will be accommodated. Given these expectations, building privacy in libraries is going to require a lot of work and a lot of thought.

Sunday, July 12, 2015

The Library Digital Privacy Pledge

I've been busy since my last post! We've created the Free Ebook Foundation, which will be the home for Unglue.it and GITenberg. I helped with the NISO "Consensus Framework to Support Patron Privacy in Digital Library and Information Systems", which I'll write more about soon. And some coding.


But I've also become a volunteer for the Library Freedom Project, run by radical librarian Alison Macrina. The project I'm working on is the "Library Digital Privacy Pledge."

The Library Digital Privacy Pledge is a result of discussions on several listservs about how libraries and the many organizations that serve libraries could work cooperatively to (putting it bluntly) start getting our shit together with regard to patron privacy.

I've talked to a lot of people about privacy in digital libraries, and there's remarkable unity about its importance. There's also a lot of confusion about some basic web privacy technology, like HTTPS. My view is that HTTPS sets a foundation for all the other privacy work that needs doing in libraries.

Someone asked me why I'm so passionate about working on this. After a bit of thought, I realized that the one thing that gives me the most satisfaction in my professional life is eliminating bugs. I hate bugs. Using HTTP for library services is a bug.

The draft of the Library Digital Privacy Pledge is open for comment and improvement  for a few more weeks. We want all sorts of stakeholders to have  a chance to improve it. The current text (July 12, 2015) is as follows: 

The Library Digital Privacy Pledge of 2015

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.
We focus on HTTPS as a first step because of its timeliness. At the end of July the Let's Encrypt initiative of the Electronic Frontier Foundation will launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.
The 3rd article of the American Library Association Code of Ethics sets a broad objective:
We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.


It's not always clear how to interpret this broad mandate, especially when the everything is done on the internet. However, one principle of implementation should be clear and uncontroversial:
Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.
The current best practice dictated by this principle is as following:
Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.

The Pledge for Libraries:

1. All web services and resources that this library directly controls will use HTTPS by the end of 2015.
2. Starting in 2016, this library will not sign or renew any contracts for web services or information resources that do not commit to use HTTPS by the end of 2016.

The Pledge for Service Providers (Publishers and Vendors):

1. All web services that we (the signatories) control will enable HTTPS by the end of 2015.
2. All web services that we (the signatories) offer will require HTTPS by the end of 2016.

The Pledge for Membership Organizations:

1. All web services that this organization directly controls will use HTTPS by the end of 2015.
2. We encourage our members to support and sign the appropriate version of the pledge.

Schedule:

This document will be open for discussion and modification until finalized by July 27, 2015. The finalized pledge will be published on the website of the Library Freedom Project. We expect a number of discussions to take place at the Annual Conference of the American Library Association and associated meetings.
The Library Freedom Project will broadly solicit signatures from libraries, vendors and publishers.
In September, in coordination with the Let's Encrypt project, the list of charter signatories will be made announced and broadly publicized to popular media.

FAQ

Q: What is HTTPS and what do we need to implement it?
A: When you use the web, your browser software communicates with a server computer through the internet. The messages back and forth pass through a series of computers (network nodes) that work together to pass messages. Depending on where you and the server are, there might be 5 computers in that chain, or there might be 50, each possibly owned by a different service provider. When a website uses HTTP, the content of these messages is open to inspection by each intermediate computer- like a postcard sent through the postal system, as well as by any other computer that shares a network those computers. If you’re connecting to the internet over wifi in a coffee shop, everyone else in the coffee shop can see the messages, too.


When a website uses HTTPS, the messages between your browser software and the server are encrypted so that none of the intermediate  network nodes can see the content of the messages. It’s like sending sealed envelopes through the postal system.


Your web site and other library services may be sending sensitive patron data across the internet: often bar codes and passwords, but sometimes also catalog searches, patron names, contact information, and reading records. This kind of data ought to be inside a sealed envelope, not exposed on a postcard.


Most web server software supports HTTPS, but to implement it, you’ll need to get a certificate signed by a recognized authority. The certificate is used to verify that you are who you say you are. Certificates have added cost to HTTPS, but the Electronic Frontier Foundation is implementing a certificate authority that will give out certificates at no charge. To find out more, go to Let’s Encrypt.


Q: Why the focus on HTTPS?
A: We think this issue should not be controversial and is relatively easy to explain. Libraries understand that circulation information can’t be sent to patron on postcards. Publishers don’t want their content scooped up by unauthorized entities. Service providers don’t want to betray the trust of their customers.
Q. How can my library/organization/company add our names to the list of signatories?
A. Email us at pledge@libraryfreedomproject.org. Please give us contact info so we can verify your participation.
Q. Is this the same as HTTPS Everywhere?
A. No, that's a browser plug-in which enforces use of HTTPS.
Q. My Library won't be able to meet the implementation deadline. Can we add our name to the list once we've completed implementation?
A. Yes.
Q. A local school uses an internet filter that blocks https websites to meet legal requirements. Can we sign the pledge and continue to serve them?
A. Most of the filtering solutions include options that will whitelist important services. Work with the school in question to implement a work-around.


Q. What else can I read about libraries using HTTPS?
A. The Electronic Frontier Foundation has published What Every Librarian Needs to Know About HTTPS
Q. How do I know if I have implemented HTTPS correctly?
A. The developers behind the “Let’s Encrypt” initiative are ensuring that best practices are used in setting up the HTTPS configuration.  If you are deploying HTTPS on your own, we encourage you to use the Qualys SSL Labs SSL Server Test service to review the performance of your implementation.  You should strive for at least a “B” rating with no major security vulnerabilities identified in the scan.


Q. Our library subscribes to over 200 databases only a fraction of them currently delivered via https. We might be able to say we will not sign new contracts but the renewal requirement could be difficult for an academic library like ours. Can we sign the pledge?
A. No one is going to penalize libraries that aren’t able to comply 100% with their pledge. One way to satisfy the ethical imperatives of the pledge would be to clearly label for users the small number of insecure library resources that remain after 2016 as being subject to surveillance.


Q. I/We can contribute to the effort in a way that isn’t covered well by the pledges. Can I add another pledge?

A. We want to keep this simple, but we welcome your support. email us with your individualized statement, and we may include it on our website when signatories are announced.

Wednesday, June 10, 2015

Protect Reader Privacy with Referrer Meta Tags

Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user, and if you turned on the referer header display, you could see what the user had been reading just before.  There was a group of scientists in Poland who'd be on my site regularly- I reported the latest news on nitride semiconductors, and my site was free. Every day around the same time, one of the Poles would check my site, and I could tell he had a bunch of sites he'd look at in order. My site came right after a Russian web site devoted to photographs of unclothed women.

The original idea behind the HTTP referer header (yes, that's how the header is spelled) was that webmasters like me needed it to help other webmasters fix hyperlinks. Or at least that was the rationalization. The real reason for sending the referer was to feed webmaster narcissism. We wanted to know who was linking to our site, because those links were our pats on the back. They told us about other sites that liked us. That was fun. (Still true today!)

The fact that my nitride semiconductor website ranked up there with naked Russian women amused me; reader privacy issues didn't bother me because the Polish scientist's habits were safe with me.


Twenty years later, the referer header seems like a complete privacy disaster. Modern web sites use resources from all over the web, and a referer header, including the complete URL of the referring web page, is sent with every request for those resources. The referer header can send your complete web browsing log to websites that you didn't know existed.

Privacy leakage via the referrer header plagues even websites that ostensibly believe in protecting user privacy, such as those produced by or serving libraries. For example, a request to the WorldCat page for What you can expect when you're expecting  results in the transmission of referer headers containing the user's request to the following hosts:
  • http://ajax.googleapis.com
  • http://www.google.com (with tracking cookies)
  • http://s7.addthis.com (with tracking cookies)
  • http://recommender.bibtip.de
None of the resources requested from these third parties actually need to know what page the user is viewing, but WorldCat causes that information to be sent anyway. In principle, this could allow advertising networks to begin marketing diapers to carefully targeted WorldCat users. (I've written about AddThis and how they sell data about you to advertising networks.)

It turns out there's an easy way to plug this privacy leak in HTML5. It's called the referrer meta tag. (Yes, that's also spelled correctly.)

The referrer meta tag is put in the head section of an HTML5 web page. It allows the web page to control the referer headers sent by the user's browser. It looks like this:

<meta name="referrer" content="origin" />

If this one line were used on WorldCat, only the fact that the user is looking a WorldCat page would be sent to Google, AddThis, and BibTip. This is reasonable, library patrons typically don't expect their visits to a library to be private; they do expect that what they read there should be private.

Because use of third party resources is often necessary, most library websites leak lots of privacy in referer headers. The meta referrer policy is a simple way to stop it. You may well ask why this isn't already standard practice. I think it's mostly lack of awareness. Until very recently, I had no idea that this worked so well. That's because it's taken a long time for browser vendors to add support. Although Chrome and Safari have been supporting the referrer meta tag for more than two years; Firefox only added it in January of 2015. Internet Explorer will support it with the Windows 10 release this summer. Privacy will still leak for users with older browser software, but this problem will gradually go away.

There are 4 options for the meta referrer tag, in addition to the "origin" policy. The origin policy sends only the host name for the originating page.

For the strictest privacy, use

<meta name="referrer" content="no-referrer" />

If you use this sitting, other websites won't know you're linking to them, which can be a disadvantage in some situations. If the web page links to resources that still use the archaic "referer authentication", they'll break.

 The prevailing default policy for most browsers is equivalent to

<meta name="referrer" content="no-referrer-when-downgrade" />

"downgrade" here refers to http links in https pages.

If you need the referer for your own website but don't want other sites to see it you can use

<meta name="referrer" content="origin-when-cross-origin" />

Finally, if you want the user's browser to send the full referrer, no matter what, and experience the thrills of privacy brinksmanship, you can set

<meta name="referrer" content="unsafe-url" />

Widespread deployment of the referrer meta tag would be a big boost for reader privacy all over the web. It's easy to implement, has little downside, and is widely deployable. So let's get started!

Links:

Thursday, June 4, 2015

Towards the Post-Privacy Library?

I have an article in this month's Digital Futures, a supplement to American Libraries magazine. The full issue is an important one, so go take a look. In addition to my article, be sure to read the article starting on page 20 entitled "Empowering Libraries to Innovate" in which I am quoted. Here's the web version.

I'm reprinting the article here so as to have a good place for discussion.



Alice, a 17 year old high school student, goes to her local public library and reads everything she can find about pregnancy. Noticing this, a librarian calls up some local merchants and tells them that Alice might be pregnant. When Alice visits her local bookstore, the staff has some great suggestions about newborn care for her. The local drugstore sends her some coupons for scent-free skin lotion. She reads "what you can expect..." at the library and a few months later she starts getting mail about diaper services.

Unthinkable? In the physical library, I hope this never happens. It would be too creepy!

In the digital library, this future could be happening now. Libraries and their patrons are awash in data that really isn't sensitive until aggregated, and the data is getting digested by advertising networks and flowing into "big data" archives. The scenario in which advertisers exploit Alice's library usage is not only thinkable, it needs to be defended against. It's a "threat model" that's mostly unfamiliar to libraries.

Recently, I read a book called Half Life. Uranium theft, firearms technology and computer hacking are important plot elements, but I'm not worried about people knowing that I loved it. The National Security Agency (NSA) is not going to identify me as a potential terrorist because I'm reading Half Life. On the contrary, I'd love for my reading behavior to be broadcast to the entire world, because maybe more people would discover what a wonderful writer S.L. Huang is. A lot of a library user's digital usage data is like that. It's not particularly private, and most would gladly trade usage information for convenience or to help improve the services they rely on. It would be a waste of time and energy for a library to worry much about keeping that information secret. Quite the opposite, libraries are helping users share their behavior with things like Facebook Like buttons and social media widgets.

Which is why Alice should be very worried and why it's important for libraries to understand new threat models. What breaches of user privacy are most likely to occur and which are most likely to present harm?

A 2012 article in the New York Times Magazine described a real situation involving Target (the retailer).  Target's "big data" analytics team developed a customer model that identified pregnant women based on shopping behavior. Purchases of scent-free skin lotion, vitamin supplements, and cotton balls turned out to be highly predictive of subsequent purchases of baby diapers. Using the model, Target sent ads for baby-oriented products to the customers their algorithm had identified. In one case, an irate father whose daughter had received ads for baby clothes and cribs accused the store of encouraging his daughter to get pregnant. When a manager called to apologize, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

Among the companies collecting "big data" about users are the advertising networks, companies that sit in between advertisers and websites. They use their data to decide which ad from a huge inventory is most likely to result in a user response. If I were Alice, I don't think I would want my search for pregnancy books broadcasted to advertising networks. Yet that's precisely what happens when I do a search on my local public library's online catalog. I very much doubt that many advertisements are being targeted based on that searching ... yet. But the digital advertising industry is extremely competitive, and unless libraries shift their practices, it's only a matter of time that library searches get factored into advanced customer models.

But it doesn't have to happen that way. Libraries have a strong tradition of protecting user privacy. Once all the "threat models" associated with the digital environment are considered, practices will certainly change.

So let's get started. In the rest of this article, I'll examine the process of borrowing and reading an ebook, and identify privacy weaknesses in the processes that advertisers and their predictive analytics modeling could exploit.
  1. Most library catalogs allow non-encrypted searches. This exposes Alice's ebook searches to internet providers between Alice and the library's server. The X-UIDH header has been used by providers such as Verizon and AT&T to help advertisers target mobile users. By using HTTPS for their catalogs, libraries can limit this intrusion. This is relatively easy and cheap, and there's no good excuse in 2015 for libraries not to make the switch.

  2. Some library catalogs use social widgets such as AddThis or ShareThis that broadcast a user's search activity to advertising networks. Similarly, Facebook "Like" buttons send a user's search activity to Facebook whether or not the user is on Facebook. Libraries need to carefully evaluate the benefits of these widgets against the possibility that advertising networks will use Alice's search history inappropriately.

  3. Statistics and optimization services like Google Analytics and NewRelic don't currently share Alice's search history with advertising networks, but libraries should evaluate the privacy assurances from these services to see if they are consistent with their own policies and local privacy laws.

  4. When Alice borrows a book from a vendor such as OverDrive or 3M, it monitors Alice's reading behavior, albeit anonymously. At this date, it's very difficult for an advertiser to exploit Alice's use of reading apps from OverDrive or 3M. Although many have criticized the use of Adobe digital rights management (DRM) in these apps, both 3M and OverDrive use the "vendorID" method which avoids the disclosure of user data to Adobe, and at this date, there is no practical way for an advertising network to exploit Alice's use of these services. Here again, libraries should review their vendor contracts to make sure that can't change.
  5. If Alice reads her ebook using a 3rd party application such as Adobe Digital Editions (ADE), the privacy behavior of the third party comes into play. Last year, ADE was found to be sending user reading data back to Adobe without encryption;  even today, it's known to phone home with encrypted reading data. Other applications, such as Bluefire Reader, have a better reputation for privacy, but as they say "past performance is no guarantee of future returns".

  6. If Alice wants to read her borrowed ebook on a Kindle (via OverDrive), it's very likely that Amazon will be able to exploit her reading behavior for marketing purposes. To avoid it, Alice would need to create an anonymous account on Amazon for reading her library books. Most people will just use their own (non-anonymous) accounts for convenience. If Alice shares her Amazon account with others, they'll know what she reads.

    This is a classic example of the privacy vs. convenience tradeoff that libraries need to consider. A Kindle user trusts that Amazon will not do anything too creepy, and Amazon has every incentive to make that user comfortable with their data use. Libraries need to let users make their own privacy decisions, but at the same time libraries need to make sure that users understand the privacy implications of what they do.

  7. The library's own records are also potential source of a privacy breach. This "small-data" threat model is perhaps more familiar to librarians. Alice's parents could come in and demand to know what she's been reading. A schoolmate might hack into the library's lightly defended databases looking for ways to embarrass Alice. A staff member might be a friend of Alice's family. Libraries need clear policies and robust processes to be worthy of Alice's trust.

In the digital environment, it's easy for libraries to be unduly afraid of using the data from Alice's searches and reading to improve her experience and make the library a more powerful source of information. Social networks are changing the way we think about our privacy, and often the expectation is that services will make use of personal information that's been shared. Technologies exist to protect the user's control over that data but advertising networks have no incentive to employ them. I want my library to track me, not advertising networks!. I want great books to read, and no, I'm not in the market for uranium-238!