Sunday, December 7, 2014

Stop Making Web Surveillance Bugs by Mistake!

Since I've been writing about library websites that leak privacy, I figured it would be a good idea to do an audit of to make sure it wasn't leaking privacy in ways I wasn't aware of. I knew that some pages leak some privacy via referer headers to Google, to Twitter, and to Facebook, but we force HTTPS and make sure that user accounts can be pseudonyms. We try not to use any services that push ids for advertising networks. (Facebook "Like" button, I'm looking at you!)

I've worried about using static assets loaded from third party sites. For example, we load jQuery from (it's likely to be cached, and should load faster) and Font Awesome from (ditto). I've verified that these services don't set any cookies and allow caching, which makes it unlikely that they could be used for surveillance of users.

It turned out that my worst privacy leakage was to Creative Commons! I'd been using the button images for the various licenses served from I was surprised to see that id cookies were being sent in the request for these images.
In theory, the folks at Creative Commons could track the usage for any CC-licensed resource that loaded button images from Creative Commons! And it could have been worse. If I had used the HTTP version of the images, anyone in the network between me and Creative Commons would be able to track what I was reading!

Now, to be clear, Creative Commons is NOT tracking anyone. The reason my browser is sending id cookies along with button image requests is that the Creative Commons website uses Google Analytics, and Google Analytics sets a domain-wide id cookie. Google Analytics doesn't see any of this traffic- it doesn't have access to server logs. But without anyone intending it, the combination of Creative Commons, Google Analytics, and websites like mine that want to promote use of Creative Commons have conspired to build a network of web surveillance bugs BY MISTAKE.

When I inquired about this to Creative Commons, I found out they were way ahead of the issue. They've put in redirects to HTTPS version of their button images. This doesn't plug any privacy leakage, but it discourages people from using the privacy spewing HTTP versions. In addition, they'd already started to process of moving static assets like button images to a special-purpose domain. The use of this domain,, will ensure that id cookies aren't sent and nobody could use them for surveillance.

If you care about user privacy and you have a website, here's what you should do:
  1. Avoid loading images and other assets from 3rd party sites. consider self-hosting these.
  2. When you use 3rd party hosted assets, use HTTPS references only!
  3. Avoid loading static assets from domains that use Google Analytics and set id domain cookies.
For Creative Common license buttons, use the buttons from If you use the Creative Commons license chooser, replace "" in the code it makes for you with "". This will help the web respect user privacy. The buttons will also load faster, because the "" requests will get redirected there anyway.

Saturday, November 22, 2014

NJ Gov. Christie Vetoes Reader Privacy Act, Asks for Stronger, Narrower Law

According to New Jersey Governor Chris Christie's conditional veto statement, "Citizens of this State should be permitted to read what they choose without unnecessary government intrusion." It's hard to argue with that! Personally, I think we should also be permitted to read what we choose without corporate surveillance.

As previously reported in The Digital Reader, the bill passed in September by wide margins in both houses of the New Jersey State Legislature and would have codified the right to read ebooks without letting the government and everybody else knowing about it.

I wrote about some problems I saw with the bill. Based on a California law focused on law enforcement, the proposed NJ law added civil penalties on booksellers who disclosed the personal information of users without a court order. As I understood it, the bill could have prevented online booksellers from participating in ad networks (they all do!).

Governor Christie's veto statement pointed out more problems. The proposed law didn't explicitly prevent the government from asking for personal reading data, it just made it against the law for a bookseller to comply. So, for example, a local sheriff could still ask Amazon for a list of people in his town reading an incriminating book. If Amazon answered, somehow the reader would have to:
  1. find out that Amazon had provided the information
  2. sue Amazon for $500.
Another problem identified by Christie was that the proposed law imposed privacy burdens on booksellers stronger than those on libraries. Under another law, library records in New Jersey are subject to subpoena, but bookseller records wouldn't be. That's just bizarre.

In New Jersey, a governor can issue a "Conditional Veto". In doing so, the governor outlines changes in a bill that would allow it to become law. Christie's revisions to the Reader Privacy Act make the following changes:
  1. The civil penalties are stripped out of the bill. This allows Gov. Christie to position himself and NJ as "business-friendly".
  2. A requirement is added preventing the government from asking for reader information without a court order or subpoena. Christie gets to be on the side of liberty. Yay!
  3. It's made clear that the law applies only to government snooping, and not to promiscuous data sharing with ad networks. Christie avoids the ire of rich ad network moguls.
  4. Child porn is carved out of the definition of "books". Being tough on child pornography is one of those politically courageous positions that all politicians love.
The resulting bill, which was quickly reintroduced in the State Assembly, is stronger but narrower. It wouldn't apply in situations like the recent Adobe Digital Editions privacy breach, but it should be more effective at stopping "unnecessary government intrusion". I expect it will quickly pass the Legislature and be signed into law. A law that properly addresses the surveillance of ebook reading by private companies will be much more complicated and difficult to achieve.

I'm not a fan of his by any means, but Chris Christie's version of the Reader Privacy Act is a solid step in the right direction and would be an excellent model for other states. We could use a law like it on the national level as well.

(Guest posted at The Digital Reader)

Wednesday, November 5, 2014

If your website still uses HTTP, the X-UIDH header has turned you into a snitch

Does your website still use HTTP? It not, you're a snitch.

As I talk to people about privacy, I've found a lot of misunderstanding. HTTPS applies encryption to the communication channel between you and the website you're looking at. It's an absolute necessity when someone's making a password or sending a credit card number, but the modern web environment has also made it important for any communication that expects privacy.

HTTP is like sending messages on a postcard. Anyone handling the message can read the whole message. Even worse, they can change the message if they want. HTTPS is like sending the message in a sealed envelope. The messengers can read the address, but they can't read or change the contents.

It used to be that network providers didn't read your web browsing traffic or insert content into it, but now they do so routinely. This week we learned that Verizon and AT&T were inserting an "X-UIDH" header into your mobile phone web traffic. So for example, if a teen was browsing a library catalog for books on "pregnancy" using a mobile phone, Verizon's advertising partners could, in theory, deliver advertising for maternity products.

The only way to stop this header insertion is for websites to use HTTPS. So do it. Or you're a snitch.

Sorry, doesn't support HTTPS. So if you mysteriously get ads for snitch-related products, or if the phrase "Verizon and AT&T" is not equal to "V*erizo*n and A*T*&T" without the asterisks, blame me and blame Google.

Here's more on the X-UIDH header.

Tuesday, November 4, 2014

Reading Privacy Enables Reader Sharing

Digital privacy is a weird thing. People confuse it for digital security, but it's much more than that. Privacy isn't keeping secrets, it's controlling the information we share. What we think of as privacy depends on trusting that the people we share with won't do bad things. Privacy isn't digital at all. Maybe instead of "digital privacy" we should talk about "digital discretion".

The recent revelations of how Adobe Digital Editions was spewing the users' reading activity, unencrypted, to a logging server are an instructive example of poor digital discretion. I thought that Adobe was working on an ebook synchronization system, but it now looks like ADE was doing the logging "to support new business models" rather than for ebook sync.  It got me thinking about how ebook synchronization can and should be done.

Synchronization is a useful function. I'd like to be able to start reading a book on my iPhone while on the train in the morning, then pick up reading where I left off in the evening using my iPad. But to accomplish this function, I need to trust someone with information that discloses what I'm reading. It's easy to design a centralized sync system that requires a reader to register who they are,  what book they're reading and an activity stream of what pages are being read.

But a sync system designed for privacy doesn't need all that information. The central server doesn't even need to know the identity of the book! As Jason Griffey pointed out in his article on Adobe's spyware, the book's identifier could be hashed with a password, effectively hiding its identity from the central server.

I wrote about how Bluefire is doing sync for their apps while trying their best to respect user privacy. Rather than obscuring the identity of the book, they focus on making it hard to identify users in their system.

Adobe was justifiably criticized for sending lots of information back to its central server without encryption. Although their version 4.0.1 sends less information, mostly Adobe is just encrypting the stream and claiming the privacy problem is solved. The core privacy problem remains- when a DRM ebook is read, an encrypted activity stream is sent back to Adobe. If the information is sensitive or useful, why should Adobe get the benefit of this information at all? At the very least, providing your activity stream to Adobe should be opt-in.

There's a second privacy problem that hasn't been discussed anywhere. It may seem contradictory, but central-server synchronization systems impose TOO MUCH privacy. In many situations, a reader will want to share their reading stream. Look at GoodReads - you can share your opinions with friends. Look at Kobo Reading Life - you get awards and statistics in return for your stream. In classroom situations, students could sync their readers with the instructors'. These sorts of affordances can't be developed without access to the reading-activity stream, and won't work unless everyone participating in the stream is in the same reading ecosystem, using the same central server.

If instead of encrypting the reading-event stream, encryption were applied to the events themselves, the events could be shared over most any messaging system, and distributed according to the user's choices and desired application. In fact, you could use Twitter.

Every user of a Twitter-reading-sync system would create a Twitter feed to publish their reading activity. Other users could subscribe to the event stream. Direct messages could be used to send decryption data for private reading streams. The system could be engineered so that even Twitter would be unable to know what's being read privately. And the whole world would have access to reading that's being done publicly. In addition to page turning,  bookmarking and annotation activity could be of interest.

It's interesting to think about what might happen in a reading ecosystem where readers, not corporations, control the access to their reading activity streams. Publishers and authors might provide incentives to readers who share their reading-events with them. Social networks might match users reading the same page of the same book. Libraries could learn how to meet the needs of their communities. Teachers might be alerted to passages that students find to be difficult. Ironically, these public uses are enabled by a system design which puts a premium on privacy for the reader.

Dave Egger's novel "the Circle" gave us the expression "Privacy is Theft". The novel imagines a social norms that consider privacy to be a reflection of selfishness. But in the real world, it's the lack of discretion by companies building up vast private collections of personal information that's the true threat to social sharing. Too bad that theft is not a crime.

Wednesday, October 29, 2014

GITenberg: Modern Maintenance Infrastructure for Our Literary Heritage

One day back in March, the Project Gutenberg website thought I was a robot and stopped letting me download ebooks. Frustrated, I resolved to put some Project Gutenberg ebooks into GitHub, where I could let other people fix problems in the files. I decided to call this effort "Project Gitenhub". On my second or third book, I found that Seth Woodworth had had the same idea a year earlier, and had already moved about a thousand ebooks into GitHub. That project was named "GITenberg". So I joined his email list and started submitting pull requests for PG ebooks that I was improving.

Recently, we've joined forces to submit a proposal to the Knight Foundation's News Challenge, whose theme is "How might we leverage libraries as a platform to build more knowledgeable communities? ". Here are some excerpts:
Project Gutenberg (PG) offers 45,000 public domain ebooks, yet few libraries use this collection to serve their communities. Text quality varies greatly, metadata is all over the map, and it's difficult for users to contribute improvements. 
We propose to use workflow and software tools developed and proven for open source software development- GitHub- to open up the PG corpus to maintenance and use by libraries and librarians. 
The result- GITenberg- will include MARC records, covers, OPDS feeds and ebook files to facilitate library use. Version-controlled fork and merge workflow, combined with a change triggered back-end build environment will allow scaleable, distributed maintenance of the greatest works of our literary heritage.  
Libraries need metadata records in MARC format, but in addition they need to be able to select from the corpus those works which are most relevant to their communities. They need covers to integrate the records with their catalogs, and they need a level of quality assurance so as not to disappoint patrons. Because this sort of metadata is not readily available, most libraries do not include PG records in their catalogs, resulting in unnecessary disappointment when, for example, a patron want to read Moby Dick from the library on their Kindle. 
43,000 books and their metadata have been moved to the git version control software, this will enable librarians to collaboratively edit and control the metadata. The GITenberg website, mailing list and software repository has been launched at . Software for generating MARC records and OPDS feeds have already been written.
Modern software development teams use version control, continuous integration, and workflow management systems to coordinate their work. When applied to open-source software, these tools allow diverse teams from around the world to collaboratively maintain even the most sprawling projects. Anyone wanting to fix a bug or make a change first forks the software repository, makes the change, and then makes a "pull request". A best practice is to submit the pull request with a test case verifying the bug fix. A developer charged with maintaining the repository can then review the pull request and accept or reject the change. Often, there is discussion asking for clarification. Occasionally versions remain forked and diverge from each other. GitHub has become the most popular sites for this type software repository because of its well developed workflow tools and integration hooks. 
The leaders of this team recognized the possibility to use GitHub for the maintenance of ebooks, and we began the process of migrating the most important corpus of public domain ebooks, Project Gutenberg, onto GitHub, thus the name GITenberg. Project Gutenberg has grown over the years to 50,000 ebooks, audiobooks, and related media, including all the most important public domain works of English language literature. Despite the great value of this collection, few libraries have made good use of this resource to serve their communities. There are a number of reasons why. The quality of the ebooks and the metadata around the ebooks is quite varied. MARC records, which libraries use to feed their catalog systems, are available for only a subset of the PG collection. Cover images and other catalog enrichment assets are not part of PG. 
To make the entire PG corpus available via local libraries, massive collaboration amoung librarians and ebook develeopers is essential. We propose to build integration tools around github that will enable this sort of collaboration to occur. 
  1. Although the PG corpus has been loaded into GITenberg, we need to build a backend that automatically converts the version-controlled source text into well-structured ebooks. We expect to define a flavor of MarkDown or Asciidoc which will enable this automatic, change-triggered building of ebook files (EPUB, MOBI, PDF). (MarkDown is a human-readable plain text format used on GitHub for documentation; MarkDown for ebooks is being developed independently by several team of developers. Asciidoc is a similar format that works nicely for ebooks.) 
  2. Similarly, we will need to build a parallel backend server that will produce MARC and XML formatted records from version-controlled plain-text metadata files.
  3. We will generate covers for the ebooks using a tool recently developed by NYPL and include them in the repository.
  4. We will build a selection tool to help libraries select the records best suited to their libraries.
  5. Using a set of "cleaned up" MARC records from NYPL, and adding custom cataloguing, we will seed the metadata collection with ~1000 high quality metadata records.
  6. We will provide a browsable OPDS feed for use in tablet and smartphone ebook readers.
  7. We expect that the toolchain we develop will be reusable for creation and maintenance of a new generation of freely licensed ebooks.

The rest of the proposal is on the Knight News Challenge website. If you like the idea of GITenberg, you can "applaud" it there. The "applause' is not used in the judging of the proposals, but it makes us feel good. There are lots of other interesting and inspiring proposals to check out and applaud, so go take a look!