Sunday, April 2, 2017

Copyrighted Clickstream Poetry to Stop ISP Click-Selling

Congress won't let the Federal Communications Commission (FCC) protect users from Internet Service Provider (ISP) snooping-for-cash. My ISP could decide to sell a list of all the websites I visit to advertisers, and the FCC can't stop them. I wondered if there was some way I could use copyright law to prevent my ISP from selling copies of my clickstream.

So I invented "clickstream poetry". Here is my first clickstream poem, entitled My clicks are mine:
{
    "content":       
        [
        "https://roses.com",
        "http://are.com",
        "https://reddit.com",
        "http://theultraviolets.net",
        "http://are.com",
        "https://moo.com",
        "http://this.is",
        "http://work.org",
        "http://is.com",
        "https://copyright.com",
        "https://ted.com",
        "https://www.so.ch",
        "http://verizon.com",
        "http://www.faa.gov",
        "https://kyu.com",
        "https://copyright.com",
        "http://2o17.com",
        "http://eric.org",
        "http://hellman.net",
        "https://creativecommons.org/licenses/by-nc/4.0/legalcode"
        ],
    "copyright": "2017 Eric Hellman",
    "license": "https://creativecommons.org/licenses/by-nc/4.0/legalcode",
    "title": "My clicks are mine"
}

I wrote a python script that "performs" the poem for the benefit of anyone listening to my clickstream. The script requests the websites in the poem in a random order; the listener will see the website names requested, and this dataset comprises the "poem". I used a Creative Commons license that doesn't let anyone distribute copies of my poem for commercial purposes. If my ISP tries to sell a copy of my clickstream, they would be violating the license, and thus infringing my copyright to the poem. If you run the script to perform the poem (for non-commercial purposes, of course), your ISP would similarly be infringing my copyright if they try to sell your clickstream.

If I tried to sue an ISP for copyright infringement, they would likely argue that though my creation is original and used in its entirety, selling my clickstream is a "fair use". They would assert that advertising optimization (or whatever) is a "transformative use" and that it didn't affect the market for my poem. Who would pay anything for a stupid clickstream poem? How would a non-existent, hypothetical market for clickstream poetry be harmed by use in their big data algorithms?

That's why I'm offering commercial licenses to the clickstream poem My clicks are mine. This will demonstrate that a commercial market for clickstream poetry licenses exists. For only $10, you can use a copy of my poem for any purpose whatsoever, for a period of 24 hours. If an ad network wants to use my clickstream to optimize the ads they show me, more power to them, as long as they pay for a license. I imagine that, over the lifetime of my poem's copyright protection (into the 22nd century), clickstream poetry will become increasingly valuable because of uses that haven't been invented yet.

To acquire a commercial license to my poem, support my work at the Free Ebook Foundation, a 501(c)3 not-for-profit corporation, by making a donation. Or don't. I have no idea if a court would take my side against a big company (and against Congress). I'm told that judges are generally skeptical of clever "legal hacks" unless they are crafted by lawyers instead of engineers.

ISPs would probably figure out a legal or technical subterfuge around the copyright of my clickstream poem; but if they have to worry even a little, this effort will have been worth my time.

Update: I have now paid $35 to register my copyright to My clicks are mine.

Wednesday, March 22, 2017

Reader Privacy for Research Journals is Getting Worse

Ever hear of Grapeshot, Eloqua, Moat, Hubspot, Krux, or Sizmek? Probably not. Maybe you've heard of Doubleclick, AppNexus, Adsense or Addthis? Certainly you've heard of Google, which owns Doubleclick and Adsense. If you read scientific journal articles on publisher websites, these companies that you've never heard of will track and log your reading habits and try to figure out how to get you to click on ads, not just at the publisher websites but also at websites like Breitbart.com and the Huffington Post.

Two years ago I surveyed the websites of 20 of the top research journals and found that 16 of the top 20 journals placed trackers from ad networks on their web sites. Only the journals from the American Physical Society (2 of the 20) supported secure (HTTPS) connections, and even now APS does not default to being secure.

I'm working on an article about advertising in online library content, so I decided to revisit the 20 journals to see if there had been any improvement. Over half the traffic on the internet now uses secure connections, so I expected to see some movement. One of the 20 journals, Quarterly Journal of Economics, now defaults to a secure connection, significantly improving privacy for its readers. Let's have a big round of applause for Oxford University Press! Yay.

So that's the good news. The bad news is that reader privacy at most of the journals I looked at got worse. Science, which could be loaded securely 2 years ago, has reverted to insecure connections. The two Annual Reviews journals I looked at, which were among the few that did not expose users to advertising network tracking, now have trackers for AddThis and Doubleclick. The New England Journal of Medicine, which deployed the most intense reader tracking of the 20, is now even more intense, with 19 trackers on a web page that had "only" 14 trackers two years ago. A page from Elsevier's Cell went from 9 to 16 trackers.

Despite the backwardness of most journal websites, there are a few signs of hope. Some of the big journal platforms have begun to implement HTTPS. Springer Link defaults to HTTPS, and Elsevier's Science Direct is delivering some of its content with secure connections. Both of them place trackers for advertising networks, so if you want to read a journal article securely and privately, your best bet is still to use Tor.

Thursday, February 2, 2017

How to enable/disable privacy protection in Google Analytics (it's easy to get wrong!)

In my survey last year of ARL library web services, I found that 72% of them used Google Analytics. So it's not surprising that a common response to my article about leaking catalog searches to Amazon was to wonder whether the same thing is happening with Google Analytics.

The short answer is "It Depends". It might be OK to use Google Analytics on a library search facility, if the following things are true:
  1. The library trusts Google on user privacy. (Many do.)
  2. Google is acting in good faith to protect user privacy and is not acting under legal compulsion to act otherwise. (We don't really know.)
  3. Google Analytics is correctly doing what their documentation says they are doing and not being circumvented by the rest of Google. (They don't always.)
  4. The library has implemented Google Analytics correctly to enable user privacy.
There's an entire blog post to write about each of the first three conditions, but I have only so many hours in a day.  Given that many libraries have decided that the benefits using of Google Analytics outweigh the privacy risks, the rest of this post concerns only this last condition. Of the 72% of ARL libraries that use Google Analytics, I find that only 19% of them have implemented Google Analytics with privacy-protection features enabled.

So, if you care about library privacy but can't do without Google Analytics, read on!

Google Analytics has a lot of configuration options, which is why webmasters love it. For the purposes of user privacy, however, there are just two configuration options to pay attention to, the "IP Anonymization" option and the "Display Features" option.

IP Anonymization says to Google Analytics "please don't remember the exact IP address of my users". According to Google, enabling this mode masks the least significant bits of the user's IP address before the IP address is used or saved. Since many users can be identified by their IP address, this prevents anyone from discovering the search history for a given IP address. But remember, Google is still sent the IP address, and we have to trust that Google will obscure the IP address as advertised, and not save it in some log somewhere. Even with the masked IP address, it may still be possible to identify a user, particularly if a library serves a small number of geographically dispersed users.

"Display Features" says to Google to that you don't care about user privacy, and it's OK to track your users all to hell so that you can get access to "demographic" information. To understand what's happening, it's important to understand the difference between "first-party" and "third-party" cookies, and how they implicate privacy differently.

Out of the box, Google Analytics uses "first party" cookies to track users. So if you deploy Google Analytics on your "library.example.edu" server, the tracking cookie will be attached to the library.example.edu hostname. Google Analytics will have considerable difficulty connecting user number 1234 on the library.example.edu domain with user number 5678 on the "sci-hub.info" domain, because the user ids are chosen randomly for each hostname. But if you turn on Display Features, Google will connect the two user ids via a third party tracking cookie from its Doubleclick advertising service. This enables both you and Google to know more about your users. Anyone with access to Google's data will be able to connect the catalog searches saved for user number 1234 to that user's searches on any website that uses Google advertising or any site that has Display Features turned on.

IP Anonymization and Display Features can be configured in Google Analytics in three ways, depending on how it's being configured. The instructions here apply to the "Universal Analytics" script. You can tell a site uses Universal Analytics because the pages execute a javascript named "analytics.js". An older "classic" version of Google Analytics uses a script named "ga.js"; its configuration is similar to that of Universal. More complex websites may use Google Tag Manager to deploy and configure Google Analytics.

Google Analytics is usually deployed on a web page by inserting a script element that looks like this:
<script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('send', 'pageview');
</script>
IP Anonymization and Display Features are turned on with extra lines in the script:
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('require', 'displayfeatures');  // starts tracking users across sites
    ga('set', 'anonymizeIp', true); // makes it harder to identify the user from logs
    ga('send', 'pageview');
The Google Analytics Admin allows you to turn on cross site user tracking, though the privacy impact of what you're doing is not made clear . In the "Data Collection" item of the Tracking info pane, look at the toggle switches for "Remarketing" and "Advertising Reporting Features" if these are switched to "ON", then you've enabled cross site tracking and your users can expect no privacy.

Turning on IP anonymization is not quite as easy and turning on cross-site tracking. You have to add it explicitly in your script or turn it on in Google tag manager (where you won't find it unless you know what to look for!).

To check if cross-site tracking has been turned on in your institution's Google Analytics, use the procedures I described in my article on How to check if your library is leaking catalog searches to Amazon.  First, clear the cookies for your website, then load your site and look at the "Sources" tab in Chrome developer tools. If there's a resource from "stats.g.doubleclick.net", then your website is asking google to track your users across sites. If your institution is a library, you should not be telling Google to track your users across sites.

Bottom line: if you use Google Analytics, always remember that Google is fundamentally an advertising company and it will seldom guide you towards protecting your users' privacy.

Thursday, January 26, 2017

Policy-based Privacy is Over


Yesterday, President Donald Trump issued an executive order to enhance "Public Safety in the Interior of the United States".

Of interest here is section 14:
Sec. 14.  Privacy Act.  Agencies shall, to the extent consistent with applicable law, ensure that their privacy policies exclude persons who are not United States citizens or lawful permanent residents from the protections of the Privacy Act regarding personally identifiable information.  
What this means is that the executive branch, including websites, libraries and information systems may not use privacy policies to protect users other than US citizens and green card holders. Since websites, libraries and information systems typically don't keep track of user citizen status, this makes it very difficult to have any privacy policy at all.

Note that this executive order does not apply to the Library of Congress, an organ of the legislative branch of the US government. Nevertheless, it demonstrates the vulnerability of policy-based privacy. Who's to say that Congress won't enact the same restrictions for the legislative branch? Who's to say that Congress won't enact the same restrictions on any website. library or information system that operates in multiple states?

Lawyering privacy won't work any more. Librarianing privacy won't work any more. We need to rely on engineers to build privacy into our websites, libraries and information systems. This is possible. Engineers have tools such as strong cryptography that allow privacy to be built into systems without compromising functionality. It's not that engineers are immune from privacy-breaking mandates, but it's orders of magnitude more difficult to outlaw privacy engineering than it is to invalidate privacy policies. A system that doesn't record what a user does can't produce user activity records. Some facts are not alternativable. Math trumps Trump.

Friday, January 13, 2017

Google's "Crypto-Cookies" are tracking Chrome users

Ordinary HTTP cookies are used in many ways to make the internet work. Cookies help websites remember their users. A common use of cookies is for authentication: when you log into a website, the reason you stay logged is because of a cookie that contains your authentication info. Every request you make to the website includes this cookie; the website then knows to grant you access.

But there's a problem: someone might steal your cookies and hijack your login. This is particularly easy for thieves if your communication with the website isn't encrypted with HTTPS. To address the risk of cookie theft, the security engineers of the internet have been working on ways to protect these cookies with strong encryption. In this article, I'll call these "crypto-cookies", a term not used by the folks developing them. The Chrome user interface calls them Channel IDs.


Development of secure "crypto-cookies" has not been a straight path. A first approach, called "Origin Bound Certificates" has been abandoned. A second approach "TLS Channel IDs" has been implemented, then superseded by a third approach, "TLS Token Binding" (nicknamed "TokBind"). If you use the Chrome web browser, your connections to Google web services take advantage of TokBind for most, if not all, Google services.

This is excellent for security, but might not be so good for privacy; 3rd party content is the culprit. It turns out that Google has not limited crypto-cookie deployment to services like GMail and Youtube that have log-ins. Google hosts many popular utilities that don't get tracked by conventional cookies. Font libraries such as Google Fonts, javascript libraries such as jQuery, and app frameworks such as Angular, are all hosted on Google servers. Many websites load these resources from Google for convenience and fast load times.  In addition, Google utility scripts such as Analytics and Tag Manager are delivered from separate domains so that users are only tracked across websites if so configured.  But with Google Chrome (and Microsoft's Edge Browser), every user that visits any website using Google Analytics, Google Tag Manager, Google Fonts, JQuery, Angular, etc. are subject to tracking across websites by Google. According to Princeton's OpenWMP project, more than half of all websites embed content hosted on Google servers.
Top 3rd-party content hosts. From Princeton's OpenWMP.
Note that most of the hosts labeled "Non-Tracking Content"
are at this time subject to "crypto-cookie" tracking.


While using 3rd party content hosted by Google was always problematic for privacy-sensitive sites, the impact on privacy was blunted by two factors – cacheing and statelessness. If a website loads fonts from fonts.gstatic.com, or style files from fonts.googleapis.com, the files are cached by the browser and only loaded once per day. Before the rollout of crypto-cookies, Google had no way to connect one request for a font file with the next – the request was stateless; the domains never set cookies. In fact, Google says:
Use of Google Fonts is unauthenticated. No cookies are sent by website visitors to the Google Fonts API. Requests to the Google Fonts API are made to resource-specific domains, such as fonts.googleapis.com or fonts.gstatic.com, so that your requests for fonts are separate from and do not contain any credentials you send to google.com while using other Google services that are authenticated, such as Gmail. 
But if you use Chrome, your requests for these font files are no longer stateless. Google can follow you from one website to the next, without using conventional tracking cookies.

There's worse. Crypto-cookies aren't yet recognized by privacy plugins like Privacy Badger, so you can be tracked even though you're trying not to be. The TokBind RFC also includes a feature called "Referred Token Binding" which is meant to allow federated authentication (so you can sign into one site and be recognized by another). In the hands of the advertising industry, this will get used for sharing of the crypto-cookie across domains.

To be fair, there's nothing in the crypto-cookie technology itself that makes the privacy situation any different from the status quo. But as the tracking mechanism moves into the web security layer, control of tracking is moved away from application layers. It's entirely possible that the parts of Google running services like gstatic.com and googleapis.com have not realized that their infrastructure has started tracking users. If so, we'll eventually see the tracking turned off.  It's also possible that this is all part of Google's evil master plan for better advertising, but I'm guessing it's just a deployment mistake.

So far, not many companies have deployed crypto-cookie technology on the server-side. In addition to Google and Microsoft, I find a few advertising companies that are using it.  Chrome and Edge are the only client side implementations I know of.

For now, web developers who are concerned about user privacy can no longer ignore the risks of embedding third party content. Web users concerned about being tracked might want to use Firefox for a while.

Notes:

  1. This blog is hosted on a Google service, so assume you're being watched. Hi Google!
  2. OS X Chrome saves the crypto-cookies in an SQLite file at "~/Library/Application Support/Google/Chrome/Default/Origin Bound Certs". 
  3. I've filed bug reports/issues for Google Fonts, Google Chrome, and Privacy Badger. 
  4. Dirk Balfanz, one of the engineers behind TokBind has a really good website that explains the ins and outs of what I call crypto-cookies.