Go To Hellman: privacy

Showing posts with label privacy. Show all posts

Wednesday, June 12, 2024

The PII Figleaf

The Internet's big lie is "we respect your privacy". Thanks to cookie banners and such things, the Internet tells us this so many times a day that we ignore all the evidence to the contrary. Sure, there are a lot of people who care about our privacy, but they're often letting others violate our privacy without even knowing it. Sometimes this just means that they are trying to be careful with our "PII". And guess what? You know those cookies you're constantly blocking or accepting? Advertisers like Google have mostly stopped using cookies!!!

"PII" is "Personally Identifiable Information" and privacy lawyers seem to be obsessed with it. Lawyers, and the laws they care about, generally equate good PII hygiene with privacy. Good PII hygiene is not at all a bad thing, but it protects privacy the same way that washing your hands protects you from influenza. Websites that claim to protect your privacy are often washing the PII off their hands while at the same time coughing data all over you. They can and do violate your privacy while at the same time meticulously protecting your PII.

Examples of PII include your name, address, social security number, your telephone number and your email address. The IP address that you use can often be traced to you, so it's often treated as PII, but often isn't. The fact that you love paranormal cozy romance novels is not PII, nor is the fact that you voted for Mitt Romney. That you have an 18 year old son and an infant daughter is also not PII. But if you've checked out a paranormal cozy romance from your local library, and then start getting ads all over the internet for paranormal cozy romances set in an alternate reality where Mitt is President and the heroine has an infant and a teenager, you might easily conclude that your public library has sold your checkout list and your identity to an evil advertising company.

That's a good description of a recent situation involving San Francisco Public Library (SFPL). As reported by The Register :

In April, attorney Christine Dudley was listening to a book on her iPhone while playing a game on her Android tablet when she started to see in-game ads that reflected the audiobooks she recently checked out of the San Francisco Public Library.

Let me be clear. There's no chance that SFPL has sold the check-out list to anybody, much less evil advertisers. However, it DOES appear to be the case that SFPL and their online ebook vendors, Overdrive and Baker and Taylor, could have allowed Google to track Ms. Dudley, perhaps because they didn't fully understand the configuration options in Google Analytics. SFPL offers ebooks and audiobooks from Overdrive, "Kindle Books from Libby by Overdrive", and ebooks and audiobooks from Baker and Taylor's "Boundless" Platform. There's no leakage of PII or check-out list, but Google is able to collect demographics and interests from the browsing patterns of users with Google accounts.

A few years ago, I wrote an explainer about how to configure Google Analytics to protect user privacy. That explainer is obsolete, as Google is scrapping the system I explained in favor of a new system, "Google Analytics 4" (GA-4), that works better in the modern, more privacy-conscious browser environment. To their credit, Google has made some of the privacy-preserving settings the default - for example, they will no long store IP addresses. But reading the documentation, you can tell that they're not much interested in Privacy with a capital P as they want to be able to serve relevant (and thus lucrative) ads, even if they're for paranormal cozy romances. And Google REALLY doesn't want any "PII"! PII doesn't much help ad targeting, and there are places that regulate what they can do with PII.

We can start connecting the dots from the audiobook to the ads from the reporting in the Register by understanding a bit about Google Analytics. Google Analytics helps websites measure their usage. When you visit a webpage with Google Analytics, a javascript sends information back to one or more Google trackers about the address of the webpage, your browser environment, and maybe more data that the webpage publisher is interested in. Just about the only cookie being set these days is one that tells the website not to show the cookie banner!

From the Register:

The subdomain SFPL uses for library member login and ebook checkout, sfpl.bibliocommons.com, has only a single tracker, from Alphabet, that communicates with the domains google-analytics.com and googletagmanager.com.
The page is operated by BiblioCommons, which was acquired in 2020 by Canada-based Constellation Software. BiblioCommon has its own privacy policy that exists in conjunction with the SFPL privacy policy.
In response to questions about ad trackers on its main website, Wong( acknowledged that SFPL does use third-party cookies and provides a popup that allows visitors to opt-out if they prefer.
With regard to Google Analytics, she said that it only helps the library understand broad demographic data, such as the gender and age range of visitors.
"We are also able to understand broad interests of our users, such as movie, travel, sports and fitness based on webpage clicks, but this information is not at all tied to individual users, only as aggregated information," said Wong.

The statement from Jaime Wong, deputy director of communications for the SFPL, is revealing. The Google Analytics tracker only works within a website, and neither SFPL or its vendors are collecting demographic information to share with Google. But Google Analytics has options to turn on the demographic information that libraries think they really need. (Helps to get funding, for example.) It used to be called "Advertising Reporting Features" and "Remarketing" (I called these the "turn off privacy" switches) but now it's called "Google Signals". It works by adding the Google advertising tracker, DoubleClick, alongside the regular Analytics tracker. This allows Google to connect the usage data from a website to its advertising database, the one that stores demographic and interest information. This gives the website owners access to their user demographics, and it gives the Google advertising machine access to the users' web browsing behavior.

I have examined the relevant webpages from SFPL, as well as the customized pages that BiblioCommons, Overdrive, and Baker and Taylor provide for SFPL for trackers. Here's what I found:

The SFPL website, SFPL.org, has Analytics and DoubleClick ad trackers enabled.
The BiblioCommons website, sfpl.bibliocommons.org, has two analytics trackers enabled, but no advertising tracker. Probably one tracker "belongs" to SFPL while the other "belongs" to BiblioCommons.
The Overdrive website, sfpl.overdrive.com has Analytics and DoubleClick ad trackers enabled.
The Baker and Taylor website, sfpl.boundless.baker-taylor.com has Analytics and DoubleClick ad trackers enabled.

So it shouldn't be surprising that Ms. Dudley experienced targeted ads based on the books she was looking at in the San Francisco Public Library website. Libraries and librarians everywhere need to understand that reader privacy is not just about PII, and that the sort of privacy that libraries have a tradition of protecting is very different than the privacy that Google talks about when it says "Google Analytics 4 was designed to be able to evolve for the future and built with privacy at its core." At the end of this month earlier versions of Google Analytics will stop "processing" data. (I'm betting the trackers will still fire!)

What Google means by that is that in GA-4, trackers continue to work despite browser restrictions on 3rd party cookies, and the tracking process is no longer reliant on data like IP addresses that could be considered PII. To address those troublesome regulators in Europe, they only distribute demographic data and interest profiles for people who've given their permission to Google to do so. Do you really think you haven't somewhere given Google permission to collect your demographic data and interest profiles? You can check here.

Here's what Google tells Analytics users about the ad trackers:

When you turn on Google signals, Google Analytics will associate the session data it collects from your site and apps with Google's information from accounts of signed-in, consented users. By turning on Google signals, you acknowledge you adhere to the Google Advertising Features Policy, including rules around sensitive categories, have the necessary privacy disclosures and rights from your end users for such association, and that such data may be accessed and deleted by end users via My Activity.

In plain english, that means that if a website owner flips the switch, it's the website's problem if the trackers accidentally capture PII or otherwise violate privacy, because it's responsible for asking for permission.

Yep. GA-4 is engineered with what I would call "figleaf privacy" at its core. Google doesn't have fig leaves for paranormal cozy romance novels!

Wednesday, May 8, 2019

RA21's recommended technical approach is broken by emerging browser privacy features

This is my third comment about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th. My first comment concerned the use of secure communication channels. The second looked at potential phishing attacks on the proposed system. I'm posting the comments here so you can easily comment.

RA21's recommended technical approach is broken by emerging browser privacy features

Third party cookies are widely on the web used as trackers, or "web bugs", by advertising networks wishing to target users with advertising on the web. The impact of these trackers on privacy has been widely reported and decried. Browser local storage deployed using 3rd-party iframes is similarly employed for user tracking by ad networks. Browser vendors, led by Apple, have fought back against user tracking by providing user options to limit third party information sharing. Apple's "Intelligent Tracking Protection" has progressively increased the barriers to cross-site information storage, for example, by partitioning the local storage according to third-party context.

Unfortunately for RA21, the draft recommended practice (RP) has endorsed a technical approach which mirrors the tactics used for user tracking by the advertising industry. For this reason, users of Safari who choose to enable the "prevent cross-site tracking" option may not benefit from the "seamless" access promised by RA21 if implemented with the endorsed technical approach.

Wikimedia commons

The optimistically acronymed "P3W" pilot used a javascript library called "Krakenjs/zoid" (According to the Norse sagas, the kraken is a squidlike monster that terrorizes voyagers) to exchange data between cross-domain contexts. The limitations on krakenjs in Safari are acknowledged by the library's developer. It works by having the host webpage create an iframe loaded from a P3W website. With privacy controls off, the web page posts to the iframe, which answers with a reference to the user's identity provider. The service provider website uses that information to help the user authenticate without having to search through a huge list of identity providers. With Safari privacy features turned on, the search process must be repeated for each and every service provider domain.

Other browser vendors have moved towards restricting tracking behaviour. Firefox has announced that it will phase in "enhanced tracking protection"
Even Google's Chrome browser is moving towards restrictions on tracking technologies.

The bottom line is that if RA21 is implemented with the recommended technical approach, library users will probably be required to turn off privacy enhancing features of their browser software to use resources in their library. As a result, RA21 will have difficulty moving forward with community consensus on this technical approach.

Browser software is much more tolerant of cross-domain communication when the information "hub" is a first-party context (i.e. a window of its own, not an embedded iframe), as is done in more established authentication schemes such as OpenID Connect and SAML flow. RA21 should refocus its development effort on these technical approaches.

Update July 5, 2019:

RA21's official response to this comment is:

Future work includes storage policy notification. Also, we are not actually using third party cookies even though this term is often used to describe several cross-domain access patterns; instead, RA21 recommends using web storage (aka, browser local storage) together with HTML5 post-message for cross-domain access. This is the same mechanism (and indeed the same implementation) that PayPal uses, thus demonstrating broad browser support. A description of web storage has been added to the Terminology section. We are aware that by turning off "third party cookies" it is possible for the user to partly or completely disable the call to action button but in those cases the user experience degrades gracefully to a classical SAML/OpenIDC discovery flow.

Essentially the same response was made to three other submitted comments. Two of them, from Duke's Tim McGeary, called out two sections of the recommended practice and noted:

Word of caution: this login specifically cannot happen in an iFrame to meet SSO security protocol

The third, from Cornell University Library, submitted by Adam Chandler, amplified on McGeary:

Comment from Cornell University Library Privacy as a Service Working Group. Our group includes membership drawn from Library IT, Library Licensing, Library Public Services, Cornell IT Security, and Cornell Privacy Office.

Under 2.4.: We agree with Tim McGreary's comment (#862 or #863 - seems that he double-posted it) that the SSO login shouldn't be inside a frame on another page. There are security issues with that kind of approach. The users can't see the login page URL to verify that the page is a x.uni.edu page before entering their passwords, so it makes it easier to spoof the login page. Generally, login pages use "framebusting" to prevent this kind of possibility.

RA21's response on this issue is alarming, and suggests that the whole project is in danger of failure. RA21 seems to be unaware that using HTML5 web storage is worse than 3rd party cookies in many respects - particularly privacy and security. Currently, only Safari defaults to "a classical SAML/OpenIDC discovery flow", but that still means that if they want to be accurate, they'll have to rename the implementing organization "The Coalition for Seamless Access but Not on iOS" or "The Coalition for Problematic Access".

I hope that the beta implementation will be executed by a team with the experience and competence to override or at least effectively mitigate RA21's technical blunder.

Monday, December 31, 2018

On the Surveillance Techno-state

I used to run my own mail server. But then came the spammers. And dictionary attacks. All sorts of other nasty things. I finally gave up and turned to Gmail to maintain my online identities. Recently, one of my web servers has been attacked by a bot from a Russian IP address which will eventually force me to deploy sophisticated bot-detection. I'll probably have to turn to Google's recaptcha service, which watches users to check that they're not robots.

Isn't this how governments and nations formed? You don't need a police force if there aren't any criminals. You don't need an army until there's a threat from somewhere else. But because of threats near and far, we turn to civil governments for protection. The same happens on the web. Web services may thrive and grow because of economies of scale, but just as often it's because only the powerful can stand up to storms. Facebook and Google become more powerful, even as civil government power seems to wane.

When a company or institution is successful by virtue of its power, it needs governance, lest that power go astray. History is filled with examples of power gone sour, so it's fun to draw parallels. Wikipedia, for example, seems to be governed like the Roman Catholic Church, with a hierarchical priesthood, canon law, and sacred texts. Twitter seems to be a failed state with a weak government populated by rival factions demonstrating against the other factions. Apple is some sort of Buddhist monastery.

This year it became apparent to me that Facebook is becoming the internet version of a totalitarian state. It's become so ... needy. Especially the app. It's constantly inventing new ways to hoard my attention. It won't let me follow links to the internet. It wants to track me at all times. It asks me to send messages to my friends. It wants to remind me what I did 5 years ago and to celebrate how long I've been "friends" with friends. My social life is dominated by Facebook to the extent that I can't delete my account.

That's no different from the years before, I suppose, but what we saw this year is that Facebook's governance is unthinking. They've built a machine that optimizes everything for engagement and it's been so successful that they they don't know how to re-optimize it for humanity. They can't figure out how to avoid being a tool of oppression and propaganda. Their response to criticism is to fill everyone's feed with messages about how they're making things better. It's terrifying, but it could be so much worse.

I get the impression that Amazon is governed by an optimization for efficiency.

How is Google governed? There has never existed a more totalitarian entity, in terms of how much it knows about every aspect of our lives. Does it have a governing philosophy? What does it optimize for?

In a lot of countries, it seems that the civil governments are becoming a threat to our online lives. Will we turn to Wikipedia, Apple, or Google for protection? Or will we turn to civil governments to protect us from Twitter, Amazon and Facebook. Will democracy ever govern the Internet?

Happy 2019!

Thursday, August 2, 2018

My Face is Personally Identifiable Information

Facial recognition technology used to be so adorable. When I wrote about it 7 years ago, the facial recognition technology in iPhoto was finding faces in shrubbery, but was also good enough to accurately see family resemblances in faces carved into a wall. Now, Apple thinks it's good enough to use for biometric logins, bragging that "your face is your password".

I think this will be my new password:

The ACLU is worried about the civil liberty implications of facial recognition and the machine learning technology that underlies it. I'm worried too, but for completely different reasons. The ACLU has been generating a lot of press as they articulate their worries - that facial recognition is unreliable, that it's tainted by the bias inherent in its training data, and that it will be used by governments as a tool of oppression. But I think those worries are short-sighted. I'm worried that facial recognition will be extremely accurate, that its training data will be complete and thus unbiased, and that everyone will be using it everywhere on everyone else and even an oppressive government will be powerless to preserve our meager shreds of privacy.

We certainly need to be aware of the ways in which our biases can infect the tools we build, but the ACLU's argument against facial recognition invites the conclusion that things will be just peachy if only facial recognition were accurate and unbiased. Unfortunately, it will be. You don't have to read Cory Doctorow's novels to imagine a dystopia built on facial recognition. The progression of technology is such that multiple face recognizer networks could soon be observing us where ever we go in the physical world - the same way that we're recognized at every site on the internet via web beacons, web profilers and other spyware.

The problem with having your face as your password is that you can't keep your face secret. Faces aren't meant to be secret. Our faces co-evolved with our brains to be individually recognizable; evidently, having an identity confers a survival advantage. Our societies are deeply structured around our ability to recognize other people by their faces. We even put faces on our money!

Facial recognition is not new at all, but we need to understand the ways in which machines doing the recognizing will change the fabric of our societies. Let's assume that the machines will be really good at it. What's different?

For many applications, the machine will be doing things that people already do. Putting a face-recognizing camera on your front door is just doing what you'd do yourself in deciding whether to open it. Maybe using facial recognition in place of a paper driver's license or passport would improve upon the performance of a TSA agent squinting at that awful 5-year-old photo of you. What's really transformative is the connectivity. That front-door camera will talk to Fedex's registry of delivery people. When you use your face at your polling place, the bureau of elections will make sure you don't vote anywhere else that day. And the ID-check that proves you're old enough to buy cigarettes will update your medical records. What used to identify you locally can now identify you globally.

The reason that face-identity is so scary is that it's a type of identifier that has never existed before. It's globally unique, but it doesn't require a central registry to be used. It's public, easily collected and you can't remove it. It's as if we all had to tattoo our ~~prisoner~~ social security numbers on our foreheads! Facial profiles can be transmitted around the world, and used to index ALL THE DATABASEZ!

We can't stop facial recognition technology any more than we can reverse global warming, but we can start preparing today. We need to start by treating facial profiles and photographs as personally identifiable information. We have some privacy laws that cover so-called "PII", and we need to start applying them to photographs and facial recognition profiles. We can also impose strict liability for the misuse of biased inaccurate facial recognition; slowing down the adoption of facial recognition technology will give our society a chance to adjust to its consequences.

Oh, and maybe Denmark's new law against niqabs violates GDPR?

Friday, May 4, 2018

Choose Privacy Week: Your Library Organization Is Watching You

Choose Privacy Week

T. J. Lamana and I have written a post for Choose Privacy Week. It's mirrored here, but be sure to check out all the great posts there.

Your Library Organization Is Watching You

We commonly hear that ‘Big Brother’ is watching you, in the context of digital and analog surveillance such as Facebook advertising, street cameras, E-Zpass highway tracking or content sniffing by internet service providers. But it’s not only Big Brother, there are a lot of “Little Brothers” as well, smaller less obvious that wittingly or unwittingly funnel data, including personal identifiable information (PII) to massive databases. Unfortunately libraries (and related organizations) are a part of this surveillance environment. In the following we’ll break down two example library organization websites. We’ll be focusing on two American Library Association (ALA) websites: ALA’s Office of Intellectual Freedom’s Choose Privacy Week website (ChoosePrivacyWeek.org) and ALA’s umbrella site (ala.org).

Before we dive too deeply, let’s review some basics about the data streams generated by a visit to a website. When you visit a website, your browser software - Chrome, Firefox, Safari, etc. - sends a request containing your IP address, the address of the webpage you want, and a whole bunch of other information. If the website supports “SSL”, most of that information is encrypted. If not, network providers are free to see everything sent or received. Without SSL, bad actors who share the networks can insert code or other content into the webpage you receive. The easiest way to see if a site has a valid SSL certificate is to look at the protocol identifier of a url. If it’s ‘HTTPS’, that traffic is encrypted, if it’s ‘HTTP’ DO NOT SEND any personally identifiable information (PII), as there is no guarantee that traffic is being protected. If you’re curious about the quality of a sites encryption, you can check its “Qualys report”, offered by SSL Labs., which checks the website’s configuration, and assigns a letter grade. ALA.org gets a B; ChoosePrivacyWeek gets a A. The good news is that even ALA.org’s B is an acceptable grade. The bad news is that the B grade is for “https://www.ala.org/”, whose response is reproduced here in its entirety:

Unfortunately the ALA website is mostly available only without SSL encryption.

You don’t have to check the SSL Labs to see the difference. You can recognize ChoosePrivacyWeek.org as a “secure” connection by looking for the lock badge in your browser; click on that badge for more info. Here’s what the sites look like in Chrome:

Don’t assume that your privacy is protected just because a site has a lock badge, because the was is designed to spew data about you in many ways. Remember that “whole bunch of other information” we glossed over above? Included in that “other information” are “cookies” which allow web servers to keep track of your browsing session. It’s almost impossible to use the web these days without sending these cookies. But many websites include third party services that track your session as well. These are more insidious, because they give you an identifier that joins your activity across multiple websites. The combination of data from thousands of websites often gives away your identity, which then can be used in ways you have no control over.

Privacy Badger is a browser extension created by the Electronic Frontier Foundation (EFF) which monitors the embedded code in websites that may be tracking your web traffic. You can see a side-by-side comparison of ALA.org on the left and ChoosePrivacyWeek on the right:

ALA.org

ChoosePrivacyWeek.org

The 2 potential trackers identified by Privacy Badger on ChoosePrivacyWeek are third party services: fonts from Google and an embedded video player from Vimeo. These are possibly tracking users, but are not optimized to do so. The 4 trackers on ALA.org merit a closer look. They’re all from Google; the ones of concern are placed by Google Analytics. One of us has written about how Google analytics can be configured to respect user privacy, if you trust Google’s assurances. To its credit, ALA.org has turned on the "anonymizeIP" setting, which in theory obscures user’s identity. But it also has “demographics” turned on, which causes an advertising (cross-domain) cookie to be set for users of ALA.org, and Google’s advertising arm is free to use ALA.org user data to target advertising (which is how Google makes money). PrivacyBadger allows you to disable any or all of these trackers and potential trackers (though doing so can break some websites).

Apart from giving data to third parties, any organization has to have internal policies and protocols for handling the reams of data generated by website users. It’s easy to forget that server logs may be grow to contain hundreds of gigabytes or more of data that can be traced back to individual users. We asked ALA about their log retention policies with privacy in mind. ALA was kind enough to respond:

“We always support privacy, so internal meetings are occurring to determine how to make sure that we comply with all applicable laws while always protecting member/customer data from exposure. Currently, ALA is taking a serious look at collection and retention in light of the General Data Protection Regulation (GDPR) EU 2016/679, a European Union law on data protection and privacy for all individuals within the EU. It applies to all sites/businesses that collect personal data regardless of location.”

Reading in between the lines, it sounds like ALA does not yet have log retention policies or protocols. It’s encouraging that these items are on the agenda, but disappointing that it’s 2018 and these items are on the agenda. ALA.org has a 4 year old privacy policy on its website that talks about the data it collects, but has no mention of a retention policy, or of third party service use.

The ChoosePrivacyWeek website has a privacy statement that’s more emphatic:

We will collect no personal information about you when you visit our website unless you choose to provide that information to us.

The lack of tracking on the site is aligned with this statement, but we’d still like to see a statement about log retention. ChoosePrivacyWeek is hosted on a DreamHost WordPress server, and usage log files at Dreamhost were recently sought by the Department of Justice in the Disruptj20.org case.

Organizations express their priorities and values in their actions. ALA’s stance toward implementing HTTPS will be familiar to many librarians; limited IT resources get deployed according competing priorities. In the case of ALA, a sorely needed website redesign was deemed more important to the organization than providing incremental security and privacy to website users by implementing HTTPS. Similarly, the demographic information provided by Google’s advertising tracker was valued more than member privacy (assuming ALA is aware of the trade-off). The ChoosePrivacyWeek.org website has a different set of values and objectives, and thus has made some different choices.

In implementing their websites and services, libraries make many choices that impact on user privacy. We want librarians, library administrators, library technology staff and library vendors to be aware of the choices they are making, and aware of the values they are expressing on behalf of an organization or of a library. We hope that they will CHOOSE PRIVACY.

Monday, September 11, 2017

Prepare Now for Topical Storm Chrome 62

Sometime in October, probably the week of October 17th, version 62 of Google's Chrome web browser will be declared "stable". When that happens, users of Chrome will get their software updated to version 62 when they restart.

One of the small but important changes that will occur is that many websites that have not implemented HTTPS to secure their communications will be marked in a subtle way as "Not Secure". When such a website presents a web form, typing into the form will change the appearance of the website URL. Here's what it will look like:

Unfortunately, many libraries, and the vendors and publishers that serve them, have not yet implemented HTTPS, so many library users that type into search boxes will start seeing the words "Not Secure" and may be alarmed.

What's going to happen? Here's what I HOPE happens:

Libraries, Vendors, and Publishers that have been working on switching their websites for the past two years (because usually it's a lot more work than just pushing a button) are motivated to fix the last few problems, turn on their secure connections, and redirect all their web traffic through their secure servers before October 17.

So instead of this:

... users will see this:

Library management and staff will be prepared to answer questions about the few remaining problems that occur. The internet is not a secure place, and Chrome's subtle indicator is just a reminder not to type in sensitive information, like passwords, personal names and identifiers, into "not secure" websites.

The "Not Secure" animation will be noticed by many users of libraries, vendors, and publishers that haven't devoted resources to securing their websites. The users will file helpful bug reports and the website providers will acknowledge their prior misjudgments and start to work carefully to do what needs to be done to protect their users.

Libraries, vendors, and publishers will work together to address many interactions and dependencies in their internet systems.

Here's what I FEAR might happen:

The words "Not Secure" will cause people in charge to think their organizations' websites "have been hacked".

Publishing executives seeing the "Not Secure" label will order their IT staff to "DO SOMETHING" without the time or resources to do a proper job.

Library directors will demand that Chrome be replaced by Firefox on all library computers because of a "BUG in CHROME". (creating an even worse problem when Firefox follows suit in a few months!)

Vendor IT managers will claim that their systems are 100% SECURE because they've "never had a report of data breeches", despite the fact that their systems save passwords in plaintext.

Library staff will put up signs instructing patrons to "ignore security warnings" on the internet. Patrons will believe them.

Back here in the real world, libraries are under-resourced and struggling to keep things working. The industry in general has been well behind the curve of HTTPS adoption, needlessly putting many library users at risk. The complicated technical environment, including proxy servers, authentication systems, federated search, and link servers has made the job of switching to secure connections more difficult.

So here's my forecast of what WILL happen:

Many libraries, publishers and vendors, motivated by Chrome 62, will finish their switch-over projects before October 17. Users of library web services will have better security and privacy. (For example, I expect OCLC's WorldCat, shown above in secure and not secure versions, will be in this category.)

Many switch-over projects will be rushed, and staff throughout the industry, both technical and user-facing, will need to scramble and cooperate to report and fix many minor issues.

A few not-so-thoughtful voices will complain that this whole security and privacy fuss is overblown, and blame it on an evil Google conspiracy.

Here are some notes to help you prepare:

I've been asked whether libraries need to update links in their catalog to use the secure version of resource links. Yes, but there's no need to rush. Website providers should be using HTTP redirects to force users into the secure connections, and should use HSTS headers to make sure that their future connections are secure from the start.
Libraries using proxy servers MUST update their software to reasonably current versions, and update proxy settings to account for secure versions of provider services. In many cases this will require acquisition of a wildcard certificate for the proxy server.
I've had publishers and vendors complain to me that library customers have asked them to retain the option of insecure connections ... because reasons. Recently, I've seen reports on listservs that vendors are being asked to retain insecure server settings because the library "can't" update their obsolete and insecure proxy software. These libraries should be ashamed of themselves - their negligence is holding back progress for everyone and endangering library users.
Chrome 62 is expected to reach beta next week. You'll then be able to install it from the beta channel. (Currently, it's in the dev channel.) Even then, you may need to set the #mark-non-secure-as flag to see the new behavior. Once Chrome 62 is stable, you may still be able to disable the feature using this flag.
A screen capture using chrome 62 now might help convince your manager, your IT department, or a vendor that a website really needs to be switched to HTTPS.
Mixed content warnings are the result of embedding not-secure images, fonts, or scripts in a secure web page. A malicious actor can insert content or code in these elements, endangering the user. Much of the work in switching a large site from HTTP to HTTPS consists of finding and addressing mixed content issues.
Google's Emily Schechter gives an excellent presentation on the transition to HTTPS, and how the Chrome UI is gradually changing to more accurately communicate to users that non-HTTPS sites may present risks: https://www.youtube.com/watch?v=GoXgl9r0Kjk&feature=youtu.be (discussion of Chrome 62 changes starts around 32:00)
(added 9/15/2017) As an example of a company that's been working for a while on switching, Elsevier has informed its ScienceDirect customers that ScienceDirect will be switching to HTTPS in October. They have posted instructions for testing proxy configurations.

Monday, August 14, 2017

PubMed Lets Google Track User Searches

CT scan of a Mesothelioma patient.
CC BY-SA by Frank Gaillard

If you search on Google for "Best Mesothelioma Lawyer" and then click on one of the ads, Google can earn as much as a thousand dollars for your click. In general, Google can make a lot of money if it knows you're the type of user who's interested in rare types of cancer. So you might be surprised that Google gets to know everything you search for when you use PubMed, the search engine offered by the National Center for Biotechnology Information (NCBI), a service of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). Our tax dollars work really hard and return a lot of value at NCBI, but I was surprised to discover Google's advertising business is getting first crack at that value!

You may find this hard to believe, but you shouldn't take may word for it. Go and read the NLM Privacy Policy, in particular the section on "Demographic and Interest Data"

On some portions of our website we have enabled Google Analytics and other third-party software (listed below), to provide aggregate demographic and interest data of our visitors. This information cannot be used to identify you as an individual. While these tools are used by some websites to serve advertisements, NLM only uses them to measure demographic data. NLM has no control over advertisements served on other websites.

DoubleClick: NLM uses DoubleClick to understand the characteristics and demographics of the people who visit NLM sites. Only NLM staff conducts analyses on the aggregated data from DoubleClick. No personally identifiable information is collected by DoubleClick from NLM websites. The DoubleClick Privacy Policy is available at https://www.google.com/intl/en/policies/privacy/
You can opt-out of receiving DoubleClick advertising at https://support.google.com/ads/answer/2662922?hl=en.

I will try to explain what this means and correct some of the misinformation it contains.

DoubleClick is Google's display advertising business. DoubleClick tracks users across websites using "cookies" to collect "demographic and interest information" about users. DoubleClick uses this information to improve its ad targeting. So for example, if a user's web browsing behavior suggests an interest in rare types of cancer, DoubleClick might show the user an ad about mesothelioma. All of this activity is fully disclosed in the DoubleClick Privacy Policy, which approximately 0% of PubMed's users have actually read. Despite what the NLM Privacy Policy says, you can't opt-out of receiving DoubleClick Advertising, you can only opt out of DoubleClick Ad Targeting. So instead of Mesothelioma ads, you'd probably be offered deals at Jet.com

It's interesting to note that before February 21 of this year, there was no mention of DoubleClick in the privacy policy (see the previous policy ). Despite the date, there's no reason to think that the new privacy policy is related to the change in administrations, as NIH Director Francis Collins was retained in his position by President Trump. More likely it's related to new leadership at NLM. In August of 2016, Dr. Patricia Flatley Brennan became NLM director. Dr. Brennan, a registered nurse and an engineer, has emphasized the role of data to the Library's mission. In an interview with the Washington Post, Brennan noted:

In the 21st century we’re moving into data as the basis. Instead of an experiment simply answering a question, it also generates a data set. We don’t have to repeat experiments to get more out of the data. This idea of moving from experiments to data has a lot of implications for the library of the future. Which is why I am not a librarian.

The "demographic and interest data" used by NLM is based on individual click data collected by Google Analytics. As I've previously written, Google Analytics only tracks users across websites if the site-per-site tracker IDs can be connected to a global tracker ID like the ones used by DoubleClick. What NLM is allowing Google to do is to connect the Google Analytics user data to the DoubleClick user data. So Google's advertising business gets to use all the Google Analytics data, and the Analytics data provided to NLM can include all the DoubleClick "demographic and interest" data.

What information does Google receive when you do a search on Pubmed?
For every click or search, Google's servers receive:

your search term and result page URL
your DoubleClick user tracking ID
your referring page URL
your IP address
your browser software and operating system

While "only NLM staff conducts analyses on the aggregated data from DoubleClick", the DoubleClick tracking platform analyzes the unaggregated data from PubMed. And while it's true that "the demographic and interest data" of PubMed visitors cannot be used to identify them as individuals, the data collected by the Google trackers can trivially be used to identify as individuals any PubMed users who have Google accounts. Last year, Google changed its privacy policy to allow it to associate users' personal information with activity on sites like PubMed.

"Depending on your account settings, your activity on other sites and apps may be associated with your personal information in order to improve Google’s services and the ads delivered by Google.

So the bottom line is that Google's stated policies allow Google to associate a user's activity on PubMed with their personal information. We don't know if Google makes use of PubMed activity or if the data is saved at all, but NLM's privacy policy is misleading at best on this fact.

Does this matter? I have written that commercial medical journals deploy intense advertising trackers on their websites, far in excess of what NLM is doing. "Everybody" does it. And we know that agencies of the US government spend billions of dollars sifting through web browsing data looking for terrorists, so why should NLM be any different? So what if Google gets a peek at PubMed user activity - they see such a huge amount of user data that PubMed is probably not even noticeable.

Google has done some interesting things with search data. For example, the "Google Flu Trends" and "Google Dengue Trends" projects studied patterns of searches for illness - related terms. Google could use the PubMed Searches for similar investigations into health provider searches.

The puzzling thing about NLM's data surrender is the paltry benefit it returns. While Google gets un-aggregated, personally identifiable data, all NLM gets is some demographic and interest data about their users. Does NLM really want to better know the age, gender, and education level of PubMed users??? Turning on the privacy features of Google Analytics (i.e. NOT turning on DoubleClick) has a minimal impact on the usefulness of the usage data it provides.

Lines need to be drawn somewhere. If Google gets to use PubMed click data for its advertising, what comes next? Will researchers be examined as terror suspects if they read about nerve toxins or anthrax? Or perhaps inquiries into abortifactants or gender-related hormone therapies will be become politically suspect. Perhaps someone will want a list of people looking for literature on genetically modified crops, or gun deaths, or vaccines? Libraries should not be going there.

So let's draw the line at advertising trackers in PubMed. PubMed is not something owned by a publishing company, PubMed belongs to all of us. PubMed has been a technology leader worthy of emulation by libraries around the world. They should be setting an example. If you agree with me that NLM should stop letting Google track PubMed Users, let Dr. Brennan know (politely, of course.)

Notes:

You may wonder if the US government has a policy about using third party services like Google Analytics and DoubleClick. Yes, there is a policy, and NLM appears to be pretty much in compliance with that policy.
You might also wonder if Google has a special agreement for use of its services on US government websites. It does, but that agreement doesn't amend privacy policies. And yes, the person signing that policy for Google subsequently became the third CTO of the United States.
I recently presented a webinar which covered the basics of advertising in digital libraries in the National Network of Libraries of Medicine [NNLM] "Kernal of Knowledge" series.
(8/16) Yes, this blog is served by Google. So if you start getting ads for privacy plug-ins...
(8/16) urlscan.io is a tool you can use to see what goes on under the cover when you search on PubMed. Tip from Gary Price.

Monday, June 12, 2017

Book Chapter on "Digital Advertising in Libraries"

I've written a chapter for a book, edited by Peter Fernandez and Kelly Tilton, to be published by ACRL. The book is tentatively titled Applying Library Values to Emerging Technology: Tips and Techniques for Advancing within Your Mission.

Digital Advertising in Libraries: or... How Libraries are Assisting the Ecosystem that Pays for Fake News

To understand the danger that digital advertising poses to user privacy in libraries, you first have to understand how websites of all stripes make money. And to understand that, you have to understand how advertising works on the Internet today.

The goal of advertising is simple and is quite similar to that of libraries. Advertisers want to provide information, narratives, and motivations to potential customers, in the hope that business and revenue will result. The challenge for advertisers has always been to figure out how to present the right information to the right reader at the right time. Since libraries are popular sources of information, they have long provided a useful context for many types of ads. Where better to place an ad for a new romance novel than at the end of a similar romance novel? Where better to advertise a new industrial vacuum pump but in the Journal of Vacuum Science and Technology? These types of ads have long existed without problems in printed library resources. In many cases the advertising, archived in libraries, provides a unique view into cultural history. In theory at least, the advertising revenue lowers the acquisition costs for resources that include the advertising.

On the Internet, advertising has evolved into a powerful revenue engine for free resources because of digital systems that efficiently match advertising to readers. Google's Adwords service is an example of such a system. Advertisers can target text-based ads to users based on their search terms, and they only have to pay if the user clicks on their ad. Google decides which ad to show by optimizing revenue—the price that the advertiser has bid times the rate at which the ad is clicked on. In 2016, Search Engine Watch reported that some search terms were selling for almost a thousand dollars per click. [Chris Lake, “The most expensive 100 Google Adwords keywords in the US,” Search Engine Watch (May 31, 2016).] Other types of advertising, such as display ads, video ads, and content ads, are placed by online advertising networks. In 2016, advertisers were projected to spend almost $75 billion on display ads; [Ingrid Lunden, “Internet Ad Spend To Reach $121B In 2014, 23% Of $537B Total Ad Spend, Ad Tech Boosts Display,” TechCrunch, (April 27, 2014).] Google's Doubleclick network alone is found on over a million websites. [“DoubleClick.Net Usage Statistics,” BuiltWith (accessed May 12, 2017). ]

Matching a user to a display ad is more difficult than search-driven ads. Without a search term to indicate what the user wants, the ad networks need demographic information about the user. Different ads (at different prices) can be shown to an eighteen-year-old white male resident of Tennessee interested in sports and a sixty-year-old black woman from Chicago interested in fashion, or a pregnant thirty-year-old woman anywhere. To earn a premium price on ad placements, the ad networks need to know as much as possible about the users: age, race, sex, ethnicity, where they live, what they read, what they buy, who they voted for. Luckily for the ad networks, this sort of demographic information is readily available, thank to user tracking.

Internet users are tracked using cookies. Typically, an invisible image element, sometimes called a "web bug," is place on the web page. When the page is loaded, the user's web browser requests the web bug from the tracking company. The first time the tracking company sees a user, a cookie with a unique ID is set. From then on, the tracking company can record the user's web usage for every website that is cooperating with the tracking company. This record of website visits can be mined to extract demographic information about the user. A weather website can tell the tracking company where the user is. A visit to a fashion blog can indicate a user's gender and age. A purchase of scent-free lotion can indicate a user's pregnancy. [Charles Duhigg, “How Companies Learn Your Secrets,” The New York Times Magazine, (February 16, 2012).] The more information collected about a user, the more valuable a tracking company's data will be to an ad network.

Many websites unknowingly place web bugs from tracking companies on their websites, even when they don't place adverting themselves. Companies active in the tracking business include AddThis, ShareThis, and Disqus, who provide functionality to websites in exchange for website placement. Other companies, such as Facebook, Twitter, and Google similarly track users to benefit their own advertising networks. Services provided by these companies are often placed on library websites. For example, Facebook’s “like” button is a tracker that records user visits to pages offering users the opportunity to “like” a webpage. Google’s “Analytics” service helps many libraries understand the usage of their websites, but is often configured to collect demographic information using web bugs from Google’s DoubleClick service. [“How to Enable/Disable Privacy Protection in Google Analytics (It's Easy to Get Wrong!)” Go To Hellman (February 2, 2017).]

Cookies are not the only way that users are tracked. One problem that advertisers have with cookies is that they are restricted to a single browser. If a user has an iPhone, the ID cookie on the iPhone will be different from the cookie on the user's laptop, and the user will look like two separate users. Advanced tracking networks are able to connect these two cookies by matching browsing patterns. For example, if two different cookies track their users to a few low-traffic websites, chances are that the two cookies are tracking the same user. Another problem for advertisers occurs when a user flushes their cookies. The dead tracking ID can be revived by using "fingerprinting" techniques that depend on the details of browser configurations. [Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, and Claudia Diaz, “The Web Never Forgets: Persistent Tracking Mechanisms in the Wild.” In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 674-689. DOI] Websites like Google, Facebook, and Twitter are able to connect tracking IDs across devices based on logins.

Once a demographic profile for a user has been built up, the tracking profile can be used for a variety of ad-targeting strategies. One very visible strategy is "remarketing." If you've ever visited a product page on an e-commerce site, only to be followed around the Internet by advertising for that product, you've been the target of cookie-based remarketing.

Ad targeting is generally tolerated because it personalizes the user's experience of the web. Men, for the most part, prefer not to be targeted with ads for women’s products. An ad for a local merchant in New Jersey is wasted on a user in California. Prices in pounds sterling don't make sense to users in Nevada. Most advertisers and advertising networks take care not to base their ad targeting on sensitive demographic attributes such as race, religion, or sexual orientation, or at least they try not to be too noticeable when they do it.

The advertising network ecosystem is a huge benefit to content publishers. A high traffic website has no need of a sales staff—all they need to do is be accepted by the ad networks and draw users who either have favorable demographics or who click on a lot of ads. The advertisers often don't care about what websites their advertising dollars support. Advertisers also don't really care about the identity of the users, as long as they can target ads to them. The ad networks don't want information that can be traced to a particular user, such as email address, name or home address. This type of information is often subject to legal regulations that would prevent exchange or retention of the information they gather, and the terms of use and so-called privacy policies of the tracking companies are careful to specify that they do not capture personally identifiable information. Nonetheless, in the hands of law enforcement, an espionage agency, or a criminal enterprise, the barrier against linking a tracking ID to the real-world identity of a user is almost non-existent.

The amount of information exposed to advertising networks by tracking bugs is staggering. When a user activates a web tracker, the full URL of the referring page is typically revealed. The user's IP address, operating system, and browser type is sent along with a simple tracker; the JavaScript trackers that place ads typically send more detailed information. It should be noted that any advertising enterprise requires a significant amount of user information collection; ad networks must guard against click-jacking, artificial users, botnet activity and other types of fraud. [Samuel Scott, “The Alleged $7.5 Billion Fraud in Online Advertising,” Moz, (June 22, 2015).]

Breitbart.com is a good example of a content site supported by advertising placed through advertising networks. A recent visit to the Breitbart home page turned up 19 advertising trackers, as characterized by Ghostery: [Ghostery is a browser plugin that can identify and block the trackers on a webpage.]

33Across
[x+1]
AddThis
adsnative
Amazon Associates
DoubleClick
eXelate
Facebook Custom Audience
Google Adsense
Google Publisher Tags
LiveRamp
Lotame
Perfect Market
PulsePoint
Quantcast
Rocket Fuel
ScoreCard Research Beacon
Taboola
Tynt

While some of these will be familiar to library professionals, most of them are probably completely unknown, or at least their role in the advertising industry may be unknown. Amazon, Facebook and Google are the recognizable names on this list; each of them gathers demographic and transactional data about users of libraries and publishers. AddThis, for example, is a widget provider often found on library and publishing sites. They don't place ads themselves, but rather, they help to collect demographic data about users. When a library or publisher places the AddThis widget on their website, they allow AddThis to collect demographic information that benefits the entire advertising ecosystem. For example, a visitor to a medical journal might be marked as a target for particularly lucrative pharmaceutical advertising.

Another tracker found on Breitbart is Taboola. Taboola is responsible for the "sponsored content" links found even on reputable websites like Slate or 538.com. Taboola links go to content that is charitably described as clickbait and is often disparaged as "fake news." The reason for this is that these sites, having paid for advertising, have to sell even more click-driven advertising. Because of its links to the Trump Administration, Breitbart has been the subject of attempts to pressure advertisers to stop putting advertising on the site. A Twitter account for "Sleeping Giants" has been encouraging activists to ask businesses to block Breitbart from placing their ads. [Osita Nwanevu, “‘Sleeping Giants’ Is Borrowing Gamergate’s Tactics to Attack Breitbart,” Slate, December 14, 2016.] While several companies have blocked Breitbart in response to this pressure, most companies remain unaware of how their advertising gets placed, or that they can block such advertising. [Pagan Kennedy, “How to Destroy the Business Model of Breitbart and Fake News,” The New York Times (January 7, 2017).]

I'm particularly concerned about the medical journals that participate in advertising networks. Imagine that someone is researching clinical trials for a deadly disease. A smart insurance company could target such users with ads that mark them for higher premiums. A pharmaceutical company could use advertising targeting researchers at competing companies to find clues about their research directions. Most journal users (and probably most journal publishers) don't realize how easily online ads can be used to gather intelligence as well as to sell products.

It's important to note that reputable advertising networks take user privacy very seriously, as their businesses depend on user acquiescence. Google offers users a variety of tools to "personalize their ad experience." [If you’re logged into Google, the advertising settings applied when you browse can be viewed and modified.] Many of the advertising networks pledge to adhere to the guidance of the "Network Advertising Initiative" [“The NAI Code and Enforcement Program: An Overview,”], an industry group. However, the competition in the web-advertising ecosystem is intense, and there is little transparency about enforcement of the guidance. Advertising networks have been shown to spread security vulnerabilities and other types of malware when they allow JavaScript in advertising payloads. [Randy Westergren, “Widespread XSS Vulnerabilities in Ad Network Code Affecting Top Tier Publishers, Retailers,” (March 2, 2016).]

Given the current environment, it's incumbent on libraries and the publishing industry to understand and evaluate their participation in the advertising network ecosystem. In the following sections, I discuss the extent of current participation in the advertising ecosystem by libraries, publishers, and aggregators serving the library industry.

Publishers

Advertising is a significant income stream for many publishers providing content to libraries. For example, the Massachusetts Medical Society, publisher of the New England Journal of Medicine, takes in about $25 million per year in advertising revenue. Outside of medical and pharmaceutical publishing, advertising is much less common. However, advertising networks are pervasive in research journals.

In 2015, I surveyed the websites of twenty of the top research journals and found that sixteen of the top twenty journals placed ad network trackers on their websites. [“16 of the Top 20 Research Journals Let Ad Networks Spy on Their Readers,” Go To Hellman (March 12, 2015). ]

Recently, I revisited the twenty journals to see if there had been any improvement. Most of the journals I examined had added tracking on their websites. The New England Journal of Medicine, which employed the most intense reader tracking of the twenty, is now even more intense, with nineteen trackers on a web page that had "only" fourteen trackers two years ago. A page from Elsevier's Cell went from nine to sixteen trackers. [“Reader Privacy for Research Journals is Getting Worse,” Go To Hellman (March 22, 2017). ] Intense tracking is not confined to subscription-based health science journals; I have found trackers on open access journals, economics journals, even on journals covering library science and literary studies.

It's not entirely clear why some of these publishers allow advertising trackers on their websites, because in many cases, there is no advertising. Perhaps they don’t realize the impact of tracking on reader privacy. Certainly, publishers that rely on advertising revenue need to carefully audit their advertising networks and the sorts of advertising that comes through them. The privacy commitments these partners make need to be consistent with the privacy assurances made by the publishers themselves. For publishers who value reader privacy and don't earn significant amounts from advertising, there's simply no good reason for them to continue to allow tracking by ad networks.

Vendors

The library automation industry has slowly become aware of how the systems it provides can be misused to compromise library patron privacy. For example, I have pointed out that cover images presented by catalog systems were leaking search data to Amazon, which has resulted in software changes by at least one systems vendor. [“How to Check if Your Library is Leaking Catalog Searches to Amazon,” Go To Hellman (December 22, 2016).] These systems are technically complex, and systems managers in libraries are rarely trained in web privacy assessment. Development processes need to include privacy assessments at both component and system levels.

Libraries

There is a mismatch between what libraries want to do to protect patron privacy and what they are able to do. Even when large amounts of money are at stake, there is often little leverage for a library to change the way a publisher delivers advertising bearing content. Nonetheless, together with cooperating IT and legal services, libraries have many privacy-protecting options at their disposal.

Use aggregators for journal content rather than the publisher sites. Many journals are available on multiple platforms, and platforms marketed to libraries often strip advertising and advertising trackers from the journal content. Reader privacy should be an important consideration in selecting platforms and platform content.
Promote the use of privacy technologies. Privacy Badger is an open-source browser plugin that knows about, and blocks tracking of, users. Similar tools include uBlock Origin, and the aforementioned Ghostery.
Use proxy-servers. Re-writing proxy servers such as EZProxy are typically deployed to serve content to remote users, but they can also be configured to remove trackers, or to forcibly expire tracking cookies. This is rarely done, as far as I am aware.
Strip advertising and trackers at the network level. A more aggressive approach is to enforce privacy by blocking tracker websites at the network level. Because this can be intrusive (it affects subscribed content and unsubscribed content equally) it's appropriate mostly for corporate environments where competitive-intelligence espionage is a concern.
Ask for disclosure and notification. During licensing negotiations, ask the vendor or publisher to provide a list of all third parties who might have access to patron clickstream data. Ask to be notified if the list changes. Put these requests into requests for proposals. Sunlight is a good disinfectant.
Join together with others in the library and publishing industry to set out best practices for advertising in web resources.

Conclusion

The widespread infusion of the digital advertising ecosystem into library environments presents a new set of challenges to the values that have been at the core of the library profession. Advertising trackers introduce privacy breaches into the library environment and help to sustain an information-delivery channel that operates without the values grounding that has earned libraries and librarians a deep reserve of trust from users. The infusion has come about through a combination of commercial interest in user demographics, consumer apathy about privacy, and general lack of understanding of a complex technology environment. The entire information industry needs to develop understanding of that environment so that it can grow and evolve to serve users first, not the advertisers.

Tuesday, May 30, 2017

Readium's New Licensed Content Protection May Result in Better Reader Privacy

CC BY

Libraries offering ebook lending are between a rock and a hard place. They know in their heart of hearts that digital rights management (DRM) software is evil, but not allowing users to borrow the ebooks they want to read is not exactly the height of virtue. Saintly companies like Amazon will be happy to fill the gaps if libraries can't lend ebooks. The fundamental problem is that "borrowing" is a fiction, a conceptual construct, when applied to the ones and zeroes of a digital book. An ebook loan is really a short-term license. Under today's copyright law, a reader must have a license to read an ebook, and ebook rights-holders don't trust users to adhere to short-term licenses without some sort of software to enforce the license.

Unless the rock becomes a marshmallow, libraries that want to improve the ebook lending experience are hoping to make the hard place a bit softer. The most common DRM system used in libraries is run by Adobe. Adobe Content Server (ACS) is used by Overdrive, Proquest, EBSCO and Bibliotheca's Cloud Library. Adobe Content Server is a hard place for libraries in two ways. First, a payment must be made to Adobe for every lending transaction processed through ACS. Second, use of ACS affects reader privacy. When ACS first came out, Adobe got to know the identity of every borrower. Adobe says this about these records:

"Adobe keeps internet protocol (IP) address logs related to Adobe ID sign-ins for 90 days"

I wish they also said they destroyed these logs. Their privacy policy says:

"Your personal information and files are stored on Adobe’s servers and the servers of companies we hire to provide services to us. Your personal information may be transferred across national borders because we have servers located worldwide and the companies we hire to help us run our business are located in different countries around the world."

... and generally says that reader should trust Adobe to not betray you.

Thanks in part to demand from libraries and the companies that serve them, Adobe changed ACS so that borrower identities could be de-identified by intermediaries such as Overdrive. So instead of relying on Adobe's sometimes lax privacy protections, libraries could rely on vendors more responsive to library concerns. But still, the underlying DRM technology was designed to trust Adobe, and to distrust readers. Its centralized architecture requires everyone to trust participants closer to the center. A reader's privacy requires trust of the library or bookstore, which in turn have to trust a vendor, who in turn have to trust Adobe.

This state of affairs has been the motivation for the Readium Foundation's new DRM technology, called Readium Licensed Content Protection (LCP). LCP's developers claim that it offers libraries a low cost way to improve the library ebook lending experience while providing readers with the privacy assurances they expect from libraries. In addition, Readium describes LCP as Open Source... except for a few lines of code. To understand LCP, and to see if it delivers on the developer's claims, I took a close look at the recently released spec. The short description of what I found is that it can do what it claims to do... but everything depends on the implementation. Also, DRM may be a Hofstadter-Moebius loop.

Now for the longer description:

Every DRM system uses encryption and secrets. Centralized DRM systems such as ACS keep a centralized secret, and use that secret to generate, distribute and control keys that lock and unlock content. LCP takes a somewhat different approach. It uses two secrets to lock and unlock content, a user secret and and ecosystem secret. An "ecosystem" is all the libraries, booksellers, and reading system vendors who agree to interoperate. Any software that knows the ecosystem secret can combine it with a user's secret to unlock content that has been locked for a user. This way multiple content providers in an ecosystem can independently lock content for a user- there's no requirement for a central key server.

The LCP DRM system has some interesting usability and privacy features. If you want to read on several devices, you just need to remember your encryption secret, and you can move files from one device to another. If you want to share an ebook with a family member or close friend, that's ok too, as long as you're comfortable sharing your encryption secret. If you want to read anonymously, can have have a trusted friend borrow the book on your behalf. But to get publisher buy-in for these usability features, the system has to have a way for content providers to limit oversharing. Content providers don't want you to just post the file and the password on a pirate file-sharing service. So ecosystem software applications are required to "phone home" with a device identifier and license identifier when they are connected to the internet.

As you might imagine, the LCP phone-home information could have an impact on reader privacy, depending on the implementation. So for example, if you borrow a book from the library, and your reader app contacts the library to say you've opened the book, your privacy is minimally impacted since the library already knows you borrowed the book. But if the phone-home transaction is unencrypted, or if it contains too much information, then your employer might be able to find out about the union-organizer book you're reading. If the libraries or booksellers can aggregate all their phone-home logs, then your detailed reading profile could be compiled and exploited. Or if users are not permitted to select their own encryption secret, it might be much harder to read a book anonymously. (Note: my suggested changes for improving these parts of the spec were accepted by the spec's authors.) But if everything is implemented with a view to reader privacy, LCP should offer much better reader privacy than possible with existing systems.

There's some bad news, however. Because the ecosystem secret has to be protected, the openness of the reader software is not quite what it seems. The code will need to be obfuscated before distribution, and the secret will only be available to developers and to distribution channels that are willing and able to "harden" their software. If you want to fork the software to add a feature, your build will not be able to unlock ecosystem content until the ecosystem overlords deign to approve your changes. So don't expect reader software with lots of plugins and options. Don't expect a javascript web-reader.

The code obfuscation raises another issue: it will be difficult to audit reader software to make sure it doesn't harbor spyware, even if the source code is open (except for the ecosystem secret). You still have have to trust app provider, your library and the people who sell you books. But it's hard to get far without trusting somebody, so this isn't a new problem, and when was the last time anyone audited library software? And because the ecosystem overlords distribute the ecosystem secrets to trusted developers, the topology of trust and accountability is very different from Adobe's centralized system.

If you didn't like that bad news, that cloud may have a silver lining, or maybe a lead lining, depending on your perspective. If LCP becomes widely used, the ecosystem secret will inevitably leak, and an anti-ecosystem could form. There will be a Calibre plugin to strip encryption. There will be grayware that does everything that the ecosystem software isn't permitted to do. And it might even be sort-of legal to use. Library ebook lending might flourish. Or collapse. Because in the end, ebook lending requires trust to flow in both directions; while it's not perfect, LCP is a baby step in the direction of mutual trust between readers and content providers.

In Stanley Kubrick's 2001: A Space Odyssey, the computer HAL 9000 goes insane. The reason:

HAL's crisis was caused by a programming contradiction: he was constructed for "the accurate processing of information without distortion or concealment", yet his orders, directly from Dr. Heywood Floyd at the National Council on Astronautics, required him to keep the discovery of the Monolith TMA-1 a secret for reasons of national security. This contradiction created a "Hofstadter-Moebius loop", reducing HAL to paranoia.

Readium LCP software is sort of like HAL 9000. It's charged with opening up information to readers, with expanding minds everywhere, transporting them to worlds of new knowledge and imagination, yet it must work to keep a secret and prevent users from doing things that copyright owners don't want them to do. Let's hope that the P in LCP doesn't stand for "Paranoia".

Go To Hellman

Wednesday, June 12, 2024

The PII Figleaf

Wednesday, May 8, 2019

RA21's recommended technical approach is broken by emerging browser privacy features