I'm reprinting the article here so as to have a good place for discussion.
Alice, a 17 year old high school student, goes to her local public library and reads everything she can find about pregnancy. Noticing this, a librarian calls up some local merchants and tells them that Alice might be pregnant. When Alice visits her local bookstore, the staff has some great suggestions about newborn care for her. The local drugstore sends her some coupons for scent-free skin lotion. She reads "what you can expect..." at the library and a few months later she starts getting mail about diaper services.
Unthinkable? In the physical library, I hope this never happens. It would be too creepy!
In the digital library, this future could be happening now. Libraries and their patrons are awash in data that really isn't sensitive until aggregated, and the data is getting digested by advertising networks and flowing into "big data" archives. The scenario in which advertisers exploit Alice's library usage is not only thinkable, it needs to be defended against. It's a "threat model" that's mostly unfamiliar to libraries.
Recently, I read a book called Half Life. Uranium theft, firearms technology and computer hacking are important plot elements, but I'm not worried about people knowing that I loved it. The National Security Agency (NSA) is not going to identify me as a potential terrorist because I'm reading Half Life. On the contrary, I'd love for my reading behavior to be broadcast to the entire world, because maybe more people would discover what a wonderful writer S.L. Huang is. A lot of a library user's digital usage data is like that. It's not particularly private, and most would gladly trade usage information for convenience or to help improve the services they rely on. It would be a waste of time and energy for a library to worry much about keeping that information secret. Quite the opposite, libraries are helping users share their behavior with things like Facebook Like buttons and social media widgets.
Which is why Alice should be very worried and why it's important for libraries to understand new threat models. What breaches of user privacy are most likely to occur and which are most likely to present harm?
A 2012 article in the New York Times Magazine described a real situation involving Target (the retailer). Target's "big data" analytics team developed a customer model that identified pregnant women based on shopping behavior. Purchases of scent-free skin lotion, vitamin supplements, and cotton balls turned out to be highly predictive of subsequent purchases of baby diapers. Using the model, Target sent ads for baby-oriented products to the customers their algorithm had identified. In one case, an irate father whose daughter had received ads for baby clothes and cribs accused the store of encouraging his daughter to get pregnant. When a manager called to apologize, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
Among the companies collecting "big data" about users are the advertising networks, companies that sit in between advertisers and websites. They use their data to decide which ad from a huge inventory is most likely to result in a user response. If I were Alice, I don't think I would want my search for pregnancy books broadcasted to advertising networks. Yet that's precisely what happens when I do a search on my local public library's online catalog. I very much doubt that many advertisements are being targeted based on that searching ... yet. But the digital advertising industry is extremely competitive, and unless libraries shift their practices, it's only a matter of time that library searches get factored into advanced customer models.
But it doesn't have to happen that way. Libraries have a strong tradition of protecting user privacy. Once all the "threat models" associated with the digital environment are considered, practices will certainly change.
So let's get started. In the rest of this article, I'll examine the process of borrowing and reading an ebook, and identify privacy weaknesses in the processes that advertisers and their predictive analytics modeling could exploit.
- Most library catalogs allow non-encrypted searches. This exposes Alice's ebook searches to internet providers between Alice and the library's server. The X-UIDH header has been used by providers such as Verizon and AT&T to help advertisers target mobile users. By using HTTPS for their catalogs, libraries can limit this intrusion. This is relatively easy and cheap, and there's no good excuse in 2015 for libraries not to make the switch.
Some library catalogs use social widgets such as AddThis or ShareThis that broadcast a user's search activity to advertising networks. Similarly, Facebook "Like" buttons send a user's search activity to Facebook whether or not the user is on Facebook. Libraries need to carefully evaluate the benefits of these widgets against the possibility that advertising networks will use Alice's search history inappropriately.
Statistics and optimization services like Google Analytics and NewRelic don't currently share Alice's search history with advertising networks, but libraries should evaluate the privacy assurances from these services to see if they are consistent with their own policies and local privacy laws.
- When Alice borrows a book from a vendor such as OverDrive or 3M, it monitors Alice's reading behavior, albeit anonymously. At this date, it's very difficult for an advertiser to exploit Alice's use of reading apps from OverDrive or 3M. Although many have criticized the use of Adobe digital rights management (DRM) in these apps, both 3M and OverDrive use the "vendorID" method which avoids the disclosure of user data to Adobe, and at this date, there is no practical way for an advertising network to exploit Alice's use of these services. Here again, libraries should review their vendor contracts to make sure that can't change.
If Alice reads her ebook using a 3rd party application such as Adobe Digital Editions (ADE), the privacy behavior of the third party comes into play. Last year, ADE was found to be sending user reading data back to Adobe without encryption; even today, it's known to phone home with encrypted reading data. Other applications, such as Bluefire Reader, have a better reputation for privacy, but as they say "past performance is no guarantee of future returns".
If Alice wants to read her borrowed ebook on a Kindle (via OverDrive), it's very likely that Amazon will be able to exploit her reading behavior for marketing purposes. To avoid it, Alice would need to create an anonymous account on Amazon for reading her library books. Most people will just use their own (non-anonymous) accounts for convenience. If Alice shares her Amazon account with others, they'll know what she reads.
This is a classic example of the privacy vs. convenience tradeoff that libraries need to consider. A Kindle user trusts that Amazon will not do anything too creepy, and Amazon has every incentive to make that user comfortable with their data use. Libraries need to let users make their own privacy decisions, but at the same time libraries need to make sure that users understand the privacy implications of what they do.
- The library's own records are also potential source of a privacy breach. This "small-data" threat model is perhaps more familiar to librarians. Alice's parents could come in and demand to know what she's been reading. A schoolmate might hack into the library's lightly defended databases looking for ways to embarrass Alice. A staff member might be a friend of Alice's family. Libraries need clear policies and robust processes to be worthy of Alice's trust.