Friday, August 7, 2009

What the Google Books Settlement Agreement Says About Privacy

Here's what Google thinks I'm interested in:
  • Computers & Electronics - Enterprise Technology - Data Management
  • Computers & Electronics - Software - ... - Content Management
  • Entertainment - Movies
  • Finance & Insurance - Investing
  • Internet - Web Services
  • Lifestyles - Clubs & Organizations
  • Lifestyles - Parenting & Family
  • News & Current Events - Technology News
  • Reference - Libraries & Museums
  • Social Networks & Online Communities - Social Networks
If you want to find out what Google thinks you're interested in, go to and find out. Does anything there disturb you? Can you imagine items that might appear on your list that would disturb you, or that you wouldn't want to post on your Facebook profile or in a comment to this post?

Last Friday I participated in a workshop sponsored by Harvard's Berkman Center focusing on Google Books and Google's settlement agreement with authors and publishers. The meeting was very well tweeted, so I won't bother to summarize or comment on the blog, at least for now. In the afternoon, I participated in a breakout session on privacy issues facilitated by Marc Rotenberg from EPIC. Recently, the Electronic Frontier Foundation (EFF) has focused attention on the neglect of patron privacy in the settlement agreement. At the NYPL panel I attended, the view expressed by the participants was that though privacy was a very important issue, the settlement agreement was not the place to address privacy concerns. At the Harvard workshop, the opposite view was predominant.

In the EFF posting, I was struck by by the fact that they suggest that Google should be required to
allow users of anonymity providers, such as Tor, proxy servers, and anonymous VPN providers, to access Google Book Search
but they don't seem to expect libraries to be participating in the digital environment for books. In fact, I doubt that many libraries today view themselves as potential anonymity providers, despite the deep-seated respect for patron privacy that is part of the inherited culture of librarianship. I had wondered whether libraries would be able to use technological means, such as proxy servers, to ensure the privacy of their patrons who use Google Book Search. With some inspiration from the workshop, I've spent some time closely examining the agreement to see what it really says about privacy and what libraries might be able to do to enhance patron privacy.

Nowadays, electronic resources librarians can't help but focus more concern on monitoring for misuse of resources than on patron privacy issues. The obligation to do so is built into most license agreements for electronic resources. The following passage is from section 5.2 of the CLIR/DLF Model License:
Protection from Unauthorized Use. Licensee shall use reasonable efforts to inform Authorized Users of the restrictions on use of the Licensed Materials. In the event of any Authorized User makes an unauthorized use of the Licensed Materials, the parties may take the following actions as a cure:
  1. Licensor may terminate such Authorized User's access to the Licensed Materials;
  2. Licensor may terminate the access of the Internet Protocol (“IP”) address(es) from which such unauthorized use occurred; or
  3. Licensee may terminate such Authorized User’s access to the Licensed Materials upon Licensor’s request. Licensor shall take none of the steps described in this paragraph without first providing reasonable notice to Licensee (in no event less than [time period]) and cooperating with the Licensee to avoid recurrence of any unauthorized use.
Frequently libraries have to negotiate to get this language, as publisher licenses frequently have more burdensome requirements.

The sorts of things that actually happen, and librarians worry about, are of two types. The first is when a student with legitimate credentials "loans" them to a friend and in a week or two thousands of "friends" (often in another country) are using a resource through the campus proxy server. The librarians obligation is to identify and disable the rogue credentials. In the other scenario, a student or faculty member tries to use some type of downloading tool to download an entire journal or database for some sort of offline use. In this case as well, the offending user must be identified and told "don't do that". Publishers of electronic resources typically have monitoring tools (and bot traps and poison pills) in place so that they can detect such misuse and shut off a customer's access when this sort of thing occurs. A call to a support desk is typically needed to restore access. Publishers realize that these things happen, and that on a campus with 20,000 students, there are limits to how much librarians can control what their patrons do. I do not know of any case where legal proceedings or demands or compensation have resulted from such incidents, but I do know that one publisher cut off access to all of China for several months when a breach occurred there.

There clearly exists tension between a library's obligations to prevent unauthorized use and its obligations to protect the privacy of users. In the CLIR/DLF model license, there is a mutual obligation that balances the licensee obligation to control unauthorized use:
Confidentiality of User Data. Licensor and Licensee agree to maintain the confidentiality of any data relating to the usage of the Licensed Materials by Licensee and its Authorized Users. Such data may be used solely for purposes directly related to the Licensed Materials and may only be provided to third parties in aggregate form. Raw usage data, including but not limited to information relating to the identity of specific users and/or uses, shall not be provided to any third party.
The balance between providing for security against unauthorized use and confidentiality of user data is the practical determinant of the degree to which a patron can expect to have real privacy. To provide security against unauthorized of electronic resources, a library needs to generate logs for any proxy servers that it operates. To assure patron privacy, a library must be diligent to limit the retention of those log files and of any other records that might be used to identify and track users and their usage of particular resources.

The settlement agreement (available here) says very little about patron privacy. (In fact, the only users whose privacy is mandated are users with print disabilities who access Library Digital Copies in Fully Participating Libraries under the special access provisions of section 7.2(b)(i) of the agreement.) (I'm capitalizing terms defined in the agreement.) It says quite a lot about security, however, and thus many aspects of patron privacy will be effectively governed by the provisions for security. The use of proxy servers by libraries is implicitly mentioned in two places. In section 4.1(a)(iv) pricing bands are specified for government, public and school library subscriptions with the qualifier "no remote access without Registry approval", while higher education and corporate pricing bands are specified without the remote access qualifier. Remote access is most typically provided by libraries in higher education through the use of proxy servers. Additionally, the "Security Standard" set out in Appendix D to the agreement specifies that
Google shall use commercially reasonable efforts to authenticate individual End Users for access to Books in an Institutional Subscription by verifying that an individual is affiliated with an institution with an active subscription. Google’s efforts will be in partnership with the subscribing institutions in a manner consistent with, or otherwise equivalent to, generally accepted industry standards for authentication of use of subscriptions. Techniques used may include IP address authentication, user login, and/or leveraging authentication systems already in place at an individual institution.
Since the current "industry standard" is to allow users to authenticate through a proxy server against an institutional id/password service, it would seem that proxy servers would be permitted under the agreement, at least for higher education settings.

There is a specific security requirement set by the agreement that is likely to result in increased user tracking by Google. Google is required to make sure that each user cannot preview more than a certain number of pages of a book. Thus, Google must keep track of the books that a user has viewed, and stop the preview once the quota is reached. For the purposes of this requirement, Google is supposed to treat multiple users of a given computer as a single user. Assuming it is possible do so, this would have some odd consequences for computers in a library. A patron would be able to move from computer to computer and view more than their quota, but might not be able to view any pages from a book popular enough to have been previously viewed by another patron. In the current version of Google Book Search, cookies, not IP addresses, are used to track users, but a user is not required to log into the service at all unless they want to access personalization features. Google sets a 2-year cookie when you use the service, but the service can be used without cookies. To fulfill the terms of the settlement agreement, it appears to me that it's likely that Google would have to either require users to log into personally identifiable accounts, or to use IP addresses of individual computers to allow unidentified users access the service. Either way, libraries would be limited in their ability to use proxy servers to protect patron privacy (for example, by blocking cookies), and it's quite clear that what EFF has proposed with respect to anonymity providers is incompatible with the agreement.

I do not know of any resources currently licensed to libraries that are comparable to the post-settlement Google Book Search in the requirements for user tracking to prevent excessive uses, so it's not clear to me how much guidance is really given by the phrase "generally accepted industry standards for authentication of use of subscriptions". Authentication methods less widely deployed, such as Shibboleth may provide more patron privacy than use of cookies and IP addresses and/or proxy servers, while at the same time allowing Google to satisfy the terms of the settlement agreement. It is also likely that authentication technology, or modifications of existing authentication technologies, could be developed and specifically tailored to meet both the security requirements of licensors and the privacy requirements of libraries.

It is worth noting that the requirement for user tracking is not found in main part of the settlement agreement, but rather in "Attachment D", the Security Standard. Interestingly, the settlement agreement includes a provision for the Security Standard to be reviewed and revised every two years, by "Google, the Registry and up to a total of four (4) representatives on behalf of the Fully Participating Libraries" to allow for changes in Technology. Note that although the libraries are included because of their role in allowing use of library digital copies, there is a single Security Standard which applies to both Google and library-provided services. Thus there will be four library representatives who must agree to revisions in the security policy (and thus on the privacy that it allows) to be implemented in Google services. In theory at least, libraries could use the review of the Security Standard to introduce use of security/privacy technologies suited to the special characteristics of Google Book Search Subscriptions.

It is unclear what security requirements apply to the "Public Access Service" which would put free terminals (with paid printing available) in any US library that wanted it, because the settlement agreement treats the Public Access Service as something separate from the Institutional Subscription, while the Security Standard makes no mention at all of the Public Access Service. It seems possible that adding coverage of the Public Access Service to the Security Standard would also have to be addressed by the security review group that includes the library representatives.

In any case, it is clear that the power of the Book Rights Registry to review and approve the security implementation plans of Google gives it a great deal of leeway to set standards for patron privacy. Since the primary duty of the Registry is to serve rights-holders, its intrinsic motivation for protecting privacy would be only to see that privacy intrusions do not act to depress revenue significantly. Strong oversight by the court such as has been requested by the library associations may also promote attention to privacy concerns. Finally, it is likely that the Registry will need to pay close attention to state patron privacy laws. The library-registry agreements explicitly allow for state laws to trump any library obligations under the settlement agreement, so there can be no provision of the security standard that is incompatible with state privacy laws.

Google, as presently constituted, has every reason to be concerned about user privacy and guard it vigilantly; its business would be severely compromised by any perception that it intrudes on the privacy of its users. As Larry Lessig pointed out at the Berkman workshop, that doesn't mean that the Google of the future will behave similarly. Privacy concerns should be addressed; the main question has been how and where to address them. My reading of the settlement agreement is that it may be possible to address these concerns through the agreement's Security Standard review mechanism, through oversight of the Registry, and through state and federal laws governing library patron privacy.

And I am still not a lawyer.
Reblog this post [with Zemanta]


Contribute a Comment

Note: Only a member of this blog may post a comment.