Showing posts with label Proxy server. Show all posts
Showing posts with label Proxy server. Show all posts

Monday, December 12, 2011

SOPA Could Put Common Library Software in the Soup


The "Stop Online Piracy Act", or SOPA, is promoted as something that will... stop online piracy. So I was a bit surprised when I learned how it's supposed to work. A key provision of SOPA will shut down "notorious" websites by setting up a national web filter based on domain names. I'm sure the pirates had a great laugh about that one. They'll be the ones benefiting while the rest of us figure out how to avoid collateral damage. Members of Congress should consult the nearest available 14-year-old on the ease of web filter evasion: school teachers in my town routinely access their filter-blocked Facebook accounts by asking students to show them how it's done.

Rerouting domain names to alternate IP addresses is pretty easy to do, and can be very useful as well. One type of software used to accomplish this is called a "proxy server". It's called that because it acts as your web browser's proxy in requesting files from a web site. For example, after connecting to a proxy server in Stockholm, my requests for web pages would appear to issue from a computer in Sweden instead of from my computer in New Jersey.

Libraries often use proxy servers to simplify IP authentication of their networks to digital information providers. When an academic library buys access to a database, for example, they'll give the IP address of their proxy-server to the database provider, which then puts the IP address on an "allow" list. Then everyone at the school accesses the database through the address of the proxy server. In effect, those proxy-authenticated users circumvent the IP address-based filter that blocks unauthorized users.

Passage of SOPA would inevitably spawn the creation of a network of proxy servers hosted in countries that reject filtering of the internet. Users in the US could then connect transparently to  blocked sites by connecting through a constantly shifting network of proxy servers. The key to that connection would be a Proxy Auto-config, or PAC file- essentially a mini DNS file installed in the user's web browser software.

SOPA contains provisions that allow the US Attorney General to
bring an action for injunctive relief against any entity that knowingly and willfully provides or offers to provide a product or service designed or marketed for the circumvention or bypassing of [domain name blocking] and taken in response to a court order issued pursuant to this subsection, to enjoin such entity from interfering with the order by continuing to provide or offer to provide such product or service.
Proxy servers meet the condition of being designed to route around filters and therefore fall into the category of services that could be subject to injunctive action under SOPA. The proxy servers most frequently used in libraries are OCLC's EZProxy and the open-source software known as SQUID, but there are many others in use.

In particular, SQUID makes use of PAC files, and thus could be vulnerable if the Justice Department decides that PAC files make it too easy to evade SOPA blockages. Conceivably, the Justice department could force browser developers to omit support for PAC files, or perhaps to restrict their transmission.

Similar concerns about important software have been raised by Jim Fruchterman on behalf of Benetech, a non-profit that among other things, provides ebooks to the reading disabled. Benetech is also one of the largest developers of software for human rights activists around the world. They operate TOR servers designed to foster anonymous communications. On Beneblog, Fruchterman worries that Benetech services could be impacted by SOPA. In response, a commenter signing in as "Copyright Alliance" argues that such action would be unlikely because "The State Department is strongly committed to advancing both Internet freedom and the protection and enforcement of intellectual property rights on the Internet." Too bad it's the Justice Department that gets to decide which services constitute circumvention.

I don't think that libraries will have their proxy servers taken away anytime soon, even if SOPA is enacted. But it's likely that the widespread development of SOPA-circumventing infrastructure would degrade the ability of rights holders to find and prosecute copyright violators. Knowledge of the actual locations of unauthorized files would by hidden offshore in distributed proxy servers, completely out of the reach of US law enforcement. The "file lockers" of today would dissolve into ungraspable bit vapors, and the online piracy problem would just get worse and worse.

There are many ways to address the online piracy problem- too many to list in this post. My own company is working on a piracy-neutering business model for ebooks. I don't know enough to evaluate the possible effectiveness of the payment and advertising network components of SOPA. But it appears to me that from the technical point of view, the internet filter component of SOPA will be a charm of powerful trouble, like a hell-broth, boil and bubble.

Notes:
  1. @amac has a good post on SOPA's scope issues, as well as links to other articles.
  2. I focus here on SOPA, but there are similar issues with PROTECT IP, as described by Steve Crocker and 4 other prominent internet engineers.
  3. The Crocker paper describes a number of other ways that domain name filtering might be circumvented. These include using replacing .hosts files on the user's computer (similar to PAC file installation) and switching the user to using a non-filtered DNS server. Apparently this is done transparently by some types of computer malware. This can only end badly.


Enhanced by Zemanta

Friday, August 7, 2009

What the Google Books Settlement Agreement Says About Privacy

Here's what Google thinks I'm interested in:
  • Computers & Electronics - Enterprise Technology - Data Management
  • Computers & Electronics - Software - ... - Content Management
  • Entertainment - Movies
  • Finance & Insurance - Investing
  • Internet - Web Services
  • Lifestyles - Clubs & Organizations
  • Lifestyles - Parenting & Family
  • News & Current Events - Technology News
  • Reference - Libraries & Museums
  • Social Networks & Online Communities - Social Networks
If you want to find out what Google thinks you're interested in, go to http://www.google.com/ads/preferences/view and find out. Does anything there disturb you? Can you imagine items that might appear on your list that would disturb you, or that you wouldn't want to post on your Facebook profile or in a comment to this post?

Last Friday I participated in a workshop sponsored by Harvard's Berkman Center focusing on Google Books and Google's settlement agreement with authors and publishers. The meeting was very well tweeted, so I won't bother to summarize or comment on the blog, at least for now. In the afternoon, I participated in a breakout session on privacy issues facilitated by Marc Rotenberg from EPIC. Recently, the Electronic Frontier Foundation (EFF) has focused attention on the neglect of patron privacy in the settlement agreement. At the NYPL panel I attended, the view expressed by the participants was that though privacy was a very important issue, the settlement agreement was not the place to address privacy concerns. At the Harvard workshop, the opposite view was predominant.

In the EFF posting, I was struck by by the fact that they suggest that Google should be required to
allow users of anonymity providers, such as Tor, proxy servers, and anonymous VPN providers, to access Google Book Search
but they don't seem to expect libraries to be participating in the digital environment for books. In fact, I doubt that many libraries today view themselves as potential anonymity providers, despite the deep-seated respect for patron privacy that is part of the inherited culture of librarianship. I had wondered whether libraries would be able to use technological means, such as proxy servers, to ensure the privacy of their patrons who use Google Book Search. With some inspiration from the workshop, I've spent some time closely examining the agreement to see what it really says about privacy and what libraries might be able to do to enhance patron privacy.

Nowadays, electronic resources librarians can't help but focus more concern on monitoring for misuse of resources than on patron privacy issues. The obligation to do so is built into most license agreements for electronic resources. The following passage is from section 5.2 of the CLIR/DLF Model License:
Protection from Unauthorized Use. Licensee shall use reasonable efforts to inform Authorized Users of the restrictions on use of the Licensed Materials. In the event of any Authorized User makes an unauthorized use of the Licensed Materials, the parties may take the following actions as a cure:
  1. Licensor may terminate such Authorized User's access to the Licensed Materials;
  2. Licensor may terminate the access of the Internet Protocol (“IP”) address(es) from which such unauthorized use occurred; or
  3. Licensee may terminate such Authorized User’s access to the Licensed Materials upon Licensor’s request. Licensor shall take none of the steps described in this paragraph without first providing reasonable notice to Licensee (in no event less than [time period]) and cooperating with the Licensee to avoid recurrence of any unauthorized use.
Frequently libraries have to negotiate to get this language, as publisher licenses frequently have more burdensome requirements.

The sorts of things that actually happen, and librarians worry about, are of two types. The first is when a student with legitimate credentials "loans" them to a friend and in a week or two thousands of "friends" (often in another country) are using a resource through the campus proxy server. The librarians obligation is to identify and disable the rogue credentials. In the other scenario, a student or faculty member tries to use some type of downloading tool to download an entire journal or database for some sort of offline use. In this case as well, the offending user must be identified and told "don't do that". Publishers of electronic resources typically have monitoring tools (and bot traps and poison pills) in place so that they can detect such misuse and shut off a customer's access when this sort of thing occurs. A call to a support desk is typically needed to restore access. Publishers realize that these things happen, and that on a campus with 20,000 students, there are limits to how much librarians can control what their patrons do. I do not know of any case where legal proceedings or demands or compensation have resulted from such incidents, but I do know that one publisher cut off access to all of China for several months when a breach occurred there.

There clearly exists tension between a library's obligations to prevent unauthorized use and its obligations to protect the privacy of users. In the CLIR/DLF model license, there is a mutual obligation that balances the licensee obligation to control unauthorized use:
Confidentiality of User Data. Licensor and Licensee agree to maintain the confidentiality of any data relating to the usage of the Licensed Materials by Licensee and its Authorized Users. Such data may be used solely for purposes directly related to the Licensed Materials and may only be provided to third parties in aggregate form. Raw usage data, including but not limited to information relating to the identity of specific users and/or uses, shall not be provided to any third party.
The balance between providing for security against unauthorized use and confidentiality of user data is the practical determinant of the degree to which a patron can expect to have real privacy. To provide security against unauthorized of electronic resources, a library needs to generate logs for any proxy servers that it operates. To assure patron privacy, a library must be diligent to limit the retention of those log files and of any other records that might be used to identify and track users and their usage of particular resources.

The settlement agreement (available here) says very little about patron privacy. (In fact, the only users whose privacy is mandated are users with print disabilities who access Library Digital Copies in Fully Participating Libraries under the special access provisions of section 7.2(b)(i) of the agreement.) (I'm capitalizing terms defined in the agreement.) It says quite a lot about security, however, and thus many aspects of patron privacy will be effectively governed by the provisions for security. The use of proxy servers by libraries is implicitly mentioned in two places. In section 4.1(a)(iv) pricing bands are specified for government, public and school library subscriptions with the qualifier "no remote access without Registry approval", while higher education and corporate pricing bands are specified without the remote access qualifier. Remote access is most typically provided by libraries in higher education through the use of proxy servers. Additionally, the "Security Standard" set out in Appendix D to the agreement specifies that
Google shall use commercially reasonable efforts to authenticate individual End Users for access to Books in an Institutional Subscription by verifying that an individual is affiliated with an institution with an active subscription. Google’s efforts will be in partnership with the subscribing institutions in a manner consistent with, or otherwise equivalent to, generally accepted industry standards for authentication of use of subscriptions. Techniques used may include IP address authentication, user login, and/or leveraging authentication systems already in place at an individual institution.
Since the current "industry standard" is to allow users to authenticate through a proxy server against an institutional id/password service, it would seem that proxy servers would be permitted under the agreement, at least for higher education settings.

There is a specific security requirement set by the agreement that is likely to result in increased user tracking by Google. Google is required to make sure that each user cannot preview more than a certain number of pages of a book. Thus, Google must keep track of the books that a user has viewed, and stop the preview once the quota is reached. For the purposes of this requirement, Google is supposed to treat multiple users of a given computer as a single user. Assuming it is possible do so, this would have some odd consequences for computers in a library. A patron would be able to move from computer to computer and view more than their quota, but might not be able to view any pages from a book popular enough to have been previously viewed by another patron. In the current version of Google Book Search, cookies, not IP addresses, are used to track users, but a user is not required to log into the service at all unless they want to access personalization features. Google sets a 2-year cookie when you use the service, but the service can be used without cookies. To fulfill the terms of the settlement agreement, it appears to me that it's likely that Google would have to either require users to log into personally identifiable accounts, or to use IP addresses of individual computers to allow unidentified users access the service. Either way, libraries would be limited in their ability to use proxy servers to protect patron privacy (for example, by blocking cookies), and it's quite clear that what EFF has proposed with respect to anonymity providers is incompatible with the agreement.

I do not know of any resources currently licensed to libraries that are comparable to the post-settlement Google Book Search in the requirements for user tracking to prevent excessive uses, so it's not clear to me how much guidance is really given by the phrase "generally accepted industry standards for authentication of use of subscriptions". Authentication methods less widely deployed, such as Shibboleth may provide more patron privacy than use of cookies and IP addresses and/or proxy servers, while at the same time allowing Google to satisfy the terms of the settlement agreement. It is also likely that authentication technology, or modifications of existing authentication technologies, could be developed and specifically tailored to meet both the security requirements of licensors and the privacy requirements of libraries.

It is worth noting that the requirement for user tracking is not found in main part of the settlement agreement, but rather in "Attachment D", the Security Standard. Interestingly, the settlement agreement includes a provision for the Security Standard to be reviewed and revised every two years, by "Google, the Registry and up to a total of four (4) representatives on behalf of the Fully Participating Libraries" to allow for changes in Technology. Note that although the libraries are included because of their role in allowing use of library digital copies, there is a single Security Standard which applies to both Google and library-provided services. Thus there will be four library representatives who must agree to revisions in the security policy (and thus on the privacy that it allows) to be implemented in Google services. In theory at least, libraries could use the review of the Security Standard to introduce use of security/privacy technologies suited to the special characteristics of Google Book Search Subscriptions.

It is unclear what security requirements apply to the "Public Access Service" which would put free terminals (with paid printing available) in any US library that wanted it, because the settlement agreement treats the Public Access Service as something separate from the Institutional Subscription, while the Security Standard makes no mention at all of the Public Access Service. It seems possible that adding coverage of the Public Access Service to the Security Standard would also have to be addressed by the security review group that includes the library representatives.

In any case, it is clear that the power of the Book Rights Registry to review and approve the security implementation plans of Google gives it a great deal of leeway to set standards for patron privacy. Since the primary duty of the Registry is to serve rights-holders, its intrinsic motivation for protecting privacy would be only to see that privacy intrusions do not act to depress revenue significantly. Strong oversight by the court such as has been requested by the library associations may also promote attention to privacy concerns. Finally, it is likely that the Registry will need to pay close attention to state patron privacy laws. The library-registry agreements explicitly allow for state laws to trump any library obligations under the settlement agreement, so there can be no provision of the security standard that is incompatible with state privacy laws.

Google, as presently constituted, has every reason to be concerned about user privacy and guard it vigilantly; its business would be severely compromised by any perception that it intrudes on the privacy of its users. As Larry Lessig pointed out at the Berkman workshop, that doesn't mean that the Google of the future will behave similarly. Privacy concerns should be addressed; the main question has been how and where to address them. My reading of the settlement agreement is that it may be possible to address these concerns through the agreement's Security Standard review mechanism, through oversight of the Registry, and through state and federal laws governing library patron privacy.

And I am still not a lawyer.
Reblog this post [with Zemanta]