Thursday, February 25, 2010

Named Graphs, Argleton and the Truth Economy

Depending on the map provider you're using, there may be a street running through my kitchen. After driving through my kitchen, perhaps you'd like to visit Argleton, town in Lancashire, UK, that only exists on Google Maps. I expect the street through my kitchen is a real mistake, but map companies are known to intentionally insert "trap streets" into their maps to help expose competitors who are just copying their maps.

Errors in information sources can be inadvertant or intentional, but either way, on the internet the errors get copied, propagated and multiplied, resulting in what I call the Information Freedom Corollary:
Information wants to be free, but the truth'll costya.
If you accept the idea that technologies such as Linked Data, web APIs and data spidering are making it much easier to distribute and aggregate data and facts on the internet, you come to the unmistakeable conclusion that it will become harder and harder to make money by selling access to databases. Data of all types will become more plentiful and easy to obtain, and by the laws of supply and demand, the price for data access will drop to near zero. In fact, there are many reasons that making data free increases its value, because of the many benefits of combining data from different sources.

The Attention Economy: Understanding the New Currency of Business
If you want a successful business, it's best to be selling a scarce commodity. Chris Anderson and others have been promoting "free" as a business model for media with the idea that attention is a increasingly scarce commodity (an observation attributed to Nobel prize winning economist Herbert Simon). John Hagel has a good review of discussions about "the Economics of Attention" Whether or not this is true, business models that sell attention are very hard to execute when the product is factual information. Data is more of a fuel than a destination.

The Economics of Attention: Style and Substance in the Age of InformationThere is something that becomes scarce as the volume and velocity of information flow increases, and that's the ability to tell fact from fiction. As data becomes plentiful, verifiable truth becomes scarce.

Let's suppose we want to collect a large quantity of information, and think about the ways that we might practically reconcile conflicting assertions. (We're also assuming that it actually matters to someone that the information is correct!)

One way to resolve conflicting assertions is to evaluate the reputation of the sources. The New York Times has has a pretty good reputation for accuracy, so an assertion by to the Times might be accepted over a conflicting assertion by the Drudge Report. An assertion about the date of an ancestor's death might be accepted if it's in the LDS database, and might be trusted even more strongly if it cites a particular gravestone in a particular cemetary (has provenance information). But reputation is imperfect. I am absolutely, positvely sure that there's no street through my kitchen, but if I try to say that to one of the mapping data companies, why should they believe me in preference to a planning map filed in my town's planning office? What evidence are they likely to accept? Try sending a correction to Google Maps, and see what happens.

Another method to resolve conficts is voting. If two or more independent entities make the same assertion, you can assign higher confidence to that assertion. But as it becomes easier to copy and aggregate data, it becomes harder and harder to tell whether assertions from different sources are really independent, or whether they're just copied from the same source. The more that data gets copied and reaggregated, the more that the truth is obscured.

The semantic web offers another method of resolving conficting assertions, consistency checking. Genealogy offers many excellent examples of how data consistency can be checked against models of reality. A death date needs to be after the birth date of a person, and if someone's mother is younger than 12 or older than 60 at their birth, some data is inconsistent with our model of human fertility. Whatever the topic area, a good ontological model will allow consistency checks of data expressed using the model. But even the best knowledge model will be able to reconcile only a small fraction of conflicts- a birth date listed as 03-02 could be either February or March.

Since none of these methods is a very good solution, I'd like to suggest that many information providers should stop trying to sell access to data, and start thinking of themselves as truth providers.

How does an information provider become a truth provider? A truth provider is a verifier of information. A truth provider will try to give not only the details of Barack Obama's birth, but also a link to the image of his certificate of live birth. Unfortunately, the infrastructure for information verification is poorly developed compared to the infrastructure for data distribution, as exemplified by standards developed for the Semantic Web. Although the existing Semantic Web technology stack is incomplete, it comes closer than any other deployed technology to making "truth provision" a reality.

Although there have been an number of efforts to develop vocabularies for provenance of Linked Data (mostly in the context of scientific data), I view "named graphs" as an essential infrastructure for the provision of truth. Named graphs are beginning to emerge as vital infrastructure for the semantic web, but they have not been standardized (except obliquely by the SPARQL query specification). This means that they might not be preserved when information is transferred between one system and another. Nonetheless, we can start to think about how they might be used to build what we might call the "true" or "verified" semantic web.

On the Semantic Web, named graphs can be used to collect closely related triples. The core architecture of the Semantic Web uses URIs to identify the nouns, verbs, and adjectives; named graphs allow URIs to  identify the sentences and paragraphs of the semantic web. Once we have named graphs, we can build machinery to verify the sentences and paragraphs.

The simplest way to verify named graphs using their URIs is to use the mechanism of the web to return authoritative graph data in response to an http request at the graph URI. Organizations that are serious about being "truth providers" may want to do much more. Some data consumers may need much more extensive verification (and probably updates) of a graph- they may need to know the original source, the provenance, the change history, the context, licensing information, etc. This information might be provided on a subscription basis, allowing the truth provider to invest in data quality, while at the same time allowing the data consumer to reuse, remix, and redistribute the information without restriction, even adding new verification layers.

Consumers of very large quantities of information may need to verify and update information without polling each and every named graph. This might be done using RSS feeds or other publish/subscribe mechanisms. Another possible solution is to embed digital signatures for the graph in the graph URI itself, allowing consumers posessing the appropriate keys to cryptographically distinguish authentic data from counterfeit or "trap street" data.

Named graphs and data verification. I think this is the beginning of a beautiful friendship.
Reblog this post [with Zemanta]

Friday, February 19, 2010

Notes from the Google Books Fairness Hearing

The Fairness Hearing was even more interesting than I expected; every time a speaker started droning on about something we'd all heard ten times before, Judge Chin would interrupt with a snippy or pointed comment. Judge Chin definitely runs a no-nonsense courtroom.

ResourceShelf has a nice round up of the news reporting from the fairness hearing; the best summaries are from Norman Oder at Library Journal: Part One and Part Two.

Here are some of my observations.

How Many Books?

In Dan Clancy's declaration (PDF, 149 KB) in support of the settlement, there are some interesting numbers (which actually come from Google's Jon Orwant).
  • Google pays approximately $2.5 million per year to license metadata from 21 commercial databases of information about books.
  • Google has gathered 3.27 billion records about Books, and analyzed them to identify more than 174 million unique works.
These numbers seemed to cause a great deal of confusion at the hearing. Several speakers opposed to the settlement combined this number with the information from the Declaration of Tiffaney Allen, Settlement Administrator for Rust Consulting, (PDF, 2.1 MB) that
As of February 8, 2010, Rust Consulting has received 1,846 completed hard copy claim froms, and 42,604 claim forms were completed using the settlement website. The total number of Books claimed by those 44,450 claimants is 1,125,339. [...]

Of the 1,107,620 Books claimed online, 619,531 are classified as out-of-print (not Commercially Available) and 488,089 are classified as in-print (Commercially Available).
Some objectors subtracted 1 million claimed books from 174 million unique works to get the eye-opening number of 173 million unclaimed works supposedly being exploited by Google. This is silly math, and the use of silly math is a good indicator of speakers not doing their homework.

It's known that one of the bibliographic databases licensed by Google is OCLC's Worldcat; it's probably not a coincidence that Worldcat currently contains 174,618,797 bibliographic records. There's a big difference between a bibliographic record and a book subject to the settlement. Later in the day, Daralyn Durie, an attorney representing Google, tried to clarify what the numbers meant. (updated February 22 with text from the transcript)
  • 174 million is NOT the number of books in the settlement. 
  • Google estimates that there are 42 million different books in US libraries. 
  • 20% of these are in the public domain.
  • About half of those left are written in foreign languages.
  • Of the 42 million, less than 10 million of these works are affected by the settlement in any way. 
  • Of these, about 5 million are out-of-print books implicated by the settlement. 
These numbers are in line with reality. Michael Cairns, a veteran of the book data supply chain business, has published his own estimates of the number of orphan works which more or less square with these numbers.

So what are the other 160 million works? They're duplicates (different editions of the same work), works that aren't books, and works published in countries excluded from the agreement and not registered with the US copyright office.

Update, February 20: Jon Orwant was kind enough to send me some clarifications.
The only correction I'd make is that it actually *is* a coincidence that OCLC cites 174M records and we cite 174M books. 

One thing to add to your "silly math" bit is that the 174M number also includes public domain books (hence not part of the settlement), and (this is the part that everyone messes up, and was ambiguous in Dan's declaration) 174M is a count of *manifestations*, not *works*.  Hamlet is one work but hundreds of manifestations.  The actual number of works is closer to 120M, but I haven't checked our most recent analysis.

Phrase of the Day: "Identical Factual Predicate"

It became clear at the hearing that Judge Chin's decision would turn on a determination of whether the settlement and the complaint it is meant to resolve have "identical factual predicates." I'll do my best to explain why.

A significant hurdle that the parties (i.e., Google, the Authors, and the Publishers) have to overcome is that the settlement is truly innovative and forward looking, and seeks to bind absent class members to business models that would not otherwise be allowed under copyright law. In their brief justifying the use of a class action, the parties cite a 1986 Supreme Court decision nicknamed "Firefighters", Local Number 93, Int’l Assoc. of Firefighters v. City of Cleveland. In this case, in which the petitioner tried to overturn a consent decree designed to redress past racial discrimination using ongoing obligations, the Court clarified that a judicial decree may go beyond the bounds of an original complaint.

In their filings, objectors countered with the “identical factual predicate” doctrine. This doctrine arises from a case known as "Super Spuds" in which it was held that a class action settlement could not go beyond the complaint of the original lawsuit. Judge Chin seemed interested in the apparent conflict and even asked Amazon's lawyer, famed copyright attorney David Nimmer, for his views on how to reconcile the precedents.

Nonetheless, attorneys from both sides wanted to argue whether the settlement satisfied the "identical factual predicate" test. Michael Boni, attorney for the Authors Guild, appear to be digging himself deep into a hole when Judge Chin asked him "Isn't it true that this case started out about snippets?" Boni argued that the case was really about the fears that publishers had about the scanning that Google was doing, and who knew what else? I thought to myself that publishers seem to fear much about the future of their industry, and following Boni's line of reasoning, the settlement could have included air rights because authors and publishers feared that the sky was falling.

Daralyn Durie's subsequent argument went a long way to recovering the ground lost by Boni. Of all the hot-shot lawyers making arguments at the hearing, Durie was by far the most impressive. She persuasively argued that since the original complaint included the Google's distribution of scan files to the libraries that contributed books for scanning, the settlement's provisions for selling access to scan files indeed constituted an identical factual predicate.

Judge Chin's eventual decision will turn on his evaluation of the "factual predicates".

What, Exactly, is Copyright's "Head"?

By the end of the hearing, I was sick and tired of hearing the phrase "turning copyright on its head". Even Bruce Keller, attorney for the Publishers' Association, was eager to use the phrase in its negative form. Have you ever tried repeating a word over and over again, so that its sound becomes grotesquely detached from its meaning? That's my feeling about the copyright-head phrase. It's meant to express that copyright usually means that copying requires the rightsholders permission, and the settlement would allow Google to make copies unless the rightsholder refuses permission.

On repetition, I began to ask myself: What part of copyright is the head? Are there brains in copyright? Is copyright blind? Does copyright have legs? Is there an invisible hand of copyright? When you eviscerate copyright, do copyright intestines spill out onto the floor?

Judge Chin Wants to Fix It

I got the impression that Judge Chin would like to approve a settlement. At least twice he asked objectors how they would "fix" the settlement to remove their objections. He asked EFF's Cindy Cohn how to fix the privacy problems she called attention to, and he sounded unhappy when EPIC's Marc Rotenberg told him that privacy problems with the settlement couldn't be cured. He asked Irene Pakuscher (representing the Federal Republic of Germany) if the settlement could be fixed to satisfy Germany's concerns about treaty compliance and effective representation. He also wanted to explore with more than one questioner Hadrian Katz' suggestion that all problems would go away if the settlement shifted from being opt-out to being opt-in.

State Laws Aren't Relevant

In an article last year, I suggested that Judge Chin might be tempted to used state unclaimed property laws as an alternate way to unravel the Orphan Works mess. Looks like I was wrong- he expressed open skepticism at the argument of Norman Marden, representing the Commonwealth of Pennsylvania, that the settlement should be rejected because of incompatibility with state laws.

Blind People had the Best View

The National Federation of the Blind made sure to have a very visible presence at the hearing to emphasize the benefits of the settlement for the reading disabled. It worked- photographs of blind people made the New York Times.

Spectators for the hearing filled two courtrooms. For the morning, I was in the overflow room, which featured a video screen small for the room and a distorted sound system. The view of the courtroom was fixed, and omitted any view of Judge Chin. Ironically, the seats closest to the video screen were filled with people who couldn't see it. Let's hope that's not emblematic of the case.

Thursday, February 18, 2010

Settlement Lawyers Say Real Authors Don't Advocate Fair Use

Today, February 18, 2010, in the US District Court, Southern District of New York, Judge Denny Chin will hear arguments for and against approval of an agreement to settle the lawsuit against Google by a class of book rightsholders formed by the American Association of Publishers and the Author's Guild. The unlikely alliance of publishers, authors, and Google will try to push through a settlement that would provide increased access to millions of books that Google has scanned and digitized in cooperation with libraries.

You can read about the pros and cons, the benefits and controversy of the settlement on a variety of blogs, websites and news outlets, but if you want to read one paragraph (with footnote) from the thousands of pages filed with the court that embodies all the issues, contradictions and complexities of the Google Books Settlement, here it is:
Some object to the entire ASA because it does not ensure that scientific or academic works are freely accessible under “Open Access” principles.    They have claimed that if those works remain unclaimed, then they should be freely made available for use. These arguments run counter to the economic interests of members of the Class.146 That the reading public may wish to have free access to scientific and other academic works covered by the ASA, or that some academic authors may not want to exploit their works through the Revenue Models, should not supersede the economic interests of members of the Class.

146 That the interests motivating these objections runs contrary to the interests of the Class is best illustrated by their preference that Google should prevail on the merits of this litigation. See, e.g., D.I. 336 at 2-­3 (“we believe . . . that scanning books to index them and make snippets available is likely and should be considered fair use”).
This comes from the Supplemental Memorandum Responding to Specific Objections filed by lawyers for the plaintiffs in the case. This 187 page document, available from the Public Index (PDF 856 KB) presents legal arguments countering objections to the agreement filed with the court.  Just in case you've not had a chance to follow all the issues surrounding the case, I'll try to explain some of this crankiness.

In this excerpt, the "some" who object in "D.I. 336" (PDF, 287 KB) to the entire ASA (Amended Settlement Agreement) is Pamela Samuelson, Professor of Law at the University of California. Samuelson writes on behalf of a long list of academic authors, who believe that many absent rightsholders would want their books to be made as freely available as possible, and object to Google's exclusive monetization of those works.

I can speak to this belief from personal experience. My wife's father was a history professor, and wrote a small number of scholarly monographs published by university presses. These monographs, representing a significant part of his life's work, are unavailable to many scholars in his field. If he were still alive, we are sure that he would have wanted his books to be digitized and made freely available. I've advised the family that the Google settlement would allow these works to become much more available, something that would be difficult to acheive without the settlement because we have no documentation of the relevant publication contracts. Nonetheless, my father-in-law's interests would have closely aligned with those of the academics represented by Samuelson, in favor of free access, and siding with Google on the fair use arguments.

A large fraction of book authors write them for reasons other than to profit from book sales, and only a very small number of authors are able to make a living publishing books. In addition to academic authors, who publish to advance their careers, there are authors who publish to advance a political or social agenda, or as a means of personal expression. It seems bizarre to me that the legal representatives of the entire class of authors should just dismiss these motivations as running counter to the "economic interests of members of the Class".

Since the lawsuit is configured as a Class Action, the central issue that Judge Chin must consider is whether all authors and publishers are properly represented by attorneys for the class, and whether the settlement deals fairly with them. The provisions of the settlement are unusually broad, so Judge Chin will need to give detailed scrutiny to the provisions of the settlement which impact some class members differently from others.

It seems to me that footnote 146 argues too much. In attacking Samuelson and the academics she represents for siding with Google on the fair use issue, the footnote undermines the plaintiff's core argument that Boni & Zack LLC and Debevoise & Plimpton LLP, the authors of the Memorandum, are fairly representing their interests in the lawsuit.

At the fairness hearing, I won't expect to hear any new arguments or experience any legal drama (although I expect vitriolic verbal grenades from Lynne Chu). I'll mostly be looking for signs of interest, impatience, or annoyance from Judge Chin.

Reblog this post [with Zemanta]

Thursday, February 11, 2010

Blog Post Number One Hundred

In eleventh grade one of my English teachers predicted that I would become a writer. I scoffed. I was going to be a scientist, or an engineer. I attributed his prediction to projection- the same sort of thinking that led the minister of our church predict that I would become a preacher of the gospel. Of course, when I was four years old, my ambition was that I would either become a doctor or a garbage man.

I did the scientist thing and the engineer thing, but recently I've become a blogger. When I started, I resolved to write at least ten posts, but believe it or not, this is my 100th. I think that qualifies me to say a bit about the blog as a literary form, although it doesn't qualify me to say anything original. I pity the English Ph. D. student 50 years from now whose dissertation topic is "the blog as a literary form", trying to come up with something original to say. (assuming of course that the Ph. D. dissertation is still extant as a literary form come 50 years.)

The blog is perhaps the first literary form native to the web. It's not a news story, though it can be news. It's something that can't be done in print. For example, a blog post without hyperlinks is like a Superbowl without commercials. It exists as part of a web, like a conversation. My smash-hit article on "Offline Book Lending" could not have existed without a Publisher's Weekly article to bounce off of; as for my posts on Dung Beetle Armament and Dancing Parrots- well what more could you want?

The creative use of multimedia is integral to a good blog post, though much of my subtlety is rarely appreciated. I'm especially proud of the duchampian picture in The Illusion of Internet Identity and the punk rock references in The Rock-Star Librarian and Objective Selector.

The blog post is also the first literary form that is optimized for search engines, with embedded metadata that's integral to the content. I'm proud that I have a post ranking #2 in Google for "hashtags for conferences" and #4 in Google for "bird shit antenna".

Writing as many articles as I have has made me much more aware of their construction. My favorite construction pattern seems to be [odd story]-[dry exposition]-[surprising connection], but I somehow I never intend it that way to start. I also find that my best and most popular posts are the ones that I write quickly; I'm never very happy with the ones where I do a lot of research and work hard on.

If you've ever left a comment, thanks! The opportunity to have my thoughts enriched and corrected by the range of experts who have done so is a real privilege. I've also received some wonderful private comments and links from other blogs, which are deeply appreciated.

Now that the blog has had over 55,000 "unique visitors", I've started to get a few Facebook friend requests from people I don't think I know. I'm sort of old fashioned about only friending people I've met (unless they're cousins!), but it's nice to see people so interested. As a response, I've started a Facebook fan page as a place for blog readers who are active Facebookers to interact.

I'm not exactly sure where this will lead, but so far it's been both fun and worthwhile. I'll leave lucrative to the imaginary future.

That is all.
Reblog this post [with Zemanta]

Wednesday, February 10, 2010

Branches of Koha

An arborist recently came to look at the white oak tree in my back yard. The tree is about 80 years old and is the biggest in the neighborhood. According to the arborist, our tree was in excellent health because of its large number of leaders, or main branches. Even in the strongest wind, these leaders will bend and a few might even break, but the tree itself is very unlikely to topple. Some neighbors have a very tall, scary tulip tree with only one main branch. I fear it will come down one day very suddenly.

In my discussion of Koha and LibLime (part 1, part 2), I promised to write more about the so-called "forking" of the Koha development process. What happened was that about 10 months ago LibLime stopped participating in the the Koha open development community. According to Josh Ferraro, LibLime's CEO, this happened because LibLime developers were having trouble completing development projects that LibLime had committed to doing for customers. Together with his development partners, Ferraro judged that LibLime's developers were spending too much time providing support to non-customers and that the overhead of the community development process was slowing down development more than it was contributing to LibLime's development objectives.

This judgement is hotly contested by developers advocating a open community development process. (See, for example, this post by Chris Cormack, or Owen Leonard's post on the Koha List.) It's not hard to imagine that an open community might pose difficulties during focused development. Two coders may disagree about how a task is to be done, and depending on personality and skills involved,  such disputes might easily become major time-sinks. Implementing a feature important to US libraries might break a feature important to European libraries, for example, and making both features work at the same time might be a lot of work. On the other hand, a small community such as the one working on Koha can ill afford to fragment into factions and start working at cross purposes.

When one branch of code diverges too far from another, the branches can become incompatible, or forked. When this happens, effort applied to one branch may have to be duplicated for the other branch. Forking is a common occurrence in open source projects, and can be evidence of a project's health. Such a fork in the Linux kernel came to light just last week, as some drivers added to support Google's Android system were deleted from the project's main tree. The downside is that forks increase the maintenance burden. It's often worthwhile for developers to work hard to join their code to a main branch so that others can maintain the contributed code and keep it from breaking.

Open Source projects can be motivated in many ways. Some projects have their origin in proprietary software, when the developers decide their businesses would benefit from wider adoption or support. Or perhaps the emphasis of the developer's business has changed. Etherpad is a recent example- the company was acquired by Google, whose main interest was to improve Google Wave rather than to continue the Etherpad service.

Other Open Source projects arise as "calling cards". That's how IndexData started doing Open Source. Sebastian Hammer, IndexData's Founder and President, told me that when he started, he just wanted his software to be widely used. His business was primarily custom development, and companies who were using his software because it was free began using his company for development because the free software worked well.

Only a small percentage of open source software projects are supported by more then just a few developers, and even fewer survive without an acknowledged lead developer. Koha has been blessed with significant contributions from a number of developers (and it uses free open source components such as Apache, MySQL, Perl, PHP and IndexData's Zebra).

You can imagine that Koha contributors outside LibLime would be very upset at LibLime's withdrawal from community development. Their contributions to the project were made with an understanding of Koha as an inherently community-driven effort for the benefit of all Koha libraries, and LibLime's withdrawal from the community process implicitly minimizes the value of their ongoing contributions. In fact, several contributors within LibLime were upset at the changes, and are now working for competing companies.

The fact of competition among Koha project participants inevitably leads to conflicting incentives. While proprietary software creates incentives for vendors to compete for initial sales with a robust platform and advanced features at the expense of ongoing service, open source software creates incentives for companies to focus on service and custom development. A company that puts a lot of effort into the core software may gain no advantage from that work if it has competitors which instead focus on services. Competitive considerations may certainly have been an important factor in the manner of LibLime's withdrawal from the community process.

It's interesting to see the messaging that LibLime's competitors used to respond to LibLime's withdrawal from the community development process. These ranged from the sunny "Equinox Promise" to a pointed post from BibLibre and a worried post from ByWater. There is clearly a struggle among all these companies to resolve the tension between competition and the need for cooperation that underlies Open Source support businesses. (Note: Equinox supports a different library system, Evergreen, so it's not so directly competitive with LibLime. Update Feb. 11- Equinox announced its entry into the Koha support market.)

Another reason given by LibLime for the change in its development process was that they wanted development customers to be able to test and approve new functionality before it would be released to the world. In the words of a LibLime press release:
"A public software release of each version of LibLime Enterprise Koha will occur periodically, after the sponsoring library and LibLime's customers have had adequate time to ensure that the codebase is of sufficient quality and stability to be contributed back to the Koha Community."
One Koha developer described this rationale to me as "nonsensical" and pointed out that the code quality seemed to be good enough for LibLime production customers. A look through the Koha developer wiki gives the impression that an elaborate QA process has been built by the community; I don't know how well it is followed.

My perspective, as someone who has managed a development project of similar scope, is that testing and quality assurance require a fair amount of disciple and attention to process. Getting developers to comment their code, do proper testing, and keep documentation up to date (i.e. adhere to a documented QA process) is not always easy, even if you're signing their paychecks. So while I have little insight into whether LibLime's new internal development processes are in fact resulting in better or more timely code, I think that the explanation given is at least plausible.

Managing a software development project is really, really hard. A lot of people imagine that their success in managing one project is evidence of superior process or ability, when really they were just lucky to have the right people at the right time. So I'm really skeptical when someone says that "community development" is the best way to build software, or that "agile methodology" is the one true way. In the real world, development managers may have the skills to succeed in one style of development (or group of developers) and be lacking in the skills needed to succeed in another style.  Software development projects only work if they work. In the case of the two branches of Koha, only time will tell whether one branch will wither and die, or whether two branches will end up diverging, both healthy.

While the people in charge at LibLime and PTFS have been in no position to comment on what they will do before their transaction is complete, other Koha stakeholders that I talked to were "hopefully optimistic" that PTFS would ultimately decide to rejoin the community development process and help reunify the Koha code base. PTFS developers have been active with contributions during the period that LibLime has pursued separate development. At ALA Midwinter, PTFS' John Yokley emphasized that a decision as to the extent of PTFS participation in Koha community development had not yet been made. In the meantime, Koha stakeholders other than LibLime have launched a new website to be the "Temporary home of the Koha Community."

 (Update Feb.12 - the acquisition is not happening.) (Update Mar. 16 - the acquisition closed after all.)

You can look at my tree analogy in two ways. You could say that having multiple branches of the Koha code is good for the project, as it is for my oak tree. You could also say that concentrating development of Koha in one company is dangerous, and worry about it as I do about the tulip tree.

Or you could just be happy that spring is coming and buds are already appearing on the trees.

This is the third part of a series. Also see Part 1 and Part 2
Reblog this post [with Zemanta]

Thursday, February 4, 2010

Copyright-Safe Full-Text Indexing of Books

As the February 18 hearing on the revised Google Books Settlement Agreement draws near, I think its timely to explore some issues surrounding full-text indexing of books. It's important to realize that when Google began its program of scanning books in libraries, it chose to do so in a way that entered the gray zone of fair use. Google continues to maintain that its scanning activities are perfectly legal, and fair use advocates welcomed the Publishers' and Authors' lawsuit because it had the potential to clarify ambiguities around fair use. No matter where the court decided to draw the line, the both fair use and rightsholder control would be able to extend into the zone of current uncertainty.

Overlooked in the controversy is the fact that Google could have chosen a safer course in its effort to make full-text indices of books. In this article, I'll argue that it's possible to make full-text indices of books in a way that steers well clear of copyright infringement. But first, I should note that playing it safe would not have been a good plan for Google. By pushing fair use to its limits, Google assured itself a favorable competitive position. In a lawsuit, Google could have lost on 90% of the fair use they were claiming and would still have ended up 10% ahead of where a safe course would have taken them. Google is large enough that even a 10% victory in court would have paid off in the long run. As it is, Google chose to settle the lawsuit under terms that put them in a better position than they would have occupied by playing it safe, and potential competitors don't gain the benefits of a fair-use precedent.

I make two assumptions about copyright in devising an copyright-safe indexing method:
  1. You can't infringe the copyright to a work if you don't copy the work.
  2. If you can't reconstruct a work from its index, then distributing copies of the index doesn't infringe on the work's copyright.
Just in case these assumptions are weak, my fall-back position is that indexing is clearly a fair use under US copyright law.

First, the fall-back assumption: full-text indexing is allowed as fair use under US copyright law. Indices are allowed as "transformative uses". Judge Robert Patterson's decision (pdf, 195K) in the "Harry Potter Lexicon" case gives an excellent background of this jurisprudence and concludes:
The purpose of the Lexicon’s use of the Harry Potter series is transformative. Presumably, Rowling created the Harry Potter series for the expressive purpose of telling an entertaining and thought provoking story centered on the character Harry Potter and set in a magical world. The Lexicon, on the other hand, uses material from the series for the practical purpose of making information about the intricate world of Harry Potter readily accessible to readers in a reference guide. To fulfill this function, the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.
The author of the Lexicon lost his case not because his indexing was not allowed, but rather because he copied too much of J. K. Rowling's creative expression in doing so.

Second, you have to copy to infringe copyright. A more accurate statement is this: You have to either make a copy or a derivative work to infringe copyright. The second piece of this can be a bit more confusing, because "derivative work" has a specific meaning in copyright law. A translation into another language is an example of a derivative work. Indices are not derivative works. The law considers indices to be more akin to metadata. I might need access to a book to count the number of figures it contains, but a report of the number of figures in a book and what page they're on is in no way a derivative work. The copyright act defines a derivative work as
a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.
If you make copies by scanning, however, as Google is doing, you must also establish that your use is allowed as fair use. If you don't, then you don't even need to reach the fair use provision.

The last assumption gets more technical. The simplest form of a word index is a sorted list of words with pointers to the occurrence of the word within the text. So an index of that last sentence might look like this:
a    5,9
form    3
index    7
is    8
list    11
occurrence    18
of    4,12,19
pointers    15
simplest    2
sorted    10
text    24
the    1,17,20,23
to    16
with    14
within    22
word    6,21
words    13
It doesn't take a computer science degree to see that it's easy to reconstruct the sentence from this index. For that reason this form of index is equivalent to a copy. If you remove the position pointers, however, the index loses enough information that the sentence cannot be reconstructed. So if we take the words on a page of text and sort the words in each sentence, then sort the word-sorted sentences, we get an index of a page that can't be used to reconstruct text, but can be used to build a useful full-text index of a book.

The trickiest step of completely copyright-safe indexing is producing the page index from a book without producing intermediate copies of the pages. In a conventional scanning process, a digital image of a page is stored to disk and the copy is passed to OCR software. Indexing software then works on the OCR text. A scanning process that was fastidious about copyright, however, could scan lines of text word by word and never acquire an image large enough to be subject to copyright.

US courts have considered the loading of a copyrightable work into a computer's RAM storage to constitute copying, but scanning sufficient to produce an index can in principle be done without requiring that to occur. (For an excellent law review article on the RAM-copying situation, read Jonathan Band and Jeny Marcinko's article in Stanford Technology Law Review.) Also, even sentences of more than a few words can be considered copyrightable works, as I discussed in an article from November.

Another possible way to avoid copying is to build a black-box indexer. A closer look at the RAM-copying precedent, MAI SYSTEMS v. PEAK COMPUTER suggests that a non-copying scanning indexer can be built even if page images exist somewhere in RAM. In that case, the court reasoned that the software copy could be viewed via terminal readouts, system logs, and that sort of thing. If a closed-box indexing system were built so that page images resident in RAM could never be "perceived, reproduced, or otherwise communicated", then there is a fair chance that a court would find that copying was not occurring.

I'm a technologist, not a lawyer. I would welcome comment and criticism from experts of all stripes on this analysis. For example, I've not considered international aspects at all. There are many technical aspects of copyright-safe indexing that would need to be sorted out, but doing so could open the way to countless transformative uses of all the books in the world.
Enhanced by Zemanta

Tuesday, February 2, 2010

Back to the Future at the Storefront Library

I wish library I can buy book.

I wish we had a permanent library.

I wish to be happy and proud of my accomplishments.

In the window of the Chinatown Storefront Library in Boston stood a Wish Tree. Modeled after Yoko Ono's Wish Tree Project, the tree was meant to allow patrons to pass on a spirit of energy and hope. The instructions were:
Make a wish. Write it down on a piece of paper. Fold it and tie it around a branch of a wish tree. Ask your friend to do the same. Keep wishing until the branches are covered.
The Chinatown Storefront Library closed its doors on January 17, 2010, the Sunday that ALA Midwinter was in town. Always meant to be a temporary library, the Storefront Library was an expression by Boston's Chinatown community of its need and support for a library of its own. The Chinatown neighborhood of Boston has been without a branch of the Boston Public Library since 1956, when the branch was closed and demolished to make way for a highway.

Without a local branch, Chinatown residents needing library services have to go to the main library in Copley Square, which, though a beautiful building, may seem rather imposing and hard to navigate for someone looking for Chinese language materials.

The founders of Chinatown Storefront Library, Sam and Leslie Davol, had been involved in community meetings surrounding the proposed design and construction of a new branch of Boston Public Library, and in that process had gotten to know faculty at Harvard's Graduate School of Design. With a new branch on hold due to budgetary reasons, the Davols decided to take action. A local developer offered to let them use a vacant storefront for free. Design students made some gorgeous, modernistic shelving pieces for the library, enabling it to create an inviting environment in an bare commercial space. Library students from Simmons paired with Cantonese- and Mandarin-speaking community volunteers to staff the facility. Donations of over 5,000 books were solicited, and for twelve weeks, a community library came into existence. The operating budget for the entire project was about $10,000.

The day before the closing, I had a chance to tour the Storefront Library and sit down with Sam Davol. Formerly a legal-aid lawyer in New York, he and his wife moved back to Boston with their two children, partly so that Sam could devote more time to music. The Library project was an outgrowth of their involvement in the community and other cultural programming they've produced.

In just a few short months, the Storefront Library has had a clear impact on its neighborhood. People who used to avoid the block because of its vacant, spooky feel began to feel welcomed by the activity surrounding the library. Cultural activities, language classes and storytimes attracted people from the community and passersby.

Initially, the Storefront library did not plan to circulate books, but in the first week of operation patrons told them that they really wanted to take books home with them. A makeshift paper-based circulation system was implemented, and over 1,374 books were circulated in 11 weeks of operation, over half of them in Chinese. Over 4,000 books were catalogued using LibraryThing.

In talking to librarians in general about the storefront library concept, I've gotten a consistent reaction that small storefront spaces could not provide sufficient room to provide internet access; terminals take up more room than books. At the Storefront Library, the computers tended to be lightly used. When I was there, some older gentemen were reading newspapers, some children were reading books, but no one was using the computers or internet access. This could be because the Library did not subscribe to electronic resources.

I think the most important lesson that can be learned from the Storefront Library experiment is that even small temporary libraries can be powerful agents of community development. In Boston, this role was accentuated by a location in close proximity to people's everyday lives. While I've written that the future of public libraries may be in smaller locations, the Chinatown Storefront Library reminded me that many public libraries began as grassroots efforts to promote knowledge and culture.

Now that the Storefront Library has closed, its books will be going to a new reading room, to local schools, and a few to the Chinese Historical Society of New England. The furniture will be going to local schools and daycare facilities. Information about the project will be published on the website so that similar projects in other communities can learn from their experiences.

As for Sam Davol, he goes on tour. He plays cello with the indie-pop band "The Magnetic Fields", which has a new CD out, Realism. I just got my tickets for one of the shows at New York's Town Hall in March.
I wish there were more people experimenting with libraries.

Reblog this post [with Zemanta]