Tuesday, September 29, 2009

It Shall Be Unlawful to Sell Non-accessible Books

Here's an assertion that a few years ago would have been preposterous. Today, it's still pretty far out on the fringe. In ten years I think it will be obvious:
It shall be unlawful to sell books in the United States without making accessible versions available on the same terms.
The reason that this assertion has not been true in the past and may well be true in the future is the continuing progress in e-reader and digitization technology, combined with the structure of the Americans with Disabilities Act (ADA). To review, the ADA forbids discrimination against the disabled, and includes the following in its definition of discrimination:
a failure to take such steps as may be necessary to ensure that no individual with a disability is excluded, denied services, segregated or otherwise treated differently than other individuals because of the absence of auxiliary aids and services, unless the entity can demonstrate that taking such steps would fundamentally alter the nature of the good, service, facility, privilege, advantage, or accommodation being offered or would result in an undue burden;
Offering a book for sale without offering an accessible version can only be considered discriminatory if
  1. Making a book accessible "would fundamentally alter its nature" or
  2. Making a book accessible would not pose an "undue burden" on the entity offering the book for sale.
Let's look at both of those points a bit.

Today, most people think of digitized books as being rather different things from the print products. The way you use them is different, the way you acquire them is different, and the way they are produced is different. But for many people, a threshold has already been passed, in part due to the adoption of reader devices. I used to wake up to the sound of my early-rising wife flipping the pages of a novel, now it's the click of a button on her Kindle.

Think of scholarly journals 10 years ago. They were printed things that libraries put on their shelves. E-journals were a different thing entirely, and though most of the important publishers were focused on moving their services on line, revenue streams were dominated by print. Ten years later, there is nothing remarkable about having a digital version of a print journal, and while libraries are still considered daring when they choose to do away with print subscriptions entirely, there are few libraries not considering that as a possibility. Many journal publishers now have digital revenue larger than print, and almost all of them view the future as being purely digital.

I don't think it's a huge stretch to think that in 10 years, it will seem natural to think of the digitized book as having fundamentally the same nature as a printed book.

Today, it may no longer be an undue burden to require a producer of a printed book to also produce a digitized (accessible) book. Ten years ago, the cost to digitize a book would have been roughly $10/page. The digitization and conversion process has since become dramatically less expensive. Brewster Kahle estimates the Internet Archive's cost for book scanning at about 10 cents/page, and although you'd also need to add costs for an accessible book delivery system, it's no longer prohibitively expensive. Imagine how cheap it will be to digitize and deliver an accessible book ten years from now!
Already, publishers are making accessible versions of their most popular books available through organizations that provide services for the reading disabled. For example, even though J. K. Rowling refuses to allow the Harry Potter series to be sold as ebooks, accessible ebook versions ARE available through Bookshare and Accessible Book Collection.

If you accept the premise that book publishers will at some point be forced to deeply confront the requirements of the Americans with Disabilities Act, many features of the proposed (and currently in limbo) Google Book Search Settlement Agreement make a huge amount of sense. As part of the agreement, Google would have been required to develop a service providing access to a huge collection of digitized books for the reading disabled. This would be simultaneously a huge benefit for the disabled and a benefit for booksellers, publishers and libraries, who would most likely be spared the expense and burden of deep ADA compliance. Google has downplayed the notion that the Settlement Agreement would shape the future of the book, but it seems to me that it would most certainly have shaped the future the accessible book.

The critical importance of ADA compliance for the Google Book Search Agreement was confirmed by the Statement of Interest (pdf) filed by the US Department of Justice (DOJ). After dealing with its concerns about antitrust and sufficiency of class notice and representation, DOJ emphasizes its interest in accessibility:
First, the parties have sought to maximize accessibility in a way that will provide individuals with print disabilities with a substantially similar experience as users without print disabilities. In the Proposed Settlement, Google has committed to providing accessible formats and comparable user experience to individuals with print disabilities – and if these goals are not realized within five years of the agreement, Google will be required to locate an alternative provider who can accomplish these accommodations. Along with many in the disability community, the United States strongly supports such provisions.

Second, given the nature of the digital library the Proposed Settlement seeks to create, the United States believes that, if the settlement is ultimately approved, data provided should be available in multiple, standard, open formats supported by a wide variety of different applications, devices, and screens. Once these books are digitized, the format in which they are made available should not be a bottleneck for innovation. Google has made clear in the past that it started this project on the premise that anyone, anywhere, anytime should have the tools to explore the great works of history and culture. However the Proposed Settlement is modified by the parties, this approach should continue to be at its heart.
The DOJ's second point is somewhat puzzling unless you consider technical issues surrounding accessibility. Disabilities come in many varieties, and it is unlikely that a single accessibility service would meet the needs of every reading disabled user. A book that is accessible to a dyslexic would not be accessible to a blind person; only through the use of open, published formats can we be assured that there will be a route to accessibility for every disabled user.

The DOJ expressed concerns about other aspects of the agreement, but unlike many opponents of the agreement, it urged to parties not to lose momentum towards reaching a workable agreement for lawful, large scale digitization of books. The Authors, Publishers, and Google appear to be working to do just that; while they certainly are trying to live up to Google's motto, "Don't Be Evil", there are also laws encouraging them to do the right thing.

Forty years ago, there was no such thing as an ebook. Soon, all books may need to have ebook versions. Another example that over the short term, things change more slowly than you expect. Over the long term things change more than you can imagine. (If you know who said that, please leave a comment!)
Reblog this post [with Zemanta]

Friday, September 25, 2009

A Reading Miracle. It May Be Legal, but Don't Ask the Grox

Over the summer I witnessed a miracle.

Do you know which book was the first you ever read on your own? I'm not sure about mine, but my younger brother's first was definitely Go, Dog. Go! by P. D. Eastman. He was under 4 years old when he start reading it. If you haven't read it, Go, Dog. Go! is a 62 page multiculturalist masterpiece with engaging illustrations first published in 1961. Here is the complete text of pages 3-9:
Dog. Big dog. Little dog. Big dogs and little dogs. Black dogs and white dogs. "Hello!" "Hello!" "Do you like my hat?" "I do not." "Good-by!" Good-by!"
This summer, I witnessed my own son "reading" his first "book". It wasn't written by a single author and it wasn't published by Random House. It wasn't printed on paper, and it wasn't even what we might call an "e-book". It was a website devoted to the game "Spore" that currently consists of 3,819 articles written by website users, and over the course of the summer, my son read a majority of those articles. Here is a sample passage:
The Grox are a sentient species of cyborg aliens generally considered to be the most evil and hostile in the galaxy. They are most notable for their evil and hostility, but are also notable for their asymmetric, weak impish appearance.
Needless to say, my son's outlook on the world and his ability to explore it have dramatically changed.

This miracle was made possible by text-to-speech (TTS) software. You see, my son has a disability that makes normal reading excruciatingly difficult for him. Through a great deal of work, and some considerable courage, he is now able to read, with great effort, printed sentences and short paragraphs on his own. But as a bright 11-year-old sixth grader, books like Go, Dog. Go! and others that pose little reading difficulty hold little interest for him, and so he won't read the words in printed books on his own. He likes computers, though. At the beginning of the summer, I showed my son how to activate the text-to-speech features of his Mac. Mac OS X has text-to-speech capabilities built in, and because Mac applications are built using standardized text display objects with hooks that allow access to the system TTS services, there's a uniform, cross-application way to have text spoken. (In contrast, TTS on Windows Vista is almost useless!) Similarly, Wiki-based websites present content in uniform ways that made it easy for my son to interact with text.

I was amazed by the way my son began to devour the content that interested him. Every day after coming home from camp, he would spend hours staring at the screen and listening to the Mac's robotic voice speak the text to him. Then he would watch some YouTube videos and play some Spore. I realized that TTS had given my son a way to fully satisfy, for the first time in his life, his hunger for information.

People who see miracles tend to develop intense beliefs. I am no exception. I am no longer an objective observer of digital copyright issues when they relate to access by the reading disabled. When I want to feel some anger (it helps me run faster) I think about people and institutions who try to use copyright law in ways that prevent people like my son from being able to read what they want to read.

After having moral imperatives made clear to me, I've spent some time learning about the relevant technology and laws, and I find that these include many of the issues I've been working on and learning about. For example, last year, before I started paying attention, Amazon faced criticism from authors and publishers who argued that text to speech on the Kindle DX constituted a performance that Amazon did not have the rights to deliver. Could publishers similarly enjoin Apple from allowing my son to use its TTS on copyrighted material? With my new perspective, I cannot talk about this without fuming at the blatant immorality of some of the arguments being made.

When Amazon relented, Random House (publisher of Go, Dog. Go!) asked Amazon to turn off text-to-speech on the Kindle DX for its books, which sparked considerable controversy. This led the National Federation of the Blind and American Council of the Blind to file a discrimination lawsuit against Arizona State University which intended to test the Kindle DX as a means of distributing textbooks. The basis of this lawsuit is the Americans with Disabilities Act (ADA), which bars discrimination against people with disabilities in any public accommodation, a term which would include libraries and bookstores. The ADA has been used to force e-commerce websites to make their websites accessible to people with disabilities.

Unfortunately, the laws on accommodating disabled users have not kept up with changing technology. In 1996, the "Chafee Amendment" changed US Copyright law to allow "authorized entities" to make reproductions of previously published nondramatic literary works for the purpose of producing formats used exclusively by the disabled. Unfortunately, the possibility that all the worlds books might someday be digitized and thus made available to those with reading disabilities was remote at that time. As a result the ambiguity of the amendment's language is enough that the American Association of Publishers was able to argue that the Chafee Amendment could not be used by libraries to help them comply with the ADA. Luckily, organizations like Benetech and its BookShare website are working with publishers to get around this sort of conflict. I hope my son will be able to read the books he needs to read through BookShare.

It's my considered opinion that the Google Book Search digitization project has created the potential for a direct collision between book publishers and the ADA, and that this prospect has played a significant role in shaping the controversial aggrement to settle the publishers' and authors' lawsuit against Google, but that's a topic for another article.

Now that I know a bit more about the potential legal obstacles to my son's reading, I'm wondering what I should be doing to make sure those obstacles disappear. I'm still hoping to see more reading miracles. "Good-by!"
Reblog this post [with Zemanta]

Thursday, September 24, 2009

Nambu Gets Better and Shortener User Tracking is Undermined

When I last wrote about the tribulations of tr.im and the business of bit.ly, our heroes had just stepped away from their nose-to-nose struggle, with Nambu founder Eric Woodward having announced the shut-down of tr.im, his URL shortener, only to vow its revival a few days later. Bit.ly, with a cozy relationship with Twitter, seemed to have taken a dominant position in the URL shortening business, whatever that turned out to be. I speculated that Bit.ly would use its position in the ocean of usage data to build psychographic profiles of users to help target advertising.

Since then, Woodward decided not to sell the tr.im business and has instead released the tr.im software as free open source, making it that much easier for websites to do their own URL shortening. He's also focused his company's attention on its Nambu Twitter clients (Mac and iPhone), the development of which suffered a major setback when his Chinese developers left for richer opportunities as soon as their contract was up. Nambu for Mac OS X has been my preferred Twitter client; it has a much more Mac-like user interface than others I've tried. When I updated my system to Snow Leopard last week, I was disappointed to find that Nambu had not survived the system update.

After unhappily revisiting Tweetdeck, I decided to try the beta version of Nambu, even though it's described as being not quite done. So far, it looks pretty solid. One change in particular pleased me, and that's the way the new Nambu works with URL shorteners. It seems that by surrendering URL shortening to bit.ly, Nambu is now freer to innovate in the user experience. Nambu now pre-expands all the shortened links so that the user can see the hostname that the links are pointing to. This has a number of consequences:
  1. The user can tell where a link will go. This will help avoid wasted clicks, and will help the user avoid spam and malware sites.
  2. Because all of the links are dereferenced before use, the URL shortening sites will no longer be able to track the user's reading preference. The business model I previously suggested for bit.ly will be defeated, and the user's reading privacy will be protected.
  3. The URL shortener will have to deal with an increased load. Nambu's going to make bit.ly work harder for the privilege of domination the URL shortening space.
Now I understand why bit.ly has been registering a bunch of instantaneous hits whenever I tweeted a link- it was robot agents, not people, that were clicking the links.

I was curious to see if Nambu was querying the URL shorteners directly or whether Nambu was trying to aggregate and cache the expanded links. I installed a nifty program called "Little Snitch" to see the outbound connections being made by programs on my laptop. It turns out that Nambu is doing a direct check for redirection on ALL of the links that it shows me, not just the shortened ones. Although this could break links that are routed as part of a redirect chain, I imagine that sort of link occurrs rarely in a Twitter stream.

The new behavior of Nambu and its effects on usage tracking points up a general problem faced by any system designed to measure and track internet usage. In my post on "bowerbird privacy", I mentioned that I use StatCounter to measure usage on this blog. StatCounter works quite well for now, but I imagine that its methods (based on javascript) might well stop working so well as web client technology evolves. That's one reason I expect that efforts to standardize measurements of usage in the publishing community, such as Projects "COUNTER" and "USAGE FACTOR" are doomed to rapid obsolescence.

Will bit.ly ever get a business model? Will Nambu find peace with the chilly kitty? Find out in next months installment of... As th URL Trns

Saturday, September 19, 2009

In Which Judge Denny Chin Becomes an Orphan Works Hero

Judge Denny ChinImage via Wikipedia

In all the commentary I've read about the Google Book Search Settlement, I don't think I've seen a single article that tries to examine the issues from the point of view of the Judge who has to either approve or reject the agreement. Since the news that Judge Denny Chin is up for promotion to the Court of Appeals broke last week (and who know, in a few years, maybe to the Supreme Court) I've tried to imagine what Judge Chin's perspective might be on the Google case, especially in the context of a possible confirmation hearing. However, since I'm not a lawyer, my imaginings have focused on political angles rather than the law. Certainly, if he could come up with a way to solve the orphan works problem in the Google Books case without messing with the copyright law morass, he would emerge a hero.

To review briefly, orphan works are those that are still under copyright, but whose rightsholders cannot be found. As a result, their works cannot, under the law, be copied for most purposes, because no rightsholder can be found to authorize the use. As part of its Book Search Service, Google digitized millions of books in libraries, and were subsequently sued for copyright infringement by the the Authors Guild and the American Association of Publishers. In the proposed agreement to settle the lawsuit, Google agrees to pay rightsholders a sum of money for the release of copyright infringement claims, and also authorizes Google to make use of the digitzed books in new services and share revenue with rightsholders. Since the rightsholders to orphan works can't be found, the money they would be owed will be held for a period of time in case the rightsholders appear, after which it would be distributed to charities or to other rightsholders, depending on the source.

The treatment of orphan works in the settlement agreement has led to considerable controversy. As I discussed in a previous post, critics have pointed out that the settlement agreement would give Google an exclusive right to use orphan works in ways that would normally require the approval of a rightsholder. This sort of arrangement could result in antitrust issues, international treaty issues, and also seems to change copyright law in ways the have led the Register of Copyrights to express concern that it is impinging on an issue that should properly be the domain of Congress.

This last concern should have been be the most worrisome to Judge Chin. The Congress has to approve appointments to the federal bench, and the recent preferences of both parties have been for judges who "interpret the law" rather than "legislate from the bench". A ruling on copyright from Judge Chin that was interpreted as creating new law could doom Judge Chin's chances of being confirmed. Rulings grounded in Antitrust Law or in International Treaties could be politically controversial no matter which way they went- Americans don't like to be told what they can or can't do based on international treaties or doctrines that baffle them. Rulings based in copyright law are no safer. Try to tell a typical American that Google can't let them read a book because some 120-year-old rightsholder inconsiderately died 50 years ago without giving Google permission, and they'll look at you as if you were a lunatic. For these reasons I imagine that Judge Chin might look for a clear and uncontroversial way to avoid ruling on issues that doesn't he doesn't have to rule on.

To make his situation worse, the Department of Justice submitted its "statement of interest" last night. (For a play-by-play, read Danny Sullivan's summary, then read the expert color commentary by James Grimmelmann) The DOJ's brief displays all of the annoying competence and adroit politics that we've come to expect from the Obama administration. Grimmelmann calls it "a really, really good brief," and while I agree with that assessment, I can't imagine that it's much help to Judge Chin, apart from a few very helpful footnotes and citations (according to Grimmelmann). It is a sweeping brief which examines all the core issues in the agreement, not just those, such as antitrust, where the department's position as enforcer of the law gives it a clear role in the case. On every issue it addresses, it makes excellent sense, but when you step back and look at the filing as a whole, it doesn't do what it says it does. It first praises the benefits that would accrue from the Settlement, then recommends its rejection. It next purports to lay out a course of modifications which would allow the agreement to proceed. But when you look closer at the recommendations, it's hard to chart out a realistic chain of events where they would lead to a judge's approval any time in the near future. It's as if the DOJ had intervened in the Barry Bonds case and said "it's important to our beloved national pasttime and to baseball fans around the world that Barry be allowed to continue his assault on the record books, and we don't want this steroids matter to take away his momentum, but we feel that to comply with the law, the Court should consider stripping away any of Barry's muscles that might might have been chemically enhanced."

As an example, the DOJ has suggested that several classes of rightsholders have not been adequately represented in the settlement, and points to orphan rightsholders and foreign rightsholders as examples. Now imagine inserting two more legal teams into the negotiations. That authors and publishers were able to join in a single agreement strikes me as being a minor miracle; it goes against my scientific training to expect additional miracles. At the very least, we can expect the additional legal fees to eat up more of any settlement.

The DOJ wants Judge Chin to do a lot of searching. It says the "Court should undertake a particularly searching analysis to ensure that the requirements of Federal Rule of Civil Procedure 23 (“Rule 23”) are met and that the settlement is consistent with copyright law and antitrust law". It also thinks the "Court should engage in a careful and searching examination of the Proposed Settlement and any revised version that may be submitted." And of course, "it believes the Court should undertake a searching inquiry to ensure both that a sufficient number of class members will be reached and that the notice provided gives a complete picture of the broad scope of the Proposed Settlement." If I were Judge Chin, I would be a bit put off by the suggestion that I need some other branch of government to tell me that I need to do a good job of analyzing the case; since I'm not Judge Chin, I just wonder how he would do all this searching without access to Google.

The objections to the agreement filed by several State Attorneys General take a rather different approach. (Here's the pdf of the Objection from Connecticut) Instead of filing sweeping briefs, they address specific concerns germane to the states. They lay out a suprisingly simple argument:
  1. Orphan rights are abandoned property.
  2. State laws, not Federal laws, govern abandoned property, often giving the state custody/ownership.
  3. States have sovereign immunity and cannot be subject to a federal court's order.
Thus orphan rights cannot be governed by the settlement agreement approved by a federal court.

What I think Judge Chin might be able to do is rule that the states are the only ones with the right to manage the rights to orphan works. (I am not a lawyer, so take this with a grain of salt.) After all, the 10th Amendment of the US Constitution states
The powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.
The right to grant Copyrights is one of the powers given to the federal government, but inheritance of those rights is governed by state laws. While this sort of ruling would probably be a rejection of the settlement agreement, it would instruct Google and the States that they could go ahead and negotiate agreements for the use of orphan works. The States could set up an "Orphan Works Registry" and empower it to make deals with companies other than Google.

This sort of ruling would only lead to a solution to the orphan works problem if the states more eager to address the problem than Congress has been. Realistically, in 2009, Congress is fully occupied with 2 wars, a global financial crisis and figuring out how to solve health care. The states, on the other hand, are mostly trying to figure out how to close huge budget gaps; I imagine that most State Attorneys General (and state legislatures, where applicable) would love to be able to deliver both the money and the benefits that would accrue from a non-exclusive deal with Google. Another advantage of a state-run rights registry is that it might avoid some of the liability for errors that a privately run registry would have. Or this might be just an ill-informed fantasy of mine.

If Judge Chin follows DOJ's course, the big question is whether Google would be willing to continue pursuit of an agreement. Will Google think it could live with the costs and risks of the opt-in mechanisms supported by DOJ? Maybe I'll try to address that question in a future post.

Reblog this post [with Zemanta]

Wednesday, September 16, 2009

The Redirector Chain Mashup Design Pattern

The most exciting things to learn are the ones that you already knew, but didn't know that you knew. An example for me was when I learned about design patterns in the context of java programming. In an instant, I had both a framework for thinking about solutions I had figured out on my own, and a mechanism for discovery of solutions ready to be reused. You've surely encountered design patterns in one form or another, though perhaps you haven't thought of them as such. For example, the inverted pyramid and the Five Paragraph Essay are design patterns for writing. In software development there are many more design patterns than there are in writing, which is why the concept is so useful.

A post by Owen Stephens about managing link persistence using OpenURL got me thinking about design patterns used for the composition of web services. Most developers of web services tend to think in terms of container-oriented solutions to composed services, or to use another term, mash-ups. The container can be either server side or client-side and may be composing data, software services, or neither. In a post more than two years old, Alex Barnett enumerated 5 different design patterns for mash-ups. Missing from that list is what we might call the redirection chain pattern, of which Owen Stephens' proposed service for the Telstar Project would be an example.

In a redirection chain, a user traverses a link to one or more servers that redirect the user to a different address. At each step in the chain, services can be performed or functions added. In Owen's post, the service being contemplated is link maintenance. In most cases, the user will be redirected to a target url embedded in the link. If for some reason the URL no longer works, a new url can be provided to the user. For example, suppose a website goes out of business and the expired domain name is taken by a pornography site. The redirector provides a single point of maintenance for the link. Website publishers routinely use this type of redirector to enable them to move content around without breaking links. In the world of scholarly publishing, Crossref has provided an invaluable service of this type that enables doi-based e-journal links to continue to work in the face of publisher mergers, acquisitions, migrations and bankruptcy. PURL.org provides a similar service aimed at archives in libraries. The use of NISO OpenURL link would provide a standardized way to add metadata to the link, and would allow easy way to mash-up, using the redirector design pattern, the URL maintenance redirector with a link-to-full text redirector service used by the library.

In the years that I've worked on linking technology, I've come across a considerable diversity of redirector based services, but I don't think I've ever seen a list of things they are used for. So here goes:
  1. Session initiation. If a website needs to maintain state, visitors may need to first acquire a session. This is very often done by greeting a new visitor with a redirect that carries a cookie, or if that doesn't work, a session token in the link.
  2. Referrer tracking. Many web services need to keep track of the source of visitors, for example, in the context of an affiliate marketing program such as Amazon Associates. (Take a look at the link to Amazon Associates if you want an example of an affiliate link)
  3. Customized resolution.
    • In the world of libraries, this is referred to as solving the "appropriate copy problem", and almost all research libraries today make use of specialized redirectors that handle links conforming to the OpenURL standard mentioned abouve. Libraries subscribe to many electronic resources, and library patrons that want a particular article need to be directed to the one resource among many that the library has subscribed to.
    • an internet business may want to deliver different resources depending on where in the world the user is. A good example of this is the GeoDirection service provided by GeoBytes. A business with a global presence might need to do this to comply with local laws.
    • Language customization. Websites are often maintained in multiple languages. A language redirector might inspect accept-languange headers and redirect the user to a language appropriate service.
  4. User authentication. Almost all modern single-sign-on user authentication systems employ redirection in some form. For example, in the Shibboleth system, an unauthenticated user is redirected to a "where are you from" (WAYF) service that then redirects to an authentication form, which then adds an authentication token to the url and redirects the user again back to the original target resource.
  5. URL shortening. I've previously written about tr.im and bit.ly and the challenges for the URL shortener businesses. In brief, the advent of twitter has created a need for short URLs.
  6. Usage tracking. Although I became aware of tr.im and bit.ly from their use as short URLs, the reason I continue to use them is that they provide a handy way to see if anyone is clicking on the links. Similarly, libraries that have deployed OpenURL link servers are finding that the usage logs they generate provide invaluable information about the usage of digital collections.
  7. User tracking. This is really the same as the previous use, except with a different focus. Advertising networks often serve ads through redirectors and try to deliver the most relevant ads depending on what they know about the user.
  8. Agent based metadata delivery. This is an application that has been exploited less often for good than for deception. You may have heard of "cloaking" which is the practice of providing keyword-filled pages to search engines so that they don't find out that a website is just a bunch of advertisements. A similar practice is recommended by the W3C in the context of Semantic Web metadata for "things".
  9. Link enhancement. Often, a redirector will be configured to change the format of a link or to add information into a link. An example of a service that does this is OCLC's LibraryLookup service. It not only translates a simple isbn based link into something a library catalog can understand, it also adds alternate ISBNs into the link.
  10. Link Framing. Some redirectors put a frame around the linked content. This can be done to provide the user a path back to the referrer or to present more services (and advertising) to the user.
Looking over this list, we can see that there are 3 types of things that redirectors can do.
  1. Dynamic routing.
  2. Data collection.
  3. Link enhancement.
Many redirectors perform more than one of these functions.

In a redirector chain mashup, one redirector points to a second, which points to a third, etc. These mashups are often composed on an ad hoc basis, or even inadvertently. For example, if you think that an OpenURL link is too horribly long and ugly to put in an email, you might consider shortening it with a shortening redirector. On clinking the link, the recipient of your email visits the library link server, which might repoint the link to a redirecting proxy server to check if the user is locally authenticated. The proxy server may in turn redirect the user to doi.org's global redirector, which will next point the user to a publisher's linking hub which finally redirects the user to the full text content server. The distributed and minimally coordinated development and deployment allowed by the redirector chain is perhaps its greatest advantage.

The serial invocation of redirectors is the design pattern's main weakness. The processing times for each of the composed services add together and multiply transit time latencies. There is also an impact on reliability, as failure of any redirector component will result in a failure of the link. Similarly, the throughput of the composed service will be equal to that of the component with the lowest throughput. In most cases the technical performance and reliability of redirector services are not a big issue compared to institutional issues. In Owen Stephen's example, building and deploying a redirector is not hard, but assuring that the institution providing the redirector will be willing to continue to do so for the life of the deployed links is probably above his pay grade.

Using a standard format for redirection URLs could make it easier to swap one redirection service for another, partly addressing the institutional commitment issue. The idea of using the OpenURL standard for the URL persistence application seems promising, but as someone who served on the NISO committee that standardized OpenURL, I must admit that the existing standard falls a bit short of what would needed for an applications such as Telstar's. Still, there's not much competition. Proposals such as ARK, which is focused more on archives, might be worth a look. Although developer of urlShort has called for a shortener standard, the shortening application has rather different constraints.

What sort of things might a broader redirector standard include? Here's my very short list:
  1. A standard query parameter for target urls. The most common format seems to be [baseurl]?url=[url] .
  2. A recommendation on what to do with that standard query parameter and any others that might be sent.
  3. Something about preventing loops.
Doesn't seem too hard.

I wonder if "developing a standard" could itself be considered a design pattern?

Friday, September 11, 2009

Public Identity and Bowerbird Privacy

My legal name is "Eric Sven Hellman". On Twitter, I'm using "gluejar". On Facebook, I took the username "eshellman", which I also use in a number of other places. For the most part, while I do try to separate my work from my personal life, I don't try to isolate my online identity from my "real world" identity. I used to use the identity "openly" in some circumstances for my work identity, but I sold that name as part of my previous company. My work identities can be easily connected to my personal identity, and as a result, the use of different identities affords me negligible privacy.

The very concept of privacy has changed a great deal over the last 20 years, in large part due to the internet-induced shrinkage of the world and the relentlessly growing power of large databases. Our traditional notions of privacy have had embedded within them an implicit equation of privacy with obscurity. Public documents with records of where we lived, what we owned, who we were married to and how much our house was worth would be available in the sense that anyone could go to a county registrar and get the information. If I walked to town, anyone who saw me and knew me would know where I was. In the not so distant future, it's not hard to imagine that an internet connected camera could see me, recognize my face, and post my whereabouts on the internet so that anyone searching for me on google could discover my whereabouts. It doesn't really matter whether that happens through Linked Data or by discovering my GPS coordinates on Twitter. As any computer security expert will tell you, security-by-obscurity is ultimately doomed to failure, and I'm pretty sure that the same is true of the privacy-by-obscurity.

Yesterday, Wired Magazine writer Evan Ratliff was found. As part of reporting an article about how hard is for someone to "disappear" in the digital age, Wired had offered $5000 to anyone who could track down Ratliff during 30 days starting August 15. Ratliff's downfall was partly that he "followed" a vegan pizza restaurant in New Orleans on Twitter. It should not be surprising that Ratliff was found, given that a Facebook group with 1,000 members formed in an effort to track him down, so the relevance to our everyday privacy is a bit tenuous. Hollywood celebrities are only too aware that privacy retention is much harder for famous people.

Partly inspired by Ratliff's article, I decided to do a bit of investigation of my own. If you've been reading blogs around the topic of e-book technology, you probably have encountered posts by someone that signs posts with the name "bowerbird". Bowerbird's posts are always on-topic, but they are written with oddly short lines, as if bowerbird was typing on a 40 character-wide terminal. Bowerbird's posts are often impolite and sometimes really insulting, and in several forums, the posts have provoked complaints of trolling or that bowerbird is "hiding behind a pseudonym". Bowerbird replies that "bowerbird" is his real identity "in many versions of reality". It's clear that bowerbird is an iconoclast. When bowerbird posted a comment on one of my recent posts, I decided to see what I could find out about him or her.

It turns out that "bowerbird" is really the first name of "bowerbird intelligentleman". He has used this as his professional name since at least the late eighties. The name is written with lowercase letters in the manner of e e cummings, and Mr. intelligentleman, as the New York Times might refer to him, is a performance poet, among other things. In fact, he claims to have started performance poetry as an art form in 1987, and was an early promoter of "poetry jams". In a charming, self-deprecating bio (PDF), he writes "bowerbird is also one of the world's worst poetry producers" and describes how his forays into computer typesetting of poetry magazines led him into the world of electronic publishing and ebooks. He was very active on the Project Gutenberg volunteer discussion list, where his talent for provocation prompted Marcello Parathoner to cathartically excerpt a collection of his postings. Much of his energy in the ebook arena was spent promoting his ideas about "Zen Markup Language" (z.m.l.) whose philosophy can be summed up as "the best mark-up is no mark-up". The short line endings in bowerbird's post appear to be his insistence on using z.m.l. for his posts. Or perhaps they're performance poetry. It's a cute idea, but personally, I find that the formatting makes the posts hard to read in their context.

When bowerbird posted his comment, he left digital footprints. He visited the blog on a link from LanguageLog. He lives in the Los Angeles area (he's posted elsewhere that he can be found in Santa Monica), uses Verizon DSL, and uses version 4.0 of Safari on the Mac as his browser. The blog uses statcounter.com to monitor usage, so a cookie has been placed in his browser so I can tell if he returns for a visit; bowerbird is able to control these cookies using privacy controls in Safari. DSL lines use a pool of IP addresses, so although I know the IP address he used, I can't use that IP address to persistently track him. However, StatCounter can follow him to other sites that use StatCounter. In principle, StatCounter could report his interest in my blog to other sites and perhaps even connect him to other identities he might have, which would bother me a lot and prompt me to stop using StatCounter.

What's interesting to me is that bowerbird has had an online public identity for over 20 years, and although his entire online life, warts and all, is open for examination (how many of us can say the same?) it appears as though he has successfully walled it off from his private life. Even if I go to the register of deeds in Santa Monica, I probably won't be able to discover whether he owns a house. I can't find out from fundrace if he has donated to a political candidate. I can find his cell phone number because he's chosen to post it, but I don't know anything he hasn't chosen to divulge. (He once owed 1-800-GET-POEM!) In the course of leading a poet's life, bowerbird has been living an experiment in public identity and privacy for 20 years!

I've previously written about the evolution and fluidity of personal names. The use of professional names for public identity is quite common in our society. Women who marry and take their husband's family name routinely retain their names professionally. Use of professional names is particularly common among authors, actors, and musicians. For them, the additional privacy afforded by the use of a professional name is particularly valuable. It strikes me that the separation and isolation of identities may become an essential privacy curtain even for people who aren't celebrities.

It's probably too late for me and most people of my generation. But "bowerbird privacy" could be a reasonable solution for the next generation. A significant number of my son's friends use Facebook under not-their-real-names, and I say more power to them. I think that privacy advocacy organizations should be working to put rules in place to prevent Facebook from enforcing its "only your real name" terms of service and prohibit companies such as Twitter and Google and Yahoo (and StatCounter) from working with ISPs to connect online identities with offline identies.

Nature's bowerbird gets its name from the bower, a structure that male bowerbirds construct to attract females. You might think of it as the bird's public identity. It's not sure why the females are attracted to the bower. Maybe it's privacy?

Reblog this post [with Zemanta]

Saturday, September 5, 2009

RDF Properties on Magic Shelves

Book authors and politicians who go on talk shows, whether it's the Daily Show, Charlie Rose, Fresh Air, Oprah, Letterman, whatever, seem to preface almost every answer with the phrase "That's a really good question, (Jon|Teri|Stephen|Conan)". The Guest never says why it's a good question because real meaning of that phrase is "Thanks for letting me hit one out of the ballpark." Talk shows have so little in common with baseball games or even tennis matches. On the rare occasion when a guest doesn't adhere to form, the video goes viral.

I've been promising to come back to my discussion of Martha Yee's questions on putting bibibliographic data on the semantic web. Karen Coyle has managed to discuss all of them at least a little bit, so I'm picking and choosing just the ones that interest me. In this post, I want to talk about Martha's question #11:
Can a property have a property in RDF?
The rest of my post is divided into two parts. First, I will answer the question, then in the second part, I will discuss some of the reasons that it's a really good question.

Yes, a property can have a property in RDF. In the W3C Recommentation entitled RDF Semantics, it states: "RDF does not impose any logical restrictions on the domains and ranges of properties; in particular, a property may be applied to itself." So not only can a property have a property in RDF, it can even use itself as a property!

OK, that's done with. Not only is the answer yes, but it's yes almost to the point of absurdity. Why would you ever want a property to be applied to itself? How can a hasColor property have a hasColor property? If you read and enjoyed Gödel, Escher, Bach, you're probably thinking that the only use for such a construct is to define a self-referential demonstration of Gödel's Incompleteness Theorem. But there actually are uses for properties which can be applied to themselves. For example, if you want to use RDF properties to define a schema, you probably want to have a "documentation" property, and certainly the documentation property should have its own documentation.

If you're starting to feel queasy about properties having properties, then you're starting to understand why Yee question 11 is a good one. Just when you think you understand the RDF model as being blobby entities connected by arcs, you find out that the arcs can have arcs. Our next question to consider is whether properties that have properties accomplish what someone with a library metadata background intends them to accomplish, and even if they do so, is it the right way to accomplish it?

In my previous post on the Yee questions, I pointed out that ontology development is a sort of programming. One of most confusing concepts that beginning programmers have to burn into their brains is the difference between a class and an class instance. In the library world, there are some very similar concepts that have been folded up into a neat hierarchy in the FRBR model. Librarians are familiar with expressions of works that can be instantiated in multiple manifestations, each of which can be instantiated in multiple items. Each layer of this model is an example of the class/instance relationship that is so important for programmers to understand. This sort of thinking needs to be applied to our property-of-a-property question. Are we trying to apply an property to an instance of a property, or do we want to apply properties to property "classes"?

Here we need to start looking at examples, or else we will get hopelessly lost in abstraction-land. Martha's first example is a model where the dateOfPublication is a property of a publishedBy relationship. In this case, what we really want is a property instance from the class of publishedBy properties that we modify with a dateOfPublication property. Remember, there is a URI associated with the property piece of any RDF triple. If we were to simply hang a dateOfPublication on a globally defined publishedBy we would have made that modification for every item in our database using the publishedBy attribute. That's not what we want. Instead, for each publishedBy relation we wanted to assert, we need to create a new property, with a new URI, related to publishedBy using the RDF Schema property subPropertyOf.

Let's look at Martha's other example. She wants to attach a type to her variantTitle property to denote spine title, key title, etc. In this case, what we want to do is create global properties that retain variantTitleness while making the meaning of the metadata more specific. Ideally, we would create all our variant title properties ahead of time in our schema or ontology. As new cataloguing data entered our knowledgebase, our RDF reasoning machine would use that schema to infer that spineTitle is a variantTitle so that a search on variantTitle would automatically pick up the spineTitles.

Is making new properties by adding a property to a subproperty the right way to do things? In the second example, I would say yes. The new properties composed from other properties make the model more powerful, and allow the data expression to be simpler. In the first example, where a new property is composed for every assertion, I would say no. A better approach might be to make the publication event a subject entity with properties including dateOfPublication, publishedBy, publishedWhat, etc. The resulting model is simpler, flatter, and more clearly separates the model from the data.

We can contrast the RDF approach of allowing new properties to be created and modified by other properties to that of MARC. MARC makes you to put data in fields and subfields and subfields with modifiers, but the effect is sort of like having lots of dividers on lots shelves on a bookcase- there's one place for each and every bit of data- unless there's no place. RDF is more like a magic shelf that allows things to be in several places at once and can expand to hold any number of things you want to put there.

"Thanks for having me, Martha, it's been a real pleasure."
Reblog this post [with Zemanta]

Thursday, September 3, 2009

White Dielectric Substance in Library Metadata

A bad scientist assumes his data is pristine. A good scientist understands all the shit in her data. When I worked at Bell Labs, we listened repeatedly to Arno Penzias talk about his Nobel prize. In 1965, Arno and Robert Wilson were trying to understand the noise sources in radio signals received from communication satellites. Because they worked for the research arm of a big, rich monopoly company, they had big, expensive ultrasensitive microwave horn antennas to play with and the fanciest, most expensive electronics. But however much work they did, there was more noise in their signal than they wanted. Noise would limit the amount of data that could be transmitted, and therefore the amount of money that AT&T would be able to make from their communications satellites. At one point, Penzias and Wilson thought that the noise was coming from an unwanted "white dielectric substance" being deposited by birds that were perching on their antenna. They cleaned out the antenna, but the noise was still there. Diligent and focused work on problems related to the business of AT&T, we learned, could lead to great discoveries.

When I was a student at Princeton, we listened repeatedly to the story of how Bob Dicke and Jim Peebles tried to find the 3 degree background radiation predicted by George Gamow if the big bang theory were really true. They were unsuccessful, but when they happened to hear about the noise problem down the road at Bell Labs, they knew immediately what it really was. It was the afterglow left over from the very beginning moments of our universe, an unmistakable sign that the Big Bang had really occurred. Penzias and Wilson had worked so hard at understanding and eliminating the other noise in their radio receiver that they could confidently publish their findings, after being clued in by their Princeton colleagues. Penzias and Wilson eventually received the Nobel Prize in Physics that eluded Dicke and Peebles. Understanding the noise turned out to be more important than understanding the signal.

I was reminded of this story on reading the accounts of Friday afternoon's session at The Google Books Settlement and the Future of Information Access Conference. I attended in the morning, but due to family commitments I missed the entertainment in the afternoon, which was highlighted by Geoff Nunberg's report on mistakes in the metadata in Google Book Search. Reading Nunbergs blog post corresponding to the talk is very entertaining in a juvenile sort of way. The poor guy has been trying to use Google Books as a linguistic research corpus, and has discovered to his professed horror that there are all sorts of errors, many of them humorous, in its metadata.

I must now let you in on a closely held secret among library metadata technologists which due to the Google Books metadata fiasco must now be revealed to the general public. There is some crappy data in library catalogs. How much is an interesting question, and my ability to comment on how much is limited by confidentiality obligations. However, I am free to observe that studies have been published on the error rate in OpenURL linking. OpenURL linking usually depends on matching of metadata between a source metadata file and a target metadata file; errors in either file can cause a linking error. Reported error rates are in excess of 1%. In his response to Nunberg blog post, Jon Orwant points out that a one in a million error occurs a million times if you have a trillion metadata items; my guess is that an error rate of one part per million may be overly optimistic by four orders of magnitude when applied to library metadata.

In my post on "collecting too much data", I wrote that a huge challenge of maintaining a large metadata database is battling entropy as the collection grows. I've observed that most people trying to collect metadata go through an early period of thinking it's easy, and then gradually gain understanding of the real challenges. Google has certainly been no exception to this pattern. When they first started dealing with book metadata, they were oblivious to the difficulties of maintaining a large metadata database. As Orwant's response to Nunberg shows, they are currently in the phase of understanding the true difficulties of what they need to do. They have most certainly become attuned to the importance of keeping track of the source (provenance) of their metadata, if for no other reason than to have someone to blame for the inevitable metadata stupidities. Much of the "Linked Data" crowd has yet to digest this lesson fully.

Nunberg's thesis is that Google Books will be the "Last Library" and that it would be a disaster for society if Google does a bad job of it. He does not consider the converse possibility. What if Google manages to do a better job of it than libraries have done? If that happens, all of the library world could be turned upside down. Existing metadata maintenance cooperatives would vanish overnight and libraries around the world would become dependent on Google's metadata prowess. Google would acquire a legal metadata monopoly through technical merit rather than through class action maneuvering. What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so? Is it this even conceivable?

Arno Penzias would say YES.
Reblog this post [with Zemanta]