Friday, August 28, 2009

Third Wheels on Class Action Coffee Search


Although I'm not a lawyer, (IANAL) I've worked on a fair number of legal agreements. Most typically, I would spend a lot more time working on the agreement than I ever did consulting the terms of the agreement. That's because it's much easier to work out the hard core details of implementation without the lawyers reviewing everything.

In my post on privacy and google book search, I spent a fair amount of time looking at what the settlement agreement said about security and privacy. From the perspective of a few weeks later, it looks to me like most of the hard-core details that matter are in fact not in there and will need to be worked out by the parties to the agreement.

This morning, I took some vacation from my vacation and went over to Berkeley to participate in the a conference called "The Google Books Settlement and the Future of Information Access". I was struck by a comment that Jason Schultz, Associate Director of the Samuelson Law, Technology & Public Policy Clinic at U.C. Berkeley School of Law made at least twice. He said it was important to get more assurances on privacy into the settlement agreement because "at least then we'll have something on paper that we can enforce". As I mentioned before, IANAL, so the mention of having anything on paper that I'll have to pay a lawyer lots of money to attempt to enforce something else doesn't have a lot of appeal to me. And I have questions.
  1. Can third parties enforce "pieces of paper"? Suppose Joe and Mary, instead of hiring a divorce lawyer, sign an agreement to have coffee together once a week at Starbucks, can Starbucks sue them if one or both of them breaches the agreement?
  2. Can third parties sue to enforce a class action "piece of paper" that has been approved by a court? Suppose a class of baseball widows files suit against the class of Yankees tickets holders and gets a settlement that offers them a cup of coffee at Starbucks with their spouses once a week. Can Starbucks file suit to get better compliance with the agreement?
  3. Do opt outs affect the ability of class members to enforce the class action agreements? If the Yankee widows can opt out of the settlement, can they sue to get better coffee agreement compliance even if they opt out?
  4. If you are an author concerned about privacy, do you have more or less leverage on the devilish details if you opt in or opt out of the settlement agreement?
  5. Why am I so focused on coffee?

Friday, August 14, 2009

Tr.im's Brief Demise and the Privacy Implications of the Bit.ly Monopoly


In case you missed it, last Sunday the URL shortener Tr.im announced that it was going to close. Then, on Wednesday, they announced that they were going to keep the service open. I spent yesterday morning listening to the TechZing interview with Tr.im and Nambu founder Eric Woodward, which was interesting in a lot of ways. His perspective on having worked with a Chinese development team was quite interesting, and though not relevant for this post, it explains why development on Nambu (an OS X and iPhone twitter client) has been slow in coming- the Chinese developers left to get rich making iPhone apps. What I was more interested in was the discussion of business models for URL shorteners in specific and the Twitter ecosystem in general. It was the lack of any plausible business model for Tr.im that led to the decision to close the service. According to Woodward, there are only three plausible business models for a legitimate URL shortener:
  1. you can charge users
  2. you can sell advertising
  3. you can sell data that you generate
and given that Bit.ly has an inside track with Twitter and offers everything to users for free, all of the three business models were just not going to work for Tr.im.

Part of the problem that Tr.im has experienced is that the cost of running a URL shortener scales with the amount of usage. (In a good web business, you have a high fixed cost and very sublinear scaling of cost with usage) Woodward says that he has to spend an hour a day just dealing with spam, and this problem is getting worse. Why is spam a problem for URL shortening? It's because spammers (and presumably phishers, too) will use URL shorteners to hide links to porn, scams, malicious content, etc. Afflicted users will report the links to the URL shortener's ISP host (Tr.im uses Rackspace) and the URL shortener will be shut down unless the spamming links are turned off. Another problem that afflicts popular URL shorteners is the problem of popular twitterers. When Ashton Kutcher or Shaquille O'Neill tweets a link to their millions of followers, the URL shortener can suddenly be hit with 10,000 hits per minute. A read-only website can handle this traffic easily by spreading service over multiple servers, but a URL shortener such as Tr.im, architected to generate dynamic usage data by writing to a MySQL database, needs beefy hardware to avoid getting overloaded.

In addition to the problems highlighted by Woodward, I see three core difficulties for URL shortener businesses:
  1. It's really easy to build a small-scale URL shortener. A good web developer could probably build one in a day. Because of this low barrier to entry, casual users are forever going to be able to find free URL shorteners.
  2. There are plenty of illegitimate (or at least annoying) business models for URL shorteners. These involve stealing traffic, stealing "Google juice", putting interstitial advertising in links, framing links, etc. This attracts entrants who make it harder for someone wanting to run a non-annoying business to attract paying users.
  3. Links need to be reliable, because if your shortener fails, the user doesn't get the content they are trying to access In a lot of applications, links are meant to keep working forever. Reliability and persistence are expensive.
So basically, URL shorteners are a high cost, tiny revenue business.

Nonetheless, URL shorteners are very useful in today's 140 character world. Woodward had concluded, and I agree, that there is no room for more than one URL shortener business in the Twittersphere, and that Bit.ly has won. Bit.ly thus finds itself with an odd sort of natural monopoly. Of the three plausible business models for Bit.ly, it seems to me that generating and selling data is the only one that would maintain the monopoly, and thus the business. What kind of data might Bit.ly sell? I'll place my bet on "psychographic data". With its URL shortening monopoly, Bit.ly has access to a huge number of clicks. Bit.ly knows who I am, because I signed up for an account. Whenever I click a bit.ly link, my browser sends a cookie to Bit.ly which it could be using to track my interests, what I read. Aggregated over all the people who click Bit.ly links, the dataset of who clicked what could be very interesting to advertisers. Just as Google has the ability to tailor advertising to me based on my search history, Bit.ly could use a psychographic profile of me to help advertiser do targetting. Bit.ly has an interesting advantage over Google, however. Because so many of its properties rely on the user's perception of Google as a company that can be trusted with sensitive information, Google is quite limited in how far it can go in tracking users. In contrast, Bit.ly is not inhibited in this way. In fact, Bit.ly's ability to profile users could make it even more attractive to people putting links into tweets. Bit.ly could even provide this sort of profiling without violating its privacy policy which promises that it
... discloses potentially personally-identifying and personally-identifying information only to those of its employees, contractors and affiliated organizations that (i) need to know that information in order to process it on Bitly, Inc.‘ behalf or to provide services available at Bitly, Inc. websites, and (ii) that have agreed not to disclose it to others. Some of those employees, contractors and affiliated organizations may be located outside of your home country; by using Bitly, Inc. websites, you consent to the transfer of such information to them. Bitly, Inc. will not rent or sell potentially personally-identifying and personally-identifying information to anyone. Other than to its employees, contractors and affiliated organizations, as described above, Bitly, Inc. discloses potentially personally-identifying and personally-identifying information only when required to do so by law, or when Bitly, Inc. believes in good faith that disclosure is reasonably necessary to protect the property or rights of Bitly, Inc., third parties or the public at large.
Ironically, URL shorteners could also be used in ways that enhance user privacy from a different direction. As I discussed in my post on the semantics of redirectors, most URL shorteners use HTTP redirects. Although it's browser-dependent, either 301 or 302 redirects will result in the originating page being sent in the referrer header. "META refresh" redirects, on the other hand, can be used to wipe (or replace) the value of the referrer header. (Unfortunately, these can be also used annoyingly to cause "referrer spam".)

Redirectors deployed for purposes other than URL shortening also have market-share related privacy implications. For example, the dx.doi.org redirector which handles most DOI traffic could be a very useful vantage point for industrial or technology espionage. Because this redirector serves so many scientific article links, a spy agency might be able to monitor everyone in the world doing research on nuclear fission or anthrax weaponization, to give two examples.

In preparing my post on privacy mechanisms for Google Book Search, I was struck at the many directions that someone intent on privacy intrusion could take to collect potentially sensitive information. Part of the skepticism I expressed about being able to "sell" privacy comes from a feeling that privacy as traditionally thought of is pretty much a lost cause on the internet, no matter what Google, Bit.ly, or anyone does. Somehow, traditional concepts of privacy need to be recast into something that people still value as they use the internet.
Reblog this post [with Zemanta]

Tuesday, August 11, 2009

Shibboleth, Google Book Search, and the Hello Kitty Diary

I don't think I can sell privacy. In fact, it's hard to think of any technology that has succeeded in the market place because of privacy attributes. Swiss banks don't count as technology; strong encryption succeeded in the market for its security attributes rather than privacy attributes. (The same is true of locks on houses- its true that they provide privacy, but people use them against thieves, not against snoops. OK, maybe curtains, but the if there's a trade-off between privacy and style in curtain technology, style usually wins. Even the Hello Kitty Electronic Password Diary does not appear to be a big commercial success.


In my post on privacy and Google Book Search, I alluded to technological solutions libraries could use to enhance patron privacy while also protecting against unauthorized access. I thought it would be useful to elaborate on this comment with some details. In general, there is no reason that privacy and security objectives can both be met in a properly engineered solution, other than the fact that it's hard to find someone willing to pay for the properly engineered solution. For example, I mentioned Shibboleth as a possible solution to providing security and privacy. Shibboleth is an open-source single-sign-on authentication system developed as part of the Internet2 project. It uses strong cryptographic techniques to delegate trust over a network, and in so doing, allows for significantly enhanced privacy.

Think about the situation where a company has licensed some content to a university. The licensor wants to make sure that only persons associated with the university are allowed to access the content. It doesn't need to know who the user is, it only needs to know that the user is properly entitled. The Shibboleth system allows the institutional user to sign in to an authentication point once using their institutional credentials, then any licensed resource can check with the central authentication point that the user is accredited by virtue of institutional affiliation. Shibboleth also allows users an institutions to disclose attributes to providers of their choosing. Attributes might include their name, preferred language, subject areas of interest, subgroup membership, etc.. Security is preserved because the institution still knows the identity of the users, and is enhanced because the Shibboleth system is designed to be much harder to defeat than competing solutions.

As far as I understand, Shibboleth would not significantlyonly slightly enhance privacy in the specific scenario created by Google Book Search, where users have to be tracked as to how much of individual books they have viewed. However, a system could be built that distributes information over a network. Here's how it would work:
  1. When the user is authenticated by the institution, a session id would be sent to Google. The session id tracks the user, but only the institution knows the identity of the user.
  2. When the user views a page in a book, Google sends a message to the institution to increment a named counter associated with the user. The name of the counter identifies a book, but only Google knows which book is associated with the counter.
  3. when the user asks to view another page, Google asks the institution for the page count associated with the book and the user, and grants access accordingly.
Such a system works to enhance privacy by storing separately the identity of the person reading the book and the identity of the book. Only if Google and the institution agree to exchange information can the reading history of an identified patron be revealed. This results in much stronger privacy even than we have in the print world. A government request for a patron's GBS reading habits would have to be made to two separate entities, probably in two different jurisdictions.

What is the likelihood that such a system can be created and adopted? On this score I am very skeptical. Who would pay for the enhanced privacy afforded by such a system? The success of a variety of Web 2.0 services seem to indicate that users are almost eager to give up privacy to gain the ability to communicate. As Randal Picker has discussed in a recent paper, consumers have significant incentives to give up their privacy to online advertising networks because doing so amounts to advertising by the consumer that results in a more efficient market. The history of Shibboleth can be used as an indicator of market behavior. Although it can provide enhanced privacy and strong security, these advantages have not been able to counteract implementation and usability costs and compared to competing technologies, and Shibboleth has not been widely adopted. When Peter Brantley raised the specific question of using Shibboleth for Google Book Search, Google's Dan Clancy commented that "Some institutions use Shiboleth and we will support this although most institutions prefer IP authentication". Google is known for putting a very high priority on usability, which is an area of significant weakness for Shibboleth.

On second thought, maybe the Swiss banks are onto something. Maybe the best target market for ultimate privacy is ultra rich people. Sergey and Larry, Warren and Bill, might I sell you a bit of privacy?
Reblog this post [with Zemanta]

Friday, August 7, 2009

What the Google Books Settlement Agreement Says About Privacy

Here's what Google thinks I'm interested in:
  • Computers & Electronics - Enterprise Technology - Data Management
  • Computers & Electronics - Software - ... - Content Management
  • Entertainment - Movies
  • Finance & Insurance - Investing
  • Internet - Web Services
  • Lifestyles - Clubs & Organizations
  • Lifestyles - Parenting & Family
  • News & Current Events - Technology News
  • Reference - Libraries & Museums
  • Social Networks & Online Communities - Social Networks
If you want to find out what Google thinks you're interested in, go to http://www.google.com/ads/preferences/view and find out. Does anything there disturb you? Can you imagine items that might appear on your list that would disturb you, or that you wouldn't want to post on your Facebook profile or in a comment to this post?

Last Friday I participated in a workshop sponsored by Harvard's Berkman Center focusing on Google Books and Google's settlement agreement with authors and publishers. The meeting was very well tweeted, so I won't bother to summarize or comment on the blog, at least for now. In the afternoon, I participated in a breakout session on privacy issues facilitated by Marc Rotenberg from EPIC. Recently, the Electronic Frontier Foundation (EFF) has focused attention on the neglect of patron privacy in the settlement agreement. At the NYPL panel I attended, the view expressed by the participants was that though privacy was a very important issue, the settlement agreement was not the place to address privacy concerns. At the Harvard workshop, the opposite view was predominant.

In the EFF posting, I was struck by by the fact that they suggest that Google should be required to
allow users of anonymity providers, such as Tor, proxy servers, and anonymous VPN providers, to access Google Book Search
but they don't seem to expect libraries to be participating in the digital environment for books. In fact, I doubt that many libraries today view themselves as potential anonymity providers, despite the deep-seated respect for patron privacy that is part of the inherited culture of librarianship. I had wondered whether libraries would be able to use technological means, such as proxy servers, to ensure the privacy of their patrons who use Google Book Search. With some inspiration from the workshop, I've spent some time closely examining the agreement to see what it really says about privacy and what libraries might be able to do to enhance patron privacy.

Nowadays, electronic resources librarians can't help but focus more concern on monitoring for misuse of resources than on patron privacy issues. The obligation to do so is built into most license agreements for electronic resources. The following passage is from section 5.2 of the CLIR/DLF Model License:
Protection from Unauthorized Use. Licensee shall use reasonable efforts to inform Authorized Users of the restrictions on use of the Licensed Materials. In the event of any Authorized User makes an unauthorized use of the Licensed Materials, the parties may take the following actions as a cure:
  1. Licensor may terminate such Authorized User's access to the Licensed Materials;
  2. Licensor may terminate the access of the Internet Protocol (“IP”) address(es) from which such unauthorized use occurred; or
  3. Licensee may terminate such Authorized User’s access to the Licensed Materials upon Licensor’s request. Licensor shall take none of the steps described in this paragraph without first providing reasonable notice to Licensee (in no event less than [time period]) and cooperating with the Licensee to avoid recurrence of any unauthorized use.
Frequently libraries have to negotiate to get this language, as publisher licenses frequently have more burdensome requirements.

The sorts of things that actually happen, and librarians worry about, are of two types. The first is when a student with legitimate credentials "loans" them to a friend and in a week or two thousands of "friends" (often in another country) are using a resource through the campus proxy server. The librarians obligation is to identify and disable the rogue credentials. In the other scenario, a student or faculty member tries to use some type of downloading tool to download an entire journal or database for some sort of offline use. In this case as well, the offending user must be identified and told "don't do that". Publishers of electronic resources typically have monitoring tools (and bot traps and poison pills) in place so that they can detect such misuse and shut off a customer's access when this sort of thing occurs. A call to a support desk is typically needed to restore access. Publishers realize that these things happen, and that on a campus with 20,000 students, there are limits to how much librarians can control what their patrons do. I do not know of any case where legal proceedings or demands or compensation have resulted from such incidents, but I do know that one publisher cut off access to all of China for several months when a breach occurred there.

There clearly exists tension between a library's obligations to prevent unauthorized use and its obligations to protect the privacy of users. In the CLIR/DLF model license, there is a mutual obligation that balances the licensee obligation to control unauthorized use:
Confidentiality of User Data. Licensor and Licensee agree to maintain the confidentiality of any data relating to the usage of the Licensed Materials by Licensee and its Authorized Users. Such data may be used solely for purposes directly related to the Licensed Materials and may only be provided to third parties in aggregate form. Raw usage data, including but not limited to information relating to the identity of specific users and/or uses, shall not be provided to any third party.
The balance between providing for security against unauthorized use and confidentiality of user data is the practical determinant of the degree to which a patron can expect to have real privacy. To provide security against unauthorized of electronic resources, a library needs to generate logs for any proxy servers that it operates. To assure patron privacy, a library must be diligent to limit the retention of those log files and of any other records that might be used to identify and track users and their usage of particular resources.

The settlement agreement (available here) says very little about patron privacy. (In fact, the only users whose privacy is mandated are users with print disabilities who access Library Digital Copies in Fully Participating Libraries under the special access provisions of section 7.2(b)(i) of the agreement.) (I'm capitalizing terms defined in the agreement.) It says quite a lot about security, however, and thus many aspects of patron privacy will be effectively governed by the provisions for security. The use of proxy servers by libraries is implicitly mentioned in two places. In section 4.1(a)(iv) pricing bands are specified for government, public and school library subscriptions with the qualifier "no remote access without Registry approval", while higher education and corporate pricing bands are specified without the remote access qualifier. Remote access is most typically provided by libraries in higher education through the use of proxy servers. Additionally, the "Security Standard" set out in Appendix D to the agreement specifies that
Google shall use commercially reasonable efforts to authenticate individual End Users for access to Books in an Institutional Subscription by verifying that an individual is affiliated with an institution with an active subscription. Google’s efforts will be in partnership with the subscribing institutions in a manner consistent with, or otherwise equivalent to, generally accepted industry standards for authentication of use of subscriptions. Techniques used may include IP address authentication, user login, and/or leveraging authentication systems already in place at an individual institution.
Since the current "industry standard" is to allow users to authenticate through a proxy server against an institutional id/password service, it would seem that proxy servers would be permitted under the agreement, at least for higher education settings.

There is a specific security requirement set by the agreement that is likely to result in increased user tracking by Google. Google is required to make sure that each user cannot preview more than a certain number of pages of a book. Thus, Google must keep track of the books that a user has viewed, and stop the preview once the quota is reached. For the purposes of this requirement, Google is supposed to treat multiple users of a given computer as a single user. Assuming it is possible do so, this would have some odd consequences for computers in a library. A patron would be able to move from computer to computer and view more than their quota, but might not be able to view any pages from a book popular enough to have been previously viewed by another patron. In the current version of Google Book Search, cookies, not IP addresses, are used to track users, but a user is not required to log into the service at all unless they want to access personalization features. Google sets a 2-year cookie when you use the service, but the service can be used without cookies. To fulfill the terms of the settlement agreement, it appears to me that it's likely that Google would have to either require users to log into personally identifiable accounts, or to use IP addresses of individual computers to allow unidentified users access the service. Either way, libraries would be limited in their ability to use proxy servers to protect patron privacy (for example, by blocking cookies), and it's quite clear that what EFF has proposed with respect to anonymity providers is incompatible with the agreement.

I do not know of any resources currently licensed to libraries that are comparable to the post-settlement Google Book Search in the requirements for user tracking to prevent excessive uses, so it's not clear to me how much guidance is really given by the phrase "generally accepted industry standards for authentication of use of subscriptions". Authentication methods less widely deployed, such as Shibboleth may provide more patron privacy than use of cookies and IP addresses and/or proxy servers, while at the same time allowing Google to satisfy the terms of the settlement agreement. It is also likely that authentication technology, or modifications of existing authentication technologies, could be developed and specifically tailored to meet both the security requirements of licensors and the privacy requirements of libraries.

It is worth noting that the requirement for user tracking is not found in main part of the settlement agreement, but rather in "Attachment D", the Security Standard. Interestingly, the settlement agreement includes a provision for the Security Standard to be reviewed and revised every two years, by "Google, the Registry and up to a total of four (4) representatives on behalf of the Fully Participating Libraries" to allow for changes in Technology. Note that although the libraries are included because of their role in allowing use of library digital copies, there is a single Security Standard which applies to both Google and library-provided services. Thus there will be four library representatives who must agree to revisions in the security policy (and thus on the privacy that it allows) to be implemented in Google services. In theory at least, libraries could use the review of the Security Standard to introduce use of security/privacy technologies suited to the special characteristics of Google Book Search Subscriptions.

It is unclear what security requirements apply to the "Public Access Service" which would put free terminals (with paid printing available) in any US library that wanted it, because the settlement agreement treats the Public Access Service as something separate from the Institutional Subscription, while the Security Standard makes no mention at all of the Public Access Service. It seems possible that adding coverage of the Public Access Service to the Security Standard would also have to be addressed by the security review group that includes the library representatives.

In any case, it is clear that the power of the Book Rights Registry to review and approve the security implementation plans of Google gives it a great deal of leeway to set standards for patron privacy. Since the primary duty of the Registry is to serve rights-holders, its intrinsic motivation for protecting privacy would be only to see that privacy intrusions do not act to depress revenue significantly. Strong oversight by the court such as has been requested by the library associations may also promote attention to privacy concerns. Finally, it is likely that the Registry will need to pay close attention to state patron privacy laws. The library-registry agreements explicitly allow for state laws to trump any library obligations under the settlement agreement, so there can be no provision of the security standard that is incompatible with state privacy laws.

Google, as presently constituted, has every reason to be concerned about user privacy and guard it vigilantly; its business would be severely compromised by any perception that it intrudes on the privacy of its users. As Larry Lessig pointed out at the Berkman workshop, that doesn't mean that the Google of the future will behave similarly. Privacy concerns should be addressed; the main question has been how and where to address them. My reading of the settlement agreement is that it may be possible to address these concerns through the agreement's Security Standard review mechanism, through oversight of the Registry, and through state and federal laws governing library patron privacy.

And I am still not a lawyer.
Reblog this post [with Zemanta]

Tuesday, August 4, 2009

Can Librarians Be Put Directly Onto the Semantic Web?


The professor who taught "Introduction to Computer Programming" my freshman year of college told us that it was easier to teach a (doctor, lawyer, architect) to program a computer than it was to teach a computer programmer to be a (doctor, lawyer, architect). I was never really sure whether he meant that it was easy to teach people programming, or whether he meant that it was impossible to teach programmers anything else. Many years later, I met the doctor he collaborated a lot with, and decided that my professor's conclusion was based on an unrepresentative data set, because the doctor had the personality of a programmer who accidentally went to medical school.

I was reminded of that professor by one of Martha Yee's questions in her article "Can Bibliographic Data Be Put Directly Onto the Semantic Web?":
Do all possible inverse relationships need to be expressed, or can they be inferred? My model is already quite large, and I have not yet defined the inverse of every property as I really should to have a correct RDF model. In other words, for every property there needs to be an inverse property; for example, the property isCreatorOf needs to have the inverse property isCreatedBy; thus "Twain" has the property isCreatorOf , while "Adventures of Tom Sawyer" has the property isCreatedBy. Perhaps users and inputters will not actually have to see the huge, complex RDF data model that would result from creating all the inverse relationships, but those who maintain the model will need to deal with a great deal of complexity. However, since I'm not a programmer, I don't know how the complexity of RDF compares to the complexity of existing ILS software.
Although there are many incorrect statements in this passage, the most important one to correct here is in the last sentence. Whether she likes it or not, Martha Yee has become a programmer. Congratulations, Martha!

In many respects, the most important question for the library world in examining semantic web technologies is whether librarians can successfully transform their expertise in working with metadata into expertise in working with ontologies or models of knowledge. Whereas traditional library metadata has always been focused on helping humans find and make use of information, semantic web ontologies are focused on helping machines find and make use of information. Traditional library metadata is meant to be seen and acted on by humans, and as such has always been an uncomfortable match with relational database technology. Semantic web ontologies, in contrast, are meant to make metadata meaningful and actionable for machines. An ontology is thus a sort of computer program, and the effort of making an RDF schema is the first step of telling a computer how to process a type of information. Martha Yee's development of an RDF class to represent an Author is precisely analogous to a Java programmer's development of a Java class to do the same thing.

RDF is the first layer of the program; OWL (Web Ontology Language) is the next layer. In OWL, you can describe relationships and constraints on classes and properties. For example, an ontology could contain the statement:
<owl:ObjectProperty rdf:ID="isCreatorOf">
<owl:inverseOf rdf:resource="#isCreatedBy" />
</owl:ObjectProperty>
which defines isCreatorOf as the inverse of isCreatedBy. With this definition, a reasoning engine that encounters an isCreatorOf relationship will know that it can simplify the data graph by replacing it with the inverse isCreatedBy relationship. This does NOT MEAN that a good ontology should have inverses of all properties that it defines- in fact quite the opposite is true. The OWL ObjectProperty inverseOf (and sameAs) are meant to make it easier to link separate ontologies, not to encourage ontologies to have redundant property definitions.

I'm not sure where the notion that "for every property there needs to be an inverse property" came from, but I'll venture two guesses. It's true that if you want to browse easily in both directions from one entity to a related entity, you need to have the relationship expressed at both ends, particularly in a distributed data environment. Most application scenarios for RDF data involve gathering the data into large datastores for this reason. But you don't need an inverse property to be defined for this purpose.

Another possible source for the inverse property confusion comes from the way that relational databases work. In order to efficiently display sorted lists using a relational databases, you need to have prepared indices for each field you want to use. So if you want to display authors alphabetically by book title, and also books alphabetically by author name, you need to have relationships defined in both directions. If you're using an RDF tuple store by contrast, all the data goes in a single table, and thus indices are all predefined.

The fact that ontologies are programs that encode domain knowledge should remove a lot of mechanical drudgery for "users and inputters". To take a trivial example, the cataloguer of a new version of "Adventures of Tom Sawyer" would not have to enter "Samuel Clemens" as an alternate author name for "Mark Twain" once the isCreatedBy relationship has been made. In fact, if the ontology contained a relationship "isVersionOf", then the cataloguer wouldn't even need to enter the title or create a new isCreatedBy relationship. A library catalog that used semantic web technologies wouldn't need separate programming to make these relationships, they would be come directly from the ontology being used.

To some extent, the success of the semantic web in any domain is predicated on the successful embodiment of that domain's knowledge in ontological code. Either coders need to learn the domain knowledge, or domain experts need to learn to code. People need to talk.