Tuesday, December 3, 2019

Your Identity, Your Library

Today, your identity on the Internet is essentially owned by the big email providers and social networks. Google, Yahoo, Facebook, Twitter - chances are you use one of these services to conveniently log into other services as YOU. You don't need to remember a new password for each service, and the service providers don't have to verify your "identity". What you gain in convenience, you lose in privacy, and that's turned out really well, hasn't it?

The "flow" you use to take advantage of this single sign-in is a "dance" that takes you from website to website and back to the site you're logging into. A similar dance occurs to secure access to resources licensed on you behalf by libraries, institutions, corporations, etc.. I wrote a bunch of articles about "RA21" (now rebranded as the vaguely NSFW "SeamlessAccess"), an effort spearheaded by STM publishers to improve the user experience of that dance. (It can be complicated and confusing because there are lots of potential dance partners!)

Henri Matisse, La danse (first version) 1909

These dance partners style themselves as "identity providers". That label makes me uncomfortable. Identity can't be something that can be stripped from you by on the whim of a megacorporation. Instead, internet identity should be woven from a web of relationships. These can be formed digitally or face-to-face, global or local, business or personal.

You'd have thunk that the whole identity-on-the-internet thing would have improved in the 13 years since that login dance was first rolled out. And you'd be almost right, because a new architecture for internet identity is now on the horizon. Made possible by many of the same technologies that are securing the internet and inflating the blockchain bubble, massively distributed and even "self-sovereign identity" are becoming real-ish.

These technologies will inevitably be applied to the access authorization problem. Access via distributed identity replaces the website-to-website dance with the presentation of some sort of signed credential. A service provider verifies the signature against the signer's public key. It's like showing a passport that can't be forged. A tricky bit is that the credential also needs to be checked against a list of revoked credentials. This would have been cumbersome even ten years ago, but distributed databases are now a mature technology, versions of which underpin the internet itself.

Interlinked with the concept of distributed identity is the notion that users of the web should be able to securely control their data, and that decisions about what a web site gets to know about you should not be delegated to advertising networks.

Unfortunately, we're not quite ready for distributed identity, in the sense that implementation for today's web would require users to install plugin software, which has its own set of usability, privacy and security issues. The ideal situation would be for some sort of standardized distributed identity and secure data management capability to be installed in browser software - Chrome, Firefox, Safari, etc.

There's a lot of work going on to make this happen.
  • ID2020 has put out an identity manifesto that starts with the declaration that "The ability to prove one’s identity is a fundamental and universal human right."
  • Tim Berners-Lee is leading the Solid Project, which let's you "move freely between services, reuse data across apps, connect with anyone, and select what you share precisely".
  • The W3C Verifiable Claims Working Group has published Technical Recommendations for "Verifiable Credential Use Cases and a "Verifiable Credential Data Model". They observe that "from educational records to payment account access, the next generation of web applications will authorize entities to perform actions based on rich sets of credentials issued by trusted parties."
  • The Sovrin Network is a "new standard for digital identity – designed to bring the trust, personal control, and ease-of-use of analog IDs – like driver’s licenses and ID cards – to the Internet."
  • Kaliya Young, Doc Searls and Phil Windley have been convening the Internet Identity Workshop twice a year since 2005 to create a community centered around internet identity. A glance at prior year proceedings gives a flavor of how much is happening in the field
The common thread here is that users, not unaccountable third parties, should be able to manage their identity on the internet, while at the same time creating a global chain of trust.

It seems to me that there's a last-mile problem with all these schemes. If identity is really a universal human right, how do we create a chain of trust that can include every human? That problem becomes a lot easier to solve if there were some sort of organization with a physical presence in communities all over, trusted by the community and by other organizations. A sort of institution experienced in managing information access and privacy, and devoted to the needs of all sorts of users.

In other words, what if "libraries" existed?

The federated authentications systems used by libraries today - Shibboleth, Athens, and related systems use a dance similar to what you do with Google or Facebook. It's a big step that moves your internet identity away from "surveillance capitalists" towards community institutions. But you still don't have control over what data your institution give away, as you will in the next-generation internet identity systems I describe here. (RA21 is no different from Shib or Athens in this respect.)

What might libraries do to prepare for the age of distributed identity? The first step is not about technology, it's about mission. I believe libraries should start to think of themselves as internet relationship providers for their communities. When I get access to a resource though my library, I won't be "logging in",  I'll be asserting a relationship with a library community, and the library will be standing behind me. Joining an identity federation is a good next step for libraries. But the library community needs to advocate for user identity as a basic human right and prepare their systems to support a future where no dancing is required.

Update 12/5/2019: revised last two paragraphs to be less mystifying.


Friday, July 26, 2019

Four-Leaf Clovers

It seems a friend of mine collects four-leaf clovers.

When I was a kid, I loved looking for four-leaf clovers in the lawn.  It was the same sort of relaxing concentration and observation you use to find a piece of a jigsaw puzzle. But one day, I found a clover plant in front of the garage that had multiple four-leaf clovers. Looking carefully, I found that not only were there four leaf clovers, but there were FIVE-LEAF-CLOVERS. I had hit the jackpot. And even a six-leaf clover!!!! I swear to all of God's integers that I even found a SEVEN leaf clover. I saved that seven leaf clover in my box of treasures for years, until I just had seven crumbling leafs of a clover.

I never looked for a four-leaf clover again.

Now, whenever I remember that clover plant (and that garage), I think of the toxins that must have caused the polyfoliate abomination.

Please don't let my story stop you looking for four-leaf clovers! Happy summer!


Thursday, May 30, 2019

Responding to Critical Reviews

The first scientific paper I published was submitted to Physical Review B, the world's leading scientific journal in condensed matter physics. Mailing in the manuscript felt like sending my soul into a black hole, except not even Hawking radiation would came back. A seemingly favorable review returned a miraculous two months later:
"I found this paper interesting, and I think it probably eventually it should be published - but only after Section II is revamped and section III clarified."
I made a few minor revisions and added some computations that had been left out of the first version, then confidently resubmitted the paper. But another two months later, I received the second review. The referee hadn't appreciated that I had deflected the review's description of "fundamental logic flaws and careless errors" that made my paper "extremely confusing". The reviewer went on to say "I do not think the authors' new variational calculation is correct" and suggested that my approach was completely wrong.
A ridiculously long equation

My thesis advisor suggested that I go and talk to Bob Laughlin in the Physics department about how to deal with the stubborn referee. I had been collaborating with Bob and one of his students on a related project, and he had become a surrogate advisor for my theoretical endeavors. During that time, Bob had acquired a reputation among my fellow students for asking merciless questions at oral exams; many of us were scared of him.

Bob's lesson on how to deal with a difficult referee turned out to be one of the most useful things I learned in grad school. Referees, he told me, come in 2 varieties, complete idiots, and not-complete-idiots. (Yes, Bob was merciless.) If your referee is a complete idiot, all you can do is ask for a different referee. If your referee has the least bit of sense, then you have to take the attitude that either the referee is somewhat correct, and you think YES-SIR MISTER REFEREE SIR! (Bob had been in the Army) and do whatever the referee says to do, or you take the point of view that you have explained something so poorly that the referee, who is an excellent representative of your target audience, had no hope of understanding it. Either way, there was a lot of work to do. We decided that this referee was not an idiot, and I needed to go back to the drawing board and re-do my calculation, figuring out how to be clearer and more correct in my exposition.

A third review came back with the lovely phrase "The significance of the calculation of section II, which is neither fish nor fowl, remains unclear." Using Bob's not-idiot rule, I recognized that my explanation was still unclear and I worked even harder to improve the paper.

My third revised version was accepted and published. Bob later won the Nobel Prize. I'm here writing blog posts for you about RA21.

RA21 received 120 mostly critical reviews from a cross-section of referees, not a single one of whom is the least bit an idiot. Roughly half the issues fell into the badly-explained category, while the other half fell in the "fundamental flaws and careless errors" category. RA21 needs to go back to the chalkboard and rethink even their starting assumptions before they can move forward with this much-needed effort.

Friday, May 17, 2019

RA21: Technology is not the problem.

RA21 vows to "improve access to institutionally-provided information resources". The barriers to access are primarily related to the authorization of such access in the context of licensing agreements. In a perfect world, trust and consensus between licensors and licensing communities would render authorization technology irrelevant. In the real world, technological controls need to build upon good-faith agreements and the consent of community members. Also in the real world, poorly implemented technology erodes that good-faith and consent.

The RA21 draft recommended practice focuses on technology and technology implementations, all the while failing to consider how to build the trust that underpins good-faith and consent. Service providers need to trust that identity providers faithfully facilitate authorized users and that the communities that identity providers serve will adhere to licensing agreements; users of information resources need to trust that their usage data will not be tracked and sold to the highest bidder.

Trust is not created out of thin air and certainly not by software. Technology can provide tools that facilitate trust, but shared values and communication between parties is the raw material of trust. An effective program to improve access must include processes and procedures that develop shared values and promote cooperation.

I recognize that RA21 has chosen to consider only the authentication intercourse as in-scope. But the draft recommendation has identified several areas of "further work". Included in this further work should be areas where community standards and best practices can enhance trust around authentication and authorization. To name two examples:
  1. A set of best practices around "incident response" would in practice work much better than a "guiding principle" of "end-to-end traceability".
  2. A set of best practices around auditing of security and privacy procedures and technology at service providers and identity providers would materially address the privacy and security concerns that the draft recommendation punts over to cited reports and studies.

This is the fifth and last of my comments submitted as part of the NISO standards process. The 102+ comments that have been submitted so far represent a great deal of expertise and real-world experience. My previous comments were  about secure communication channels, potential phishing attacks, the incompatibility of the recommended technical approach with privacy-enhancing browser features, and the need for radical inclusiveness. I've posted the comments here so you can easily comment.

Update July 22, 2019:



RA21's official response to this comment is:
We agree that technology is not the primary problem. There are two core issues that RA21 is seeking to address - firstly the current user experience of federated authentication needs to be improved, and this comprises the bulk of our recommendations. Secondly, considerable trust has been established between identity providers and service providers through their mutual particpation in identity federations and we are recommending broader particpation in identity federations where they do not exist. The understanding and acceptance of this trust model is not universal among all stakeholder groups particularly withing IdP organisations and through ongoing dialog and outreach during the implementation phase, RA21 hopes to address this deficit. Finally, we have added a section to the recommendations addressing security incident response and adoption of an operational security baseline by particpants.
OK.

Monday, May 13, 2019

RA21 doesn't address the yet-another-WAYF problem. Radical inclusiveness would.

The fundamental problem with standards is captured by XKCD 927.
XKCD https://xkcd.com/927/
Single sign-on systems have the same problem. The only way for a single sign-on system to deliver a seamless user experience is to be backed by a federated identity system that encompasses all use cases. For RA-21 to be the single button that works for everyone, it must be radically inclusive. It must accommodate a wide variety of communities and use cases.

Unfortunately, the draft recommended practice betrays no self-awareness about this problem. Mostly, it assumes that there will be a single "access through your institution" button. While it is certainly true that end-users have more success when presented with a primary access method, it's not addressed how  RA-21 might reach that state.

Articulating a radical inclusiveness principle would put the goal of single-button access within reach. Radical inclusiveness means bringing IP-based authentication, anonymous access, and access for walk-ins into the RA-21 tent. Meanwhile the usability and adoption of of SAML-based systems would be improved; service providers who require "end-to-end traceability" could achieve this in the context of their customer agreements; it needn't be a requirement for the system as a whole.

Radical inclusiveness would also broaden the user base and thus financial support for the system as a whole. We can't expect a 100,000 student university library in China to have the same requirements or capabilities as a small hospital in New Jersey or a multinational pharmaceutical company in Switzerland, even though all three might need access to the same research article.



This is my fourth comment on the RA-21 draft "Recommended Practices for Improved Access toInstitutionally-Provided Information Resources". The official comment period ends Friday. This comment, 57 others, and the add-comment form can be read here. My comments so far are about secure communication channelspotential phishing attacks, and the incompatibility of the recommended technical approach with privacy-enhancing browser features. I'm posting the comments here so you can easily comment. I'll have one more comment, and then a general summary.

Update July 10, 2019:

RA21's official response to this comment is:
RA21 envisages supporting the anonymous and walk-in use cases via federated authentication. It is anticpated that federated authentication and IP authentication will exist side-by-side during a transition phase. The specifics of the User Experience during the transition phase will need to be determined during implementation; however it is likely that the RA21 button will simply not need to be displayed to users who are IP authenticated.
I suppose self-awareness was a big ask. The revised recommendation includes some "envisaging" of use cases that was glaring by omission in the draft recommendation. The added section 2.1.1., Employ appropriate authentication mechanisms for specific use cases, is an improvement on the draft; but the revised recommendation has not retreated from its end-to-end traceability "guiding principle".

RA21 used the same response for a comment by Ohio State's, Jennifer Vinopal:
I want to reiterate a point that a number of commenters have already mentioned: there is no discussion of how public or walk-in (or other unauthenticated/unauthenticating) users will get access to resources through RA21. Public libraries, as well as many college and research libraries, negotiate our e-resource licenses to provide access to walk-in users who aren?t represented in our IdM systems.
Don't forget, EZProxy was supposed to be a transition phase!




Wednesday, May 8, 2019

RA21's recommended technical approach is broken by emerging browser privacy features


This is my third comment about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th.  My first comment concerned the use of secure communication channels. The second looked at potential phishing attacks on the proposed system. I'm posting the comments here so you can easily comment.

RA21's recommended technical approach is broken by emerging browser privacy features

Third party cookies are widely on the web used as trackers, or "web bugs", by advertising networks wishing to target users with advertising on the web. The impact of these trackers on privacy has been widely reported and decried. Browser local storage deployed using 3rd-party iframes is similarly employed for user tracking by ad networks. Browser vendors, led by Apple, have fought back against user tracking by providing user options to limit third party information sharing. Apple's "Intelligent Tracking Protection"  has progressively increased the barriers to cross-site information storage, for example, by partitioning the local storage according to third-party context.

Unfortunately for RA21, the draft recommended practice (RP) has endorsed a technical approach which mirrors the tactics used for user tracking by the advertising industry. For this reason, users of Safari who choose to enable the "prevent cross-site tracking" option may not benefit from the "seamless" access promised by RA21 if implemented with the endorsed technical approach.

Wikimedia commons
The optimistically acronymed "P3W" pilot used a javascript library called "Krakenjs/zoid" (According to the Norse sagas, the kraken is a squidlike monster that terrorizes voyagers) to exchange data between cross-domain contexts. The limitations on krakenjs in Safari are acknowledged by the library's developer.  It works by having the host webpage create an iframe loaded from a P3W website. With privacy controls off, the web page posts to the iframe, which answers with a reference to the user's identity provider. The service provider website uses that information to help the user authenticate without having to search through a huge list of identity providers. With Safari privacy features turned on, the search process must be repeated for each and every service provider domain.

Other browser vendors have moved towards restricting tracking behaviour. Firefox has announced that it will phase in "enhanced tracking protection"
Even Google's Chrome browser is moving towards restrictions on tracking technologies.

The bottom line is that if RA21 is implemented with the recommended technical approach, library users will probably be required to turn off privacy enhancing features of their browser software to use resources in their library. As a result, RA21 will have difficulty moving forward with community consensus on this technical approach.

Browser software is much more tolerant of cross-domain communication when the information "hub" is a first-party context (i.e. a window of its own, not an embedded iframe), as is done in more established authentication schemes such as OpenID Connect and SAML flow. RA21 should refocus its development effort on these technical approaches.

Update July 5, 2019:

RA21's official response to this comment is:
Future work includes storage policy notification. Also, we are not actually using third party cookies even though this term is often used to describe several cross-domain access patterns; instead, RA21 recommends using web storage (aka, browser local storage) together with HTML5 post-message for cross-domain access. This is the same mechanism (and indeed the same implementation) that PayPal uses, thus demonstrating broad browser support. A description of web storage has been added to the Terminology section. We are aware that by turning off "third party cookies" it is possible for the user to partly or completely disable the call to action button but in those cases the user experience degrades gracefully to a classical SAML/OpenIDC discovery flow.
Essentially the same response was made to three other submitted comments. Two of them, from Duke's Tim McGeary, called out two sections of the recommended practice and noted:
Word of caution: this login specifically cannot happen in an iFrame to meet SSO security protocol
The third, from Cornell University Library, submitted by Adam Chandler, amplified on McGeary:
Comment from Cornell University Library Privacy as a Service Working Group. Our group includes membership drawn from Library IT, Library Licensing, Library Public Services, Cornell IT Security, and Cornell Privacy Office.

Under 2.4.: We agree with Tim McGreary's comment (#862 or #863 - seems that he double-posted it) that the SSO login shouldn't be inside a frame on another page. There are security issues with that kind of approach. The users can't see the login page URL to verify that the page is a x.uni.edu page before entering their passwords, so it makes it easier to spoof the login page. Generally, login pages use "framebusting" to prevent this kind of possibility.
RA21's response on this issue is alarming, and suggests that the whole project is in danger of failure. RA21 seems to be unaware that using HTML5 web storage is worse than 3rd party cookies in many respects - particularly privacy and security. Currently, only Safari defaults to "a classical SAML/OpenIDC discovery flow", but that still means that if they want to be accurate, they'll have to rename the implementing organization "The Coalition for Seamless Access but Not on iOS" or "The Coalition for Problematic Access".

I hope that the beta implementation will be executed by a team with the experience and competence to override or at least effectively mitigate RA21's technical blunder.

Monday, May 6, 2019

RA21 Draft RP session timeout recommendation considered harmful

Hey everybody, I implemented RA21 for access to the blog!































Well, that was fun.

I'm contributing comments about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th. The draft has much to recommend it, but it appears to have flaws that could impair the success of the effort. My first comment concerned the use of secure communication channels. I expect to write two more. I'm posting the comments here so you can easily comment.

RA21 Draft RP session timeout recommendation considered harmful

RA21 hopes to implement a user authentication environment which allows seamless single sign-on to a large number of service provider websites. Essential to RA21's vision is to replace a hodge-podge of implementations with a uniform, easily recognizable user interface.

While a uniform sign-in flow will be a huge benefit to end users, it introduces an increased vulnerability to an increasingly common type of compromise, credential phishing.  A credential phishing attack exploits learned user behavior by presenting the user with a fraudulent interface cloned from a legitimate service. The unsuspecting user enters credentials into the fraudulent website without ever being aware of the credential theft. RA21 greatly reduces the difficulty of a phishing attack in three ways:
  1. Users will learn and use the same sign-in flow for many, perhaps hundreds, of websites. Most users will occasionally encounter the RA21 login on websites they have never used before.
  2. The uniform visual appearance of the sign-in button and identity provider selection step will be trivial to copy. Similarly, a user's previously selected identity provider will often be easy for an attacker to guess, based on the user's IP address.
  3. If successful, RA21 may be used by millions of authorized users, making it difficult to detect unauthorized use of stolen credentials.
If users are trained to enter password credentials even once per day, they are unlikely to notice when they are asked for identity provider credentials by a website crafted to mimic a real identity provider.

For this very reason, websites commonly used for third party logins, such as Google and Facebook, use timeouts much longer than the 24 hour timeouts recommended by the RA21 draft RP. To combat credential theft, they add tools such as multi-factor authentication and insert identity challenges based on factors such as user behavior and the number of devices used by an account.

Identity providers participating in RA21 need to be encouraged to adopt these and other anti-phishing security measures; the RA21 draft's recommended identity provider session timeout (section 2.7) is not in alignment with these measures and is thus counterproductive. Instead, the RP should encourage long identity provider session timeouts, advanced authentication methods, and should clearly note the hazard of phishing attacks on the system. Long-lived sessions will result in better user experience and promote systemic security. While the RP cites default values used in Shibboleth, there is no published evidence that these parameters have suppressed credential theft; the need for RA21 suggests that the resulting user experience has been far from "seamless".

Update July 3, 2019:

RA21's official response to this comment is:
We disagree with premise that consumer websites adopt long sign-in timeouts as a Phishing protection measure. That said, IdPs should follow best practices such as HTTPS so users can verify that they are on a valid sign in page. Length of validity of sign-in is also by necessity context dependent.
Well, yeah. I wasn't expecting them to actually consult real people who battle identity theft on consumer websites. I was mostly amazed that sign-in timeouts would be considered in-scope for RA21 while HTTPS, which will be essential to RA21's success or failure, was not. But the RA21 recommendation will have no effect whatsoever on what identity providers do, unless perhaps existing identity providers are making timeouts ridiculously short. Identity providers know their context much better than any committee and they will do what they want to do. And they should!

Interestingly, a section (2.8. Establish Security Incident Reporting Frameworks) has been added to the revised recommendation that acknowledges credential phishing as a motivation for RA21! So, yay RA21!
 

Sunday, May 5, 2019

RA21 RP does not require secure protocols. It should.

As I've written, "RA21" could be a Good Thing, or it could be a disaster. The RA21 working group has released its "Recommended Practice" draft for comments, until May 17. The draft has much to like, but also has significant flaws. I will be contributing comments to address the flaws I see, which I will also publish here so we can discuss and comment. My official comments, and many others worth reading are here.

Here's my first comment, perhaps the most predictable:

RA21 RP does not require secure protocols. It should.

RA21 envisions the creation of a widely deployed authentication and authorization for resources and tools serving the research community. In such an ecosystem, the health and security of the entire system can be degraded by a small number of weak implementations. In particular, delivering resources over insecure unencrypted channels will be harmful.

In this context it is surprising that the RA21 recommended practice (RP) fails to directly address the need for service providers and identity providers to use secure channels such as HTTPS for websites. The recommended practice makes indirect reference to this need by citing another document, "WAYF Cloud and P3W Security & Privacy Recommendations". This document fails to treat secure channels as a requirement, saying in analyzing the pilot implementations (italics added):
"All browser traffic should use secured protocols, such as https, to prevent unauthorized access and to preserve confidentiality." (WAYF cloud, page 13)
"All browser traffic should use secured protocols such as https to prevent unauthorized access and to preserve confidentiality." (P3W, page 18)
In contrast to the "should" used for secure communications, the analysis uses the stronger "must" in other places, for example,
"Therefore, applications must include strong controls to prevent user ID tampering and abuse "(Information Disclosure, page 7)
Security and privacy issues essential to the success of RA21 should not be buried in technical analyses of uncertain normativity. Secure channels should not be optional, they must be required.

Update July 2, 2019:

RA21's official response to this comment is
We agree that HTTPS everywhere is a good idea for tools and resources serving the research community. However, a specific recommendation on this would be outside of the scope of RA21.
This response strikes me as uninformed, considering that the recommendation promotes a technical solution that will likely require publishers to adopt HTTPS. Either the committee is unaware of the technical ramifications of their recommendations (very likely), or they're trying to hide from the publishing community the inconvenient fact that RA21 will require all of them to go HTTPS (I wish).

Really, all I was hoping for some bland indication that RA21 will not compromise system privacy and security to accommodate the laggards of the service provider community. Since that didn't happen, I'll do some shouting here:

ATTENTION PUBLISHERS! RA21 WON'T TELL YOU IT'S GOING TO REQUIRE HTTPS ON YOUR SITE. I JUST DID.

Wednesday, April 3, 2019

Fudge, and open access ebook download statistics

If you found out that the top 50 authors born in Gloucestershire, England average over 10 million copies sold, you might think that those authors are doing pretty well. But it's silly to compute averages like that. When you compute an average over a population, you're making an assumption that the quantity you're averaging over is statistically distributed somehow over the population. Unless of course you don't care if the average means anything, and you just want numbers to help justify an agenda.

Most folks would look at the list of Gloucestershire authors and say that one of the authors is an outlier, not representative of Gloucestershire authors in general. And so J.K. Rowling, with more than 500 million copies sold, would get removed from the data set, revealing the presumably unimpressive book selling record of the "more representative" authors. Scientists refer to this process as "fudging the data". It's done all the time, but it's not honest.

There's a better way. If a scientific study presents averages across a population, it should also report statistical measures such as variance and standard deviation, so the audience can judge how meaningful the reported averages are (or aren't!).

Other times, the existence of "outliers" is evidence that the numbers are better measured and compared on a different scale. Often, that's a logarithmic scale. For example, noise is measured on a logarithmic scale, in units of decibels. An ambulance siren has a million times the noise power of normal conversation, but it's easier to make sense of that number if we compare the 60 dB sound volume of conversation to the 90 dB of a hair dryer, the 120 dB of the siren and the 140 dB of a jet engine. Similarly, we can understand that while J.K. Rowling's sales run into 8 figures, most top Gloucestershire-born authors are probably 3, 4 and or maybe 5 figure sellers.

Over the weekend, I released a "preprint" on Humanities Commons, describing my analysis of open-access ebook usage data. I worked with a wonderful team including two open-access publishers, University of Michigan Press and Open Book Publishers, on this project, which was funded by the Mellon Foundation. To boil down my analysis to two pithy points, the preprint argues:

  1. Free ebook downloads are best measured on a logarithmic scale, like earthquakes and trade publishing sales.
  2. We shouldn't average download counts.

If you take the logarithm of book downloads, the histogram looks like a bell curve!
For example, if someone tries to tell you that "Engineering, mathematics and computer science OA books perform much better than the average number of downloads for OA books across all subject areas" without telling you about variances of the distributions and refusing to release their data, you should pay them no mind.

Next week, I'll have a post about why logarithmic scales makes sense for measuring open-access usage, and maybe another about how log-normal statistics could save civilization.