Friday, December 4, 2009

My Funnest Bugfix Ever: the Double Relative Redirect

Intel has been running an ad called "Our Jokes Aren't Like Your Jokes". In it, an actor portrays Mario Paniccia, Director of Intel's Photonics Technology Lab mischievously altering an equation on a futuristic see-through noteboard. Hilarity ensues.



There's a lot to remark on here. In what sense is the actor portraying Mario Paniccia? In what way is this ad promoting Intel's corporate image? Can you believe that I'm putting it on my blog, totally without get paid for it?

The ad is hilarious because many of us recognize the truth it expresses. The very best geek humor is funny exactly because so few can appreciate it.

I am now going to attempt to detail the funnest bug I ever found. I've encountered less than ten people in the world who have understood it, and of those, only a few could laugh, because the others were complicit.

Five years ago, one of my linking technology company's users, a librarian-customer who was quite adept technically, had noticed that the links we were providing for one publisher were "flaky", and wondered if we could do anything about it. Our first two layers of support had trouble reproducing the problem, but they agreed that there was something flaky about the links, and they had a high regard for the librarian in question, so the problem got escalated to me (in a small company, even the President should be part of the customer support team). As it turned out, I had configured the link in question myself a year or two earlier, and I had a vague recollection that there was some hard-to-put-your-finger-on flakiness with it. So I started poking around.

The linking technologist's most important tool is a Firefox plugin called LiveHTTPHeaders. It lets you see how your browser is communicating with web servers. When I ran this tool, I immediately saw the problem. The links being sent to this publishers were being routed through a redirection chain. The first stage of the chain converted metadata-based links to database key links and set a session cookie. The next and final stage turned the key link into the actual content link for the user. The problem was that both of these redirects failed to comply with the W3C standard for HTTP redirects.

When a link is embeddded in a web page, it can be one of two types, relative or absolute. An absolute link uses a fully spelled out address, like http://go-to-hellman.blogspot.com/2009/09/redirector-chain-mashup-design-pattern.html. A relative link specifies the address of the link target relative to the current address. So, from this page, the same address could be written in relative form as /2009/09/redirector-chain-mashup-design-pattern.html. The browser has to do some computation to figure out the expanded address from a relative link.

The problem with the flaky links was that the server was sending back relative links in the redirect. Although you can do that in a web page, it's not allowed in the HTTP standard for the server. The most popular web server software, Apache, won't even permit a web developer to send a relative link; in fact, it will helpfully do the expansion calculation automatically. Our flaky link, however, was being served by Microsoft's IIS server software, which doesn't do the expansion. It's possible that the bug arose when a web application designed for Apache was deployed in IIS.

Although I could see where the problem was coming from, and I could see that our linking software was thankfully not to blame, I was thoroughly puzzled. Why did the link only seem flaky, instead of just failing? The answer must lie in the client software- the bits of code on the user's computer that talks to the server.

It's usually a good idea when implementing a communications protocol to be strict about what you send and forgiving about what you receive. So when a relative address is received by an http client, it tries its best to recover from the error. It goes and gets the most recent address it's received and uses that to expand the relative address into an absolute address. So, in the vast majority of cases, users don't experience a problem when a server sends a relative redirect to their browser.

The flaky link problem I was looking at resulted in link failure because the server was sending two relative links in a row. The first link could get expanded just fine, but the second link only had a relative address to expand against, resulting in complete failure of the link.

How could a problem like this have escaped notice? And here we have the geek punch line: the expanded address gets put in the redirect cache! So the second time the user clicks on the link, the initial address is found in the cache, and the first relative redirect is skipped over. Only the second relative redirect is performed, and this time there's a full address from the cache to expand against, and the link works as intended.

If you are an HTTP geek of any sort, you can get up off the floor and stop laughing already. If you are not, read on, and I'll try to explain why it is truly hilarious.

Once I figured this bug out, I did some digging. I found that the same thing was happening not only for the OpenURL links that my software made for linking within a librairy, but also for CrossRef DOI links, which are used pervasively for linking between publishers. I determined that it was not just one publisher that had the problem, it was every publisher that used the same e-journal hosting service. There were about a thousand e-journals affected. And it had been that way for at least a year. My guess is that this link failure had frustrated users at least a million times.

Ninety nine times out of a hundred, the user would just try clicking the link again, and wouldn't think twice about the problem. But that other 1% (10,000 users) would ask a Local Expert for help- perhaps a librarian, perhaps a more-tech savvy friend, whatever. The Local Expert would extend her aura over the user's computer, the user would try it again, and the link would work like a charm.

Ninety nine times out of a hundred, the Local Expert would just smile, having "solved" the problem. But 1% of them (100) would smell something fishy, and ask the user to try another link, and it would fail. The local expert would go back to her desk, and try it a few more times, getting more and more mystified at the flakiness of the link, and decide to report it to a Responsible Party.

Ninety nine times out of those hundred, the Responsible Parties would, for a variety of reasons, fail to do anything about the flaky links. Maybe it was another company's problem. Maybe the problem was too hard, or not bad enough. Maybe they didn't really care.

The hundredth time, it was me. I actually enjoyed finding the bug. I knew the top technical guy at the e-journal hosting company, and I just sent him a description of the issue. Because he knew me, he paid attention, and the problem was fixed in their next update. Millions and millions of the links have worked since.

Isn't that hilarious?

3 comments:

  1. Thanks! from a person who manages one academic libray's link resolver and is often the 'local expert' called on to solve these problems

    ReplyDelete