September 19, 2013

Our Fragile Web of Dead Domains and Lapsing Links

For all our talk about how much information is produced every year, and how every little piece of our lives is being shared and instantly discoverable, it is surprising how difficult it can be to find information on the Web, in its original state from just five years ago, let alone ten or fifteen.

While we were once regaled with stories of how the Internet could withstand nuclear war, thanks to a complicated structure of redundancy and geographical backups, simple human mistakes, conflated with the occasional act of malice, have reduced our expectation that data, once posted, will be forever stable.

I believe strongly in the concept of the cloud, and have moved practically all my data to it, relying on a cloud-centric laptop each day and saving my files in the cloud. But not everyone is as careful about selecting providers and maintaining platforms and domains as I have been, and it's not too uncommon for entire sites and bookmarks to vanish from the Web, with only Archive.org and other clever cachers left to tell the tale.

Before this sounds like a sordid tale or anti-Web screed from someone embedded in the Web, let me give you a specific example. Imagine, if you would, a unique URL pointing back to the original dotcom boom - a simple one like .com.com. No, I didn't type it twice. The domain com.com has long been owned by CNET, part of CBS Interactive. While CNET has undergone a number of owners over the decades since its launch, it's also seen extreme variance in how they marketed their main site and URLs. For those of us who just wanted the news, it was News.com. Later News.com would redirect to News.Cnet.com, as it does today.

Archive.org Shows News.com.com in a Previous Known Good State

But at one point, News.com redirected to News.com.com, and for whatever reason, that's the bookmark I've had follow me from browser to browser for years. Suddenly, a couple months ago, this bookmark stopped working, instead showing me a directory of links, which looked like a squatter had seized the URL and taken it over. Archive.org shows the same. News.com.com worked and then in July... it stopped redirecting. So that's annoying. While it's simple enough for me to update my bookmark, and no doubt I'm in the small minority of users who kept that URL, to have a potentially high profile URL like .com.com do absolutely nothing ... seems foolish.

But Now The Old News.com.com URL is Pure Rubbish

Enough about .com.com. What's just as frustrating at times is the short shelf life for links and images from years past. My own blog has been around a little over seven years, and in the 3,000 or so posts I've made since start, each one has had many links. As companies come and go, their websites and the links to their subpages go away. News media sites, which one would hope would present a timeless archive, a long tail of information to the truth seekers out there, are often the worst offenders, as articles of a certain date fall behind a paywall, or site platforms change, forever hijacking link structure and rendering previous links inept.

One of the biggest regrets I have, as far as the web is concerned is one that +Ryan Tate (now with +WIRED) and I share. In the Web's more prehistoric times, back in the late 1990s, he and I both worked at the Daily Californian student newspaper at +UC Berkeley. Both of us wrote hundreds of news stories, covering everything from student elections to campus crazies and the occasional homicide. But at one point, after I'd left the paper, our site was hacked/corrupted, and all the existing content was lost - shockingly without backup. So more than 99% of that data is gone for good, and one will either have to travel to the Berkeley campus and pick up a hardbound paper to see our work, or it's just gone. While few of my stories are worth reading about 15 years later, they are part of my own personal (and work) history which has little record.

From writing on actual death to link death... while CNET's link lapse with .com.com is a surprise, it could possibly be due to a sale, to be used by an unknown acquirer, or simple neglect. Worse is when one sees links automatically shortened, only for the URL shortener to disappear, or for the hosting service to invalidate other short links. While I've made my case for URLs to look good and be intuitive, we've gotten used to seeing smaller URLs, best exemplified by the t.co from Twitter, helpful on their service, along with goo.gl from Google, bit.ly, and others. But by buying into a short URL service, you require it to be maintained by the original owners and all the tables being intact. So for those of us who long shared with ff.im from FriendFeed, it is by the grace of Facebook that those old things are still around, and practically nobody would be surprised if they went the way of the dodo in the next couple years.

My argument is that the Web should be built for permanence. A link I post today should be a link that works later. A permalink to a dedicated page with content should produce that same content, even if the surrounding frame has been upgraded, in the future. And short links and domains should behave in a trustworthy, user friendly manner. It would be a minor tragedy if the start page you use daily suddenly became something else, and a larger one if the domain on which you host your personal stories simply closed shop because the host didn't find it financially feasible to keep going any more. So while the magic of the Web is real, and it sometimes does seem you can practically find anything out there and get it instantly, assuming speedy broadband, the gaps have me thinking we can do better. And yeah +CNET, what's up?