June 04, 2011

A Scorched Data Policy Is Bad for Web, Bad for History

In a world where the cost of storage is practically zero, and the incentive to delete data declines, I'm befuddled by the lack of prioritization for some companies to focus on a complete search history, and in parallel, an intentional erasure of information by more prominent content producers, who seem to arbitrarily decide that the long tail of the Web, and the interest of future readers is less important than being seen as participating in the latest hip wave.

Even as I've changed technologies, blog providers and structures, I make extra effort to not lose historical archives and keep comment streams intact, not for ego purposes or for SEO, but because it's the right thing to do.

Steve Rubel, an executive vice president of global strategy for Edelman, a longtime blogger who was among the first to espouse the benefits of blogging, social media and was often at the leading edge of tech only a few years ago, has more recently swayed to and fro based on the hot startup of the day, leaving thousands of broken links in the process. After running out of time to update his blog regularly, back in 2009 he ditched the blog to run a lifestream, based on Posterous. (Google cache). Last month, he pivoted again (another hot thing to do in the Valley these days for those with unsuccessful ideas) and now is the proud owner of a Tumblr-powered blog.

Pivot Number One Saw Steve Go to Posterous

Pivot Number Two Saw Steve Go to Tumblr, and Delete Everything

But instead of leaving the older sites open, he deleted everything, proudly stating:
"With just two clicks of a mouse I rid the web of literally thousands of blog posts, some of which I am proud of - others less so - and redirected the URLs to the new site."
While no doubt many of his older posts (like mine) have limited or no value to today's readers or those in the future, they are an insightful historical record of one of the more visible bloggers of a specific period, one who is now in the process of erasing his tracks.

With more than five years of blogging myself here, the blog archives now take on their own role as a personal reference desk. When Kevin Rose left Digg, I was able to go back to my comments on Digg in 2006 to see my thoughts at the time. When TweetDeck was sold to Twitter, I could go back to 2008 and see my first thoughts on the service. With PostRank selling to Google yesterday, I found a post I made two years ago that had comments on it from people who now work at Google. If I didn't manage to keep the posts alive and the comment threads as well, this would not be possible.

Steve's Community Is Very Unhappy With the Deletions, Calling it "Nutty"

Others Say Erasing "A Bit Much", Want Their Comments Out

For those of you who don't consume my blog exclusively by RSS, you might have noticed some recent changes to the look and feel (Go ahead and look). It's not a major change, but an upgrade nonetheless. Part of the reason for my slow migration was the criticality for me to not lose the existing posts, structure, external links and attached discussions. I know some comments from a few years ago are still out of reach, but I'm hoping to bring them back.

Of all the Web content I have produced or managed in the last fifteen years, one of my biggest regrets is the complete void from my time in college, both from my own personal home page, and from the student newspaper where I was the online editor and also wrote hundreds of posts, most on the front page. Current Valleywag editor Ryan Tate, who picked up the online editor job at the paper after I had left, struggled with a series of malicious hacks, and all our collective work was gone, erased from the Web like a bad memory. Whenever we trade emails or talk in person, we both lament this loss.

Steve's Original Micro Persuasion Blog, Pre-Deletion

This data obliteration is something that really is avoidable now, and yet, we let it happen on a near-constant basis. Most newspaper stories from the terrorist attacks in 2001 come up as 404s, beyond the reaches of the Archive.org project, or Google's search engine cache.

What I would like to see is a proposal from Google, or some other well-intended Web entity, such as Amazon, to offer a solution, embedded in today's modern browsers, as an option, that solves for intentional or unintentional content deletion. All those links that I provided back to Steve Rubel's MicroPersuasion blog from 2006 to 2009 should automatically be detected as dead, and then presented, to the best of the tech's ability, as they originally were, using Google cache or S3, Archive.org or something. And yes, I'd love it if somebody like Google or Microsoft would also give Twitter or Facebook a helping hand to get their own search archives into something useful.

Steve Thinks The World Won't Care About His Old Posts.

The issues I have with Steve's approach to pouring gasoline on his past and then lighting it on fire is not one of a choice of platforms. While seeing him join Tumblr is about as hip as your dad trying to snowboard with the cool kids, the more important part is that it eliminates the choice for readers, present and future, to ever get that data, and fill in the blanks. It's not his call to decide what has value for others, even if he sent me a tweet saying "we're foolish if we think the world really cares."

The world should care about walking through the historical record - be it on Steve's blog, or Dave's blog, or Robert's blog, or Mike's blog, or Penelope's, or any of the people who have been chronicling the world they see around them. I wish we had full archives from newspapers for years and decades backward, or the personal journals of people famous and ordinary from centuries past. What they might have found mundane is intriguing to others of us - maybe not massive populations, but to one person, they could contain serious insight. The Web is supposed to cater to the long tail, and the history of what we've produced should be there when they come looking for it.