March 06, 2007

Maintaining Integrity of Web Archives Is Essential

On Monday, IDC announced an estimated 161 exabytes of digital information was created in 2006, by far a record. Expectations are that the data growth will continue, through a zettabyte, six times higher than last year's number, by the year 2010. Increasingly, the world's information is digital, and a great deal of it is online.

As more information related to news, commerce and personal histories goes online, it becomes increasingly essential to ensure the long-term integrity of that data, not just for today, but for future years, to accurately record a history of our time, and to complete the massive linkage needed for the Internet to act as a true conduit of a culture.

Even in the last decade or so, we can see how data has been presented for the short term, without much care as to future access. It's not uncommon to see links to news articles of months or years ago hit 404 messages, as sites change directory structures, or delete old data, in a wholly misguided effort to save space. In other cases, domain names go unrenewed, losing data, or hackers maliciously change, corrupt or delete sites, which cannot be recovered. Even some of the most popular sites, like Yahoo! News' carrying articles from the Associated Press, have a limited shelf life, as articles rise to prominence and fall away quickly, to be forgotten forever, as URLs expire.

I strongly believe that every webmaster, blogger, and news medium needs to pay as much attention to archival data as they do to present-day data and future plans. The very Web-like nature of the World Wide Web demands that one link lead to another and to another, with multiple hyperlinks reaching out like strands from the center of the Web to the edge, and each having a multiple, optional, paths from one location to another. If Webmasters and site owners opt to make wholesale changes to directory structures, domain names, or server names, it should be done in a way that does not impact historical data. External hyperlinks should remain unbroken. Search engine archives should return true data, and access to those archives should remain clear. This line of thought falls in line with the thinking of Web pioneer Dave Winer, author of Scripting.com.

If you run a wholly Web-based blog, make backups to your local computer or offline, to CDs, DVDs or even an iPod. I back up to my laptop, iPod, and using Apple's .Mac Backup service.

From 1996 to 1998, I worked for the student newspaper at UC Berkeley, and had set up a search engine to index all articles on the newspaper's online site for the previous 3 to 4 years. Later, due to a series of site overhauls, and the occasional webmaster error, all was lost. Now, the Daily Cal's Web site offers a search engine option for "Before 1999", and it returns nothing. The 200+ stories I wrote for the paper can only be found sporadically in Google's archives, or where they were reposted on other sites.

Others, like Robert Scoble, note their own history was erased, when blog hosts made widescale changes. That is something that should never happen if the Web is to continue forward for years to come.

Now, at the office, I find media coverage on our company I once linked to in 2001 or later is gone, due to media mergers and acquisitions, site changes, or other reasons. In other areas, early personal home pages I once wrote, both in college, and after, are only preserved in patchwork fashion on The Internet Archive, one gallant attempt at bridging the gap between publishers' holes and the true nature of the Web.

There are countless examples of one day's treasures becoming tomorrow's trash. But if the Web is ever to be considered a historical record, we must treat it as such. Backup the data - there's tons of storage space. Preserve URL structures. Preserve domain names, and constantly look both forward and backward, or that next 404 you find could be your own.