May 08, 2009

Every Piece of the Infrastructure Carries Potential to Fail

Though it may end up being a temporary blip, at this moment FriendFeed is down, following a scheduled outage at Twitter this afternoon. And while that's not really news, it comes on the heels of many discussing the potential for failure that third-party URL shorteners bring to the Web. For every fan of TinyURL or bit.ly, there are others who say relying on another service to be a go-between between the user and the intended data is just begging for trouble. But the truth is that in a network, when there are multiple items with potential to fail between the user and the data, any one of those pieces in many cases can bring the entire system down to its knees.
  • Storage can fail.
  • Servers can fail.
  • Networks can fail.
  • Routers can fail.
  • Lines can be cut.
  • Services can close down.
  • Users can delete images or pages.
It happens, and until we control all aspects of the system, there will be outages.

On Wednesday, in the middle of testing a third-party Twitter service, I linked to the Guardian using a URL shortener called tr.im, required to get the service to work. Later that night, tr.im failed, and it broke all links that were being used.


The conversation (in Google cache)

In response, Paul Buchheit, co-founder of FriendFeed, with a long history at Google, Microsoft and Intel prior to his latest efforts, referenced the break, calling it "another reason why url shorteners are annoying."

But FriendFeed itself has a URL shortener, called ff.im, which it uses when sending updates to Twitter. Paul added in the thread, "Except ff.im of course :)"

But guess what? Because FriendFeed is down (for now), also down are the ff.im links, making them as likely to fail as any other third party shortener. I could rant up and down saying that FriendFeed and ff.im should be served from different data centers, or offer better redundancy, but I won't. Nobody loves downtime, and FriendFeed by and large has had a fantastic track record of staying up. But as they become a more integrated part of the ecosystem, they too will get more opportunities to fail and need to take the same safeguards to protect the infrastructure as do all the other players.

Things will fail. We will live, but we know that there is no such thing as a fully redundant failsafe machine. Every hop delivers the potential to turn into a skip, and not in a good way.