April 30, 2007

Google News Search Removes Duplicates

One of my biggest pet peeves of Google News search has long been the search engine's inability to recognize if a story had already been tracked, spawning new "alerts" every hour, while the content stayed the same - the only difference being the story registered as having been posted "five hours ago" versus "four hours ago", "three hours ago", et cetera. Now, in a quiet upgrade today, it looks like Google has heard our cries and eliminated the issue once and for all.

As the two below screen shots show, Google News now offers three ways to sort search results:

1. Sorted by relevance
2. Sort by date
3. Sort by date with duplicates included

The second option, without duplicates, replaces the previous version, which showed all duplicates. It, in turn slid over to the right, as the third option.

A quick search for Global Warming, a hot topic at Google these days, easily demonstrates the power of the new filter. Under the old system, it would spawn 34,065 results. With the deduplication filter applied, that number is slashed in half, to 17,247.


Before: Searching for Global Warming Gains 34,000 results


After: The same search yields 17,000 results


And in case you think that's an isolated case, here are some more searches, before and after.

* Apple: 17,568 with the filter, and 30,419 without.
* Intel: 6,411 with the filter, and 12,783 without.
* Microsoft: 20,897 with the filter, and 40,847 without.

Looks like a fairly standard 2 to 1 reduction, across the board. Looking at current events, or "real news", how does the new filter work?

* Virginia Tech: 35,325 with the filter, and 82,270 without.
* Freeway Collapse: 234 with the filter, and 1,640 without.
* Golden State Warriors: 1,959 with the filter, and 8,182 without.

There, the ratio is even more dramatic, anywhere from 3 to 8 times.

As I rely on Google News searches to keep me on top of my company and its competitors, and have imported these news searches into my RSS feeds in Google Reader, the deduplication effort by Google is much anticipated and appreciated. Even if Google wants to find and manage all the world's information, they know the value of reducing the noise.