Tag Archives: rss

Detecting Duplicates Within XML Feeds

The same web page, shown within Google Reader multiple times

At the end of January, I commented on and offered a suggestion to the Google Reader team about how to improve their product by removing duplicate feed items.

At the time, I didn’t think to post a screenshot to aid in my explanation but remembered to grab one recently and felt it would help explain just how annoying this can be within Google Reader.

From the screenshot, you can see that I have highlighted eight different references to an article by Simon Willison about jQuery style chaining with the Django ORM. When a human looks at that image, it is abundantly clear that each of the eight highlighted references are ultimately going to link through to the same page.

The Google Reader team could use this new feature to their advantage by collapsing the duplicates and offering a visual clue that the item is hot/popular based on the number of references found to the same article. Google search already has the notion of the date/time when content is published, so using that information along with the number of inbound references they discover, the number of duplicates collapsed within your RSS streams could be quite useful.

I know I would really love better facilities within Google Reader for detecting duplicates within RSS, it’d just remove so much noise from the information stream when you’re trying to keep a eye on what is happening within the community.

Google Reader Duplicate Item Improvement

One of the features that I love about RSS, is that it allows users to keep their finger on the pulse of certain topics very easily. As some people may know, I quite like the Python web framework Django and I use Google Reader to help me keep up to date about what is happening within the greater Django community. I do this by subscribing to RSS feeds of people who I know regularly write about the product but also by utilising online social booking marking sites such as del.icio.us and ma.gnolia.

I recently read an article by James Holderness about detecting duplicate content within an RSS feed (via). For those not bothered with the jump, James outlines different techniques that the top x many feed reading products use to detect duplicate RSS content, which ranges from using the id field within the RSS down to comparing title and description information.

Back to the improvement, which is related to the information that James provided. When I subscribe to the social booking marking sites, they end up providing back a huge range of content matching certain criteria. The ones I’m subscribing to at the moment for Django are:

As you can imagine, each of these services has a different and overlapping user base. Each of which will find common items throughout the internet and bookmark them each day. When that stream of information is received by Google Reader, it will display half a dozen of the same unique resource, but masked by different user accounts within their bookmarking tool of choice.

What would be a great optional feature to add into Google Reader would be the ability to detect duplicate items even when they are sourced via the same domain or different domains.

The trick to something like this would be identifying the pattern, so as to allow Google to use an algorithm to flag it. For the sake of this concept, I think it’d be reasonable to consider items posted into social bookmarking sites and an aside or link drop in a blog to be reasonably similar.

My initial concept would involve comparing the amount of content within an item. If there are less than a predefined limit of words and a small number of links, then that item might be considered to be a link drop. You could apply that logic not only to social bookmarking sites but also to the standard blog, where an author might find something cool they want to link to.

The next thing up for consideration might be which items to remove as duplicates and which to include. News of any kind on the internet tends to reverberate throughout it quite quickly, so it’s common to find the same information posted many times. As the vibrations are felt throughout, people will tend to link back to where they found that information (as I did above). Google Reader could leverage off the minty fresh search engine index to help with this by using the number of attribution links passed around. As a quick and simple example, imagine the following scenario:

  • Site A has the unique Django content that I’m interested in
  • Sites B through Z all link to site A directly
  • Some sites from C through Z also link back to B, which is where they found that information

I don’t subscribe to site A directly, however some of the sites B through Z have been picked up by the social networks. Using the link graph throughout those sites, it’d be possible to find out which one(s) among that list are considered authoritative (based on attribution or back links) and start filtering based on that. It might then be possible to use other features of the Google Search index to do with theme, quality, trust to filter it further.

I think a feature like that within Google Reader would be fantastic, especially if I could apply those sorts of options on a per feed or folder basis. That way, I could group all of the common information together (Django) and have Google Reader automatically filter out the duplicates that matched the above criteria.

I’m sure the development team from Google Reader will hear my call; who knows, in a few months maybe a feature like this could work its way into the product.