One of the features that I love about RSS, is that it allows users to keep their finger on the pulse of certain topics very easily. As some people may know, I quite like the Python web framework Django and I use Google Reader to help me keep up to date about what is happening within the greater Django community. I do this by subscribing to RSS feeds of people who I know regularly write about the product but also by utilising online social booking marking sites such as del.icio.us and ma.gnolia.
I recently read an article by James Holderness about detecting duplicate content within an RSS feed (via). For those not bothered with the jump, James outlines different techniques that the top x many feed reading products use to detect duplicate RSS content, which ranges from using the id field within the RSS down to comparing title and description information.
Back to the improvement, which is related to the information that James provided. When I subscribe to the social booking marking sites, they end up providing back a huge range of content matching certain criteria. The ones I’m subscribing to at the moment for Django are:
As you can imagine, each of these services has a different and overlapping user base. Each of which will find common items throughout the internet and bookmark them each day. When that stream of information is received by Google Reader, it will display half a dozen of the same unique resource, but masked by different user accounts within their bookmarking tool of choice.
What would be a great optional feature to add into Google Reader would be the ability to detect duplicate items even when they are sourced via the same domain or different domains.
The trick to something like this would be identifying the pattern, so as to allow Google to use an algorithm to flag it. For the sake of this concept, I think it’d be reasonable to consider items posted into social bookmarking sites and an aside or link drop in a blog to be reasonably similar.
My initial concept would involve comparing the amount of content within an item. If there are less than a predefined limit of words and a small number of links, then that item might be considered to be a link drop. You could apply that logic not only to social bookmarking sites but also to the standard blog, where an author might find something cool they want to link to.
The next thing up for consideration might be which items to remove as duplicates and which to include. News of any kind on the internet tends to reverberate throughout it quite quickly, so it’s common to find the same information posted many times. As the vibrations are felt throughout, people will tend to link back to where they found that information (as I did above). Google Reader could leverage off the minty fresh search engine index to help with this by using the number of attribution links passed around. As a quick and simple example, imagine the following scenario:
- Site A has the unique Django content that I’m interested in
- Sites B through Z all link to site A directly
- Some sites from C through Z also link back to B, which is where they found that information
I don’t subscribe to site A directly, however some of the sites B through Z have been picked up by the social networks. Using the link graph throughout those sites, it’d be possible to find out which one(s) among that list are considered authoritative (based on attribution or back links) and start filtering based on that. It might then be possible to use other features of the Google Search index to do with theme, quality, trust to filter it further.
I think a feature like that within Google Reader would be fantastic, especially if I could apply those sorts of options on a per feed or folder basis. That way, I could group all of the common information together (Django) and have Google Reader automatically filter out the duplicates that matched the above criteria.
I’m sure the development team from Google Reader will hear my call; who knows, in a few months maybe a feature like this could work its way into the product.