Archive for January, 2008

Google Reader Duplicate Item Improvement

One of the features that I love about RSS, is that it allows users to keep their finger on the pulse of certain topics very easily. As some people may know, I quite like the Python web framework Django and I use Google Reader to help me keep up to date about what is happening within the greater Django community. I do this by subscribing to RSS feeds of people who I know regularly write about the product but also by utilising online social booking marking sites such as del.icio.us and ma.gnolia.

I recently read an article by James Holderness about detecting duplicate content within an RSS feed (via). For those not bothered with the jump, James outlines different techniques that the top x many feed reading products use to detect duplicate RSS content, which ranges from using the id field within the RSS down to comparing title and description information.

Back to the improvement, which is related to the information that James provided. When I subscribe to the social booking marking sites, they end up providing back a huge range of content matching certain criteria. The ones I’m subscribing to at the moment for Django are:

As you can imagine, each of these services has a different and overlapping user base. Each of which will find common items throughout the internet and bookmark them each day. When that stream of information is received by Google Reader, it will display half a dozen of the same unique resource, but masked by different user accounts within their bookmarking tool of choice.

What would be a great optional feature to add into Google Reader would be the ability to detect duplicate items even when they are sourced via the same domain or different domains.

The trick to something like this would be identifying the pattern, so as to allow Google to use an algorithm to flag it. For the sake of this concept, I think it’d be reasonable to consider items posted into social bookmarking sites and an aside or link drop in a blog to be reasonably similar.

My initial concept would involve comparing the amount of content within an item. If there are less than a predefined limit of words and a small number of links, then that item might be considered to be a link drop. You could apply that logic not only to social bookmarking sites but also to the standard blog, where an author might find something cool they want to link to.

The next thing up for consideration might be which items to remove as duplicates and which to include. News of any kind on the internet tends to reverberate throughout it quite quickly, so it’s common to find the same information posted many times. As the vibrations are felt throughout, people will tend to link back to where they found that information (as I did above). Google Reader could leverage off the minty fresh search engine index to help with this by using the number of attribution links passed around. As a quick and simple example, imagine the following scenario:

  • Site A has the unique Django content that I’m interested in
  • Sites B through Z all link to site A directly
  • Some sites from C through Z also link back to B, which is where they found that information

I don’t subscribe to site A directly, however some of the sites B through Z have been picked up by the social networks. Using the link graph throughout those sites, it’d be possible to find out which one(s) among that list are considered authoritative (based on attribution or back links) and start filtering based on that. It might then be possible to use other features of the Google Search index to do with theme, quality, trust to filter it further.

I think a feature like that within Google Reader would be fantastic, especially if I could apply those sorts of options on a per feed or folder basis. That way, I could group all of the common information together (Django) and have Google Reader automatically filter out the duplicates that matched the above criteria.

I’m sure the development team from Google Reader will hear my call; who knows, in a few months maybe a feature like this could work its way into the product.

, , , , , , ,

2 Comments

Matt Mullenweg Changes Domain

Blogging master and WordPress founder Matt Mullenweg has changed domain.

Matt has been blogging for the last seven years under http://photomatt.net, which was an appropriate domain at the the time. Early on in the piece, Matt would post photos regularly and any photos of him often included his own camera.

Since leaving CNet and founding Automattic, Matt has been fiercely committed to developing the blogging platform WordPress and its associated products Ping-o-matic, Akismet and recently Gravatar.

How times have changed for Matt, after taking the initial gamble of starting Automattic – the company has just closed a USD$29.5 million dollar series B funding. The new round of funding is going to allow the team to not worry about money for salaries for the next few years and really focus on enhancing their current product line and building out new ones.

With the change, the new internet home of Matt Mullenweg is announced http://ma.tt

, , ,

No Comments

Akismet Losing Its Mojo?

I have long praised the free spam fighting service Akismet but yesterday a horribly obvious spam comment wasn’t filtered which is very unusual. I’ll include the comment here so people can see what I’m referring to:

Name: Armond
Site: http://groups.google.com/group/otekal/web/free-bestiality-sex-stories
Message: free bestiality sex stories…
accessories distributed at most major retailers for such
…

Automattic have never disclosed with any specificity how the internals of Akismet work as a service, however it is more than reasonable to assume that Bayesian filtering is in their spam fighting tool belt. For those that aren’t aware, Bayesian filtering works by learning or being told what messages are spam and then analyses each word with those spam versus non spam messages. If a given message contains words contained in spam emails above a threshold, the message is considered spam.

Given that it is a learning based system, so to speak, I find it hard to believe that the words beastiality, sex in the URL and within the body of the comment aren’t throwing up great big red flags. I’m going to put this slip up down to one of two things:

  1. I was one of the first people to receive and register that particular spam signature
  2. When the comment was submitted, the Akismet service wasn’t able to be contacted

I’m heavily leaning towards the latter, for no other reason than there are literally hundreds of thousands of blogs on the internet – the likelihood I was one of the first for that particular spam signature is highly unlikely.

Long live Akismet!

, , ,

No Comments

Web Design Faux Pas

Over the last three months, Queensland Teachers’ Credit Union have been rolling out a series of small changes to some of their online services. I first noticed the updates via their internet banking site when they removed the ability for you to login with only the keyboard – it now requires that the password is entered via the mouse and a ‘moving’ keyboard.

As a by product of recently rebuilding my home machine, I don’t have my bookmarks set up and needed to navigate to the Queensland Teachers’ Credit Union home page to find my way into their netbanking. Suffice to say, I was shocked when I was confronted by a welcome page. Apparently I missed the memo that said that welcome pages were an acceptable design decision for a web site. Not only is the welcome page poorly designed, the next page you’re presented with after clicking through isn’t a whole lot better. In my opinion, if Queensland Teachers’ Credit Union are set on having client testimonials on their site – they should remove the annoying welcome page and integrate them into the slightly better ‘home’ page.

Since the Queensland Teachers’ Credit Union are a financial institution, I would have expected that anything presented on their web site would have to go through many stages of checking and verification by various teams before it was published on their production site. If that were the case, I’m surprised that after the checking that the welcome page made it into product – I wonder who considered it to be a good design decision?

3 Comments