Archive for the ‘Services’ Category

Detecting Duplicates Within XML Feeds

Thursday, May 8th, 2008

The same web page, shown within Google Reader multiple times

At the end of January, I commented on and offered a suggestion to the Google Reader team about how to improve their product by removing duplicate feed items.

At the time, I didn’t think to post a screenshot to aid in my explanation but remembered to grab one recently and felt it would help explain just how annoying this can be within Google Reader.

From the screenshot, you can see that I have highlighted eight different references to an article by Simon Willison about jQuery style chaining with the Django ORM. When a human looks at that image, it is abundantly clear that each of the eight highlighted references are ultimately going to link through to the same page.

The Google Reader team could use this new feature to their advantage by collapsing the duplicates and offering a visual clue that the item is hot/popular based on the number of references found to the same article. Google search already has the notion of the date/time when content is published, so using that information along with the number of inbound references they discover, the number of duplicates collapsed within your RSS streams could be quite useful.

I know I would really love better facilities within Google Reader for detecting duplicates within RSS, it’d just remove so much noise from the information stream when you’re trying to keep a eye on what is happening within the community.

Google Analytics Benchmarking

Sunday, April 20th, 2008

Google have announced a new feature for Google Analytics named Benchmarking. The Google Analytics Benchmarking service is still in its beta phase, however aims to allow analytics users to compare or benchmark their web sites against other web sites.

The benchmarking service from Google is opt-in, not default-in. If a user would like to view benchmarking data for their sites, they must first opt-in to allow Google to use their own web statistics. Of interest, opting in is on a per account basis - not per web profile. As such, if you have 50 web profiles set up within your account - opting in will share all of your web profiles data with Google.

After opting into the benchmarking service, Google proceed to anonomise the users web statistic information. What this means is that any identifiable information within the web statistics is removed and only aggregate information is held; as such it isn’t possible to spy on your competitor directly or visa versa.

At this early stage, the benchmarking data is fairly high level but provides you comparative metrics on:

  • Visits
  • Pageviews
  • Pages/visit
  • Average Time on Site
  • Bounce Rate
  • Percentage New Visits

The usefulness and ultimately the success of the benchmarking service is reliant on how many Google Analytics users opt-in to sharing their web statistics with Google. If the greater user base don’t feel inclined to share their web statistics with Google in this manner, then the comparative nature of what they are offering is hamstrung to some degree.

Django Friendly Hosting

Wednesday, March 26th, 2008

If you’re about to purchase hosting for your Django application, everything you’ll need to make a good decision is in one place at Django Friendly.

Ryan Berg is the man behind Django Friendly and put together the site as a way to consolidate the plethora of hosting options available for Django. As with some other scripting languages such as Ruby, Python also has some special hosting requirements which makes it inconvenient to host it within standard hosting configurations. The development of mod_wsgi for Apache is aiming to provide a simple, high performance option for hosting Python applications within shared hosting environments.

Allowing users the ability to filter web hosts by price and shared/dedicated hosting types is a great step forward. My suggested improvement for the site would be the ability for users to dynamically build a search query. As an example, being able to filter by server location, hosting type, price ranges, ratings and so on. Maybe an interface in a similar fashion to what is offered by a custom ticket search within the popular Trac software could be used as a starting point.

Either way, it’s another fantastic looking Django specific site which has been offered up to the greater community - you’ve got to love it.

Google Account Signin With CAPTCHA

Wednesday, March 5th, 2008

Google Account login featuring CAPTCHA for additional securityTonight I was presented with a Google login page which was different in a few ways:

  • size and shape of the control were different
  • instead of using an in page control, it took me to a completely new page
  • required additional CAPTCHA validation

I suspect this may have been triggered by logging in and out of various Google products tonight, where I closed a tab but didn’t close the browser, opened new tabs and logged in and out and it might have had conflicting session information.
Does anyone know what causes this type of login prompt to be thrown up by Google?

Google Reader Duplicate Item Improvement

Sunday, January 27th, 2008

One of the features that I love about RSS, is that it allows users to keep their finger on the pulse of certain topics very easily. As some people may know, I quite like the Python web framework Django and I use Google Reader to help me keep up to date about what is happening within the greater Django community. I do this by subscribing to RSS feeds of people who I know regularly write about the product but also by utilising online social booking marking sites such as del.icio.us and ma.gnolia.

I recently read an article by James Holderness about detecting duplicate content within an RSS feed (via). For those not bothered with the jump, James outlines different techniques that the top x many feed reading products use to detect duplicate RSS content, which ranges from using the id field within the RSS down to comparing title and description information.

Back to the improvement, which is related to the information that James provided. When I subscribe to the social booking marking sites, they end up providing back a huge range of content matching certain criteria. The ones I’m subscribing to at the moment for Django are:

As you can imagine, each of these services has a different and overlapping user base. Each of which will find common items throughout the internet and bookmark them each day. When that stream of information is received by Google Reader, it will display half a dozen of the same unique resource, but masked by different user accounts within their bookmarking tool of choice.

What would be a great optional feature to add into Google Reader would be the ability to detect duplicate items even when they are sourced via the same domain or different domains.

The trick to something like this would be identifying the pattern, so as to allow Google to use an algorithm to flag it. For the sake of this concept, I think it’d be reasonable to consider items posted into social bookmarking sites and an aside or link drop in a blog to be reasonably similar.

My initial concept would involve comparing the amount of content within an item. If there are less than a predefined limit of words and a small number of links, then that item might be considered to be a link drop. You could apply that logic not only to social bookmarking sites but also to the standard blog, where an author might find something cool they want to link to.

The next thing up for consideration might be which items to remove as duplicates and which to include. News of any kind on the internet tends to reverberate throughout it quite quickly, so it’s common to find the same information posted many times. As the vibrations are felt throughout, people will tend to link back to where they found that information (as I did above). Google Reader could leverage off the minty fresh search engine index to help with this by using the number of attribution links passed around. As a quick and simple example, imagine the following scenario:

  • Site A has the unique Django content that I’m interested in
  • Sites B through Z all link to site A directly
  • Some sites from C through Z also link back to B, which is where they found that information

I don’t subscribe to site A directly, however some of the sites B through Z have been picked up by the social networks. Using the link graph throughout those sites, it’d be possible to find out which one(s) among that list are considered authoritative (based on attribution or back links) and start filtering based on that. It might then be possible to use other features of the Google Search index to do with theme, quality, trust to filter it further.

I think a feature like that within Google Reader would be fantastic, especially if I could apply those sorts of options on a per feed or folder basis. That way, I could group all of the common information together (Django) and have Google Reader automatically filter out the duplicates that matched the above criteria.

I’m sure the development team from Google Reader will hear my call; who knows, in a few months maybe a feature like this could work its way into the product.

Akismet Losing Its Mojo?

Monday, January 14th, 2008

I have long praised the free spam fighting service Akismet but yesterday a horribly obvious spam comment wasn’t filtered which is very unusual. I’ll include the comment here so people can see what I’m referring to:

Name: Armond
Site: http://groups.google.com/group/otekal/web/free-bestiality-sex-stories
Message: free bestiality sex stories…
accessories distributed at most major retailers for such
…

Automattic have never disclosed with any specificity how the internals of Akismet work as a service, however it is more than reasonable to assume that Bayesian filtering is in their spam fighting tool belt. For those that aren’t aware, Bayesian filtering works by learning or being told what messages are spam and then analyses each word with those spam versus non spam messages. If a given message contains words contained in spam emails above a threshold, the message is considered spam.

Given that it is a learning based system, so to speak, I find it hard to believe that the words beastiality, sex in the URL and within the body of the comment aren’t throwing up great big red flags. I’m going to put this slip up down to one of two things:

  1. I was one of the first people to receive and register that particular spam signature
  2. When the comment was submitted, the Akismet service wasn’t able to be contacted

I’m heavily leaning towards the latter, for no other reason than there are literally hundreds of thousands of blogs on the internet - the likelihood I was one of the first for that particular spam signature is highly unlikely.

Long live Akismet!

You Know You’re Popular When

Saturday, December 29th, 2007

Today my personal site was pinged by Live Business Radio. As I do as a matter of course, I checked out the Live Business Radio web site and was disappointed to find that it’s nothing more than your average run of the mill site ridden with advertising, spam and buy this crap product now.

I get pinged by web sites regularly that don’t have anything to do with me and when I saw the site, I was about to abandon it immediately. Just before I did though, I scanned over the article and noticed that I’d been featured in a list of sites with a high Google Pagerank which offered links which are ‘followed’. It wouldn’t be a good filthy spammers site if they didn’t offer you software (for a fee) which you could use to spam take advantage of the followed links.

If you’re not quite sure what I’m referring to regarding the ‘followed’ remark, you can read about it on my personal site:

I should feel so honoured.

Google Alerts Getting Smarter

Monday, December 24th, 2007

Google Alerts lets the user define keyword lists and phrases, which when found by Google while crawling and indexing web sites - will send a user a notification about that particular occurrence.

Historically, it always appeared as though the technology behind the alerting system was quite simple - literally matching the keyword and phrases that the user had nominated. Recently alerts have been generated that don’t strictly meet the keyword list and phrase requirements for a given page. It seems as though Google are using all of the additional meta data about a web site and the content to infer certain pieces of information.

As an example, I was recently notified about the my name being used within the post about extending the Nintendo Wii. If you view that particular item, you will not find the phrase “alistair lattimore” anywhere within it. Just to be sure, I have also ruled out my name within the RSS feeds generated by the site as well.

Putting the tinfoil hat on for a second, there is a raft of information that Google know about me already:

  • I have a Gmail account
  • The same Gmail account is associated to Google Analytics, Google Reader, Google Webmasters, Google Adwords and Google Adsense.
  • Within Google Webmasters, I monitor my personal blog and this site.
  • Within Google Analytics, I monitor my personal site and this site.
  • Within Google Reader, I subscribe to the feed of both sites.
  • Google are a domain registar, which means they could theoretically see that I purchased both domains.
  • I have linked in both directions between the two sites in the past.

When you start to see how all of that information is inter-linked, it becomes quite easy to see how Google can provide insightful results through their various services. Of course if you take the tin foil hat off and look at the more standard items such as web site content, my name is listed in the title on the front page and also on the about page. Those two bits of information might have been all it took, who knows.

If the technology behind that flexibility has a high level of accuracy in determining or inferring that information, it really is an excellent service. In the above example, if Google hadn’t of inferred my name as being associated to that document - I would have never found out about it via the alerting system. Granted in this particular example, it makes no difference as I know I wrote it - however for all other content on the internet it really lifts the products capability.

Google Analytics & URL Rewriting Caveats

Thursday, December 13th, 2007

As the internet has matured and web sites have aged and expanded over the years, it has now become common place for web site owners to restructure their web sites to increase the sites accessibility and search engine effectiveness.

During the restructuring process, less savvy web masters reorganise their web sites without any concern for the impact it might have to their search engine rankings, referrals and user experience while more savvy web masters understand that cool URL’s don’t change. That isn’t to say that the content that was originally published against that URL must remain there, just that the URL exists so that anyone linking into it don’t receive missing document or HTTP 404 error.

When restructuring web sites, the savvy web master mentioned earlier requires a way to make an existing URL redirect to its new URL after the restructure. The two common methods to handle the redirection are:

  • It is perfectly acceptable to use a standard HTML web page with the tracking code installed and a meta refresh to redirect a user from the old to the new. This method does have the down side that all of the redirections for the web site are scattered throughout.
  • Another solution is offloading the redirection into a utility such as the Apache mod_rewrite module or the equivalent ISAPI_Rewrite for IIS. Using this method allows the web master to place all of the URL redirection in once place for easy management.

Under normal conditions such as option one above, where Google Analytics is installed on every web page within a site - it’s possible for the service to collect a complete click stream for the site. Google Analytics is also capable of handling standard HTTP redirects, so long as the tracking code is installed on both the referring and destination pages.

While it is convenient to use URL rewriting, there is a caveat which reduces the amount of information that Google Analytics can collect. The redirection will happen before any content is returned to the user, which means there is no opportunity for Google Analytics tracking code to fire. This results in Google Analytics reporting zero activity against the redirecting URL.

WordPress Drop Technorati For Incoming Links

Wednesday, November 28th, 2007

WordPress has a feature in it which shows activity surrounding your particular blog, named “Incoming Links”. For a long time, WordPress has been using the services of blog search engine and aggregator Technorati to deliver this feature. Using Technorati was an excellent decision for quite some time, especially when blogging was still relatively new and Technorati where blazing their own trail in that space. It made even more sense when Automattic released Pingomatic, as virtually all blogging platforms sent activity notifications to that and Technorati subscribed to that stream of data.

Things started to change and the usefulness of Technorati started to fade as the big guns entered into the blog search space, namingly Google. Google Blog Search was a great service on its own, using the incredible infrastructure behind Google to keep their blog search index fresh. Not being content with great, Google set out to make their Google Blog Search index exceptionally fresh as they started accepting ping notifications. Of course, as soon as that happened - Pingomatic started sending notifications into Google, which has yielded an index which is minty fresh - usually showing only minutes of delay.

With the recent release of WordPress 2.3, the WordPress team have now switched from Technorati to Google Blog Search for their “Incoming Links” feature. This single link change could have a fairly profound impact on Technorati, as with literally hundreds of thousands of blogs running WordPress - they were getting traffic for free. The lack of the link from WordPress, coupled with the superior fire power of Google and tongues have to be wagging about the future of blog search engine Technorati.