When web based email services like Hotmail where still quite a new service, the barrage of spam that they received was a source of constant complaint for their users. As the internet and their spam filtering techniques have evolved, I would argue that spam has become less of a problem in email as their classifiers have become more sophisticated.
In the last couple of months, either the Windows Live Hotmail spam filtering algorithms have decided to take a nap or the spammers have found a really clever way of slipping through undetected with what a human would consider as easily identifiable spam.
How Do Email Spam Filters Work?
Early versions of email spam filters were very primative and the mere presense of a word in an email was enough to push it into the spam queue. Fortunately, those simple measures have been replaced by the more robust Bayesian spam filtering which are supported by a lot of customisation.
Bayesian spam filtering decomposes an email into tokens, normally words but sometines other markers and then uses the frequency of each words appearance in that given email against a know sample to determine if it is spam or not. Using that sort of a strategy allows a user to use a trigger word in their email and not have it marked as spam because the other content in the email doesn’t push the classification of the email high enough.
Manual Email Spam Classification
Following I’ll breakdown the first highlighted email in the inset image above into its individual components and show you why it is easy to identify for a human but slightly more complicated for a spam filter to identify.
The surname might be a play on words to do with the size of the senders penis or it could just be a randomly chosen surname. If you scan across the other senders names in the above image, you’ll notice they are all quite straight forward like Walter Carlson or Paula Santos. What you aren’t seeing as the sender names are very obviously spam sender names like Free Porn or Barbara Big Boobs; so there has definitely been some learning taking place.
Nothing unusual about the senders email address, it is common place these days for Hotmail email addresses to contain numbers in them as they have so many users. Checking across the other spam email addresses and all of the sender email addresses seem quite reasonable and have some amount of correlation to the senders name and none of them contain clearly identifiable spam trigger words.
Does the Hotmail spam filtering treat email from other Hotmail email addresses as less likely to be spam than an email originating from outside of their service? Could they be placing too much emphasis on the security of their signup process to weed out spammers? I know at some point Gmail had their signup process brute forced and the spammers were able to systematically overcome the CAPTCHA.
To an English speaker, you’d look at the senders email and think that it might be more gibberish however you’d be wrong. Amakye is actually a name – for a real world example, consider Amakye Dede, one of Ghana’s premier musicians.
As a trait of a high quality messages, do people that forward an email onto someone else regularly forward it onto multiple recipients? Given that all of the spam emails seen in the image above have a single additional recipient, I believe that must be quite common.
Following on from the possibility of Hotmail spam filtering providing brownie points for the sender coming from a Hotmail email address, it surely isn’t a coincidence that each of the emails above also have the additional recipient as another Hotmail user.
Working from left to right, you can see the message has been supposedly forwarded onto me, signified by the classic FW: prefix for the subject. Scanning through the other emails highlighted and they all contain the same FW: prefix; maybe a forwarded email gains brownie points with the spam classifiers as being less likely to be spam.
Next up and a quick glance at the actual subject topic itself and it looks like it is gibberish until your eye catches actual words in the subject and then rescans the subject seeking out other real words amongst the nonsense. Suddenly you’re left with a subject more like:
FW: Mika Mizuna Takes One Hard Cock After Another In This Gangbang
I can only assume that the word cock isn’t being picked up because it doesn’t have a space or recognised word seperator on either side of it. I also think that the different capital letter used as a word seperator must have something to do with it as well – as an example if the subject used a full stop, hyphen or pipe character it’d be crystal clear.
VCery SWexy YounUg whiCte ChicVk wiGth a NiQce AsRs BoQoty gPets F.Lu.c.ked bLy
by now."Corey clicked at him in disgust. The others of their pod had long gone
The message content itself is less obfuscated than the subject and far more easy to read in a single glance. I think the important thing about the message content is that no trigger words such as ass, arse, booty, young or fucked are present and spelled correctly.
Addressing the domain name and the spammers are no longer using domain names with words that’d be clearly identified as spam. In this particular instance, I am surprised that the spam classifier isn’t smart enough to identify that the domain itself isn’t in a common format; when was the last time that you visited a domain that had 8 numbers in it?
I suspect that the last sentence in the message is nothing but a token sentence so the entire email doesn’t look completely bogus. I find it interesting that they’ve mixed in a few words which would seldom be seen together such as clicked, disgust and pod – all of which would typically never be seen in a spam email.
As you can see from the above, to a human it is just so simple to spot spam – however it just isn’t that simple when you need to write software to identify spam while making sure not to falsely identify real email.
Every day that these spam email messages keep landing in my inbox, I keep group selecting them and reporting them to Hotmail. I wait patiently hoping that someone will answer my calls to try and tackle whatever issues their spam classifiers are having identifying those spam emails.