13 January 2007

All Your Bayes Are Belong To Us

The war went badly for us in 2006. Here I'm referring to the global war of all computer users against the vile Spammers. Our inbox's best defense against villainy, the Bayesian filter, was soundly defeated by the spamming hordes. First they started with using "found text" to add to their illicit emails -- text random enough to confuse our filters into thinking the mail is probably legit. But by the summertime I started seeing their latest bit of simple devilry -- the image spam. Instead of placing their pharmacology ads or "pump and dump" schemes in text, they use a jpeg. Naturally, this completely bypasses the spam filter.

Apparently anti-spam companies are trying to adapt by developing apps that use OCR to identify image spam (http://lwn.net/Articles/196704/). I admit to being astonished by this. It seems an utter waste of processing power to bring such heavy weaponry to bear. Far simpler, and less CPU intensive, methods should suffice. The strategy depends on how the image is sent. One article I read said that the images are sent as a link, and automatically displayed via most mail client's HTML capability. Another article said that the actual jpegs are sent as basically an attachment (and again, automatically displayed by the mail client).

If the images are sent as a link, that means that the spammers have servers somewhere, or at least some Web repository (perhaps on a free personal site) where the image is stored. Those servers/sites then become targets we can shut down. More immediately the links themselves are text which Bayesian filters should learn to reject.

If the images are sent directly with the emails then we have an additional problem. Since images are much larger than text these image spams are at least 10X the size of text spams. This means that 10X the Internet bandwidth is taken up sending them everywhere. Such spam poses problems not just for the recipient but, in aggregate, for all users of the public Internet. As for stopping this spam from reaching our inbox, again -- we don't need to waste cycles doing OCR. Mail clients can simply checksum all attachments. That checksum can then be used as a word for the Bayesian filter. This is going to be most effective at the mail server level -- if the same checksum shows up on emails going to a lot of users you can be pretty sure of its spamminess. To defeat this simple strategy the spammers would have to steganographically alter (slightly) each image. While not difficult, at least it puts the CPU burden on the spammers, not on us.

I scanned the "bulk" folder of my Yahoo mail account. In the last 24 hours I received 12 spams. Two of them were simple text (one was so-called "empty spam" -- just words without a spam payload). The other ten were image spam with the actual images sent. The image spams were about 30K each. So that's 300 kilobytes of spam waste that had to traverse the Internet to get to me. Just some back of envelope work: if there are on the order of a billion email inboxes worldwide, and they receive this much spam daily as I do, then we are talking about 300 terabytes of image spam devouring our bandwidth each day. Nasty.


The market is Open.


Sam said...


Great post. I had sick sense about the switch to image spam, and this nicely spells it out. I have a comment (well, this is a comment by definition, right, so it must be a meta-comment), and a question:

1. Rad use of steganographically.
2. Can you elaborate on this point:
'Mail clients can simply checksum all attachments. That checksum can then be used as a word for the Bayesian filter'? I just don't fully follow you here: what does it mean to 'checksum an attachment'.

Periapse said...

A checksum is a simple operation that takes an input file (any file type -- it just uses the raw bits) and returns a number. The number is essentially a fingerprint for the file. The same file would always give the same number. Uniqueness is achieved by choice of the size of the number. For example a 32-bit checksum means that the odds of two *different* files giving the same number are on the order of 4 billion to one.

My observation is that the checksum number can be represented as a text string, say in hexadecimal, e.g. "1a3fd07c". Bayesian filters don't care what text they use to filter, so we could train a filter to recognize specific numbers just as it does specific words. Thus the presence of an attachment checksum "1a3fd07c" raises the probability of it being spam just as the word "viagra" does.

While this may not be useful for an individual inbox (unless you continually receive the same spam over and over), it could see application for the email servers. Once a particular checksum is identified as corresponding to an image spam, all emails containing an image with that checksum are likely spam as well, even if the text contents and header info are different.

Sam said...

Sorry to be late in saying this, but THANKS - I get it now!