phroggy.com: Weblog Archive: Lovely Spam, Wonderful Spam

Lovely Spam, Wonderful Spam (Tuesday, November 21^st, 2006)

I've been working on fighting spam lately. I've had some pretty serious measures in place, but the recent increase in spam levels has been driving me nuts, so I've been devoting more time to actively fighting it.

Most people don't have any idea how serious the problem is, because all they see is a few messages per day that get past the filters. It only takes a few seconds to delete them, so what's the big deal? Well, those of us who actually administer mail servers see a quite different picture. The mail servers I currently maintain are very small (only a few dozen users total), so keep in mind that these numbers a very small compared to what most mail servers see.

My first line of defense against spam is a handful of blacklists of IP addresses known to be sources of spam. Any connection attempt from an IP address on a blacklist is immediately denied. I don't know anything about the message they would have sent, and often they reconnect to try again, so the number of blocked connections is probably a lot higher than the number of spam messages that would have been received if they hadn't been blocked. However, the total so far in the month of November is over 40,000 blocked connections.

The second line of defense is some custom scripts I wrote that block messages based on what's called the message envelope information. This is roughly equivalent to looking at something you get in the mail and throwing it out because you can tell it's another credit card offer without even opening it, just because of where it came from. I reject messages based on the envelope sender and recipient, as well as the DNS hostname of the connecting server and the greeting (HELO/EHLO line) it sent. This method has blocked about 5,300 connections so far this month, but again, this could include repeated attempts, so it's not necessarily meaningful.

Each message is scanned for viruses by ClamAV, which also identifies some phishing scams. These are moved to a folder I have access to, so users never see them. Also, suspicious-looking attachments are removed and replaced with a warning message.

But then we get to the interesting part. SpamAssassin analyzes the content of the message to determine whether it matches a whole slew of spam-like characteristics. Some of the rules are included with SpamAssassin; others come from the SpamAssassin Rules Emporium and are updated frequently. Image-based spam is analyzed by the FuzzyOcr plugin which processes each frame of an animation, applies various color filters, runs it through optical character recognition software, and compares the result to a word list. Several custom rules I've written recently look for specific spam I've received. All these things add points to a score, and if the score is above 5, the message gets moved to a quarantine folder for the user to review. If the score is above 15, we assume it's definitely spam, and the message is moved into a system-wide spam folder that users never see.

So how many messages go into this system-wide spam folder? Over 4,000 so far this month scored 15 or higher. Add to that about 1,200 spams scored under 15 that I've received myself, plus whatever other users have been getting, and you're looking at spam coming in about every 5 minutes on average, 24 hours a day, 7 days a week.

The frustrating thing is that the spammers have access to the same filtering software that I'm using. They can check to see whether the filters will catch their message before they send it, and then tweak it until it passes. One spammer recently has been using very specific patterns that I can easily match, but he keeps changing patterns every couple of days, so I have to keep updating my rules every couple of days - the general-case rules aren't working, so I'm basically looking for specific subject lines.

It used to be that a 486 with 32 MB of RAM was perfectly adequate for handling SMTP (as long as that's all it was doing). Now, because of all this advanced spam filtering, a Pentium 4 with 1.5GB of RAM (or more) is more reasonable. And after all of that, we still get spam in our inboxes. I wish Congress would start paying attention; they're the only ones with the power to fix this mess, by earmarking funding for enforcement. I'm not holding my breath.