The YMB Antispam Project

After reading Paul Graham's article on fighting spam, I became interested in trying out his techniques. I cobbled up some filter programs in perl, since perl is made for doing just this kind of text processing. The suite consists of 2 scripts, one for generating the word spam probability table, and one for actually classifying a mail as spam or not.

My first set of scripts came together in a couple hours, and depended on the Storable perl module; I figured it might be faster than parsing a text representation every time the program ran. The generator parses all your saved mail and generates a hash of all the words and their probabilities, and Storable-izes them into a file. It seemed to work well enough this way. The only problem was that Storable, not being in the perl core, wasn't available everywhere.

I had a thought that since Storable wasn't necessarily supported everywhere, that I should probably go with something else that was nice and fast and was supported everywhere - DB_File. I'm pretty sure it's in the perl core, so everybody'll have it. And hey, another bonus, the score files are tiny compared to either the Storable or plain text versions. So the new versions now use DB_File, though I'm still testing it.

Turns out that DB_File isn't part of the perl core installation, but hey, that's the way things go. My mailserver has DB_File, so I'm good to go.

NOTE! The runfilter script has changed. If you're using version 3 or earlier of the runfilter, these new instructions no longer apply. Please get the newer version if you try to use these directions.

Steps to getting this thing to work:

  1. Create a directory called $HOME/.spamfilter
  2. Copy the appropriate scripts somewhere you can access; I suggest $HOME/bin
  3. The probability-table building script expects to find your mail folders in $HOME/mail. If that won't work for you, there's a config option near the top of the file which you can change.
  4. The probability-table building script also expects your spam folder to be called, inventively enough, 'spam'. Again, if that's not what you're using, change it at the top of the script.
  5. Run the buildfilter.pl script; it'll generate some output, just to let you know what's going on. There will be some (possibly quite large, depending on how much mail you're using to train your filters) files in the $HOME/.spamfilter directory you created above.
  6. Install the procmail recipe below in your $HOME/.procmailrc, preferably as the first recipe (if you filter the spam first, then you'll have to process less mails in your subsequent recipes). If you're using something other than mbox format, or $HOME/mail as your mail directory, you will have to tweak the recipe, and possibly use some other programs to save the output.
  7. Keep an eye on the maybe-spam and spam folders, since all the mail messages which it thinks might be spam will land there. Move the maybe-spams to your spam folder when you've verified their spamminess. Currently I don't know of a way to make this work as other than mbox format, but surely there's a way.
  8. Re-run the buildfilter.pl script periodically (daily, weekly, whatever), so it can learn from the new spam you've collected.

The most important part of the whole process is having it run when every mail appears, and procmail is the answer here. It took me some time with the Procmail documentation project to get my procmail recipe written, but if you can believe it, I actually got it right on the first try. You can remove the LOGFILE, LOGABSTRACT, and VERBOSE lines if you don't want to get the output file. Here's the recipe:

SHELL=/bin/sh
LOGFILE=$HOME/pm.log
LOGABSTRACT="all"
VERBOSE="on"

:0 fW : maybe-spam
| $HOME/bin/runfilter.pl

The only down side to this filter is that you will have to save all the spam you get (actually you'll need to save a couple weeks' worth before you start, to have a good working set of bad emails to categorize). I'm still doing some testing, and I've got a few other tweaks I'd like to try, but I'm pretty happy with its performance for now.

My TODO list:

Version 4 has arrived. It adds support for using the Mail::Box suite in the filter builder. I'm trying to pare down the memory usage of this thing (my mail causes this program to use about 355MB of core). Version 4 of the runfilter is now here, which adds multiple levels of filtering. If the mail is 99%+, then it's saved straight to the spam folder; if it's 90%+, it's saved to maybe-spam, and if it's less than that, it's just spit back out. We also add a X-Spamfilter-Score header line to all our mails now, so we can tell how spammy the filter thinks a given mail is.

Older files:

Have fun and let me know if you find these useful!

Valid HTML 4.01Valid CSS
ymb.net