
After reading Paul Graham's article on fighting spam, I became interested in trying out his techniques. I cobbled up some filter programs in perl, since perl is made for doing just this kind of text processing. The suite consists of 2 scripts, one for generating the word spam probability table, and one for actually classifying a mail as spam or not.
My first set of scripts came together in a couple hours, and depended on the Storable perl module; I figured it might be faster than parsing a text representation every time the program ran. The generator parses all your saved mail and generates a hash of all the words and their probabilities, and Storable-izes them into a file. It seemed to work well enough this way. The only problem was that Storable, not being in the perl core, wasn't available everywhere.
I had a thought that since Storable wasn't necessarily supported everywhere, that I should probably go with something else that was nice and fast and was supported everywhere - DB_File. I'm pretty sure it's in the perl core, so everybody'll have it. And hey, another bonus, the score files are tiny compared to either the Storable or plain text versions. So the new versions now use DB_File, though I'm still testing it.
Turns out that DB_File isn't part of the perl core installation, but hey, that's the way things go. My mailserver has DB_File, so I'm good to go.
NOTE! The runfilter script has changed. If you're using version 3 or earlier of the runfilter, these new instructions no longer apply. Please get the newer version if you try to use these directions.
Steps to getting this thing to work:
The most important part of the whole process is having it run when every mail appears, and procmail is the answer here. It took me some time with the Procmail documentation project to get my procmail recipe written, but if you can believe it, I actually got it right on the first try. You can remove the LOGFILE, LOGABSTRACT, and VERBOSE lines if you don't want to get the output file. Here's the recipe:
SHELL=/bin/sh LOGFILE=$HOME/pm.log LOGABSTRACT="all" VERBOSE="on" :0 fW : maybe-spam | $HOME/bin/runfilter.pl
The only down side to this filter is that you will have to save all the spam you get (actually you'll need to save a couple weeks' worth before you start, to have a good working set of bad emails to categorize). I'm still doing some testing, and I've got a few other tweaks I'd like to try, but I'm pretty happy with its performance for now.
My TODO list:
Version 4 has arrived. It adds support for using the Mail::Box suite in the filter builder. I'm trying to pare down the memory usage of this thing (my mail causes this program to use about 355MB of core). Version 4 of the runfilter is now here, which adds multiple levels of filtering. If the mail is 99%+, then it's saved straight to the spam folder; if it's 90%+, it's saved to maybe-spam, and if it's less than that, it's just spit back out. We also add a X-Spamfilter-Score header line to all our mails now, so we can tell how spammy the filter thinks a given mail is.
Older files:
Have fun and let me know if you find these useful!