Over a Decade of Spam and I Still Haven't Killed Anyone (Yet)

November 17, 2006

I've been using SpamProbe to separate the wheat from the chaff for the last four years. That, along with the fact that I rarely delete email, gives me a reasonable set of data to analyze the performance of a spam filter. So, how does SpamProbe stack up?

Graphs With Lines and Stuff

SpamProbe: Classifications per Month (Count)

The exponential increase has flattened the numbers we really care about, and the logarithmic scaling plotting in Ploticus has failed me, so here's the same graph with correct classifications omitted:

SpamProbe: False Classifications per Month, 2002-2006 (Count)

That second graph is mildly depressing, but it reflects my day-to-day experience. Namely, more and more spam messages seem to be sneaking by SpamProbe and being incorrectly classified as legitimate messages. But how does the increase in false negatives stack up compared to the total amount of spam I'm getting? Let's take a look at the data again, but this time as a percentage rather than a sum:

SpamProbe: Classifications per Month, 2002-2006 (Percent)

And the same data again, without the correctly classified spam:

SpamProbe: Classifications per Month, 2002-2006 (Percent)

As you can see from the graphs, the percent of false positives, or legitimate mail incorrectly classified as spam, sits pretty steady around 0%, while the number of false negatives, or spam incorrectly classified as legitimate mail, has hovered below 5% for just over two years. Not too shabby for a lowly bayesian classifier. By the way, the large peaks in the percentage graphs are mostly anomalous (see below).

Caveats

Are aphorisms about liars and statistics bouncing around in your head right now? Good. Here's some of the gotchas with this data:

  • The graphs above do not include "ham". Ham is correctly-classified, non-spam messages. Including ham would flatten the percentage graphs by increasing the percent of correctly classified messages and decreasing the percent of falsly classified messages. If there's any interest, I can add additional graphs which include correctly classified, non-spam messages.
  • The false negative peaks in months 24 and 28 weren't due to any mistakes on the part of SpamProbe; I managed to break SpamProbe and/or fill up the disk where my mail is stored on a couple of occasions.
  • I have catch-all addresses enabled for some of my domains (e.g. foo@example.com, bar@example.com, and asdf200notarealname@example.com are all routed to my inbox). This necessarily affect the accuracy of SpamProbe, but it certainly increases the amount of spam I receive.
  • I purchased a few additional domains between 2002 and 2006. Although I haven't added any within the last year, so that doesn't account for the exponential increase in spam in the last 12 months.
  • I upgraded SpamProbe a handful of times, and re-trained the classifier once.

Conclusions

If I wanted to be scientific and objective and all that crap, or at least methodical and thorough, I would take several competing spam classifiers and feed them the same corpus, then compare the results. I'm not trying really trying to be objective, though; SpamProbe seems to be working pretty well, at least for now. Oh yeah, if you're interested in playing with the actual numbers, or if you're curious how I processed the data and generated the graphs, feel free to download the raw data.