It is a matter of going to war with the software you have, not the software you wish you have. SpamAssassin allows you to do plug-n-play logic for tests, but for implementation reasons only accepts regular expressions. Those are Good Enough for most email filtering tasks, very flexible, and have fairly predictable security and resource consequences. This is the architecture that lets spam assassins subject an email to literally hundreds of tests (though in fairness I think they're probably less efficient than just naive Bayesian but don't trust me, my anti-spam researcher days are almost three years in the rear view mirror at the moment) and evolve quickly with the quickly changing, particularized nature of the spam threat at any given installation.
Not to say that its optimal -- it is not -- but there is a reason it is done that way as opposed to having a fully executable plugin architecture which would have access to your date parsing library of choice.
If you look at the bug it's not a problem caused by using regular expressions as such, but rather by the choice of date to be "grossly in the future" (ie. so far in to the future that it couldn't be a legitimate date at the time the software is running).
The regex was chosen to match 2010 to 2099, which it did just fine.
So problem wasn't in choosing to use regex's but in choosing 2010 as the date. I'm sure that date was "grossly in the future" at some point (probably when the regex was first written), but obviously we are living in the future now and need the date to be moved forward.
I don't see why. The heuristic here was fine (is the date too far in the future?), and not part of the problem. The bug was in the implementation of the heuristic ("any date after 2010 is too far in the future").
Maybe you're making a software maintainability argument instead? But clearly this doesn't make a good argument for classifiers vs. heuristics.
Is rule based spam filtering still helpful? Wouldn't a big default spam database + machine learning work much better instead of rules + an empty default spam database + machine learning?
The rules help a lot, especially in the early training of the bayes database. It's amazing how much stuff is still caught by them... Over time, the SA bayesian recognizer gets good enough that the rules play a relatively small role. I think I had no false positives from this little bug thanks to the bayesian counterweight.
A default database is a poor idea, though. One thing I've learned in helping folks with SA is that people get very different mixtures of spam and mail. I don't think you'd like my database at all. There's really no substitute for making your own...
Probably a lot of the effectiveness comes from getting user spam classification feedback. The same letter is generally mass-mailed, so the chance someone got and flagged the spam you got, before you check your mail, is pretty high.
Summary: FH_DATE_PAST_20XX matches on yars 2010-2099.
As a workaround until the spamassasin rules are updated, the score can be lowered in local.cf: score FH_DATE_PAST_20XX 0.0