auto-mass-check is a simple script to run spamassassin mass-check on defined folders of ham and spam, generates a log file and uploads that log file to spamassassin. auto-mass-check is meant to run in an automated fashion from cron once per day by a non-root user. spamassassin upstream uses these logs on a nightly basis in RuleQA to assess the safety and efficacy of anti-spam rules. Masschecking is crucial to the maintenance and improvement of spamassassin in the battle against spammers.
You can read more about masschecking at Spamassassin. That page is a bit of a confusing mess, thus the existence of this new tool wrapping the complexity of upstream's tools. I intend for this site to make it easy to understand mass checking then eventually for all this code and updated documentation to replace the upstream pages.
auto-mass-check is currently in development. We intend to make it easy to deploy by packaging into RPM in the standard Fedora repository.
- git clone git://git.fedorahosted.org/auto-mass-check.git
- Git Browser
How to Use
- The server that runs nightly masscheck should have the entire stack of Spamassassin + plugin pre-requisites. The easiest way to be sure all of it is installed is by following this guide to install a full spamassassin stack.
- As a non-root user, put the auto-mass-check.sh in ~/bin/
- Copy auto-mass-check.cf to ~/.auto-mass-check.cf
- Modify ~/.auto-mass-check.cf to point at your ham and spam folders. Be sure to configure properly to mbox or Maildir. Leave the RSYNC options unchanged for now you will be running auto-mass-check in test mode at first.
- Optionally set TRUSTED_NETWORKS and INTERNAL_NETWORKS in ~/.auto-mass-check.cf
- Run auto-mass-check.
- Look in ~/masscheckwork/nightly_mass_check/ for ham-*.log and spam-*.log files. (Or weekly_mass_check on Saturday.)
- Are the filenames good? They should be named something like ham-username.log or ham-net-username.log.
- Read CorpusCleaning and HandClassifiedCorpora for guidelines of how to identify ham in your spam folder, and spam in your ham folder, and which messages you should be simply deleted.
- If you move/delete messages, do not forget to "Compact Folder" to be sure they are actually gone.
- Repeat auto-mass-check until you are certain both folders are cleaned.
- As noted in upstream's documentation, send an e-mail to private@… requesting a rsync account that you will use for upload. Wait until they send you a username and password.
- Edit ~/.auto-mass-check.cf and set RSYNC_USERNAME and RSYNC_PASSWORD.
- Run auto-mass-check which will upload your results.
- Ask a more experienced participant (probably the person who recruited you) to check your results on the server. They can see the uploaded log files by running a command like rsync --old-d username@…::corpus/
- If your upload looks good, then you're probably ready to automate nightly checks. Configure auto-mass-check to run as a cron job as your non-root user at roughly 9AM UTC.
- Start with only a Ham folder. Ham is easier to properly sort than spam. Include a variety of mail that you WANT to receive. Start with your ham from the past few months. Be sure that you have inspected each message manually. Make sure the script is working properly in cron with this small amount of mail. After it is working, you can expand it with more Ham or to add your Spam.
- DO NOT INCLUDE DISCUSSION MAILING LISTS IN YOUR HAM. Personal mail only is a general good rule. The best ham sample is "From" a variety of different senders, not all the same mail server like a discussion list.
- DO NOT INCLUDE DISCUSSION ABOUT SPAM IN YOUR HAM. Try to minimize technology mail in your ham.
- Use svn instead of rsync to obtain the nightly masscheck like this.
- Implement --rules=PATTERN that simply passes this as a parameter to mass-check, enabling easy testing of only specific rules.
- Package into rpm and put into Fedora repos. Perhaps make deb packages too.
Talk to wtogami or nb.