pigmail.rb version 0.01 ~~~~~~~~~~~~~~~~~~~~~~~ Table of Contents ~~~~~~~~~~~~~~~~~ 0. License 1. pre-Spiel 2. Features 3. Justification 4. Algorithm 5. Quick Start 6. Known Bugs 7. TODO list 8. I'd like to thank... 0. License ~~~~~~~~~~ This utility is released under the terms of the BSD license: | Copyright (c) Tim Haynes , 2002 | All rights reserved. | | Redistribution and use in source and binary forms, with or without | modification, are permitted provided that the following conditions are met: | | * Redistributions of source code must retain the above copyright | notice, this list of conditions and the following disclaimer. | * Redistributions in binary form must reproduce the above copyright | notice, this list of conditions and the following disclaimer in the | documentation and/or other materials provided with the distribution. | * Neither the name of the nor the names of its | contributors may be used to endorse or promote products derived from | this software without specific prior written permission. | | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS | IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A | PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER | OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 1. pre-Spiel ~~~~~~~~~~~~ There are several Bayesian and probabalistic spam-identifying utilies available at large already. See Paul Graham's "A Plan for Spam" document, at if you want more about that. This implementation is slightly different for reasons that will become apparent later on :) 2. Features ~~~~~~~~~~~ a) multiple categories; You can have as many categories as you want, so you're not limited to `ham' or `spam'. However, you'll need a larger representative sample, and the mails should be siginificantly different in all the categories you use, for maximum benefit. b) multiple token-classes; Other implementations do not differentiate between words (`tokens' to be more accurate) wherever they appear in a mail. With pigmail, you can define your own token-classes such as bodyWords, bodyWordPairs, Subject, Sender, Recipients, and so on. c) backend storage is done with lots of DBM files; I considered using a postgresql database for this, but it really doesn't gel very nicely with the two parts above. d) It's written in Ruby. 3. Justification ~~~~~~~~~~~~~~~~ Let's say now, we're going to be precise about the terminology used. A `category' is one axis of comparison, normally "ham" or "spam". A "token-class" (or class of tokens) is something like "bodyWords" or "rblInfo" or "Subject" or "Sender"; it defines a token as a function of the input stream - quite often this may be related to the location of a part of the mail object, but it doesn't have to be that way. Once you've implemented something generic in terms of the categories it implements, there's no reason to limit oneself to just two of the things. Some people might want multiple categories; for what it's worth, _ifile_ already does this but with a relatively crude "database" system as the backend. The multiple token-classes might require more thought. Consider your old anti-spam procmail rules. If you did it right, you had white-lists for known-good senders whereby their mails were not subject to other processing. You picked out certain domains, senders and subjects and considered them "probably spam". You used the RBL as an indicator, and got bored with the inaccuracy of the data especially on mailing lists. You got really desperate and started checking for generic things like "Subject ends in a lot of white-space plus at least 4 numbers", and even "body contains certain phrases", at a rate of a procmail rule per phrase, relegated to the bottom of the filtering pile because checking the body is comparatively expensive. Then along came SpamAssassin. This took all the above and assigned scores to the results, for various key "triggers", and adds up the scores for each to determine an overall spammishness level. It integrates all the above in a particular way into one utility you can use without any more procmail recipes around it. Unfortunately, people start plugging data into razor that doesn't belong there, which reduces the effectiveness of SA using that backend. And it couldn't really be run reliably on many mailing lists, because of the low threshold required to catch more spam elsewhere - it has a tendency to report false positives, making online ordering quite hard work, etc. Well I did all this, anyway. Now consider a probabalistic approach. All you're interested in is the likelihood that a given mail is "probably spam", or "probably ham", or probably whatever-other-categories you've created. Within that, various token-classes contribute to the overall score. The fact that a word appears in the body of a spam more often than it does a ham mail is indicative. The fact that a given sender emits more spam than they do ham is very telling. The fact that a mail talks about the RBL is different to a given mail being flagged with an X-RBL-Warning header. It is not at all clear that all these tokens carry equal "weight", and I would in fact maintain that they don't. Pigmail is written such that you can create your own token-classes, and you can assign weights to each of them. As an example, consider the case of there being an X-RBL-Warning header in the mail. Such a thing normally takes this sort of form: | (bl.spamcop.net) Blocked - see http://spamcop.net/bl.shtml?211.187.28.222 | or | (relays.osirusoft.com) [1] AsiaSpam-211, see http://spews.org/ask.cgi?S429 | (relays.osirusoft.com) Open Proxy: http(8080) | (relays.osirusoft.com) Open Proxy: http(3128) All these things contain verbose text, not all of which is indicative of spam - the specific URLs given, especially spamcop.net with the IP# on the end, are irrelevant to the quantity "spammishness". Maintaining a count of how many times you see a header as verbose as each of the above will not help at all. One possible answer to this is to create a token-class for handing the X-RBL-Warning header separately (ignoring any references to it in the body of a mail, where the thing is being discussed - quite possibly by clueful non-spammers), writing a method that extracts just the first word in parentheses (the RBL *domain* that identified the sending host) and then the word "open" or "blocked" later in the string. Then when you teach the database about a lot of mails, you'll build up data concerning how spammy and how like-ham the fact that a given mail contains a specific RBL header is - you're both monitoring the effectiveness of each RBL domain and using the result of that monitoring to assign a probability for future use. The same can be done for Senders, so you automatically build up a shades-of-grey list (in practice, I've seen up to 10 spams coming from one sender - that now counts against him; I also get lots of ham mail from repeat good senders, and that counts in their favour - hence weighting up the Sender token-class to have more influence on the final decision than it might otherwise do is quite a reasonable idea IMO). And it's written in Ruby because I'm making an attempt to learn the language, and I haven't seen a similar project yet. 4. Algorithm ~~~~~~~~~~~~ a) Learning For each mail specified on the commandline, the token-classes are computed and added to the values in the given category. A `Total Score' value is incremented, to maintain a sum of all the tokens' incidence counts in a given token-class. b) Checking For each mail specified on the commandline, the token-classes are computed. For each category C, token-class tc and token t, if C+tc contains a match, the score of match-against-category is increased by the incidence-count divided by the Total Score in tc. The degree of success matching each token-class against a category is the sum of weighted tc scores. The most likely category is the one with the highest resultant score. This is returned as the program's output. c) Feedback loops Once a new mail has been identified as belonging to a particular category based on the strength of certain tokens present in that mail, it's desirable to add *all* the tokens back into that category, both to reinforce the existing tokens' values, and to add new tokens as well. (E.g., if you receive a mail talking about PHP, that's probably a good thing. If you receive another mail talking about PHP and MySQL, then MySQL has become a good token by association. This is reflected by having scores in the "ham" category for those tokens go up by +2 and +1 respectively.) Feedback loops may be implemented either externally (e.g. using procmail rules - "if I've matched it once and concluded it's good, add it back into the good category" and so on), or internally to pigmail. If you specify both a --check and a --learn option, the check will be performed first, and whatever category it determines matches best will have the mail "learnt" against it. (Specifying a category on the commandline is meaningless in this case, and will be overwritten with one of pigmail's own choosing.) 5. Quick Start ~~~~~~~~~~~~~~ First, you need a body of mails, very well separated into multiple categories, to start with. You train the database by running pigmail with the --learn option, thus: #ML, MH one-file-per-mail systems: find probably-spam -type f ! -name \*.gz ! -name .\* | \ xargs -n 10 pigmail --learn=spam (choose any number of mails per execution of the command here - up to 20 is feasible IME.) #mbox formail -ds pigmail --learn=spam < spammboxfile Repeat for as many categories and mail-sources as you have. Note that the mbox approach here is really slow and inefficient as it spawns a complete new ruby interpreter for each & every mail going past, whereas the one-file-per-mail approach wins as the backend DBM files are held open for all mails on the commandline. Experimental evidence suggests that 685 mails can be learnt in under 5 minutes with the xargs approach, while formail got boring after 20minutes. To test what it thinks of a given mail, use: pigmail --check [mailfile1 [mailfile2...]] [ < mailfile ] Output will be like: | zsh/scr, potato 1:23PM mail/ % pigmail.rb --check mail.securityfocus/6868 | Matching file: mail.securityfocus/6868: best category = ham If no mails are specified on the commandline, it reads one mail (and only one mail) from stdin. Procmail: Pipe the mail through pigmail and note the output results, then use formail to add a header if you wish. The feedback looping will be much quicker if the builtin version is used; just specify both --check and --learn on the commandline in procmail's calls to pigmail. 6. Known Bugs ~~~~~~~~~~~~~ TMail, which is used for storing the mail as an object in the background, doesn't always parse a mail correctly; if you see problems such as .multipart? attempting to call downcase on a nil object, this is not pigmail's fault. 7. TODO list ~~~~~~~~~~~~ make the list of token-classes dynamic - it should be necessary only to add a token-class to the list at the top of the script and implement a method of that name, not to add a call to that method in the initialize method as well. implement a generic level-aware `debug' class - there are too many `if deBug==2' statements littered about the code at present. (Note: this is now done.) migrate code to TMail 0.10 instead of 0.9.3 (there may be some API changes required). start using getopt.rb instead of reinventing the wheel. Consider using an RDBMS as the backend. Addition of these commandline options: --exitstatus # set exit-status according to category number --unlearn # obvious --vacuum # some set of criteria, least-frequent, least useful in distinguishing... 8. I'd like to thank... ~~~~~~~~~~~~~~~~~~~~~~~ Paul Graham for setting so many people off on this Bayesian matching / filtering idea. The folks who wrote and work on Ruby, as it's such a .. fun .. OO scripting language to write in. Friend Dave Pearson for bouncing ideas off and asking lots of questions. Mum for baking decent fruit-cake.