Work: Looking Through Our Mail

Jeffreys Copeland & Haemer

(SunExpert, March 1998)



Letters, We get letters, We get stacks and stacks of letters ...
                        --- Perry Como





Early this month, we got a letter from a corporate attorney advising us that the courts had ordered his client to produce all our correspondence with that client.

We were surprised by this -- wouldn't you be? -- if for no other reason than that we didn't think we'd had any.

But with as much mail as we get, who could tell?

Curious, we tried using grep, to look through all our old, saved mail, but the strings we were looking for were pretty common. Worse, grep really didn't give us what we wanted: mail messages.

Stepping back a moment, we realized that the problem was that we needed a grep that understood the semantics of mail. In fact, we reasoned, this wasn't even a limited-utility tool. Questions like, ``Where the heck are those messages from Charlie about the Viper project?'' are pretty frequent.

How long would it take to write such a tool from scratch? A morning, as it turns out. What's more, doing it is a pretty good illustration of how to build useful tools quickly, so in this column, we'll walk you through the process.

As a side benefit, the next time you know an attorney who's trying to look through large volumes of electronic mail, you'll be prepared to provide a line-by-line explanation of this tool.


Design by Analogy

Start by asking how the tool should look to the user. Even if you're making something that no one else will ever use, design the interface as though you're writing it for someone else. Chances are, six months later, you'll forget why you made the choices you did. A good rule of thumb is to make the tool look as much like something else familiar as you can. Because ``theft'' is a poor choice of words around lawyers, we like the phrase ``design by analogy.'' Here, we want something grep-like so we'll make the interface look like grep's.

We'll even call our tool mgrep, for mailgrep.

This decision means that in two years, after we've forgotten what we did, when we sit down to look for all mail messages containing the string zzazz in the mailboxes Feb.mbox and Mar.mbox, we can start out trying

mgrep zzazz Feb.mbox Mar.mbox
and not be surprised by the result.

What's more, a little attention to detail now will let us pull out all messages containing zzazz or ZZAZZ or Zzazz with

mgrep -i zzazz Feb.mbox Mar.mbox
Why -i? Because that's the flag grep uses for case-insensitive matches.

Moreover, this choice saves us from having to design all features and options either from scratch or at once. If we do a bare-bones implementation now, and a year later want to add an option that means ``report all mailboxes that contain messages with this pattern,'' we don't have to convene a design committee. All we need to do is glance through the grep man page, note the -l option (which prints only the names of files containing lines that match the pattern), and write code to add a -l flag to mgrep.


More Theft


Man is a tool-using animal... Without tools he is nothing, with tools he is all. -- THOMAS CARLYLE, Sartor Resartus

On to implementation.

Our goal is to get something working quickly, so Perl seems like a good tool to use. What's more, this is a text-processing problem -- one of Perl 's strengths. Plus, we know Perl \m familiarity is never a factor to be ignored.

If we're going to use Perl , our first impulse is to continue to steal by raiding the CPAN: the Comprehensive Perl Archive Network. Oh. Sorry. Make that, ``... build on the work of others.''

We'll begin by perusing the modules list, at http://www.perl.com/CPAN/modules/ Well, not actually there, since going to http://www.perl.com/CPAN/ takes you to a multiplexer that then throws you automatically to a mirror near you. This trick gets you good performance without making you memorize a lot of URLs. (NOTE: The trailing slash is important. Without it, you don't get the multiplexer.)

Once there, we quickly find our way to http://www.perl.com/CPAN/modules/00modlist.long.html, the current module list, and begin looking.

Modules are Perl's rough equivalent of Ada packages or C++ classes: language extensions, often object-oriented, to handle specific problems.

What will we need? Well, let's see ...

Something to handle argument parsing would be nice, so we don't have to hand-craft our emulation of grep's flags. A search for ``option'' quickly yields up an entire section of the CPAN, which begins like this:

12) Option, Argument, Parameter and Configuration File Processing

Name           DSLI  Description                                  Info
-----------    ----  -------------------------------------------- -----
Getopt::
::EvaP         Mdpr  Long/short options, multilevel help          LUSOL
::Gnu          adcf  GNU form of long option handling             WSCOT
::Help         bdpf  Yet another getopt, has help and defaults    IANPX
::Long         Supf  Advanced option handling                     JV
::Mixed        Rdpf  Supports both long and short options         CJM
::Regex        ad    Option handling using regular expressions    JARW
::Simple       RdpO  A simplified interface to Getopt::Long       RSAVAGE +
::Std          Supf  Implements basic getopt and getopts          P5P
::Tabular      adpr  Table-driven argument parsing with help text GWARD +

Getopt::Std (``Implements basic getopt and getopts'') should do the trick, working to match the familiar POSIX.1 call, getopt(). What's more, the `S' in the second column means that this module is a standard part of the Perl 5 distribution, so we don't even have to pull a copy off of the archive.

What else? Well, section 19 looks pretty good: It begins like this

19) Mail and Usenet News

Name           DSLI  Description                                  Info
-----------    ----  -------------------------------------------- -----
Mail::
::Address      adpf  Manipulation of electronic mail addresses    GBARR
::Alias        adpO  Reading/Writing/expanding of mail aliases    GBARR
::Cap          adpO  Parse mailcap files as specified in RFC 1524 GBARR
::Field        RdpO  Base class for handling mail header fields   GBARR +
::Folder       adpO  Base-class for mail folder handling          KJOHNSON
::Header       RdpO  Manipulate mail RFC822 compliant headers     GBARR +
::Internet     adpO  Functions for RFC822 address manipulations   GBARR
::MH           adcr  MH mail interface                            MRG
::Mailer       adpO  Simple mail agent interface (see Mail::Send) GBARR
::POP3Client   bdpO  Support for clients of POP3 servers          SDOWD
::Send         adpO  Simple interface for sending mail            GBARR
::Util         adpf  Mail utilities (for by some Mail::* modules) GBARR

Most of these are packaged up in a single tar file, by Graham Barr, called ``MailTools,'' so we pull over the most recent version, MailTools-1.1003.tar.gz.

After we unpack it like this:

tar -zxvf MailTools-1.1003.tar.gz  # unpack the archive
cd MailTools-1.1003           # enter the source directory
(assuming we have GNU tar with a decompressor -- the -z flag -- otherwise, we can pipe gzcat into tar) and can use the -z flag building and installing the module and its documentation only requires following the instructions in the README:
perl Makefile.PL              # build a Makefile for your system
make                     # build the package
make test                # test (!) it
make install                  # install it



Craft

A little manual page perusal reveals that this is enough. A small amount of work (some of it cut-and-paste, which one of our colleagues calls ``snarf-and-barf'') gives us this:

   1   #!/usr/local/bin/perl -w
   2   # $Id: mgrep,v 1.4 1997/12/27 00:11:13 jsh Exp $

   3   require 5.004;
   4   use strict;

   5   use Getopt::Std;
   6   use Mail::Util qw(read_mbox);
   7   use Mail::Internet;

   8   my ($options, $parts, $pattern, $usage);
   9   my %opt_args;

  10   # parse args and check for proper invocation
  11   $usage = "usage: $0 [-b|-h|-H Header_field|-W] [-i] [-v] pattern [mailbox ...]";
  12   $options = 'H:Wbhiv';
  13   getopts $options, \%opt_args or die $usage;
  14   foreach (split //, $options) {
  15    $opt_args{$_} ||= 0;
  16   }

  17   $parts = $opt_args{'b'} + $opt_args{'h'} + ($opt_args{'H'} ? 1 : 0);
  18   $parts < 2 or die $usage;
  19   $opt_args{'W'} = ! $parts;

  20   $pattern = shift;
  21   if ($opt_args{'i'}) {
  22    $pattern = "(?i)$pattern";
  23   }

  24   if (@ARGV == 1) {push @ARGV, "/dev/stdin";}
  25   while (@ARGV) {
  26    my $mbox = shift;
  27    $^W = 0;
  28    my @msgs = read_mbox $mbox or die "Can't read $mbox:$!";
  29    $^W = 1;
  30    foreach (@msgs) {
  31     my ($tgt, $mail);
  32     $mail = Mail::Internet->new($_);

  33     my $head = $mail->head;
  34     my $body = $mail->body;

  35     if ($opt_args{'W'}) {# the default
  36      $tgt = [@$body, @{$head->header}];
  37     } elsif ($opt_args{'b'}) {
  38      $tgt = $body
  39     } elsif ($opt_args{'h'}) {
  40      $tgt = $head->header;
  41     } elsif ($opt_args{'H'}) {
  42      $tgt = [ $head->get($opt_args{'H'}) ];
  43     } else {
  44      die $usage;
  45     }

  46     $mail->print
  47      if ( grep /$pattern/, @$tgt xor $opt_args{'v'} );
  48    }
  49   }


  50   =head1 NAME

  51   mgrep - look through mailboxes for messages containing a string

  52   =head1 SYNOPSIS

  53     mgrep [-bhiv] pattern [mailbox ...]

  54   =head1 DESCRIPTION

  55   =over 2

  56   I<mgrep> looks for mail messages containing a pattern,
  57   and prints the resulting messages on standard out.

  58   By default looks in both header and body for the specified pattern.

  59   When redirected to a file, the result is another mailbox,
  60   which can, in turn, be handled by standard User Agents,
  61   such as I<elm>,
  62   or even used as input for another instance of I<mgrep>.

  63   =back

  64   =head1 OPTIONS AND ARGUMENTS

  65   Many of the options and arguments are analogous to those of grep.

  66   =over 8

  67   =item B<pattern>

  68   The pattern to search for in the mail message.
  69   May be any Perl regular expression,
  70   but should be quoted on the command line
  71   to protect against globbing (shell expansion).

  72   =item B<mailbox>

  73   Mailboxes must be traditional, UNIX C</bin/mail> mailbox format.
  74   If no mailbox is specified, takes input from stdin.

  75   =item B<-b>

  76   Look only in the bodies of mail messages.

  77   =item B<-h>

  78   Look only in the headers of mail messages.

  79   =item B<-H>

  80   Look in the specified header of mail messages.
  81   Field names are case-insensitive.

  82   =item B<-i>

  83   Make the search case-insensitive (by analogy to I<grep -i>).

  84   =item B<-v>

  85   Invert the sense of the search, (by analogy to I<grep -v>).

  86   =item B<-W>

  87   Look through the entire mail message (default)

  88   =back

  89   =head1 EXAMPLE

  90     find . -name '*mbox' -print | xargs mgrep -i alstadt > /tmp/alstadt.mbox

  91   This finds every file whose name ends in C<mbox>
  92   under the current directory, searches each for messages containing
  93   the strings "alstadt," "ALSTADT," "Alstadt," etc.,
  94   and puts a copy of everything it finds into the mailbox C</tmp/alstadt.mbox>

  95     find . -name '*mbox' -print | xargs mgrep -H to brother

  96   This searches the same set of files for messages containing
  97   the string "brother" in the "To:" field.

  98   =head1 AUTHOR

  99     Jeffrey S. Haemer, <jsh@boulder.qms.com>

 100   =head1 SEE ALSO

 101   elm(1), mail(1), grep(1), perl(1), printmail(1), Mail::Internet(3)
 102   Crocker,  D.  H.,
 103   Standard for the Format of Arpa Internet Text Messages, RFC822.

 104   =cut


Exegesis

Lines 1 through 4 are boilerplate. They guarantee that the script is interpreted by a version of perl that has enough features to support it, provide an RCS Id string, to let us know what revision of our code we're dealing with, and turn on lots of warnings, both at compile time and run time, to help keep us from wasting time debugging really stupid mistakes. We are trying to minimize development time, not running time.

Lines 5, 6, and 7 pull in the three modules that we'll use functions from. Lines 8 and 9 declare variables. (Line 4 told the compiler to complain about undeclared variables, which helps catch typos.) We could declare them as we use them, but have found that if we collect most of our declarations in one place it's easier to notice when we're using several different variables to do almost the same thing. We also like to declare our scalar, hash, and array variables in separate statements, but that's idiosyncrasy, not Perl.

Lines 10 through 16 process the command-line arguments and give them default values. After the call to Getopt::Std::getopts(), all option values are contained in the hash %opt_args, and the only things left in @ARGV are the pattern to search for and the file names. No muss, no fuss, nothing to tidy up. The loop beginning on line 14 gives any unselected options the value zero. (For the really nit-picky, we are aware that it also sets $opt_arg{':'} to 0: meaningless, but harmless.)

Lines 17 through 23 actually interpret some of the options. As long as we are looking for mail messages that contain strings, why not let users specify what part of the mail message to look in? As it turns out, the Mail::Internet module lets us get each of these, separately, so we use the options -b, -h, -H, and -W to tell mgrep to look in the body, header, specific header field, or whole message respectively. (The same flags given to grep aren't all that interesting, so we'll use the letters for something more meaningful.) Lines 17 and 18 enforce a prohibition against mixing these options, and make -W the default.

Line 20 grabs the pattern. This pattern can be any Perl regular expression, which means that our tool will actually be a little easier to use than grep, or egrep, which only understand POSIX regular expressions. Lines 21 through 23 implement the -i (case-insensitive matching) option, with Perl 5's new syntax for regular-expression extensions. Unlike the syntax /pattern/i which specifies case-insensitivity at compile time, putting (?i) at the beginning of a pattern makes case-insensitivity part of the pattern itself, so you can specify case sensitivity at run-time.

Lines 24, 27, and 29 are hacks, impelled by the current implementation of Mail::Util::read_mbox(). Lines 27 and 29, which bracket the call are stuck in to temporarily turn off the -w flag and block a complaint about the internals of read_mbox().

Line 24 lets mgrep read from standard input if no files are named in the argument list. Here again, the normal Perl idiom, while(<>), is unavailable because of a detail of the implementation of read_mbox(). This brings up an important point: we could avoid having to put in these three hacks, by writing our own replacement for Mail::Util. But how much work would that be?

Lines 26 and 28 grab the mailbox named on the command line, and transform the mailbox into an internal form -- an array of individual mail messages -- for processing.

The remainder of the program loops through that array, looking inside each message for the pattern, and printing the requested messages.

Line 32 turns an individual mail message into an object with methods listed in the Mail::Internet module, and lines 33 and 34 use these methods to extract the header and body.

Lines 35 through 45 use other methods from the same module to create an array of text lines to search for the pattern. What gets stuffed into the array depends on the value of the -b, -h, -H, and -W flags, but the end result is a reference to the target array, $tgt. (You might think that lines 43 through 45 are superfluous. You might think you could even prove, mathematically, that earlier code has guaranteed that one of these flags has to be set. Sure it has. We put stuff like this in because experience tells us it has always saved us a lot of debugging time.)

Whew.

Now, if you look back, you'll see that most of what we've done so far is really just argument handling. We've tried to do things sturdily, so that when the code gets this far, it's likely to look for what we think we were asking for. Moreover, we've tried to do things professionally enough that when the code doesn't get this far, it fails cleanly. Even so, it's only taken us 45 lines of code and comments.

But what about the real work? Oh, you mean the remainder of the program: lines 46 and 47.

Line 47 uses Perl's built-in grep function, which searches an array of lines for a Perl regular expression, to impose selection criteria on each message, (The xor implements the -v flag.)

Line 46 uses one of the mail-message methods to print any selected messages in RFC822 format.

Done.

Lines 50 through 104, over half the total lines in the file, are documentation. Even though we've designed this so that you shouldn't need to look at a man page often, that doesn't keep us from writing one. As is usual for Perl utilities, the documentation is in pod (plain old documentation) form. Not only does pod documentation live in the same file as the code it describes, -- it's normally ignored by perl-- but tools that come with the standard distribution let you transform such code-documentation chimeras into a wide variety of attractive documents, including flat text, UNIX man pages, and web pages. The CPAN even has a module that will let the code part use the documentation part to generate run-time usage and help messages.

And that, gentle reader, is that. We'll be back next month with more amazing programmer tricks. Until then, happy trails.