Sample solutions and discussion
Perl Quiz of The Week #20 (20040721)

        I run mailing lists. People subscribe, people unsubscribe, and
        people get unsubscribed automatically when their addresses
        generate too many bounces.

        I run these mailing lists using SmartList.

        I'd like to find out how my lists are being used - do people
        unsubscribe in a bunch when a flame war happens, or do they
        just drift in and out over time? What does the
        total-membership graph look like?

        You are to write a function, parse_smartlist_log. It takes
        three parameters:

        (1) the name of a SmartList log file.
        (2) the total current membership of the list.
        (3) the base name of the output file.

        It should parse a SmartList log file and generate a graph of
        total list membership against time.

        Note that not all subscriptions and unsubscriptions will be in
        the log; it's possible that the listmaster has added or
        removed addresses without using the administrative interface,
        especially when the list was first set up. This is the reason
        for the second parameter. Take whatever action seems
        appropriate.

        (The graph can be a bitmap, ASCII, or whatever else - just
        give it a sensible filename based on the third parameter.)

        A log file includes lines such as:

                subscribe: foo@bar.com by: foo@bar.com  Thu Mar 21 15:30:35 GMT 2002

                unsubscribe:   9 foo@bar.com                   32760 foo@bar.com by: foo@bar.com Sat Mar 23 16:27:35 GMT 2002

                procbounce: Removed: foo@bar.com            32718

        SmartList has fuzzy matching on unsubscription requests - if
        the addresses in the line differ, use the first one.

        There are many other lines that may appear in the log file.

        Sometimes, as seen above for procbounce, there may be no date
        on the log line.

        Some sample log files may be obtained from
                http://firedrake.org/roger/sample_logs.zip

        or from

                http://perl.plover.com/qotw/misc/r020/sample_logs.zip
                http://perl.plover.com/qotw/misc/r020/sample_logs.tgz


----------------------------------------------------------------

Only two solutions were submitted on the discuss list.

The only external solution which solved the problem came from Jesper 
Dalberg. This uses Text::Graph, a CPAN module of which I was not 
previously aware (thanks!), and Date::Manip. This is a relatively 
inefficient method of date parsing; as it happens, all dates I have 
observed in SmartList log files are in a format which Date::Parse can 
handle.

This solution takes the sensible approach of latching a date value when 
it is spotted and using it for subsequent undated lines. However, it 
does not use dates found on lines which do not also contain a mailing
list transaction.

The totalling logic seems broken; the default value for list membership
on any date is the final membership value, rather than being in any way 
affected by previous values. (Was there perhaps a missing reassignment
to $cnt?)

MJD submitted a solution which, while appealing (I am a great fan of
PostScript and would love to see a Perl-PostScript Quiz of the Week),
does not actually solve the problem.  He is correct in that the
SmartList log format is not particularly well-designed, and indeed
that was part of the reason why I chose it for this quiz; it is the
output of a variety of separate programs, including procmail, rather
than coming from an integrated system. In any case, working from the
provided PostScript output it appears that axes are unlabelled and
unscaled.

My own solution is designed for clarity. It parses every line in
search of a date (fed to Date::Parse), and looks for specific patterns
for subscription/unsubscription information. (It also looks for
something vaguely resembling an email address in the line; as David
Jones pointed out, not every line matching /^unsubscribe:/ will be an
unsubscription.)  After parsing, the data are rebased to give the
correct final value. The code then uses George A. Fitch's
GD::Graph::xylines module to provide a graph with labelled, scaled
axes.

Possible sophistications would be:

* choose a strftime format based on the sample's date span (e.g.
"%H:%M" if the whole logfile only covers a day, "%b %Y" if it spans
several years).

* if a subscriber is unsubscribed twice without an intervening
resubscription, discount the earlier unsubscription (as he was clearly
re-added without showing up in the log).

#! /usr/bin/perl -w

use strict;

sub parse_smartlist_log {
  use Date::Parse;
  use GD::Graph::xylines;
  use POSIX qw(strftime);
  my ($logfile,$final,$outputfile)=@_;
  my $total=0;
  my (@x,@y);
  my $date=0;
  open IN,"<$logfile";
  while (<IN>) {
    chomp;
    my $n=0;
    if (/([A-Z][a-z][a-z]\s+[A-Z][a-z][a-z]\s+\d+\s+\d+:\d+:\d+\s+\d+)/) {
      $date=str2time($1) || $date;
    }
    if (/^subscribe: (\S+\@\S+)/) {
      $n=1;
    } elsif (/^(unsubscribe:\s+\d+|procbounce: Removed:)\s+(\S+\@\S+)/) {
      $n=-1;
    }
    if ($n && $date) {
      $total+=$n;
      push @x,$date;
      push @y,$total;
    }
  }
  close IN;
  my $offset=$final-$total;
  if ($offset) {
    foreach my $n (0..$#y) {
      $y[$n]+=$offset;
      if ($y[$n]<0) {
        $y[$n]=0;
      }
    }
  }
  my $graph=GD::Graph::xylines->new;
  $graph->set(
    x_label => 'date',
    y_label => 'subscribers',
    title => $logfile,
    x_number_format => sub{strftime('%d %b %Y',localtime(shift))},
    y_min_value => 0,
    transparent => 0
  );
  my $img=$graph->plot([\@x,\@y]);
  open OUT,">$outputfile.png";
  binmode OUT;
  print OUT $img->png;
  close OUT;
}

__END__

[ Thanks to Roger Burton West for running the QOTW this week.  The
  solution was delayed because I was away at OSCON.  I will send the
  new quiz tomorrow. -MJD ]

