Sample solutions and discussion Perl Quiz of The Week #1 (20021016) ---------------------------------------------------------------- Write a function, 'center', whose argument is a list of strings, which will be lines of text. 'center' should insert spaces at the beginning of the lines of text so that if they were printed, the text would be centered, and return the modified lines. For example, center("This", "is", "a test of the", "center function"); should return the list: " This", " is", " a test of the", "center function" because if these lines were printed, they would look like: This is a test of the center function I had in mind a solution something like this one: sub center { my $maxwidth = 0; my @s = @_; for (@s){ $maxwidth = length($_) if length($_) > $maxwidth; } for (@s) { $_ = " " x (($maxwidth - length($_))/2) . $_; } @s; } Most of the solutions posted on the perl-qotw-discuss list did look very much like this. Peter Haworth used 'substr' to insert the spaces at the beginning of the lines. Using this technique in the code above would replace the "$_ = ..." line with: substr($_, 0, 0) = " " x ($maxwidth - length($_))/2; Other notes: 1. A lot of people seem not to understand what it means to "write a function which will return [some value]". Several people posted functions which would *print* centered text; this was not what the function was supposed to do. This always surprises me when I'm doing training classes, and it continued to suprise me when it came up this week. A function which prints out centered text is only about one one-hundredth as useful as a function which centers strings and returns them. If the function returns the values without printing them, the centered values can be modified and printed later; they can be incorporated into some larger data structure; they can be printed to STDERR as part of a warning message or to a file; they can be sent into some other command via a pipe; they can be sent over the network; they can be analyzed by some other part of the program and then discarded. A function which prints out the lines cannot be used in any of these ways. If I were teaching a class in how to be a professional programmer, I would mention this in the first week and then harp on it every day for the next two years. Most of what a programmer does is to design functions to be used by other programmers, and one of the biggest obstacles to making good software is that many functions are designed with broken interfaces so that they can't be used in more than one way. Functions should almost never print *anything* out. 2. Several people asked how wide a line the text should be centered in. The problem statement makes no remark about centering the text within a fixed-width line. In particular, the example should make clear that the text is not to be centered within an 80-character line, since if it were then the sample output would look like this: " this", " is", " a test of the", " center function", But it doesn't; as I said in the original message, it looks like this: " This", " is", " a test of the", "center function" Assuming that the output device is always 80 characters wide was a bad practice even in the 1970s when there was a physical reality that underlay it. Now, when I can have a 93-character-wide terminal at the touch of a button, it makes no sense at all. If you did decide to center the text within a fixed-width line, you should have provided the function with a parameter to determine the line width, perhaps something like this: sub center { my $linewidth = 80; if (ref $_[0]) { my $args = shift; $linewidth = $args->{LINEWIDTH}; } ... } Then the user has the option of calling 'center(...)' to center the text within your default-width line, or 'center({LINEWIDTH => 17}, ...)' to choose a different width. 3. Here's one minor point I found puzzling. Several people wrote expressions like this one: ( $maxlen / 2 ) - ( length( $_ ) / 2 ) When I see something like this, I usually want to rewrite it as ( $maxlen - length ( $_ ) ) / 2 unless for some reason the first one is a lot more perspicuous. Here the second one seems simpler to me: How many spaces? Take the length of the line, and subtract that from the width of the line; that says how much extra space is on the line; the amount of space to the left of the text is exactly half the extra space. For some reason this reminds me of a time when my aunt came to me to ask me for help with a math problem. She wanted a method for finding the number halfway in between two other numbers. Her method was to find the difference between the two numbers, divide it by 2, and then add the result back to the smaller number. She wanted to know if there was an easier way. There were two things about this that surprised me. The first was that she did not recognize that the average of two numbers is exactly the same as the number that is halfway in between them. The other surprsising thing is that in spite of years of mathematical training in high school and college, she was not able to frame her original method algebraically, discover that she was computing the expression x + (y-x)/2, and then reduce this to 2x/2 + (y-x)/2, to (2x+y-x)/2, and finally to (x+y)/2. I don't know what the point of this story is, except perhaps that something similar is going on with the (maxlen/2)-(length/2) expressions. A couple of people wrote something like: ' ' x ($max - length($_) / 2) apparently not realizing that this inserts far too many spaces. 4. Steve Smoot pointed out something I hadn't considered: The input strings themselves might contain newlines. The problem statement seems to preclude this, since it says that the argument strings "will be lines of text". But if you want to do something reasonable, it's quite easy: sub center { my @lines = map split /\n/, @_; # now adjust @lines... } 5. If a line couldn't be exactly centered in the space available, most people just shifted it a half-space to the left. Some people used 'int' to throw away the remainder after dividing by 2; some just took advantage of the implicit behavior of the 'x' operator to do the same thing. One person used 'use integer', which I think might be risky. 'integer' doesn't just mean 'integer'; it really means 'use the underlying C semantics for your operators', and so it may also change the behavior of operators like '&' and '%'. 6. Some people were tempted to use the centering in Perl's built-in 'format' feature. This seemed to be more trouble than it was worth. One such solution went like this: sub center { eval ("format STDOUT = \n" . ( "@" . ( "|" x (length (join "", @_ )))) . "\n" . '$_' . "\n.\n\n" ); for(@_) {write} } The idea here is to build up a format definition of a format that looks like this: format STDOUT = @||||||||||||||||||||||||| $_ . The '@|||||' tells the 'format' system that you want text centered. There are a number of defects here. One, easy to correct, is that it centers the text within a column whose width is equal to the *sub* of lengths of the input. This means that if you ask it to center fifty strings of ten characters each, the strings each get 245 spaces appended to the beginning. Another defect is that the centered data is printed, instead of being returned as a list. A more problematic defect is that if there was a 'STDOUT' format before, the subroutine has destroyed it. A better approach when dealing with dynamic formats is to use the 'formline' function, which provides access to the same internal formatting functions that Perl uses. Ron Isaacson did this, basing a solution on an 'swrite' function. 'swrite' formats data in the same way that Perl's 'format' feature would have, but returns the resulting string instead of printing it. use Carp; sub swrite { croak "usage: swrite PICTURE ARGS" unless @_; my $format = shift; $^A = ""; formline($format,@_); return $^A; } sub center { my @in = @_; my $len = (sort {$b <=> $a} map (length, @in))[0]; my $format = '@' . '|' x $len; map { swrite ($format, $_) } @in; } This works, but it does seem to use a lot of code and obscure features in proportion to the amount of work it does. 7. Randy J. Ray wondered if there wasn't any way to get the result with a single pass over the argument list instead of two passes. If there was, nobody found it. 8. Aaron Crane pointed out that instead of scanning the arguments to find the longest one, you could use the List::Util::max function. 9. As Tom Phoenix pointed out, Thereisneitherbonusnorhonoraccruedforomittingmostofthewhitespace. Perl Golf is three doors down on the left; the obfuscated contest is at the end of the hall down the stairs, and watch that first step; it's a doozy. I've decided it's too much trouble to decipher obfuscated solutions, so there may have been points of interest in some of them, but I don't really care. Thanks to everyone who has subscribed to this list, and to everyone who participated in the discussion. I'll send another quiz on Wednesday. Sample solutions and discussion Perl Quiz of The Week #2 (20021023) Write a function, days_diff, to compute the time difference, in days, between two dates. The dates will be strings in the format Wed Oct 16 2002 For example: days_diff("Wed Oct 16 2002", "Wed Oct 23 2002") should return 7. days_diff("Wed Oct 16 2002", "Tue Oct 16 2001") should return -365. I thought this would be an easy problem. But as I should have remembered, almost nothing to do with date calculations is easy or simple. Some of the varied complications are discussed below. I had originally imagined two types of solution. One might use one of the heavy-duty CPAN date calculation modules, such as Date::Calc or Date::Manip; the other other use the standard Time::Local module. The Time::Local solution I produced looked like this: use Time::Local 'timegm'; my @mon = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec); my %m2n = map {$mon[$_] => $_} 0 .. 11; sub days_diff { my @times; for (0 .. 1) { my ($dw, $mn, $dm, $yr) = split /\s+/, $_[$_]; push @times, timegm(0, 0, 0, $dm, $m2n{$mn}, $yr-1900); } ($times[1] - $times[0])/86400; } Here I used the 'timegm' function to turn each date into a Unix epoch time (number of seconds since the start of 1970), subtract the two epoch times to find the difference in seconds, and then divide the result by 86400 to get the number of days difference. 1. It might seem as though this could produce a fractional result. The question is: Is the interval between consecutive midnights in GM time (Greenwich Mean) time always exactly 86400 seconds? The answer is no, not exactly, because GMT days are occasionally 86401 seconds long. Astronomers throw in an extra 'leap second' at the end of June or December to keep the actual solar noon synchronized with chronological noon; the extra second is occasionally necessary because the Earth's rotation is gradually slowing. (They might in principle subtract a second sometimes because of random rotational changes called 'nutations', but it's never happened.) In particular, there were extra seconds at the end of June 1997 and December 1998. However, as far as I know, no Unix system actually uses GMT, and no 'gmtime()' function actually calculates Greenwich mean time! Instead, they all use UTC (Coordinated Universal Time) which is just like GM time, except without the leap seconds; UTC days are always exactly 86400 seconds long. My system even documents this: The ctime(), gmtime() and localtime() functions all take an argument of data type time_t which represents calendar time. When interpreted as an absolute time value, it represents the number of seconds elapsed since 00:00:00 on January 1, 1970, Coordinated Universal Time (UTC). So the function above works even across intervals where the GM calendar contains leap seconds, supposing that the 'timegm' function is actually producing UTC times. If not, the final difference should simply be rounded off to the nearest integer. 2. A second issue is that in many localities, mostly in Europe and North America, there's a practice called 'Daylight Saving Time' where the clock is set forward in the spring and backward in the autumn. For example, in Philadelphia, there was no local time 02:30 Sunday April 7 2002: plover% perl -le 'print scalar localtime(1018160911)' Sun Apr 7 01:28:31 2002 But an hour later it was: plover% perl -le 'print scalar localtime(1018160911 + 3600)' Sun Apr 7 03:28:31 2002 April 7 was only 23 hours long, because 2:00-2:59 was missing; similarly, yesterday (Sunday, October 27) was 25 hours long. GM time does not have any such adjustments, so they don't affect the sample function above. However, several people used the 'timelocal' function instead of 'timegm' in similar solutions, leading to incorrect results. For example, when called like this days_diff("Mon Oct 21 2002", "Mon Oct 28 2002"); # Daylight saving their functions would return 7.04166666666606 instead of 7, because of the extra hour. Again, rounding off would have solved the problem, but the solutions I saw that used timelocal() didn't round off either. 3. Here's a related issue. One poster on the -discuss list presented a solution that would check the day of the week for validity, a reasonable addition: my $sttime = timelocal(0,0,1,$stdayno,$months{$stmonth},$styear); my @sttime = gmtime($sttime); if($stday ne $days[ $sttime[6] ]) { # (gmtime)[6] is the dayname return "start date was invalid\n"; } But there's an inconsistency in the implementation. See it? He uses timelocal() to convert the argument to epoch time, but then gmtime() to convert it back and get the day of the week. That works OK where I live, because 1AM local time is either 5AM or 6AM UTC the same day, so the day of the week is the same. But had this poster run his test program in Tokyo, where 1AM local time is 6PM UTC the previous day, he would never have had a success! 4. Here's a puzzling issue. For various reasons, none of the 'timelocal()' solutions posted to the -discuss list actually works. The one I excerpted above assumes that the dates are in the format Wednesday 16 October 2002 but the problem specification calls for Wed Oct 16 2002 Two others assume that the dates will be in the format Wed 9 16 2002 which seems rather unlikely. I'm curious about why the authors of these functions didn't get this right. Was it carelessness, or a deliberate modification of the spec? 5. The major defect of the sample solution above is that the timegm() function has a limited range. It returns its result as an integer number of seconds since 1970; on machine with 32-bit integers, this covers a range of about 136 years, from Fri Dec 13 20:45:52 1901 through Tue Jan 19 03:14:07 2038. Outside this range, it throws an exception. This may be acceptable to the accounting department, but the limitation should be noted. 6. Solutions using one of the heavyweight CPAN date calculation modules don't have this range limitation. These modules also do all the difficult work. Here's a solution I produced that uses the 'Date::Calc' module: use Date::Calc 'Delta_Days', 'Decode_Date_US'; sub days_diff { my @d = @_; s/^\w+// for @d; # Remove day of the week Delta_Days(map Decode_Date_US($_), @d); } 'Decode_Date_US' attempts to parse and translate a Date in US format, where the month precedes the day number. Unfortunately, the days of the week confuse this function, so I have to strip them out first. The function returns a year number, month number (1-12) and day number. The 'Delta_Days' function takes two dates in year, month, day format and computes the number of days difference between them. Steve Smoot produced essentially the same solution: sub days_diff { return Delta_Days(Decode_Date_US(substr($_[0],4)),Decode_Date_US(substr($_[1],4))); } 7. Shawn Carroll benchmarked a Date::Calc solution against a Date::Manip solution and found that Date::Calc was about 80 times faster. This is probably because Date::Calc is written in C, while Date::Manip is in pure Perl. The Date::Manip manual contains an extensive discussion of this point, and the tradeoffs between Date::Calc and Date::Manip. 8. Some people coded the date calculations by hand. This is tricky to get right, but has the benefit that if you get do it right, it doesn't have the range limitations of Time::Local. Date calculations are really complicated, and tend to end up looking like a big stew, so I didn't bother to debug the ones that didn't work. Here's one of the less stewish examples, provided by G. Rommel: sub days_diff { my ($start, $end) = @_; my ($wd1, $mo1, $day1, $yr1) = split ' ',$start; my ($wd2, $mo2, $day2, $yr2) = split ' ',$end; # Convert the month. my %mnum = ('Jan'=>0, 'Feb'=>1, 'Mar'=>2, 'Apr'=>3, 'May'=>4, 'Jun'=>5, 'Jul'=>6, 'Aug'=>7, 'Sep'=>8, 'Oct'=>9, 'Nov'=>10, 'Dec'=>11); # Days before this month this year. my @db = qw(0 31 59 90 120 151 181 212 243 273 304 334); $mn1 = $mnum{$mo1}; $startday = int(($yr1 - 1601) * 365.2425) + $db[$mn1] + $day1; $startday++ if $mn1 > 1 && ($yr1%4==0) && (($yr1%400==0) || ($yr1%100!=0)); $mn2 = $mnum{$mo2}; $endday = int(($yr2 - 1601) * 365.2425) + $db[$mn2] + $day2; $endday++ if $mn2 > 1 && ($yr2%4==0) && (($yr2%400==0) || ($yr2%100!=0)); return $endday - $startday; } Rommel transforms each input date into a count of the number of days since the beginning of 1601. (The 365.2425 is the average number of days in the Gregorian calendar year.) He then subtracts the counts to get the difference. Rommel notes that this approach fails to detect invalid dates ("Sep 37 2002" is interpreted as the same as "Oct 7 2002", for example) and that the function won't work after the year 5881210 because the number of days will no longer fit into an integer variable. 9. There was a long discussion on the -discuss list about Julian vs. Gregorian dates. The calendar presently in use in most of the world is the Gregorian calendar, first introduced by Pope Gregory XIII in 1582. Prior to this, most European countries used the Julian calendar, almost the same but with a different leap day schedule. When you see a date like "Tuesday July 2 1776" there is a question about what it means; the same label may be applied by the Julian and Gregorian calendars to different days. There some complications on top of this: * Not every country switched to the Gregorian calendar at the same time. Most Catholic countries switched immediately; other countries held out. Great Britain and its colonies (including what would eventually become the USA) switched calendars in 1752. Russia switched in 1918; this is why the October Revolution is now celebrated in November. * The switch was accompanied by a one-time modification of the calendar, to bring the dates back into line with the seasons. In Spain, for example, October 1582 had only 21 days. In Great Britain, September 1752 had only 19 days. In Sweden there was a big mix-up too complicated to explain here. http://www.geocities.com/CapeCanaveral/Lab/7671/gregory.htm contains some interesting details about these issues. So one might ask: What is days_diff("Fri Sep 1 1752", "Sun Oct 1 1752")? In Spain, it's 30, as one would expect. In England or the USA, one might like the answer to be 19---except that in England, Sep 1 1752 was a Tuesday, not a Friday. Several people tried to take this into account. All of the solutions were locality-specific. For example, one gentleman wrote a version that was accurate in France, taking into account that fact that in France, Dec 10 1582 was followed immediately by Dec 20 1582. I feel that this is misguided. It's interesting, but it's a lot of work and the payoff seems small. The gentleman I mentioned before who included the correct adjustment for France had his function deliver an error if you asked for Dec 15 1582, which didn't exist in France (or, more precisely, there was no date with that name): if ($jd[$i] > 15821210 and $jd[$i] < 15821220) { print "Not a valid date:\nFrance switched to the Gregorian calendar in 1582\n"; print "and 10 Dec 1582 was followed immediately by 20 Dec\n"; exit; } This person would have had a much easier time if he had lived in Israel rather than in France. The entire function could have been replaced with: sub days_diff { my $date = shift; print "There is no date called '$date'\n"; exit; } (In the Hebrew calendar, "Mon Oct 28 2002" is called "Heshvan 22, 5763".) If one is going to historically inaccurate date names, then why not also throw an error for 10 Dec 1793? There was no date with that name either, in France, because the Gregorian calendar was abolished for 13 years after the Revolution and was replaced by a new calendar, in which 10 Dec 1793 was known instead as Decade II, Decadi de Frimaire, de l'Annee 2 de la Revolution or more succinctly, "20 Frimaire II", perhaps. And this says nothing about the question of whether France will still be using the Gregorian calendar 7,000 years from now. It's my feeling that if you're really trying to convert historic dates, the interface presented by days_diff() is hopelessly inadequate. The sample solutions pretend that the Gregorian calendar was in use everywhere at all times; which isn't historically accurate, but it's probably the best you can do without expending an enormous amount of effort; see the GNU Emacs 'calendar' package, for example. It didn't occur to me when I posed the question that people would get worried about this. But the issue could turn out to be important for some applications. For example, if you're trying to compute interest payments for money borrowed before the calendar change, it would be unfair to charge a full month's interest for September 1752 or October 1582 or whatever, when those months were ten or eleven days short. But I think in such a case, you would really have to go back to the people requesting the function and ask what they wanted it to do. 10. Last week I observed with some surprise that when some code failed in some circumstances, people tended to come up with very complicated examples rather than simple ones. I observed this again this week. To illustrate the potential difficulty in handling Julian vs. Gregorian dates, folks brought up the September 1752 oddity in the Great Britain calendar. A simpler example would be that days_diff("Xxx Feb 28 1700", "Xxx Mar 1 1700") should return 1 if the dates are interpreted as Gregorian dates, but 2 if they are interpreted as Julian dates, since the Julian calendar has Feb 29 1700 and the Gregorian calendar omits it. 11. Astronomers use their own modification of the Julian calendar; they label the dates with numbers, with day 0 being a certain day about 6700 years ago, and increasing by 1 each day afterwards. If you could convert a (presumably Gregorian) date like "Wed Oct 16 2002" to astronomical form, you could then subtract the day numbers of two dates to get the number of days in between. Unfortunately, nobody implemented this right. One programmer who chose this path did this: sub days_diff { my($day,$month,$mday,$year) = split(/\s+/,$_[0]); my($day2,$month2,$mday2,$year2) = split(/\s+/,$_[1]); my $monthToNum = { Jan => 1, Feb => 2, Mar => 3, Apr => 4, May => 5, Jun => 6, Jul => 7, Aug => 8, Sep => 9, Oct => 10, Nov => 11, Dec => 12, }; my $jd_1 = _jday($year,$monthToNum->{$month},$mday); my $jd_2 = _jday($year2,$monthToNum->{$month2},$mday2); return $jd_2 - $jd_1; } sub _jday { my($y,$m,$d) = @_; my $jd = ( 1461 * ( $y + 4800 + ( $m - 14 ) / 12 ) ) / 4 + ( 367 * ( $m - 2 - 12 * ( ( $m - 14 ) / 12 ) ) ) / 12 - ( 3 * ( ( $y + 4900 + ( $m - 14 ) / 12 ) / 100 ) ) / 4 + $d - 32075; return $jd; } The '_jday' function here is supposed to convert a year, month, and day to an astronomical Julian day number. This programmer said: What I did do was Google for 'julian day' and I found http://hermetic.magnet.ch/cal_stud/jdn.htm That's a good approach in general, but unfortunately, he cribbed the code without reading the accompanying discussion on that page: Days are integer values in the range 1-31, months are integers in the range 1-12, and years are positive or negative integers. Division is to be understood as in integer arithmetic, with remainders discarded. This programmer's '_jday' function doesn't discard remainders, so it produces mostly wrong answers. A correct version: sub _jday { my($y,$m,$d) = @_; my $a = int(($m-14)/12); my $b = int(( 1461 * ( $y + 4800 + $a ) ) / 4); my $c = int(( 367 * ( $m - 2 - 12 * $a ) ) / 12); my $e = int(( $y + 4900 + $a ) / 100 ); my $f = int(( 3 * $e ) / 4); my $jd = $b + $c - $f + $d - 32075; return $jd; } One other person posted an astronomical Julian day solution to the -discuss list, and made the same mistake. I think the moral here is something about how you can't just paste code into your program and expect it to work. 12. Here's a small test suite: use Test; BEGIN {plan tests => 7 } END { ok(days_diff("Wed Oct 16 2002", "Wed Oct 23 2002"), 7); ok(days_diff("Wed Oct 16 2002", "Tue Oct 16 2001"), -365); ok(days_diff("Mon Oct 21 2002", "Mon Oct 28 2002"), 7); # Daylight saving ok(days_diff("Thu Oct 31 2002", "Fri Nov 1 2002"), 1); ok(days_diff("Sun Jun 29 1997", "Tue Jul 1 1997"), 2); # Leap second ok(days_diff("Wed Dec 30 1998", "Fri Jan 1 1999"), 2); # Last leap second ok(days_diff('Wed Jul 4 1776','Tue Jul 4 1976'), 73048); } 1; To use this, put it in a file called 'DiffTest.pm'; then add the line use DiffTest; to the top of the file that contains your days_diff() function. Thanks again to all the subscribers, and to those who participated in the discussion. I will send another quiz on Wednesday. Sample solutions for this week's 'expert' quiz may be slightly delayed, since I have some other things to attend to this afternoon. Sample solutions and discussion Perl Quiz of The Week #3 (20021030) Write a program, 'spoilers-ok'. It will read the quiz-of-the-week email message from the standard input, extract the date that the message was sent, and print a message that says It is okay to send spoilers for this quiz or It is too soon to send spoilers for this quiz. You may send spoilers in another 4 hours 37 minutes. It becomes okay to send spoilers after 60 hours have elapsed from the time the quiz was sent. You can be sure that the 'Date' field of the QOTW email message will always be in the same format, which is dictated by internet mail format standards. For those unfamiliar with this format: Date: Wed, 23 Oct 2002 16:10:15 -0400 The "16:10:15" is the time of day. "-0400" means that the time of day is in a time zone that is 4 hours behind Greenwich. Effective use of modules turned out to be the key to the best solutions to this quiz. There are three key items: extract the date field from the mail header parse the date format the output When I went to write up a sample solution last week, I knew there must be a module for parsing mail headers, but I just couldn't find it. I spent some time looking for it, and then lost patience. It turned out to be Mail::Header. This short and utterly straightforward solution was provided by Craig Sanders: #! /usr/bin/perl -w use Mail::Header; use Date::Parse; use strict; my $head = new Mail::Header [<>], Modify => 0; my $date = $head->get("Date"); my $message_time = str2time($date); my $ok_time = $message_time + 3600 * 60; my $now = time(); if ($now >= $ok_time) { print "It is okay to send spoilers for this quiz\n" ; } else { my $diff = $ok_time - $now ; my $hours = int($diff / 3600); my $minutes = int(($diff - $hours * 3600) / 60); print "It is too soon to send spoilers for this quiz.\n" ; print "You may send spoilers in another $hours hours $minutes minutes.\n" ; } Some people (including me) wrote more than twice as much code to accomplish the same thing. 1. People used a variety of date-parsing modules. In addition to Date::Parse, people also used Date::Manip, Time::Local, and HTTP::Date. But if you use Time::Local, you must extract and combine the parts of the date yourself; then there is a possibility to make a mistake. One of the submitted solutions that used Time::Local made an error in the time zone handling: $release_time += ((-$3 * 36) + 216000); # timezone, 60 hr.delay The (-$3 * 36) is the time zone adjustment here; $3 contains the time zone part of the date field. This adjustment works for most time zones, but not all. For example, had the quiz-of-the-week been sent from India, where the time zone is +0530 (five hours, thirty minutes) the calculated adjustment would have been 19080 seconds, instead of 19800. This is probably an argument in favor of the modules. 2. Craig's solution has a minor defect: at times, it will generate outputs like You may send spoilers in another 1 hours 1 minutes. This is bad English. One easy way to take care of it: my $Hours = $hours == 1 ? 'hour' : 'hours'; my $Minutes = $minutes == 1 ? 'minute : 'minutes; print "You may send spoilers in another $hours $Hours $minutes $Minutes.\n" ; Seth Blumberg used Lingua::EN::Inflect to handle this. 3. Another possible defect in Craig's solution is that if $diff is 7379 seconds, the output is "... 2 hours 2 minutes"; but really it's 2 hours, 2 minutes, and 59 seconds. There was a brief discussion of how to round off times; Kevin Pfeiffer observed: For dividing seconds into hours and minutes, I believe that a normal rounding operation is wrong. If you have 1.7 hrs, you don't want to round up, but rather take the 1 and leave the remainder to convert to minutes. To handle this in Craig's code, you could use: my $hours = int($diff / 3600); my $minutes = int(($diff - $hours * 3600) / 60 + .5); (The + .5 is the only new thing here.) Iain Truskett used Time::Duration to format the output, which takes care of the plural and the rounding issues. It doesn't produce the specified output format; it might say "1 day 17 hours" instead of "41 hours 17 minutes". Whether this is a bug or a feature is up to you. 4. The solution I wrote up beforehand seems to me to be clearly inferior to Craig's; it's longer and more complicated because it does everything manually: #!/usr/bin/perl use Time::Local 'timegm'; my $date_field; while (<>) { chomp; last unless /\S/; if (s/^Date:\s+//) { $date_field = $_; while (<>) { # read continuation lines? last unless s/^\s//; chomp; $date_field .= $_; } last; } } die "No Date: field found\n" unless defined $date_field; # Typical value: # Wed, 30 Oct 2002 21:34:54 -0000 my ($dy, $mo, $yr, $hr, $mn, $sc, $tzd, $tzh, $tzm) = $date_field =~ /\w\w\w,\ # Day of week ([\d\s]\d)\ (\w\w\w)\ (\d\d\d\d)\ # Day, month, year (\d\d):(\d\d):(\d\d)\ # Time ([+-])(\d\d)(\d\d)/x; # Time zone unless (defined $dy) { die "Couldn't parse Date: field\n"; } my %mo = qw(jan 0 feb 1 mar 2 apr 3 may 4 jun 5 jul 6 aug 7 sep 8 oct 9 nov 10 dec 11); die "Unknown month name '$mo'\n" unless exists $mo{lc $mo}; my $msgtime = timegm($sc, $mn, $hr, $dy, $mo{lc $mo}, $yr-1900); my $tz_adjust = ($tzm * 60 + $tzh * 3600); $tz_adjust *= -1 if $tzd eq '+'; $msgtime += $tz_adjust; # msgtime is now adjusted for time zone my $time_left = ($msgtime + 60 * 3600) - time(); if ($time_left < 0) { print "It is okay to send spoilers for this quiz\n"; } else { print "It is too soon to send spoilers for this quiz.\n"; my $hr = int($time_left / 3600); my $min = int(($time_left - 3600*$hr)/60 + 0.5); my $hours = ($hr == 1 ? 'hour' : 'hours'); my $minutes = ($min == 1 ? 'minute' : 'minutes'); print "You may send spoilers in another $hr $hours $min $minutes.\n"; } And sure enough, it did have a bug: In the date-parsing regex, I originally wrote (\d\d) to match the day of the month, instead of ([\d\s]\d); as a result, any message sent in the first 9 days of any month would fail to match, and the program would die. Thanks again for your interest. I will send another quiz tomorrow; it will not contain any date arithmetic. Sample solutions and discussion Perl Quiz of The Week #4 (20021106) Two words are said to be 'anagrams' if the letters of one word can be rearranged to form the other word. For example, in English, 'ascot' and 'tacos' are anagrams; so are 'tacos' and 'coats'. A set of words that are all anagrams of one another is an 'anagram set'. For example, ascot tacos coast coats is an anagram set. Letter case doesn't matter, so, for example, 'liberating' and 'Gilbertian' are considered to be anagrams. Write a program, make_anagrams, which reads a list of words, one per line, and finds all the anagrams in the the word list. It should output an anagram listing, as follows: * 'Words' that contain digits or punctuational characters should be ignored. * Anagram sets that contain only one word should be omitted from the output. * If an anagram set contains two words, say 'halls' and 'shall', the output should contain two lines: halls shall shall halls * If an anagram set contains more than two words, the entire set should be listed under the alphabetically first word; the others should cross-reference it. For example: headskin nakedish sinkhead nakedish (See 'headskin') sinkhead (See 'headskin') * Finally, the output lines should be in alphabetic order. For example, if the input was 5th ascot ate carrot coast coats cots Dorian eat halls headskin inroad nakedish ordain Ronald's shall sinkhead tacos tea then the output should be: Dorian inroad ordain ascot coast coats tacos ate eat tea coast (See 'ascot') coats (See 'ascot') eat (See 'ate') halls shall headskin nakedish sinkhead inroad (See 'Dorian') nakedish (See 'headskin') ordain (See 'Dorian') shall halls sinkhead (See 'headskin') tacos (See 'ascot') tea (See 'ate') If you need a sample input, you may obtain English word lists from http://perl.plover.com/qotw/words/ If you prefer to do this quiz in a language other than English, please substitute whatever conventions are appropriate for that language. ---------------------------------------------------------------- Here's some sample code: #!/usr/bin/perl while (<>) { chomp; next if /[^A-Za-z]/; my $key = join "", sort(split //, lc $_); $sets{$key} .= "$_:"; } for my $wl (values %sets) { my @wl = split /:/, $wl; if (@wl == 2) { push @output, "$wl[0] $wl[1]"; push @output, "$wl[1] $wl[0]"; } elsif (@wl > 2) { my ($first, @rest) = sort insensitive @wl; push @output, "$first @rest"; for (@rest) { push @output, "$_ (See '$first')"; } } } for my $line (sort insensitive @output) { print $line, "\n"; } sub insensitive {lc($a) cmp lc($b)} The key to solving this problem was figuring out how to decide if two words were anagrams. Two words are anagrams if they have the same letters, possibly in a different order. The easy way to find out if this is the case is to take the letters and put them in some canonical order, for example by sorting them; if the sorted letter lists are the same, then the words were anagrams. Nearly everyone who solved this problem did so by splitting the incoming words into letters, sorting the letters, and joining them back together to form a hash key. For example, the hash key for the word 'Ethiopian' would be 'aehiinopt'. Then you just list each word under its own hash key, and you have a hash where the hash values are lists of anagrammatic words. The first 'while' loop reads the input, computes the key for each word, and installs it into the %sets hash under the appropriate key. At the end of the loop, the %sets values are lists of anagrams, separated by ':' characters. It would have been preferable to use a hash of arrays, but _Learning Perl_ doesn't cover that. A typical value: $sets{'einrs'} is "resin:rinse:risen:siren:". The second loop prepares the output. It scans over all the anagram lists ('$wl' stands for 'word list') and decides what the output will look like for each list. It appends lines of output to @output; later these lines will be sorted and printed. If there are exactly two words in $wl, then two output lines are appended. If there are more than two, then they're sorted into order, the entire word list is listed under the first word, and the rest have crossreferences. Finally, the output lines are printed in alphabetical order. The utility function 'insensitive', which compares two strings without regard to case, is used in two places to produce slphabetically sorted lists. There was some discussion about what to do when the input contained two words with the same spelling but different case, as with 'Polish' (pertaining to Poland) and 'polish' (to make something shiny.) The version above includes such words, which means that there are some silly outputs that inform you that 'polish' is an anagram of 'Polish'. If you don't like this, add next if $seen{lc $_}++; in the top loop; this discards any word that has been seen before with any capitalization. (This is another application of the 'canonical form' idea; this time the canonical form of a string is the all-lowercase version.) The 'expert' quiz postmortem is delayed because the results are so very interesting. There were many solutions posted to the -discuss list and much discussion, and I don't want to leave out anything interesting. Expect it tomorrow, along with a pair of new quizzes. Thanks to everyone who participated, whether or not they sent mail about it. Sample solutions and discussion Perl Quiz of The Week #5 (20021113) You will write a function to lay out crossword puzzles. If you are unfamiliar with American and British style crossword puzzles, an example is at: http://perl.plover.com/qotw/misc/r005/puzzle.jpg Your function, 'layout_crossword' will get an array argument which represents the desired layout of the crossword puzzle. it will then return a display version of the puzzle. For example, the diagram at the URL above would be represented like this: @sample_puzzle = qw( ....X.....X.... ....X.....X.... ....X.....X.... .......X....... XXX......X..... .....X......XXX ......X...X.... ...X.......X... ....X...X...... XXX......X..... .....X......XXX .......X....... ....X.....X.... ....X.....X.... ....X.....X.... ); If given this array as argument, layout_crossword(@sample_puzzle) would return an array containing the following 61 strings: ############################################################################ #1 #2 #3 #4 ######5 #6 #7 #8 #9 ######10 #11 #12 #13 # # # # # ###### # # # # ###### # # # # # # # # ###### # # # # ###### # # # # ############################################################################ #14 # # # ######15 # # # # ######16 # # # # # # # # ###### # # # # ###### # # # # # # # # ###### # # # # ###### # # # # ############################################################################ #17 # # # ######18 # # # # ######19 # # # # # # # # ###### # # # # ###### # # # # # # # # ###### # # # # ###### # # # # ############################################################################ #20 # # # #21 # # ######22 # #23 # # # # # # # # # # # # ###### # # # # # # # # # # # # # # ###### # # # # # # # ############################################################################ ################24 # # # #25 # ######26 # # # # # ################ # # # # # ###### # # # # # ################ # # # # # ###### # # # # # ############################################################################ #27 #28 #29 # # ######30 # # #31 # # ################ # # # # # ###### # # # # # ################ # # # # # ###### # # # # # ################ ############################################################################ #32 # # # # #33 ######34 # # ######35 #36 #37 #38 # # # # # # # ###### # # ###### # # # # # # # # # # ###### # # ###### # # # # ############################################################################ #39 # # ######40 # #41 # # # #42 ######43 # # # # # # ###### # # # # # # ###### # # # # # # ###### # # # # # # ###### # # # ############################################################################ #44 # # #45 ######46 # # ######47 # #48 # # # # # # # # ###### # # ###### # # # # # # # # # # ###### # # ###### # # # # # # ############################################################################ ################49 #50 # # # #51 ######52 # # # # # ################ # # # # # ###### # # # # # ################ # # # # # ###### # # # # # ############################################################################ #53 #54 #55 # # ######56 # # #57 # # ################ # # # # # ###### # # # # # ################ # # # # # ###### # # # # # ################ ############################################################################ #58 # # # # #59 # ######60 # # # #61 #62 #63 # # # # # # # # ###### # # # # # # # # # # # # # # ###### # # # # # # # ############################################################################ #64 # # # ######65 # #66 # # ######67 # # # # # # # # ###### # # # # ###### # # # # # # # # ###### # # # # ###### # # # # ############################################################################ #68 # # # ######69 # # # # ######70 # # # # # # # # ###### # # # # ###### # # # # # # # # ###### # # # # ###### # # # # ############################################################################ #71 # # # ######72 # # # # ######73 # # # # # # # # ###### # # # # ###### # # # # # # # # ###### # # # # ###### # # # # ############################################################################ (There are 61 lines here, each 76 characters long; the function should return a list of 61 strings, one for each line, with each string 76 characters long.) layout_crossword() will make use of two auxiliary arrays that describe what empty and full squares should look like. In the example above, these arrays were: @empty_square = qw(###### #....# #....# #....# ###### ); @full_square = qw(###### ###### ###### ###### ###### ); layout_crossword() must scan over the input representation, constructing an array of strings, which it will return. It must repeatedly insert the contents of @empty_square or @full_square into the appropriate place in the output array, depending on whether the corresponding character of the input was '.' (an empty square) or any other character (a full square.) layout_crossword() must also compute the appropriate numerals to insert into the blank squares, and insert them into the right places in the output. A numeral should be placed into the upper-leftmost empty part of its square. If the numeral doesn't fit in the square, the program should die. Again, empty parts of a square are denoted by '.' characters. An empty square should receive a numeral if it is the first square in a word; words run from left to right and from top to bottom. See the diagram for an example. All '.' characters should be translated to real spaces on output, as in the example. The function should RETURN A LIST OF STRINGS, which might be printed later. It should not print anything. ---------------------------------------------------------------- There was some confusion about the purpose of the @empty_square and @full_square arrays. (The 'glyphs'.) When I thought up the question, I wanted the function to get three arguments: The crossword template, and the empty and full glyphs. But there was no good way to pass these three arguments to a single function without using references, and the rules of the game say that the regular quiz must be soluble using only the techniques explained in _Learning Perl_. That means no references. So I compromised and had the glyphs passed via to external, global arrays. Unfortunately, I didn't make my intentions clear enough when I posed the problem, and a number of people thought that the glyph arrays were private to the layout_crossword function. This was a perfectly reasonable conclusion, since the problem statement erroneously declared the arrays as 'my' variables. Oops! My apologies to anyone who was confused by this. When I tested the programs that were sent to the qotw-discuss list, I hacked them all to use global glyph arrays. I also did other minor hacking that was necessary to make the programs work with my test harness, which such hacking suggested itself. Three programs consistently produced the best-looking output: Ron Isaacson's, which was by far the best, Alex Lewin's, and mine. But mjd1.pl was 43% shorter than lewin.pl and 51% shorter than isaacson.pl, so that's the one I'll discuss in detail. (Also, both isaacson.pl and lewin.pl use references, and lewin.pl is in object-oriented style, and so outside the scope of _Learning Perl_.) There are two small utility subroutines: sub add_numeral { my ($n, @sq) = @_; my $space = '\.' x length($n); for (@sq) { return @sq if s/$space/$n/; } die "Square was too small for numeral $n\n"; } Here $n is a numeral, and @sq is a glyph array. add_numeral() inserts the numeral $n into the glyph array and returns the result. It first assembles a regular expression that looks for L dots in a row, where L is the length of $n. Then it scans the glyph from top to bottom, looking for the L dots, which represent a space large enough to hold the numeral. When it finds one, it replaces the dots with the numeral and returns the result. If there is no place to put the numeral, it dies. One of the tests I ran used an 'empty' glyph that looked like this: ###### ## ## # # # # ###### A correct program will insert the numeral into the top space like this: ###### ###### ##7 ## ##17## # # # # # # # # ###### ###### Several of the less-correct programs assumed that the empty glyphs would be completely empty inside their borders, and produced outputs like these: ###### ###### #7 ## #17 # # # # # # # # # ###### ###### The other utility subroutine just gets an X and Y coordinate and the puzzle template, and returns true if the corresponding square is empty. sub is_blank { my ($x, $y, @puzzle) = @_; return if $y < 0 || $x < 0 || $y >= @puzzle || $x > length($puzzle[0]); return substr($puzzle[$y], $x, 1) eq "."; } The reason this is here is to establish the convention that squares outside the puzzle are considered to be full, not empty. This simplifies the process of determining whether an empty square should receive a numeral. The complete rule for deciding whether a square gets a numeral is that a blank square gets a numeral if the square above is full, if the square to the left is full, if it is in the top row, or if it is in the leftmost column. By adopting the convention that squares outside the diagram are considered full, we can simplify the logic for numbering squares: A square gets a numeral if the square above or to the left is full. The main function is fairly straightforward. It loops over the rows from top to bottom, and over the squares in each row from left to right. It sets '@square' to an appropriate glyph for the current square, copying it from @full_square or @empty_square as appropriate, and then, if empty, it uses add_numeral() to add a numeral to @square if the square above or to the left is full. use strict; our (@empty_square, @full_square); sub layout_crossword { my @puzzle = @_; my $N = 1; my @result; my ($h, $w) = (scalar(@puzzle), length($puzzle[0])); for my $y (0 .. $h-1) { my @row; for my $x (0 .. $w-1) { my @square; my $blank = is_blank($x, $y, @puzzle); if ($blank) { @square = @empty_square; unless (is_blank($x-1, $y, @puzzle) && is_blank($x, $y-1, @puzzle)) { @square = add_numeral($N++, @square); } } else { @square = @full_square; } Now there's the interesting question of what to do with the overlapping parts of adjacent squares. This program uses an extremely simple strategy: It trims off the left-hand edge of the square, so that: +----+ ----+ and ###### ##### | | becomes | # # becomes # | | | # # # +----+ ----+ ###### ##### Now the current square can borrow the right-hand edge of the square to its left. Squares in the leftmost column have nobody to borrow from, so the trimming is not performed for those squares: # trim off overlap with square to left if ($x > 0) { s/^.// for @square; } Then we similarly trim off the topmost edge of each square, except for those in the topmost row of the diagram: # Now trim off overlap with square above if ($y > 0) { shift @square; } Now that the square is complete, we append it to the right-hand end of the current row of the output: # add square to output for (0 .. $#square) { $row[$_] .= $square[$_]; } } When we finish a row, we turn the dots into spaces, as required by the spec, and insert the row into the return value array. When we finish the last row, we return the array: for (@row) { tr/./ /; } push @result, @row; } @result; } ---------------------------------------------------------------- Notes: 1. The test data and the test results are at http://perl.plover.com/qotw/misc/r005/ The program above is http://perl.plover.com/qotw/misc/r005/mjd1.pl The subdirectory 'templates' contains four sample crossword templates. The subdirectory 'glyphs' contains ten pairs of glyphs. The test harness, TestXWord.pm, tries the function on each combination of templates and glyphs, and deposits the output into the 'output' directory. To use it, say perl -MTestXWord yourprogram.pl < /dev/null The 'check-results' program checks the outputs against the sample results in the 'standard' directory. Files in the 'standard' directory have names of the form PUZZLE-glyphset-##.x where ## is a number of points and 'x' is an arbitrary letter. If a program's output matches this file, it is awarded that many points. If an output doesn't match any of the 'standard' forms, it is copied to the 'mismatches' directory. I went over 'mismatches' repeatedly and copied all the 'mismatches' that actually looked good into the 'standards' directory. 2. The technique I used to abutting the squares is very simple, and produces good-looking output most of the time. For some examples, it is not so good. One of the test glyph sets, 'mixed', contains mismatched glyphs: +----+ ###### |....| ###### |....| ###### |....| ###### +----+ ###### With these glyphs, the asymmetry in my algorithm becomes obvious: ######----+----+----+##### ######1 |2 |3 |##### ###### | | |##### ###### | | |##### ######----+----+----+##### ######4 | | |5 | ###### | | | | ###### | | | | ######----+----+----+----+ |6 | |#####7 | | | | |##### | | | | |##### | | +----+----+#####----+----+ |8 | |9 | |##### | | | | |##### | | | | |##### +----+----+----+----+##### ######10 | | |##### ###### | | |##### ###### | | |##### ######----+----+----+##### When there's a disagreement between two adjoining cells about what their shared property should look like, the cell above or the cell to the left always wins. 3. Peter Haworth's program does better with the mixed glyphs, and is also very small. I thought that getting this exactly right would be a big pain. I even wrote code do to it, and then decided to leave it out of the question because it was too much code. Peter cuts the Gordian Knot here and uses a very simple method. Peter's program starts by filling the entire grid with full-square glyphs, and then superimposes the empty-square glyphs on top of those. Empty squares win whenever there is a disagreement about the appearance of shared territory. For the template above, his program generates this output: #####+----+----+----+##### #####|1 |2 |3 |##### #####| | | |##### #####| | | |##### #####+----+----+----+----+ #####|4 | | |5 | #####| | | | | #####| | | | | +----+----+----+----+----+ |6 | |####|7 | | | | |####| | | | | |####| | | +----+----+----+----+----+ |8 | |9 | |##### | | | | |##### | | | | |##### +----+----+----+----+##### #####|10 | | |##### #####| | | |##### #####| | | |##### #####+----+----+----+##### The difference is subtle, but I think it is much handsomer. 4. The algorithm I used to determine whether a square should receive a numeral fails in certain cases. Consider this template: ..... .#.#. ..... .#.#. ..... My program generates this output: But it should be: ##################### ##################### #1 #2 #3 #4 #5 # #1 # #2 # #3 # # # # # # # # # # # # # ##################### ##################### #6 #####7 #####8 # # ##### ##### # # ##### ##### # # ##### ##### # ##################### ##################### #9 #10 # #11 # # #5 # # # # # # # # # # # # # # # # # ##################### ##################### #12 #####13 #####14 # # ##### ##### # # ##### ##### # # ##### ##### # ##################### ##################### #15 #16 # #17 # # #6 # # # # # # # # # # # # # # # # # ##################### ##################### The problem here is that I generate a numeral for any empty square below (or to the right of) a full one, but it's only correct to generate a numeral for an empty square below (or to the right of) a full one that is not also above (or to the left of) a full one. If there are full squares on both sides, the numeral is inappropriate because there is nowhere for the word to go. In American-style crossword puzzles, this situation can never occur, because there is an express prohibition on exactly this situation. Every blank square must be at the intersection of two words, one across and one down. A square that is part of only a single word is called an 'unkeyed square' and is strictly forbidden. However, in many British-style crossword puzzles, most famously the London Times Sunday puzzle, there are unkeyed letters. I specifically mentioned "British style crossword puzzles" in the question, so this is a defect. To fix it, change unless (is_blank($x-1, $y, @puzzle) && is_blank($x, $y-1, @puzzle)) { @square = add_numeral($N++, @square); } to if (!is_blank($x-1, $y, @puzzle) && is_blank($x+1, $y, @puzzle) ||!is_blank($x, $y-1, @puzzle) && is_blank($x, $y+1, @puzzle)) { @square = add_numeral($N++, @square); } I forgot all about this until I looked closely at Peter Haworth's contribution, which gets it right. (Peter, of course, is a Brit.) 5. ensch2.pl is a peculiar case. Faced with the problem of how to overlap adjacent glyphs, Peter B. Ensch did something interesting. His program ensch2.pl inserted backspace characters between the glyphs: +---+^H+---+^H | |^H| |^H +---+^H+---+^H (I have represented the backspaces by '^H'). I didn't realize this at first. When I went to look at the program's output, I used the 'less' pager program, which interpreted the ^H's as requests to overstrike! So when I used the pager, I got what looked like +---+--- | | +---+--- but with the middle vertical line in boldface. When I just printed the output to the terminal with 'cat', it looked normal. But to the automatic test suite, the answers looked completely wrong. The test suite hated it, but if the backspacing is allowed, ensch2.pl would be one of the better performers. 6. One common problem was programs that assumed that the glyphs would have a certain appearance, or would be a certain size. Using the glyphs ## and .. ## .. caused problems for many peoples' programs, which could not figure out how to fit the numerals in, or which assumed the presence of a border. The example program above completely botches these examples because it insists on overlapping the adjacent glyphs, even though that means stripping out 3/4 of each glyph. For small glyphs, overlapping is a mistake. Ron Isaacson's program was the only one posted on the qotw-discuss list that handled this case properly at all. His program overlaps squares only if the empty and full square borders match. This leads to cluttered but reasonable behavior in the 'mixed' case above: ######+----++----++----+###### ######|1 ||2 ||3 |###### ######| || || |###### ######| || || |###### ######+----++----++----+###### ######+----++----++----++----+ ######|4 || || ||5 | ######| || || || | ######| || || || | ######+----++----++----++----+ +----++----+######+----++----+ |6 || |######|7 || | | || |######| || | | || |######| || | +----++----+######+----++----+ +----++----++----++----+###### |8 || ||9 || |###### | || || || |###### | || || || |###### +----++----++----++----+###### ######+----++----++----+###### ######|10 || || |###### ######| || || |###### ######| || || |###### ######+----++----++----+###### and perfect behavior in the very-small-glyph case: ##1 2 3 ## ## ## ##4 5 ## 6 ##7 ## 8 9 ## ## ##10 ## ## ## 7. There were a few oddities that caused some programs to appear to perform more poorly in the tests than was actually warranted. The program sainio.pl used a global variable to store the current clue number, and never reset it between calls to layout_crossword(). Since the test harness called layout_crossword() forty times in a row, the numbers grew to be four digits long and then wouldn't fit into the boxes any more. sainio2.pl is a corrected version. schmidt.pl and schmidt2.pl copied the glyphs into two private arrays, @UL_empty_square and @UL_full_square. Unfortunately, they did so at compile time, thus foreclosing the possibility that anyone could change the glyphs later, and preventing the test harness from changing the glyphs. With this defect repaired, these two programs did well in the testing. As I mentioned above, a better design for this function would be for it to have three arguments instead of using the two global glyph arrays. jones2.pm did do this, so it failed the tests. I hacked it so that it used the specified argument format instead. But the output was quite broken! It didn't use the correct glyphs, and it only numbered the 'across' clues! ++++++++++++++++ +##+1 + + +##+ +##+ + + +##+ +##+ + + +##+ ++++++++++++++++ +##+2 + + + + +##+ + + + + +##+ + + + + ++++++++++++++++ +3 + +##+4 + + + + +##+ + + + + +##+ + + ++++++++++++++++ +5 + + + +##+ + + + + +##+ + + + + +##+ ++++++++++++++++ +##+6 + + +##+ +##+ + + +##+ +##+ + + +##+ ++++++++++++++++ wolters.pl produced some reasonable-looking outputs, but did not translate the dots to spaces, so the results looked like: +----+----+----+----+----+ |####|1...|2...|3...|####| |####|....|....|....|####| |####|....|....|....|####| +----+----+----+----+----+ |####|4...|....|....|5...| |####|....|....|....|....| |####|....|....|....|....| +----+----+----+----+----+ |6...|....|####|7...|....| |....|....|####|....|....| |....|....|####|....|....| +----+----+----+----+----+ |8...|....|9...|....|####| |....|....|....|....|####| |....|....|....|....|####| +----+----+----+----+----+ |####|10..|....|....|####| |####|....|....|....|####| |####|....|....|....|####| +----+----+----+----+----+ I added the line tr/./ / for @puzzle; just before the return from the function. However, it didn't correctly handle different-sized glyphs. 8. I got the idea for this problem from _The Art of Computer Programming, Vol. 1: Fundamental Algorithms_, by Donald E. Knuth. (In the 3rd edition, it is problem 1.3.2.23, and is on page 163.) I remembered that there was some issue in the Knuth problem that made it more difficult than the problem I was posing, but I didn't remember what is was, and I didn't look it up until just now. The Knuth version of the puzzle says that black squares at the border of the puzzle should be deleted from the output. His example: If the input was #....# ..#... ....#. .#.... ...#.. #....# Then the output should be +++++++++++++++++++++ +01 + +02 +03 + + + + + + +++++++++++++++++++++++++++++++ +04 + ++++++05 + +06 + + + ++++++ + + + +++++++++++++++++++++++++++++++ +07 + +08 + ++++++ + + + + + ++++++ + +++++++++++++++++++++++++++++++ + ++++++09 + +10 + + + ++++++ + + + + +++++++++++++++++++++++++++++++ +11 +12 + ++++++13 + + + + + ++++++ + + +++++++++++++++++++++++++++++++ +14 + + + + + + + + + +++++++++++++++++++++ Notice how the corner black squares have vanished. If there were other black squares next to these, they would vanish also. Knuth says: "The diagram... might have long paths of black squares that are connected to the outside in strange ways." Although my version of the problem was missing this complication, it had an additional complication because the full and empty square glyphs were variable instead of fixed. My problem specification didn't provide much guidance about how to make the glyphs overlap, and in the case where the edges of the two glyphs didn't match, it wasn't immediately clear how to overlap them and still make the result look good. 9. It was pointed out on the -discuss list that the code I posted yields a warning, if warnings are enabled. Specifically my @empty_square = qw(###### #....# #....# #....# ###### ); yields the warning "Possible attempt to put comments in qw() list". One poster to the list said: Seeing as much of the Perl community have been trying to get new Perl programmers to turn on warnings and strict, in an effort to highlight problems with their code, I have been surprised to see MJD's quiz this week. In order to use the empty_square and full_square arrays, as included in the quiz text, you are actually inclining people to turn off strict and warnings. Which IMHO is not good. http://perl.plover.com/~alias/list.cgi?1:mss:605 This remark, unfortunately, comes right at the intersection of several philosophical stances I hold, and that makes me very cranky. First, the warning has nothing whatever to do with 'strict'. Throughout this message, the author says "warnings and strict", "strict and warnings", as if in one breath. The code in question is completely strict-safe. (It *shouldn't* be, but that is an unrelated matter.) Why mention 'strict' at all? The Perl community has become increasingly dogmatic in the past few years about the use of 'strict'. It is common to see people ask questions in newsgroups, and to post four-line examples, and be criticized for failing to use 'strict'. "Why aren't you using 'strict'?" people ask. Well, because it is a four-line example posted in a Usenet article, obviously. 'strict' has no value in such cases, except perhaps to get people to shut up about it. It is true that the Perl community has been trying to get new Perl programmers to turn on warnings and strict. I have no objection to this. What I do object to is that the community seems to be trying to get people to turn on warnings and strict without knowing why they are doing that, or what they are for. It is common to see people ask questions in newsgroups like this: I got the error "Global symbol "$z" requires explicit package name." What does that mean? This is like someone coming to say that there is a loud bell ringing in the hallway, and what should be done about it? Of course, it is the fire alarm. They were told to always turn the fire alarm on, but nobody told them what it would mean if it began to ring. I believe that one of the biggest problems with programming as a profession is that programmers are fearful and superstitious. Programming is only about sixty years old. When chemistry was sixty years old, practitioners were trying to turn lead into gold, to extract the essence of fire, and so forth. After a few hundred years they learned a little more and began to study phlogiston. So we are in the dark ages of programming, and we live in a dangerous world that we do understand only poorly. Many people respond to this with superstition: "Always use objects." "Never use a global variable." "Perl is better than Python." "Always use strict." We do not have to give in to this superstition. We don't have to say "To be safe, always use strict. And to be double safe, throw salt over your left shoulder." We are engineers, and programming is empirical. We should by all means encourage beginners to use the best possible engineering practices. But we should not encourage the blind use of certain programming features. When a person says "warnings and strict" four times, when talking about a piece of code that emits a warning but is strict-safe, what is going on? Clearly this person is not thinking about the meaning of what he is saying. The code is also not ISO 9000 compliant; why not mention that also? I think this is a superstitious effect. Instead of looking at the reality of the situation, which is that the code raises certain specific warnings, and the warnings have a certain specific meaning, which suggests that the code should be examined for certain specific problems, the speaker has lumped all warnings and diagnostics together in one group and adopted the stance that all such warnings should be eliminated. This shows a lack of understanding. I think almost anyone who says "always use strict" is suffering from this lack of understanding. "strict" is not one but three features, and none of these three features has anything at all to do with the other two. Saying "always use strict" is like saying "always use a hammer, a screwdriver, and a drill." For some projects, perhaps only the hammer and drill are appropriate, and the screwdriver is an irrelevant distraction. So it is too with "strict". People are being encouraged to load up with tools that they don't know how to use. The results of this are sometimes stunningly silly. I have many examples of programs that start by saying: use strict; my ($rounds, $round_temp, $squares, $page, $x, $y, $z, $cell, $player_move, @available_choices, $computer_move, @choices, $round, $winner, $player_move_pretty, $computer_move_pretty); my ($round_minus_one); The programmer here wants to use global variables; she does not understand what lexical variables or for, or why they are preferred. But, at the advice of some well-meaning person, she has put 'use strict' at the top of the program, and now global variables are forbidden. So she declares every variable at the top of the program, effectively making them all global, and getting none of the encapsulation, reuse, or maintenance benefits that lexical variables are supposed to accrue. Another example: my @ret=eval "layout_tree_$format(\$tree)"; Why do this? There is a safer and more efficient method: my @ret= "layout_tree_$format"->($tree); Perhaps the programmer didn't know about the safer and more efficient method. Or perhaps he avoids it, as many people do, because it causes a 'strict refs' failure, while the 'eval' method, although inferior in every way, does not. I don't think we need to do more to encourage people to usewarningsandstrict. I think we need to do more to encourage them to understand the warnings they get and to take appropriate action. When I teach programming classes, I am always astonished at how little attention the students pay to the error messages they receive. The compiler complains of a syntax error on line 197, and the programmer's response is not to look at line 197, but to eyeball a random portion of the program in the hope that the error is there. By encouraging people to "always use strict and warnings" and to think of diagnostic messages as bad, and something to avoid, we are doing the exact wrong thing. The right thing is to encourage people to pay attention to the messages, to try to understand them, and then to make considered judgements about what they mean. That is what I think beginners need to learn. In this case, the warning is saying "Possible attempt to put comments in qw() list". What does that mean? It means that perl has seen a # sign in a qw(), and it is afraid that I might be trying to write something like this: my @array = qw( red crimson # But not scarlet blue azure green ); Here the thing that looks like a comment is not a comment; instead, the @array gets nine elements, including 'But', 'not', 'scarlet', and '#'. It is good that perl warns us about this. In my example code, however, this is not the case: my @empty_square = qw(###### #....# #....# #....# ###### ); I am *not* trying to put a comment into a qw() list. Perl sees the '#' signs, and it is afraid that I *might* be doing that, so it warns me. But it is mistaken; the # signs are doing what I want here. In such a case, it is perfectly appropriate to ignore the warning. The compiler has had its say, and I have listened to it, but it is just a machine, and I know better than it does what I want. If you are troubled by the warning message itself, the correct approach here is NOT to code around it by writing something like this: my @empty_square = ('######', '#....#', '#....#', '#....#', '######', ); The correct response is to SHUT OFF THE WARNING: my @empty_square; { no warnings 'qw'; @empty_square = qw(###### #....# #....# #....# ###### ); } (The "no warnings 'qw'" declaration shuts off only those warnings that pertain to the qw() operator, and only inside that one block. Elsewhere, all warnings will still be issued. Inside the block, all other warnings will still be issued.) The thing that really irks me about the 'strict' dogmatism is how defective is most of the dialog about it. Last year I read a review of a book about using Perl to write CGI programs. The reviewer harshly criticized the author for not having used 'strict'. The reviewer did not say which of the three parts of 'strict' would have been valuable. His opinion was apparently that all programs should use 'strict', whether it would be valuable or not. I objected, pointing out that none of the programs in the book used references, so that 'strict refs' would not be doing anything; that none of the example programs were more than twenty lines long, so there was no practical difference between global and lexical variables, and hence no reason to use 'strict vars' to forbid global variables; and that the only value of 'strict subs' is to prevent future maintenance problems when someone adds a subroutine whose name is the same as what was previously a bareword, a feature of small value at best and of less value in these tiny example programs. But the reviewer did not address any of my specific technical points. Instead, he told an anecdote about a bad programmer, and said that we should teach everyone "good programming style" right from the start. That begs the question of what "good programming style" is. I realized then that the reason for our disagreement was that my idea of good programming style was motivated by what was useful and effective, whereas his was motivated by superstition. Considerations of usefulness did not come into play. That is my opinion on "use warnings and strict". The short version is: No, I do not believe there is any inherent value in "keeping warnings and strict happy", and I am going to continue to try to do the most appropriate thing for the circumstances. I believe that that is the only way to set the best possible example for beginners. Sorry to go on so long, but this has all been seething inside me for a long time. New quizzes tomorrow. My grateful thanks to everyone who participated in the discussion, and also to those who quietly worked the problemns on their own. Sample solutions and discussion Perl Quiz of The Week #6 (20021120) Write a function, format_number_list, whose argument is a list of integers. It then returns a string which represents the input list in compact, human-readable form. For example, format_number_list(1, 2, 4, 5, 6, 7, 9, 13, 24, 25, 26, 27) will return "1-2, 4-7, 9, 13, 24-27" Also write a function, 'expand_number_list', which does the conversion in the opposite direction, so that expand_number_list("1-2, 4-7, 9, 13, 24-27") will return (1, 2, 4, 5, 6, 7, 9, 13, 24, 25, 26, 27) ---------------------------------------------------------------- I'll show solutions for format_number_list first. I'm going to present two sample solutions this week. One was contributed by Andreas Koenig: use Set::IntSpan; # CPAN rules :-) use strict; sub format_number_list { my(@n) = @_; my $set = Set::IntSpan->new(join ",", @n); my $run = $set->run_list; $run =~ s/,/, /g; # give them the spaces $run; } To summarize this solution: Set::IntSpan is a CPAN module that already does almost exactly what was requested; Andreas simply wrapped it. Andreas said: "I surely was amazed that nobody found it till monday." I was amazed also. Here's a synthetic but straightforward solution, from James Gray, slightly modified by me: sub format_number_list { my @output ; while (@_) { my $range_start = shift; my $range_end = $range_start; # check if the numbers go in sequence from here $range_end = shift while (@_ && $_[0] == $range_end + 1); # ...and add to output accordingly if ($range_start == $range_end) { push @output, $range_start; } else { push @output, "$range_start-$range_end"; } } join ", ", @output; } 1. This quiz generated a large amount of discussion about what to do if the input list was out of order, if the input list contained repeated numbers, if the input list contained negative numbers, and so on. had I thought about the problem more carefully before posting, I would have said something about some of these situations. The application I originally had in mind was .newsrc files. Lines of a .newsrc file indicate which news articles in a newsgroup have already beenread. News articles are numbered starting from 1, so the negative-number issue never comes up, and the .newsrc line represents a set of integers, not a list, so order and repetition is a nonissue. But I didn't say this in the question. 2. Then there was addition discussion about what the function should do if given numbers out of order. For example, given (1,2,3,7,4,5,6), should it produce "1-3, 7, 4-6" or "1-7" or something else? There were good arguments in both directions: * Maybe the order is significant; then you don't want to alter it. * If you sort the input before processing, you foreclose the possibility of the function ever treating (1, 3, 2) differently from (1, 2, 3). But if you're careful not to change the order, the user who wants you to treat them the same can still call format_number_list(sort numerically @nums); * If the purpose of the function is to generate 'human-readable lists' then reordering the numbers for maximum compression is more likely to achieve that. For the .newsrc case, since the lists are actually sets, reordering makes sense. For other applications, it may not. I tested these separately. There was also some discussion about whether (3, 2, 1) should turn into "3-1" or "3, 2, 1", supposing that the input was not to be reordered. 3. If negative numbers are allowed, the required output format is ugly and hard to read. (-3, -2, -1) would turn into "-3--1". Some people opted for "-3..-1" instead. 4. Some people also modified the output formats in various other ways. Then I had to hack on them to get them to work with the test harness. You'd be surprised at how difficult it was to make trivial formatting changes in some of these programs. In several cases I had to hunt down and change the same punctuation in several places. 5. With all this discussion, I was surprised to see how few of the submitted solutions actually worked properly for the straightforward cases: Postive integers in ascending order with no repeats. Of the 21 samples I tested, several failed some of the basic cases. (See http://perl.plover.com/qotw/misc/r006/RESULTS/format/NOTOK . ) 6. Even the programs which were advertised by their authors as handling certain special cases, often didn't. 7. My conclusion from all this is that maybe people would do well to pay more attention to basic correctness in the simple cases before worrying a lot about making the functionality as complete as possible. Now 'expand_number_list'. Here's Andreas Koenig's Set::IntSpan version: use Set::IntSpan; # CPAN rules :-) use strict; sub expand_number_list { my $run = shift; my $set = Set::IntSpan->new($run); $set->elements; } Here's Tom Varga's, cleaned up a little: sub expand_number_list { my @result ; for my $num (split(/\s* , \s*/x, $_[0])) { push(@result, ($num =~ /(\d+) \s* - \s* (\d+)/x) ? ($1 .. $2) : $num ) ; } @result ; } 8. Several people seemed to misunderstand that 'expand_number_list' was supposed to return a list of numbers, not a string. That's all for regular quiz #6. I'll send a postmortem of the expert quiz later on today, and new quizzes tomorrow. Thanks to everyone who participated. Sample solutions and discussion Perl Quiz of The Week #7 (20021127) A gentleman on the perl-qotw-discuss list reports: > In two different companies that I've worked at, the policy has been > that percentages in reports must always add up to 100% (at the cost > of munging the actual data). It seems that otherwise end users > report it as a bug. This means, for example, that if you survey 300 people and find that 100 prefer the color red, 100 prefer blue, and 100 prefer black, you are not allowed to report 33.3 % prefer red 33.3 % prefer blue 33.3 % prefer black Because then the percentages appear to add up to only 99.9%. Instead, you'll fib, by rounding one of the percentages up to 33.4% instead of down to 33.3 %: 33.3 % prefer red 33.4 % prefer blue 33.3 % prefer black This, of course, is ridiculous, since it suggests that there were somehow more 'blue' responses than 'red' or 'black' responses, when there were in fact equal numbers of each. But in the world of business the appearance of correctness is sometimes more important than actual correctness. Similarly, if you survey 70 people and find that 30 prefer red, 30 prefer blue, and 10 prefer black, you may not say that 42.9 % prefer red 42.9 % prefer blue 14.3 % prefer black because the percentages appear to add up to 100.1%. You must adjust one of the percentages down by 0.1%. You will write a function, 'fudge_numbers', which takes the real data as input and returns the appropriate percentages. The first argument to fudge_numbers() will be special: It will be an integer, 0 or greater, indicating how many places after the decimal point will be retained after rounding. An argument of 1 will mean that the percentages you return would be rounded to the nearest tenth of a percent, as in the examples above. An argument of 0 will mean that the percentages should be rounded to the nearest percent; an argument of 2 will mean that the percentages should be rounded to the nearest hundredth of a percent. The remaining arguments to fudge_numbers() will be the actual data, which will all be non-negative numbers. The return value of fudge_numbers() will be a list of numbers indicating relative percentages. There must be exactly one return value for each data argument. The return values must be rounded off as indicated by the rounding argument, and they must total exactly 100. (Or as near as possible within the computer's limits of precision.) For example, fudge_numbers(1, 100, 100, 100) should return (33.4, 33.3, 33.3) or (33.3, 33.4, 33.3) or (33.3, 33.3, 33.4) (All are equally acceptable.) Similarly: Arguments Return values 1, 100, 100, 100 33.3, 33.4, 33.3 0, 100, 100, 100 33, 34, 33 2, 100, 100, 100 33.33, 33.34, 33.33 2, 7, 7, 7 33.33, 33.34, 33.33 1, 30, 30, 10 42.9, 42.9, 14.2 or 42.9, 42.8, 14.3 or 42.8, 42.8, 14.3 1 z 100 (here 'z' is any number) ---------------------------------------------------------------- I got a pleasant surprise while I was testing these. I didn't solve the problem myself until very late, because I couldn't think of a solution, and because I hate floating-point numbers. But then when time came to write the report, I finally gave in and did it. Then I ran the test suite and started looking at the programs in order from shortest to longest. The shortest two didn't pass the tests. The third-shortest did, but when I read the code I scratched my head and said "That can't work, can it?" And then I added some more tests to the test suite and found that it *didn't* work. Then the next four-shortest also didn't pass the tests, and that left my own late entry as the shortest version that did pass the tests. Of course, it's still possible that someone might see it, scratch their head, say "That can't work, can it?" and find the test that it fails. But until then, here it is: # Round $v to nearest integer sub round { sprintf("%.0f", shift) } # Add up the arguments sub sum { my $s = 0; $s += $_ for @_; $s; } sub fudge_numbers { my ($prec, @d) = @_; my $scale = 10 ** $prec; my $sum = sum(@d); # Scale data so that all significant digits are # *left* of the decimal point @p = map $_*100*$scale/$sum, @d; @r = map round($_), @p; # rounded versions of @p @e = map $p[$_]-$r[$_], (0 .. $#r); # error # This is the number of jots by which the answer is too LOW. my $total_error = round(sum(@e)); if ($total_error) { # Sign +1: numbers need to be increased. # -1: numbers need to be decreased my $sign = $total_error < 0 ? -1 : 1; $total_error *= $sign; # absolute value # We want total_error equal to zero. # To achieve this, we will add a jot to the low numbers, # or subtract a jot from the high numbers, as needful. for (0..$#r) { next unless $e[$_] * $sign > 0; # Error goes the wrong way $r[$_] += $sign; # Adjust value $total_error--; last if $total_error == 0; } } map $_ / $scale, @r; # Scale data back to percentages } My background is in systems programming, and I think in my entire life as a systems programmer I only ever used a floating-point number once. I *hate* floating-point numbers, and I think it would be fair to criticize me for avoiding them out of fear and ignorance. But once again, avoiding them turned out to be a good strategy. I deal with integers throughout. If the input is (3, 50, 50, 50) then instead of trying to come up with 33.333 / 33.333 / 33.334, and worrying about the floating-point comparison issues, I try to come up with 33333 / 33333 / 33334 and then scale the answers back to percentages at the last moment. That way I don't have to worry about the fact that Señor Computadoro Estúpido thinks that 100 - (33.333+33.333+33.333) = 0.00100000000000477. Let's consider (1, 2, 3, 5) with a precision of 2 as an example. The program first computes the percentages, but scaled so that all the significant figures are to the left of the decimal point. For the example, the percentage values are 909.090909090909 1818.18181818182 2727.27272727273 4545.45454545455 representing 9.090909%, 18.181818%, etc. This is the '@p'array. Then the program rounds off the percentages to the specified precision; this just means rounding them off to the nearest integer, since we scaled them for that exact purpose. This is the '@r' array: 909 1818 2727 4545 The program then computes the difference between the true value (in @p) and the rounded value (in @r); this is the 'error', stored in @e. Since the true percentages must add up to 100%, and we want the rounded values to do the same, we need to adjust the rounded values so that the total error is 0. $total_error is the sum of the values in @e, and we would like it to be 0. If it *is* 0, we don't need to do any fudging at all, and we skip most of the rest of the function. The big 'if' block in the middle of the function does the fudging. First it calculates $fudge, which is +1 if the numbers need to be fudged upward (because the total is too small, as with 33% + 33% + 33%) and -1 if the numbers need to be fudged downward (because the total is too large, as with 17% + 17% + 17% + 17% + 17% + 17%.) We'll choose some of the elements of @r and add $fudge to them to make the total come out right. Because all the numbers have been scaled so that the least significant place is just to the left of the decimal point, we never need to consider a fudge amount other than +1 or -1. Now we scan over the elements of @r looking for candidates for fudging. If the number is already too small, we mustn't fudge it still further downward, and vice versa; the "next unless $e[$_] * $fudge < 0" line takes care of this check: the total rounding error for this element must be in the *opposite* direction from the direction we're trying to fudge. When we find a fudging candidate, we fudge it ($r[$_] += $fudge) and then adjust the $total_error in the same way. When the total error reaches zero, no more adjustments are necessary. After we've finished any necessary adjustments, we scale the adjusted elements of @r back to the right size for percentages and return the results. *. This time there was no discussion of peculiar edge cases. Are negative numbers allowed? What if all the numbers are zero? Perhaps all the edge-case-fanciers were on vacation. *. This problem turned out to be quite difficult to get right, much harder than I thought it would be. Of 17 programs posted to the -discuss list, only 4 (from 3 authors) passed all the tests! You should consider trying the test suite yourself. You can obtain it from http://perl.plover.com/qotw/misc/r007/TestFudge.pm Then to use it, run the command perl -MTestFudge yourprogram.pl and look for 'not ok' in the output. If your program fails a test, debugging it will probably be at least as instructive as doing the quiz in the first place. Thanks to Andreas Koenig for the tricky test case (#44) that caught out one of the submitted programs. *. A very common error was to compute the fudge factor correctly and then to apply it to the wrong elements. Many people assumed that any of the result values could be fudged. But doing so can lead to bizarre results. Nobody would accept (37, 23, 40) as a valid fudging of (33.3, 33.3, 33.3). Similarly, once person said on the -discuss list: I do not think the improper result from [0 1 1 1 1 1 1] -> [15 17 17 17 17 17] is a bug so much as an issue with the constraints of the problem. Maybe, but the problem said: The return value ... will be a list of numbers indicating relative percentages.... [which] must be rounded off as indicated by the rounding argument. There is no way to interpret '15' as 16 2/3 % (the exact relative percentage) rounded off to 0 decimal places. (The person quoted above submitted a revised solution when this language was pointed out to him; nevertheless, even if I'd somehow left a loophole in the problem specification, what's the point of producing a solution that you know is defective just because you can weasel it through a loophole in the problem statement? The Quiz of the Week is not mandatory.) Anyway, many results were misrounded even by solutions that were *not* deliberately ignoring the requirement to round off. For example, test 104 concerned the data (2 2 1 1 1 1) rounded off to 0 places. The exact percentages are (25, 25, 12.5, 12.5, 12.5, 12.5). There are a lot of reasonable answers here, all of the form (25, 25, 12, 13, 12, 13). But what you *cannot* do is alter the 25, which is already exact. There is no interpretation of 'round off' in which 25 is 'rounded off' to anything other than 25. Nevertheless, among the solutions submitted on the -discuss list, the 25 was 'rounded off' to 23, 24, 26, and 27. (One hapless poster got the 25's right and then rounded off 12.5 to 14.) All together, 10 of the 17 posted solutions failed this test. Similarly, faced with (2 1 1 1 1), where the exact answer was (33 1/3, 16 2/3, 16 2/3, 16 2/3, 16 2/3), and the correct result would have been something like (33, 16, 17, 17, 17), eight of the 17 programs produced (32, 17, 17, 17, 17) instead. *. Randal Schwartz said: I thought about a test harness for easy#7, but when I realized that the numbers could come back in any order, I punted. :) I had meant to require the percentages to be in the same order as the input data. That is, given data (1, 2, 3, 4), the return values MUST be (10, 20, 30, 40), and not some other permutation of those. But it turned out that nobody returned the results in the wrong order, so I didn't have to worry about it. *. John Macdonald's third posted solution (http:°perl.plover.com/qotw/misc/r007/macdonald3.pl) is worth study, because it relies on a clever insight: Using truncation has some advantages. - As you point out, the numbers are within 1 in the last decimal place. - Fudging will always be incrementing, never decrementing. Then a light bulb went on. If you use the technique of applying the fudge to the elements that had the greatest error from the truncation, the fudging process will then minimise the final error and come to the same result as when you start with rounding. And, because the fudging is always positive, the code is simpler. I like that. This gave me that "Gosh, I wish I had thought of that"feeling. Modulo this insight, his program is very similar to mine. *. Since at most one other person produced a correct answer, I thought I'd better look at it to see if it was doing anything different. It was sent in by Brian King. The first thing that grabbed my attention was: my $tolerance = 0.000_000_001; # 1 one-billionth should suffice. What's funny about this is that I had been about to write almost exactly the same thing in my own program, but then I got a nagging feeling about what would happen if the caller asked for their percentages rounded off to the nearest ten-billionth, and I couldn't see a way out, so I scrapped the whole idea and went with the always-use-integers approach that I showed above. And sure enough, Mr. King's program produces the wrong answer for (10, 1, 1, 1). The output is (33.3333333333 33.3333333333 33.3333333333), but it should be (33.3333333334 33.3333333333 33.3333333333). Darn! That said, there were a couple of other things I found interesting about Mr. King's program. It was one of the longer ones (second-longest, in fact) and I wondered why. Mostly it seemed to be because of repeated code. For example: if ( $off_by > 0 && ( abs($off_by) > $tolerance ) ) { # if the overall difference is positive & we're still off... if ( $out > ( ( $this / $sum ) * 100 ) ) { # and if we rounded this one up $out -= $precision; $off_by -= $precision; #round it down instead. update how much we're still off } } ## end if ( $off_by > 0 && ( abs($off_by... elsif ( $off_by < 0 && ( abs($off_by) > $tolerance ) ) { if ( $out < ( $this / $sum ) * 100 ) { $out += $precision; $off_by += $precision; } } Here I would have at least eliminated the repeated abs($off_by) > $tolerance test: if (abs($off_by) > $tolerance) { if ($off_by > 0) { ... } elsif ($off_by < 0) { ... } } but I would have preferred to somehow merge the two blocks into one. And in fact, I *did* merge the two blocks into one; this code corresponds closely with the next unless $e[$_] * $sign > 0; # Error goes the wrong way $r[$_] += $sign; # Adjust value $total_error--; section of my example program. Along similar lines, Mr. King's program has if ( $precision == 0 ) { $precision = 1; } else { $precision = '.' . '0' x ( $precision - 1 ) . '1'; } but it would have been simpler to do $precision = 10 ** -$precision; Don't take these criticisms too seriously, since I wouldn't even have been looking at the code so closely if it hadn't outperformed almost all the other submitted programs; the one real defect could be fixed (if necessary) by adjusting $tolerance to a more appropriate value. That's all for this week's regular quiz. I'll send something about the frost simulators tomorrow, and new quizzes on Wednesday. My thanks to everyone who contributed to the discussion, but also especially to the people who worked the problem on their own. Sample solutions and discussion Perl Quiz of The Week #8 (20021211) Bill Gosper, a famous programmer, once said that a good way to manufacture word puzzles was to look through the dictionary for a word that contains a sequence of four letters that does not appear in any other word. Then the puzzle is to guess the word, given only the four letters. For example, what common English word contains the contiguous sequence of the four letters 'acur'? (Gosper says that you see this word every week, but that it will take you a month to figure out what it is.) Write a Perl program which, given a dictionary, generates two output files, 'questions' and 'answers'. 'questions' should contain every sequence of four letters that appears in exactly one word of the dictionary, one sequence per line. 'answers' should contain the corresponding words that contain the sequences, in the same order, again one per line. For example, given the trivial dictionary containing only arrows carrots give me The outputs should be: 'questions' 'answers' carr carrots give give rots carrots rows arrows rrot carrots rrow arrows Of course, 'arro' does not appear in the output, since it is found in more than one word. Here's a sample program, provided by Jonathan Scott Duff. I trimmed it a little. # Well, I see bunches of other people posting their solutions to the # regular quiz, so here's mine: # #!/usr/bin/perl $SEG_LENGTH = 4; while (<>) { chomp; next if /\W/; $w = lc $_; %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 0..length($w)-$SEG_LENGTH; for $w (keys %w) { $wordmap{$w} = exists $wordmap{$w} ? undef : $_; } } open(Q,">questions") or die; open(A,">answers") or die; for (sort keys %wordmap) { next unless defined $wordmap{$_}; print Q "$_\n"; print A "$wordmap{$_}\n" } close Q; close A; The main data structure in the program is the hash %wordmap. Keys in %wordmap are strings of length 4. The value associated with a key $k is the word in which $k appears, if $k appears in only one word, and an undefined value if $k appears in more than one word. The program first converts each input word to all lowercase, and then uses 'map' to construct a hash, %w, whose keys are the length-4 segments of the word. For example, if the word is 'phlebotomy', the hash is ('phle' => 1, 'hleb' => 1, 'lebo' => 1, 'ebot' => 1, 'boto' => 1, 'otom' => 1, 'tomy' => 1, ) The 1's aren't significant'; they're just placeholders. Using a hash in this way is a common Perl idiom for representing a set of strings. The program then loops over the keys, looking up each one in %wordmap. If a key was already in wordmap, then this is at least the second time it has been seen, so the program sets the associated value to 'undef', to indicate that it has appeared more than once. If the key isn't in %wordmap yet, then it's inserted into %wordmap, and the associated value is the single word in which it has appeared. After generating %wordmap, the program writes out the questions and answers files, sipping over any elements of %wordmap whose values are undefined. *. Since the values in the %w hash are never used or examined at all, it might seem that we could dispense with them, replacing %w = map { substr($w,$_,$SEG_LENGTH) => 1 } ... ; for $w (keys %w) { ... } with @w = map { substr($w,$_,$SEG_LENGTH) } ... ; for $w (@w) { ... } This was a common error in the submitted programs. The problem it causes occurs with words like 'alfalfa' and 'lightweight' which contain the same sequence of four letters more than once. The second version of the code sets @w to ('alfa', 'lfal', 'falf', 'alfa') and then iterates over this list, processing 'alfa' twice. It then erroneously marks 'alfa' in %wordmap as appearing in two words when in fact it has appeared twice in only one word. To avoid this, we must be sure to process each sequence of four letters at most once per word. Storing the sequences as keys in the %w hash ensures this, because hash keys are unique. The %w generated for 'alfalfa' is ('alfa' => 1, 'lfal' => 1, 'falf' => 1) and so iterating over the keys processes 'alfa' only once. 1. A way to fix the problem without introducing another hash appears in Ronald Kimball's program. Ronald's program solves the problem more directly: for $w (keys %w) { $wordmap{$w} = exists $wordmap{$w} && $_ ne $wordmap{$w} ? undef : $_; } The second and subsequent times that the program sees a particular sequence, it throws away the stored word only if it's different from the current word. As written above, the program generates a huge number of 'uninitialized value' warnings because the 'undef' values stored in the hash to indicate a sequence that has been seen two or more times are compared with $_. Ronald's program uses 1 instead of undef, so doesn't generate any warnings. Another way to slience the code above is to shut off warnings.p 2. Mr. Duff's program, as submitted, actually finds unique sequences of 'n' letters, where 'n' defaults to 4, the number specified in the question. If it's run as duff.pl -n 2 < dictionary it finds digraphs (pairs of letters) that occur in only one word each. To make the code simpler, I trimmed this out and replaced the $opt{'n'} parameter with $SEG_LENGTH. 3. As a trivium, here's the output for n=2: bg bw dz fc fj fp fw gj hq hv iy jr lj qa sj vk vs vz wz xb xn xq xs xv xx yj yq zd zg zm zp Of these, 13 make good puzzles: bg bw fc fp fw gj hq lj sj wz xq xs yj The rest are either proper nouns (both English or otherwise) hv iy qa vs xb xn xv xx wq zd zg zm zp or are of visibly foreign origin ('resident aliens'): dz fj vk vz or are abbreviations: jr I think my favorite one is probably 'hq'. 4. People sometimes suggest that Perl's '..' operator should construct a backwards-counting range if the second operand is smaller than the first. For example, they say that 4..0 should produce the list (4, 3, 2, 1, 0). At present, it produces the empty list. This program demonstrates one of the many reasons why this is a bad idea. Consider this part of the code: %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 0..length($w)-$SEG_LENGTH; Suppose $SEG_LENGTH is 4 and $w is "cat". The operands of the '..' are 0..-1. With the existing semantics for '..', the '..' generates an empty list for 'map' it iterate over, the hash %w becomes empty, and the word is effectively skipped--just the right thing. With the defective alternative behavior, the '..' would generate the list (0, -1), and the 'map' generates the (bizarre) list ('cat', 't'). To get correct behavior, the code would have to be adjusted with a special case to check for length($w) < $SEG_LENGTH. A similar example concerns this construction: @rest = @a[2..$#a]; Here the intent is to copy the third-through-last elements of @a. For example, if @a contains 10 elements, $#a is 9, and @rest gets elements 2 through 9. If @a contains only one element, the '2..1' expands to an empty list, and @rest is assigned nothing---which is just what was wanted. With the alternative semantics, the '2..$#a' expands to (2, 1), and @rest is assigned two undefined values. Again, a special case is necessary to guard against precisely the behavior of .. that was proposed. (If you do want to count backwards, use something like reverse(1..$n) .) 5. As usual, many people submitted programs that did not adhere to the interface I asked for in the question, making various gratuitous changes to the input semantics, the output file names, the output format, or whatever. Unlike in the past, I decided not to repair these. The changes that puzzled me most were the ones that replaced the two output files ('questions' and 'answers') with a single output file. I agree that this is simpler and more natural. Usually I would have specified a single file with two columns. But in this case that format is no good, because when you try to pick out a puzzle, you see the answer right next to it, which spoils the fun. 6. One common variation, particularly among the shorter programs, was to use a tricky regex to generate the substrings, instead of the loop shown above. For example while (/(?=(.{4}))/g) { ... } was a popular trick. (This iterates the 'while' loop once for each four-letter sequence, with $1 set to each sequence in turn.) 7. When I first tested the programs, I got a surprise. Everyone's programs ran very quickly, except mine, which was by far the slowest of the bunch. I wondered what elementary mistake I must be making. Unfortunately, it turned out to be an error in the test apparatus, not a programming mistake in my program. (I had been looking forward to discussing it.) I had forgotten to trim the email headers out of the other programs, so mine was the only one that wasn't aborting immediately with multiple syntax errors. Once again, my thanks to everyone who participated. I will send out a new quiz on yesterday. Sample solutions and discussion Perl Quiz of The Week #9 (20021218) You will write a simple spelling checker program, similar to the Unix 'spell' utility. The program should be called 'spel'. It will read a document from standard input and print on standard output a list of all the misspelled words in the document. If any command line arguments are given, 'spel' should read those files instead of the standard input. The output words should be in lexicographic order, and no word should appear more than once in the output. 'spel' will be given one or more dictionaries of words that are already spelled correctly. It will always try to read the file '/usr/dict/words'. It will also try to read '.spel' files from certain directories. If the user has set an environment variable SPELWORDS, 'spel' should interpret its value as a :-separated list of directories to be searched for '.spel' files. If no SPELWORDS variable is set, 'spel' should search in the user's home directory and in the current directory. If you need a sample dictionary, you can obtain one from http://perl.plover.com/qotw/words/ ---------------------------------------------------------------- Here's sample code, submitted by Abigail: #!/usr/bin/perl # # The exercise isn't clear what's to be considered a word, # or how to deal with capitalization. # # This program considers words to be substrings consisting of only # 'alpha' characters. This means that 'words' like "isn't" are # considered to be two words, 'isn' and 't'. # # As for capitalization, words in the text should have the same # capitalization as in the dictionary. However, since words starting # a sentence are capitalized, we permit the first letter of a word # to be capitalized, even if the dictionary only has the all lower case # version of the word. No attempt of parsing sentenses, trying to detect # first words, has been made. # use strict; use warnings; my @std_dicts; # The default dictionaries. my @spel_dirs; # Directories to look for .spel files. my @dicts; # List of dictionaries. my %words; # Words found in the dictionaries. my %mistakes; # Mistakes in the file(s). @std_dicts = ("/usr/dict/words", # The exercise specifies this file, "/usr/share/dict/words"); # but on my system, it's found here. @dicts = grep {-f} @std_dicts; # So, we'll do some juggling, splice @dicts => 1 if @dicts; # making sure we use at most 1 file. # Adding the ".spel" files. @spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS} : ($ENV {HOME}, "."); push @dicts => grep {-f} map {"$_/.spel"} @spel_dirs; # # Init the dictionaries. # { local @ARGV = @dicts; while (<>) { chomp; $words {$_} = 1; $words {+ucfirst} = 1 unless /[[:upper:]]/; } } # # Read the text, record all words not found in a dictionary. # while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g} # # Print the mistakes, sorted. # print "$_\n" for sort keys %mistakes; __END__ ---------------------------------------------------------------- Abigail's program has four phases. First there's an initialization section, in which it determines which dictionaries to use. Then it loads words from the dictionaries into a hash, %words. Third, the program loops over the manuscript input, checking each word against %words. Words not present in %words are noted in %mistakes. Finally, the program prints out the words from %mistakes. The initialization section first decides where the standard dictionary is; my problem statement said it would be in '/usr/dict/words', but on many systems (including mine---hmmm) it's in '/usr/share/dict/words'. Abigail's code prefers the former if it exists, but if not it uses the latter: @std_dicts = ("/usr/dict/words", "/usr/share/dict/words"); @dicts = grep {-f} @std_dicts; splice @dicts => 1 if @dicts; Note that the code does not depend on there being exactly two items in @std_dicts; you can list as many standard dictionaries as you want, in the order you would prefer to try them, and the program will use the first one it finds. But I wonder if it might not have been more perspicuous to write something like @std_dicts = ("/usr/dict/words", "/usr/share/dict/words"); my ($std_dict) = grep {-f} @std_dicts; and then use local @ARGV = (@dicts, $std_dict); later on. Initializing the dictionaries uses a handy trick that all Perl programmers should be aware of. Everyone knows about the <> operator, which reads a line of input from the files named on the command line, or from the standard input if none are named. What many people aren't aware of is that you can fool it about what the command-line files are. This is what Abigail is doing here: { local @ARGV = @dicts; while (<>) { chomp; $words {$_} = 1; $words {+ucfirst} = 1 unless /[[:upper:]]/; } } 'local @ARGV' temporarily resets the value of @ARGV, which is what <> looks at to determine the command-line arguments. Since the files named in @ARGV are exactly the names of the dictionaries, the <> operator reads one line at a time from each dictionary. At the end of the block, the effect of the 'local' is undone and @ARGV resumes its original value. The 'ucfirst' code here takes care of a detail that several submitters forgot. If the dictionary contains the word 'carrot', we would like to accept both 'carrot' and 'Carrot' as correct. The 'ucfirst' takes care of this; if the word 'carrot' appears in the dictionary file, then both 'carrot' and 'Carrot' are placed in the hash. For the dictionaries I supplied, the '/[[:upper:]]/' special case is never meaningfully exercised. It would become important if the dictionary contained a word like 'iMac' which contained uppercase letters but whose initial letter was not already uppercase. The guard condition would prevent 'IMac' from being added to the dictionary. It's not clear to me that this is really the right thing to do, however. (Does 'iMac' get capitalized at the start of a sentence? I don't know.) The '+' on the 'ucfirst' prevents Perl from taking 'ucfirst' as a literal hash key. The manuscript input is read by another '<>' loop: while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g} The order of control flow here may not be clear. It's equivalent to: while (<>) { while (/([[:alpha:]]+)/g) { unless ($words {$1}) { $mistakes {$1} = 1; } } } (This sort of thing is the reason that some people love statement modifiers and other people hate them.) The interesting feature here is the use of '//g' to locate the words in each line. while (/..../g) { ... } in general will search through $_, repeating the 'while' loop once for each time the pattern matches. The pattern '[[:alpha:]]+' will match sequences of one or more alphabetic characters. Note that this treats the word 'isn't' as two words, 'isn' and 't', and similarly 'pot-au-feu' as 'pot', 'au', and 'feu'. To add apostrophes and hyphens to the list of characters that may appear in a 'word', change this to while (/([[:alpha:]'-]+)/g) { This is fraught with its own dangers; if the input now contains the line: to accept both 'carrot' and 'Carrot' as correct. The 'ucfirst' takes then "'carrot'" will be recognized as a 'word' and looked up in the dictionary; but "'carrot'" isn't in the dictionary; only "carrot" is present. To properly handle all cases correctly can be rather tricky. Finally, the list of misspellings is printed out with a straightforward loop: print "$_\n" for sort keys %mistakes; ---------------------------------------------------------------- 1. Loading the dictionary was a little tricky. Several submitters wrote code like this to load the dictionary: while () { chomp; $WORDS{lc $_} = 1; } The words are smashed to all-lowercase before being stored, which leads their programs to accept some rather peculiar words. For example, one of the dictionary files I supplied contains the 'word' 'IEEE', the acronym for the Institute of Electrical and Electronics Engineers, which is likely to appear in many technical contexts. If the case is smashed, the spell-checker will silently accept the word 'ieee', and typically 'IEeE' and 'CArrot' as well. Some submitters forgot that 'carrot' in the dicionary indicates that 'Carrot' is also acceptable. Some remembered, but got the code wrong. For example, the solution I wrote before I posed the problem loads the words into a hash exactly as they are given, and then checks for a word's presence with: $bad{$_}++ unless $d{$_} || $d{lcfirst $_}; This takes care of 'Carrot' properly, even if the dictionary contains only 'carrot'. Unfortunately, it also causes the program to silently accept 'larry', even though the dictionary contains only 'Larry'. Whoops! It also refuses to accept 'CARROT'; I would consider this a bug. 2. The punctuational issue is one of those problems that gets more and more complicated the longer you look at it. At first it seems that it can be solved by just treating hyphen and apostrophe as letters. But if you do that, your program fails on words that are quoted by being placed between apostrophes, as 'Carrot' is in this sentence. A second-order approximation is to trim punctuation from the beginning and the ending of each word before checking it, but then (as one submitter observed): Since I strip off all trailing punctuation, my program as it stands will flag 'words' such as 'etc.', 'a.c.', 'a.k.a.' as wrong, even if they are in the dictionaries used. 3. Nearly everyone's programs loaded the dictionary into a hash. Two submissions didn't. One loaded the dictionary into an array and did linear search on the array. On a 10-word file, this program took 9 sec to check the file; Abigail's program took 25 sec; most of the extra time was taken up by constructing the hash. (Some of the extra time occurred during the global destruction phase, after the program had completed; apparently the dictionary hash was destructed one key at a time, which I don't understand.) But even with this extra overhead, the hash approach won for any file that wasn't trivially small. For a 270-word file, the linear search program took 123 seconds; Abigail's program still took 25 sec. Another submission generated an enormous tree structure with hashes as the nodes. This took a long time to build and search (17 seconds to load a small dictionary that Abigail's program dealt with in less than 2 seconds) and a humongous amount of memory (I could not load the Web2 dictionary file.) 4. One submission contained the following code: if ($#ARGV == -1) { foreach (grep {!exists($words{lc $_})} split /\W+/, <>) { print qq("$_"\n); } } else { foreach(@ARGV) { open FILE, "< $_" or die "Couldn't open input, $_. $!"; print "\nMispellings in $_:\n"; foreach (grep {!exists($words{lc $_})} split /\W+/, ) { print qq("$_"\n); } close FILE; } } The repeated 'foreach' block is a red flag; it suggests that the programmer should look for a way to merge the two blocks, and then see if that makes the code any easier to understand. In this case, it's easy: foreach (grep {!exists($words{lc $_})} split /\W+/, <>) { print "\nMisspellings in $ARGV:\n" if $ARGV ne $prevARGV; print qq("$_"\n); $prevARGV = $ARGV; } Or perhaps @ARGV = ('-') unless @ARGV; foreach(@ARGV) { open FILE, "< $_" or die "Couldn't open input, $_. $!"; print "\nMispellings in $_:\n"; foreach (grep {!exists($words{lc $_})} split /\W+/, ) { print qq("$_"\n); } close FILE; } 5. A number of programs had extra features that I though substantially reduced the usefulness of the program. For example, several of the programs produced extraneous output: print "\n$word misspelled at line $.\n"; or print "\nMispellings in $_:\n"; Extraneous output makes the program more difficult to use as a tool. As specified, the program produces a list of misspelled words. This list could be fed into another program, such as an editor, which could provide a convenient interface to correcting the misspellings. The list could be piped into a program which might make guesses about what words were intended. The output could be directed to the end of './.spel' and then edited. Extra output makes all of these things more difficult. At best, it would have to be filtered out before the output would be useful for anything other than human consumption. Diagnostic messages, if they appear at all, should be printed to STDERR; that is what it is for. Here's an example that's particularly egregious: print "Do you want to change the default spellcheck settings? (y/n): "; chomp (my $choice = ); if (uc $choice eq 'Y') { print "Ignore words all in CAPITALS? (y/n): "; chomp (my $ch1 = ); $ignore_caps = 'Y' if uc $ch1 eq 'Y'; print "Ignore words with numbers? (y/n): "; chomp (my $ch2 = ); $ignore_nums = 'Y' if uc $ch2 eq 'Y'; } Using this program in any way other than as an interactive application is very difficult. In spite of all the work that went into the interface, it remains inflexible; the program will only do the things that the programmer imagined. Options like whether to ignore numerals should be specified non-interactively, on the command line. 6. Robin Szemeti suggested using the Search::Dict module. Search::Dict is one of those very clever pieces of software that never seems to be useful for anything. The idea of Search::Dict is this: If the input file is in lexicographic order, items in it can be found with binary search; this should be quick and also memory-efficient, since the whole file needn't be loaded into memory at once. I tried out a limited version of this, which looks words up in a single dictionary but doesn't actually construct the list of misspellings: use Search::Dict; open W, "< Web2" or die $!; while (<>) { for (split /[^a-zA-Z'-]+/) { look *W, $_; } } On a 2387 word version of this postmortem file, Abigail's program took the usual 27 seconds; the program above took 13. I was not expecting Search::Dict to do so well. I worked up a working spelling checker based on Search::Dict and it still took about 13 seconds on the postmortem file. I was surprised; I had expected the file reading overhead to be much higher. On a 39,XXX word file, the hash approach was a big winner. Abigail's program still took about 27 seconds; the Search::Dict program took 192: #!/usr/bin/perl use Search::Dict; my @std_dicts = ("/usr/dict/words", "/usr/share/dict/words"); my ($std_dict) = grep {-f} @std_dicts; my @spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS} : ($ENV {HOME}, "."); my @dicts = grep {-f} map {"$_/.spel"} @spel_dirs; my @fh; for my $dict (@dicts, $std_dict) { open my $fh, $dict or die "Couldn't open $dict for reading: $!"; push @fh, $fh; } while (<>) { my @words = split /[^a-zA-Z'-]+/; WORD: for (@words) { next if $missp{$_}; for my $fh (@fh) { look $fh, $_; my $w = <$fh>; chomp $w; next WORD if $w eq $_; } $missp{$_} = 1; } } print join "\n", sort(keys %missp), ""; Search::Dict, by the way, has a lousy interface. It looks in the file to find the first word that is equal to or greater than its argument, and leaves the filehandle positions at the place where it found that word; it also returns the position at which the handle was left. But it doesn't return the word itself, or any indication of whether it matched the argument or not! The return value, which is the file position, is useless, since you could have gotten it by doing tell() on the filehandle. So my program has to reread the word at the current file position and then compare it with the argument word, even though Search::Dict has just finished doing this. 7. There were a bunch of defects which made me think that programs had not been well-tested. One program wouldn't compile, because it had a missing semicolon. One program did this: push(@dictionaries, qw(~/.spel .spel/)); The author apparently didn't notice that neither of '~/.spel' or '.spel/' was ever read. One program had open DICT,"Web2" || die "Error: $!\n"; so the error message would never appear; if 'Web2' was not present in the current directory when the program was run, it would cheerfully report every word as misspelled. ---------------------------------------------------------------- Other notes: * Regular quiz #7 was about taking a list of numbers and then fudging the percentages so that they would add up to exactly 100%. Douglas Wilson contributed a good explanation of why someone might really want to do this, apart from satisfying a demand from a clueless manager. Suppose your company is in the business of selling assemble-it-yourself kits. You would like to list a price for each individual part in the kit; the prices should add up to the cost of the whole kit exactly, even when they are rounded off to the nearest penny. Mr. Wilson's message about this is at http://perl.plover.com/~alias/list.cgi?1:msp:1183 * Happy new year, everyone! I will send out Quiz #10 later this evening. Sample solutions and discussion Perl Quiz of The Week #11 (20030206) Question #1: Why does Perl have the 'defined' function? If you want to see if a variable contains an undefined value, why not just use something like this this? if ($var == undef) { ... } '==' is for comparing numbers. If its operands aren't numbers to begin with, they are converted to numbers before being compared. The 'undef' on the right is always converted to 0, so this test is that same as comparing for numeric equality with 0. In particular, the test returns true when $var is 0, even though it is not undefined. The test also fails for many strings: $var = "oops"; if ($var == undef) { die } This dies even though $var is certainly not undefined. ---------------------------------------------------------------- Question #2: What's wrong with this code? %hash = ...; while () { chomp; for my $key (keys %hash) { if ($key eq $_) { print "$key: $hash{$key}\n"; } } } The 'for' loop scans the hash looking for a particular key. But the whole point of a hash is that you *don't* have to scan it to find out if it contains a certain key or not. Hashes are organized so that Perl can look up any given key instantly, without having to examine each one. The code here is analogous to searching the telephone book one name at a time, starting from the first page, even though the telephone book is carefully organized (in alphabetical order) so that you don't have to do that. A better way to write the code would be: %hash = ...; while () { chomp; print "$_: $hash{$_}\n"; } This error is common in code written by beginning Perl programmers. Here's some code that one of my interns once wrote: foreach $k (keys %in) { if ($k eq q1) { if ($in{$k} eq agree) { $count{q10} = $count{q10} + 1; } if ($in{$k} eq disaagree) { $count{q11} = $count{q11} + 1; } } if ($k eq q2) { @q2split = split(/\0/, $in{$k}); foreach (@q2split) { $count{$_} = $count{$_} + 1; } } if ($k eq q3) { $count{$in{$k}} = $count{$in{$k}} + 1; } if ($k eq q4a) { if ($in{$k} eq care) { $count{q4a0} = $count{q4a0} + 1; } if ($in{$k} eq dontcare) { $count{q4a1} = $count{q4a1} + 1; } } if ($k eq q4b) { if ($in{$k} eq wish) { $count{q4b0} = $count{q4b0} + 1; } if ($in{$k} eq dontwish) { $count{q4b1} = $count{q4b1} + 1; } } if ($k eq q5) { if ($in{$k} eq yes) { $count{q50} = $count{q50} + 1; } if ($in{$k} eq "no") { $count{q51} = $count{q51} + 1; } } if ($k eq q6) { if ($in{$k} eq yes) { $count{q60} = $count{q60} + 1; } if ($in{$k} eq "no") { $count{q61} = $count{q61} + 1; } } if ($k eq q7) { if ($in{$k} eq "accept") { $count{q70} = $count{q70} + 1; } if ($in{$k} eq understand) { $count{q71} = $count{q71} + 1; } if ($in{$k} eq other) { $count{q72} = $count{q72} + 1; $htmlout = comments; open(COMMENTS, ">> /tmp/comments") || die "cant open comments"; print COMMENTS "$in{q7a}\n\n"; close (COMMENTS); } } if ($k eq q8) { if ($in{$k} eq yes) { $count{q80} = $count{q80} + 1; } if ($in{$k} eq "no") { $count{q81} = $count{q81} + 1; } } } #end of foreach loop Larry Wall, the inventor of Perl, has said: Doing linear scans over an associative array is like trying to club someone to death with a loaded Uzi. ---------------------------------------------------------------- Question #3: What's wrong with this code? @matching_words = grep search_for($_, $text_file), @words; sub search_for { my ($target, $file) = @_; return unless open F, "<", $file; while () { return 1 if index($_, $target) >= 1; } close F; return; } There are several things wrong with the code. Probably the biggest problem is that the search_for function inadvertently destroys the contents of @words. Inside a 'grep' loop or a 'foreach' loop with no control variable, the $_ variable is 'aliased' to the elements of the array. This means that you can look at $_ to see the current array element, and also that you can modify $_ to modify the current array element. A simpler example is: @n = (1,2,3); for (@n) { $_ = 'blah'; } print "@n\n"; This prints "blah blah blah". Since $_ is a global variable, the assignment to $_ inside the 'search_for' function overwrites the aliased values in @words. Other possible criticisms include: (a) search_for performs a repeated search that is probably wasteful; it would be better to convert it into a hash lookup of some sort, if possible. (b) If the rest of the program happened to have a filehandle named 'F', calling search_for will close it. For example, this doesn't work: open F, "myfile" or die ...; if (search_for("carrot", "otherfile")) { ... } my $next = ; because F has been closed by 'search_for'. This is a violation of function encapsulation rules. If the program who had F open before is not the same as the one who wrote search_for, this is going to create a bug that will be very difficult to track down. Sample solutions and discussion Perl Quiz of The Week #13 (20030528) The 'MH' mail system stores email messages in a 'folder', which is just a plain directory. Messages are files in this directory whose names are numerals. The directory might contain other files or subdirectories; these are not messages. The 'scan' command summarizes the contents of a folder. Here's a typical output: 1349 03/22 Yitzchak Scott-Th Re: Hey, is this list alive?< Proposed advanced problem for this week. 1352 03/25 John_Wunderlich@C Re: Hey, is this list alive?< $b } grep {!/\D/ && -f "$dir/$_" } readdir D; closedir D; my %me; { my @addresses = $ENV{ADDRESS} ? split(/,\s+/, $ENV{ADDRESS}) : guess_addresses(); for (@addresses) { $me{lc $_} = 1; }; } There was some question about how to decide if a message had been sent by the user running the program. I said on the -discuss list that anything reasonably reasonable would suffice. My 'guess_addresses' function tries to guess the user's address from various system information, but allows itself to be overidden by the contents of an ADDRESS environment varuable. Here's 'guess_addresses': sub guess_addresses { my $username = $ENV{USER} || (getpwuid($<))[0] || return; my $host = $ENV{HOST} || do { require Config; "Config"->import; $Config{myhostname} . $Config{mydomain}; } || return; "$username\@$host"; } I suppose 'Config' probably isn't in the Llama Book, but it's not an essential part of the program, so if you don't like it, you can take it out and replace it with something else. Now the main loop of the program starts: for $msgno (@message_files) { local *F; my $file = File::Spec->catfile($dir, $msgno); unless (open F, "<", $file) { warn "Couldn't read message $msgno; skipping.\n"; next; } my $mo = Mail::Internet->new(\*F); Problem #1: Mail::Internet::new requires a glob reference argument, which violates my Llama-features-only rule. Problem #2: Mail::Internet is mighty slow. my $h = $mo->head; { my $datefield = $h->get('date'); my $time = $datefield ? Mail::Field->new('date', $datefield)->time : (stat($file))[9]; $date = strftime("%m/%d", localtime($time)); } Mail::Field is part of the MailTools package. Here it returns a Mail::Field::date object, which supports a ->time method that converts the date into an epoch time. I used strftime() (which may not be available everywhere) to convert this back to a date. This seems like an awful lot of machinery to use just to convert something like "Sun, 15 Jun 2003 01:55:30 +0200" to "06/15". The possible upside is that the date in the output shows that date when the message was actually sent, relative to the user of the program. For example, the "Sun, 15 Jun 2003 01:55:30 +0200" message was sent late on the evening of 14 June, Philadelphia time, and I live in Philadelphia. However, that doesn't seem to me like much of a benefit. If there isn't a date in the message, we use the date that the message file was written. "??/??" would also be a reasonable alternative. Now the program deals with the sender's address: { for my $mf (Mail::Address->parse($h->get('from'))) { if ($me{lc $mf->address}) { $whom = "To:" . $h->get('to'); last; } else { $whom = $mf->phrase || $mf->comment || $mf->address; } } } It took me much longer to figure this out than it should have. I tried using Mail::Field again, and discovered that it's a tremendous pain in the ass. Unfortunately I don't remember most of the details. Mail::Address seems to do the job adequately, however. $subjcontent = $h->get('subject') . "<<" . join "", @{$mo->body}; $subjcontent =~ tr/\n//d; write; } format STDOUT = @#### @<<<< @<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $msgno,$date,$whom, $subjcontent . Does the Llama book cover formats? I hope not. Of course the format here is not necessary; I could have used a simple 'print' or something similar. But opportunities to use formats come along so rarely that I like to take advantage of them when they appear. Now here's the non-modules version. The biggest cost is that it's about 50% longer than the other version. #!/usr/bin/perl use File::Spec; use POSIX 'strftime'; my $dir = shift || '.'; opendir D, $dir or die "Couldn't read directory $dir: $!; aborting"; my @message_files = sort { $a <=> $b } grep {!/\D/ && -f "$dir/$_" } readdir D; closedir D; my %me; { my @addresses = $ENV{ADDRESS} ? split(/,\s+/, $ENV{ADDRESS}) : guess_addresses(); for (@addresses) { $me{lc $_} = 1 }; } sub guess_addresses { my $username = $ENV{USER} || (getpwuid($<))[0] || return; my $host = $ENV{HOST} || do { require Config; "Config"->import; $Config{myhostname} . $Config{mydomain}; } || return; "$username\@$host"; } So far everything is the same. But now because I'm not using a Mail:: module to deal with the RFC822-format date, I have to do this: my %m2n = (jan => 1, feb => 2, mar => 3, apr => 4, may => 5, jun => 6, jul => 7, aug => 8, sep => 9, oct => 10, nov => 11, dec => 12, ); for $msgno (@message_files) { local *F; my $file = File::Spec->catfile($dir, $msgno); unless (open F, "<", $file) { warn "Couldn't read message $msgno; skipping.\n"; next; } my %mo = read_message(\*F); Instead of using Mail::Internet to read in the email message, I wrote a replacement function. Here it is: sub read_message { my $fh = shift; my ($header); my %m; { local $/ = ""; $header = <$fh>; undef $/; $m{BODY} = <$fh>; } my @fields = split /\n(?!\s)/, $header; for (@fields) { my ($t, $v) = split /:\s+/, $_, 2; $m{lc $t} = $v; } %m; } The split /\n(?!\s)/ is a little tricky, and I suppose (?!\s) is non-Llama. \n(?!\s) matches only those newlines which are *not* followed by whitespace. When we split the message header on those newlines, we get an array of fields; each field may contain one or more physical lines of the header. The main program then continues; here's my quick and dirty code to deal with RFC822-format dates: { my $datefield = $mo{date}; if ($datefield && $datefield =~ /(\d+) (\w+)/) { $date = sprintf "%02d/%02d", $m2n{lc $2}, $1; } else { $date = strftime("%m/%d", localtime((stat($file))[9])); } } In retrospect, it probably would have been better to avoid strftime() here, since I could have done something like: # { my $datefield = $mo{date}; # my ($m, $d); # if ($datefield && $datefield =~ /(\d+) (\w+)/) { # ($m, $d) = ($m2n{lc $2}, $1); # } else { # ($m, $d) = (localtime((stat($file))[9])))[4,3]; # $m++; # } # $date = sprintf "%02d/%02d", $m, $d; # } instead. The double-list-slice on the 'localtime' line should make Randal happy. Next is the section which extracts the addresses from the 'from' and 'to' fields. It depends on a homemade 'parse_addr' function, which is the dodgiest part of the program. { my ($phrase, $addr, $comment) = parse_addr($mo{from}); if ($me{lc $addr}) { ($phrase, $addr, $comment) = parse_addr($mo{to}); $whom = "To:" . ($comment || $phrase || $addr); } else { $whom = $comment || $phrase || $addr; } } Here's the rather questionable 'parse_addr': sub parse_addr { my $x = shift; my ($phrase, $addr, $comment) = $x =~ /([^<(]*) (?: < ( [^<\s]* ) > ) ? \s* (?: \( ([^\)]*) \)) ? /x; $addr = $phrase unless defined $addr; # warn "$x -> '$phrase', '$addr', '$comment'\n"; for ($phrase, $addr, $comment) { s/^\s+//; s/\s+$//; } return ($phrase, $addr, $comment); } It's questionable because it doesn't always work. For example, it'll misparse "Joseph (Joe) Smith" jsmith@example.com RFC822 address syntax is horrendously complicated and grossly overengineered. But it works well enough for almost all examples that one encounters in practice. (Which is why RFC822 is overengineered.) Essentially, the idea is that an address wil have this format: PHRASE
(COMMENT) Where the address and comment parts might be missing. Addresses might also have this format: ADDRESS (COMMENT) in which case the address will be mistaken for a phrase; in that case we use the phrase as the address. We use a rather ordinary regex to extract the three parts. Finally, the rest of the program is simple: $subjcontent = $mo{subject} . "<<" . join "", $mo{BODY}; $subjcontent =~ tr/\n//d; write; } format STDOUT = @#### @<<<< @<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $msgno,$date,$whom, $subjcontent . The costs for the homegrown version are that it's 50% longer, and that it doesn't work on peculiar addresses. We also lost the benefit of displaying the local date on which the message was sent; "Sun, 15 Jun 2003 01:55:30 +0200" is displayed as 06/15 even though the message was sent on 06/14 Philadelphia time. The benefits are that the program doesn't depend on a bunch of nonstandard modules, and that the program runs *twelve* times faster. Of course, anyone can write a program that runs really fast and produces the wrong output. But this program produces the right output almost all the time, and it's hard to believe that fixing it would slow it down by a factor of 12. What wen't wrong? I haven't looked closely, but I suspect that Mail::Internet is way overwritten. I've placed complete source code at http://perl.plover.com/qotw/misc/r013/ Thanks to everyone who participated quietly and said nothing. I can confidently predict that I will post new quizzes on June 11. Sample solutions and discussion Perl Quiz of The Week #14 (20030611) Write a program which generates an HTML table of contents for a directory of files. It should read the directory, producing a list of the files, and write out a file "toc.html" in that same directory, in the following format: Table of Contents for [Directory Name]

Table of Contents for [Directory Name]

file1
file2
... lastfile
The files should be listed in alphabetic order. The program should accept a command-line argument that tells it what directory to index; if the argument is omitted, it should index the current directory. ---------------------------------------------------------------- I only saw one solution posted on the perl-qotw-discuss list, from Pr. Offer Kaye; I also wrote one myself. Offer's uses a number of standard modules to take care of escaping HTML and URLs: use strict; use warnings; use Cwd; use URI::file; use HTML::Entities; my $toc_file = "toc.html"; if ($#ARGV > 0) {die "Incorrect usage. Use \"$0 dir_name\" or just: \"$0\"\n"} my $dir_name = ($#ARGV == 0) ? $ARGV[0] : cwd(); opendir(DIR, $dir_name) or die "Can't opendir $dir_name: $!\n"; my @list = sort {lc($a) cmp lc($b)} readdir(DIR); chdir $dir_name or die "Couldn't chdir to $dir_name for some reason...\n"; open(OUT,">$toc_file") or die "Couldn't open $toc_file for writing: $!\n"; print OUT "Table of Contents for "; print OUT encode_entities($dir_name),"\n"; print OUT "

Table of Contents for "; print OUT encode_entities($dir_name),"

\n"; for (@list){ next if (($_ eq ".") or ($_ eq "..")); my $u = encode_entities(URI::file->new($_)); print OUT "\n"; } print OUT ""; Most of this seems completely straightforeard and I don't have much to say about it. I wonder about using 'cwd()' instead of just '.'. I suppose '.' might not be portable, but if you want it to be portable, it would seem to be to be simpler to use File::Spec->curdir(), which then just returns '.'. The output from Pr. Kaye's program is a little peculiar if the filenames are peculiar. For example, I created a file named 'ooky' to make sure that things didn't start blinking. The encode_entities call turns this into '%3Cblink%3Eooky'. This is fine when it appears as a URL, but as HTML text, it's wrong; you actually see '%3Cblink%3Eooky' on the page, percent signs and all. It should have been turned into '<blink>ooky', which would have displayed as 'ooky'. Since I expected everyone would use the modules, I thought I'd see if it was posible to do it without the modules, while still using only features from the Llama book. This turns out to be possible, but just barely. (The modules are *not* discussed in the Llama book.) The main part of the program, of course, is almost the same: for my $f (sort readdir D) { my $url = escape_url($f); my $html = escape_html($f); print "$html
\n"; } The big question is how to implement 'escape_url' and 'escape_html' with only Llama features. 'escape_html' is quite easy: sub escape_html { my $t = shift; $t =~ s/&/&/g; $t =~ s//>/g; return $t; } For this project, this is just fine. 'escape_url', however, is much trickier. Normally, I would write something like this: sub escape_url { my $url = shift; $url =~ s/([^0-9A-Za-z.-_])/sprintf "%%%2x", ord $1/ge; return $url; } This uses *three* non-Llama features: 1. The ord() function Page 166 says "While constructing and interpreting such a byte string is fairly straightforward using 'chr' and 'ord' (not presented here)...". Since they're not presented, I can't use them. 2. sprintf "%x" 'printf' and 'sprintf' are discussed, but never the "%x" escape. 3. s///e s/// is of course discussed in detail, but never /e. The book mentions it in passing at the very very end; there's an example on page 253. I almost went ahead and used it. Then I noticed that typeglobs are also mentioned on page 253, so that allowing page 253 features would make a mockery of the 'Llama only' restriction. So I decided that s///e was forbidden. (Reminder: The 'Llama only' rule applies only to me; everyone else can of course use whatever they want to.) The big difficulty: how to convert "x" to hexadecimal without using ord() and sprintf()? I briefly considered using something like s/%/%25/g; s/\./%2c/g; s/$title

$title

"; for my $f (sort readdir D) { my $url = escape_url($f); my $html = escape_html($f); print "$html
\n"; } print "\n\n"; ################################################################ sub escape_url { my @chars = split //, $_[0]; my $result = ""; for (@chars) { my $ord = Ord($_); if ($good_char{$_}) { $result .= $_; } else { my $h0 = $ord % 16; my $h1 = ($ord - $h0) / 16; $result .= "%$hex[$h1]$hex[$h0]"; } } return $result; } sub escape_html { my $t = shift; $t =~ s/&/&/g; $t =~ s//>/g; return $t; } ---------------------------------------------------------------- 1. Alert readers will notice that my program forgot to escape the directory name in the title of the document. I didn't realize this until I saw Pr. Kaye's solution. Whoops! 2. I violated my own spec, which said "[the program] should write out a file "toc.html" in that same directory...". When time came to write the program, I decided the spec was dumb, and opted to have it write the output to STDOUT instead. If you think the specified behavior is better, just add open STDOUT, ">", "$dir/toc.html" or die ...; near the top of the program. I still think the specified behavior is dumb. The program is more flexible without it. What was I thinking, anyway? 3. Pr. Kaye says: Note that my solution will not work for ALL cases- just the more common. Plus, the HTML file created is very simplistic. A better ('better' as in more complete/robust) solution would perhaps be to determine the proper encoding (either based on the filenames or through a command-line switch) and create a (valid) XHTML file, properly formatted for that encoding. Or perhaps use UTF-8 regardless. This is a huge issue that I never considered at all. It reminded me of the time I asked for some simple calendrical computation and the -discuss list was awash with people asking about the French revolutionary calendar. But Pr. Kaye raises a good point. Pr. Kaye lives in Israel, and may very well encounter files whose names contain Hebrew characters. I have no idea of the corect way to deal with this. I would not be surprised to learn that my program fails miserably when presented with filenames containing Hebrew characters. (I would also not be surprised to learn that it works perfectly as long as the output file contains a line specifying that the character encoding is UTF-16. I really can't exaggerate my ignorance here.) Pr. Kaye continues: Another point is that I haven't got a way to test this on different file systems, so I'm not sure it will work everywhere- although I've done my best to try to make the solution robust in that sense. But I'm pretty sure that the script will not currently work across file systems. 4. Robert Spier asks: At this point, I'm stumped. What does Pr. stand for? I liked refering to people as "Mr." in previous reports. But then I got worried, because this assumes that all these people are men. I do not want to assume that, and really, I don't know. So I decided to invent a new formal title for programmers that would evade the issue. 'Pr.' stands for 'Programmer'. Lawyers get to impress people by putting 'Esq.' after their names. Now programmers can impress people by using a special title too. Thanks again to everyone who particpated, including Pr. Kaye and also those people who worked the problem in private and said nothing. I will post another quiz tonight. [[ When I sent out this week's quiz, I forgot to mention that it had been contributed by Geoffrey Rommel, who also contributed the discussion below. Thank you, Pr. Rommel! - MJD ]] This quiz is phrased for Unix systems. If it makes sense to write a solution for Windows or other systems, feel free to do so. The usual way to look for a character string in files in Unix is to use grep. For instance, let's say you want to search for the word 'summary' without regard to case in all files in a certain directory. You might say: grep -i summary * But if there is a very large number of files in your directory, you will get something like this: ksh: /usr/bin/grep: arg list too long Now, you could just issue multiple commands, like this: grep -i summary [A-B]* grep -i summary [C-E]* etc. ... but that's so tedious. Write a Perl program that allows you to search all files in such a directory with one command. You're probably wondering: - Should I use grep? egrep? fgrep? Perl's regex matching? - Should there be an option to make the search case-sensitive or not? - Should we traverse all files under all subdirectories? You can decide for yourself on these questions. There is one other requirement, though: the program must not fail when it finds things for which grepping does not make sense (e.g. directories or named pipes). ---------------------------------------------------------------- This quiz was suggested to me by a directory on one of my servers where all of our executable scripts are stored. This directory now has over 4200 scripts and has gotten too big to search. The solution shown here works for my purposes, but I do not wish to depreciate the ingenious solutions found on the discussion list. I will try to evaluate and discuss them in a separate message. As MJD mentioned, Perl regex matching is clearly superior to the alternatives. Since the original purpose was to search a directory of scripts, the search is not case-sensitive; that option could be added easily enough. We search only files (-f) in the specified directory, not in lower directories. I also test for "text" files (-T) because my Telnet client gets hopelessly confused if you start displaying non-ASCII characters. #!/usr/bin/perl # The bin directory is too large to search all at once, so this does # it in pieces. ($PAT, $DIR) = @ARGV[0,1]; $DIR ||= ""; die "Syntax: q16 pattern directory\n" unless $PAT; open(LS, "ls -1 $DIR |") or die "Could not ls: $!"; @list = (); while () { chomp; push @list , (($DIR eq "") ? $_ : "$DIR/$_"); if (@list >= 800) { greptext($PAT, @list); @list = (); } } greptext($PAT, @list); close LS; exit; sub greptext { my ($pattern, @files) = @_; foreach $fname (@files) { next unless -f $fname && -T _; open FI, $fname; while () { chomp; print "$fname [$.]: $_\n" if m/$pattern/oi; } close FI; } } ---------------------------------------------------------------- [[ Administrative note: So far very few people have contributed quizzes. Right now we have one expert and one regular quiz ready to go. We need more, because unless more are contibuted, we will run out in two weeks. This mailing list has 1257 people subscribed to it. If each person contributed just one quiz, we would be all set for the next 24 years. Please send quizzes, or even just quiz ideas, to perl-qotw-submit. Thanks, - MJD ]]