From kragen@dnaco.net Wed Aug 26 12:04:05 1998 Date: Wed, 26 Aug 1998 12:04:03 -0400 (EDT) From: Kragen To: "Bradley M. Kuhn" cc: clug-user@clug.org Subject: Re: another way to do it (was Re: Web Page Help) In-Reply-To: <19980826111830.46777@ebb.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Keywords: X-UID: 1454 Status: O X-Status: On Wed, 26 Aug 1998, Bradley M. Kuhn wrote: > Thus spoke Kragen: > > while (<>) { > > ARGH! Why not use: > > while ( defined(my $line = <>)) > > It handles the zero byte problem gracefully. :) Why not? Because I'd never heard of the zero byte problem. Are you saying that the string "\0" is false as a boolean? Maybe we should fix this problem here. > > > if (/^From.*\s+([A-Za-z][A-Za-z][A-Za-z])\s+\d+\s+[0-9:]+\s+(?:\S+\s+)?((19|20)\d\d)\s*$/) > > That From.*\s+ is a bit dangerous, don't you think? :) > > It will slow you down severely on non-match.... I thought about its slowness on match (since it'll start trying to match near the end of the line -- maybe .*?\s+ would be better), but I didn't think about the fact that it was exponential-time in the amount of whitespace on a false From line, at least, with an NFA implementation. Doesn't Perl use a DFA, though? I'm not really familiar with optimizing regexes (even Perl ones). > How about this little rewrite: > > ############################################################################### > use strict; > use Date::Manip; > > my $lastOpened = ""; Optimization alert! > while ( defined(my $line = <>)) { > if (my($date) = $line =~ /^From\s+\S+\s+(.+)$/) { > my $filename = &UnixDate($date, "%Y-%m"); I'm not familiar with UnixDate. What does it do? And why are you explicitly &ing the routine? Are you sure that \s+\S+\s+ will skip over everything before the date? I wasn't, because I wasn't familiar with the UUCP-style "From" line's standard. > if ($filename eq "") { > warn "Line $.: found a From line without a date!: $line"; Is good. > print OUTPUT $line unless ($lastOpened eq ""); > next; > } > unless ($lastOpened eq $filename) { > close(OUTPUT) unless ($lastOpened eq ""); You can close an unopened filehandle safely, can't you? And forgetting to close a filehandle is safe, isn't it? > open(OUTPUT, ">>$filename") || die "Cannot open $filename: $!\n"; > } > $lastOpened = $filename; > } > $lastOpened ? print OUTPUT $line : > warn "Line $.: precedes any valid From lines: $line"; > } > ############################################################################### > > The Date::Manip takes care of most of your worries with the date regex. > Date::Manip can be found on CPAN. > > There are also a variety of Mail:: handling packages as well, but using the > regex is probably just as good and much faster. > > My version has the following advantages over Kragen's (however, Kragen's was > a fine start :): > - Does not reopen the same file as many times....keeps track to see > if the last file was the same This ought to be a big win. > - has greater date handling functionality, using Date::Manip. Is this a good thing? > - has more efficient regex for checking From line. Definitely. > BTW, I ran this on a 3.9MB mail file that split into 29 different months. > It took 1.5 minutes on my Pentium 90. That's interesting. It took mine four seconds to do a 1.1MB file on a 5x86-133 -- which is approximately equal to a P75 -- but running yours on a similar machine, with roughly four times as much mail, took 20 times as long? Are you sure it wasn't a 39MB mail file? Maybe UnixDate() is drowning the win from the better regex and the saved system calls. Kragen -- Kragen Sitaker We are forming cells within a global brain and we are excited that we might start to think collectively. What becomes of us still hangs crucially on how we think individually. -- Tim Berners-Lee, inventor of the Web