Oct 17

Glockentierchen

By polypompholyx in Teaching

It’s nice to find out that – old and jaded as I am – biology can still delight me.

Earlier this week, I covered a first-year practical teaching microscopy skills to first-years, through the medium of pond water. As well as the usual desmids, rotifers, Daphnia and nematodes, one lucky student found a little clump of Vorticella (or some near relative, I’m not a protistologist!)

Their German name Glockentierchen (“little bell creatures”) delights me almost as much as they do:

Vorticella micrograph [CC-BY-SA-3.0 Steve Cook]

Vorticella clump: sorry about the rather poor quality of the image: the trinocular microscope was broken, so this was taken with an iPhone pointed down the eyepiece. Miracle it worked at all!

They’re actually quite common in freshwater, but this is the first time I’ve seen them in our pond samples. Each one is a single cell, attached to the substrate by a thin stalk, which can contract like a spring. Despite their cute bounciness, they’re voracious predators: the ‘head’ end has a cell-sized mouth (cytostome) surrounded by tiny beating hairs (cilia) which help funnel prey into the cell to be digested. The diagram below is of Stentor rather than Vorticella, but gives you the rough idea:

Stentor diagram [CC-BY-SA-3.0 Steve Cook]

Stentor is another predatory ciliate with a similar form to Vorticella: the main difference being that the ‘stalk’ in Stentor is much fatter, whereas that of Vorticella is thin and contractile.

Vorticella is one example of a very large group of single-celled organisms called ciliates, which include the slightly more well known Paramecium. Their cells are amazingly intricate, and have sub-cellular compartments (organelles) playing similar roles to those that multicellular organisms have entire organs for. That such complexity can exist in a drop of pond-water smaller than your fingernail is one of the things that makes biology truly awesome.

Oct 05

Installing modules

By polypompholyx in Perl

CPAN

With bits and bobs, we’ve covered much of the core functionality of Perl, but perl also comes with dozens of useful modules in its standard distribution. CPAN expands this with an enormous wealth of other contributed modules that you can install. There are three ways of doing this:

The most direct method is to go directly to CPAN, download the module you want (as a zipped tarball), unzip it and untar it, change directory to the folder you’ve just unzipped, and build the module with dmake or whatever make program you have available. So, if we wanted to install the Parse::RecDescent module, you’d download the Parse-RecDescent-1.967009.tar.gz tarball from CPAN, and then:

gzip -d Parse-RecDescent-1.967009.tar.gz
tar -xf Parse-RecDescent-1.967009.tar
cd Parse-RecDescent-1.967009
perl Makefile.pl
dmake
dmake test
dmake install

However, this relies on a number of Unix utilities that you may not have if you’re on Windows (but can easily obtain), and will fail if the module has any dependencies.

The other two ways are much easier. The first is to use the CPAN.pm module, which comes with the core perl distribution. If you type:

perl -MCPAN -e shell

at a command prompt, it will open a CPAN shell that can install modules for you. Accept the default configuration options. To install a module, all you then need type into the CPAN shell is:

install Parse::RecDescent

and this will be done automatically.

The alternative for those using the ActiveState port of perl on Window is to use the Perl package manager, ppm. If you type:

ppm install Parse-RecDescent

at a command prompt, ppm will do it’s best to install the module for you. This technique relies on the fact that someone else has done the equivalent of the the Makefile.pl/dmake dance on Parse::RecDescent and uploaded the result to the ActiveState ppm repository as a file named Parse-RecDescent.ppd (note the hyphen replacing the double-colons). YMMV as the coverage of CPAN by the ActiveState repository is not 100%.

Some Perl modules contain extensions written in C (so called XS modules), usually to increase the speed with which they are able to run. When these modules are built with make, the C is compiled and linked, for which you need a C compiler. If you are on Windows and using the ActiveState distribution, you will find it useful to install the C-compiler MinGW so that you can compile such modules. This is a simply matter of typing:

ppm install MinGW

at a command prompt.

Some modules I could not do without…

Command line option processing

The Getopt::Long module allows you to easily modify the behaviour of a script according to the switches you pass it:

use Getopt::Long;
GetOptions (
    # takes a hash of ( switch => reference in which to put its value ) pairs
    'help' => \ my $help,
        # simple boolean switch, sets $help to 0 or 1 
        # depending on whether the --help switch was present
    'verbose!' => \ my $verbose,
        # the ! indicates that --verbose and --noverbose are both valid switches
    'source=s' => \ my $source,
        # =s indicates that the source option requires
        # a string argument, --source=c:/temp
    'number:i' => \ my $number,
        # :i indicates that the number option
        # takes an optional integer argument, --number=3
);
$number ||= 666;
    # $number  is set to 666 if it's not been set to this by the switches
    # note that this would overwrite a switch of -n 0
$number = 666 unless defined $number;
    # if you would like -n 0 to not be ignored!

Note the direct use of references to lexically scoped variables (\ my $var) to save you having to declare the switch-recording variables before calling GetOptions() . Note also the use of ||= and unless defined to default the values of unset switch variables.

You can then run your script with the switches:

script.pl --verbose
script.pl --noverbose
script.pl --verbose --source="C:/temp/in.txt" --number=9
script.pl --help

to enable variations on a theme. You can also get away with shortening these to unambiguous strings, leaving out the equals signs and just using a single dash:

script.pl --verbose --number=9 --source="C:/temp/in.txt"

script.pl -v -n 9 -s "C:/temp/in.txt"

perldoc Getopt::Long for details.

Interacting with databases

The DBI (database interface) module allows you to talk with a database server, like MySQL, Oracle, etc. To use it, you use DBI; then tell the DBI module which driver (DBD:: module) you want to use to talk to the database server, e.g. DBD::mysql for MySQL. Here’s a brief foray:

use DBI;
my $sql_query_string = 
    "SELECT Genus, Species FROM Plants WHERE Genus = 'Drosera'";
my $dsn = "DBI:mysql:database=plants;host=localhost;port=3306";
    # $dsn contains the configurations for MySQL,
    # such as the server and database to use
my $dbh = DBI->connect( $dsn, "Username", "Password1" )
    or die "Can't connect to server\n";
    # create a database handle and connect it to the database
my $sth = $dbh->prepare( $sql_query_string );
$sth->execute();
while ( my $row = $sth->fetchrow_hashref() ) {
    print "$row\n" ;
}
$dbh->disconnect();

Again, for more details, perldoc DBI.

Handling files portably

The File:: modules have various utilities for messing with files. Although Perl has the rename command, which can be used to both rename and move files about:

rename "C:/flunge.log", "D:/piffle.log";

It doesn’t have a native copy function (although it’s quite easy to roll your own badly):

local $/ = undef;          # set the input delimiter to nothing, so...
$whole_file = <$IN>;       # this slurps in the whole file from the IN filehandle
print $OUT $whole_file;    # print it to another filehandle OUT

File::Copy allows you to copy files about with ease and without worry:

use File::Copy;
copy( "D:/index.html", "F:/backup/index.html");

File::Find allows you to traverse a directory tree recursively, and do stuff to files in it:

use File::Find;
find( \&wanted, "D:/perl" );
    # find takes a list of arguments
    # the first is a reference to a subroutine to run each time a file is found
    # the rest are the directories to search, here just one item, D:/steve
sub wanted {
    # the name of the file found is put in $_
    # the current directory path and file is put in $File::Find::name
    if ( /\.(htm|html)$/ ) { print "Found an htmlfile $File::Find::name\n"; }
}

You’ll also want to look in the File and Cwd namespaces if you ever find yourself wanting to create a temporary file, or concatenate file and directory names in a platform-independent way, or parsing a filename into drive, path and file:

use 5.14.1;
use File::Spec;
use Cwd;

my $cwd = getcwd;
    # imported from Cwd
say File::Spec->catfile( $cwd, 'subdirectory', 'filename.log' );
    # catfile concatenates a list of directories and a filename with appropriate / or \\
    # catdir does similar but for a list of directories

my $absolute_path = File::Spec->rel2abs( '..\Python' );
say $absolute_path;

my ($drive, $directories, $file ) =
    File::Spec->splitpath( 'H:\\Perl\\bin\\h2xs.bat' );
say "# $_" for $drive, $directories, $file;

H:\Perl\subdirectory\filename.log
H:\Python
# H:
# \Perl\bin\
# h2xs.bat

Using internet protocols through Perl

The Net modules are Perl implementations of Internet protocols like FTP:

use Net::FTP;
my $ftp = Net::FTP->new( "ftp.myhost.com" );
    # connect to server, note this is an OO module
$ftp->login( "Bob", "Password1" );
$ftp->cwd( "/files" );
$ftp->get( "the_one_i_want.txt" );
$ftp->quit();

Similar implementations of all the other Internet protocols are available, perldoc Net::blah for each’s documentation.

Manipulating lists

List::Util is a good place to look to avoid wheel reimplementations:

use List::Util qw(first max min reduce shuffle sum);
my @list     = ( 1, 32, 8, 4, 16 );
my $max      = max @list;
my $min      = min @list;
my $sum      = sum @list;
my $first    = first { $_ > 10 } @list;
my @shuffled = shuffle @list;
my $product  = reduce { $a * $b } @list;
print <<"__REPORT__";
Max:      $max
Min:      $min
Sum:      $sum
First:    $first
Shuffled: @shuffled
Product:  $product
__REPORT__

Max:      32
Min:      1
Sum:      61
First:    32
Shuffled: 8 4 32 16 1
Product:  16384

reduce calls the block you pass it repeatedly (much like sort), so can be used to perform various map to scalar conversions, although the module already comes with five of the most useful, and List::MoreUtils has even more.

Graphical user interfaces

I wouldn’t necessarily recommend Tk these days (I’d probably suggest Wx, but have not actually used this), but sometimes you want something a little easier on the eye than black box with a prompt in it:

use Tk;
my $mw = new MainWindow; # Make a new window
$mw->title( "My first little GUI" );
my $button = $mw->Button(
    # Create a button, configure it with a (-key => value) hash
    -text    => "Hello world",
    -command => sub { exit(0) },
        # the -command key takes a coderef as its value, here to exit
);
$button->pack();
    # The button needs to be packed by the geometry manager into
    # the MainWindow to be visible
MainLoop();
    # Start the main event loop that handles the button clicks, etc.

Templating

HTML::Template is useful for creating HTML files from templates (*surprise*), but it also useful grounding for other, more complex templating engines. The module allows three main constructs in the HTML template: variables, loops and conditionals, which is about as complex as you can embed into HTML without severely entangling the design with the technology. Here is a simple template for a list of species in a genus of plants:

<html>
  <head>
    <title><TMPL_VAR NAME="GENUS"></title>
  </head>
  <body>
    <h1>Genus <TMPL_VAR NAME="GENUS"></h1>
    <p>Species:</p>
    <ul>
      <TMPL_LOOP NAME="SPECIES">
      <li><TMPL_VAR NAME="EPITHET"> <TMPL_VAR NAME="AUTHORITY">
      <TMPL_IF NAME="COMMON_NAME">
        [<TMPL_VAR NAME="COMMON_NAME">]
      </TMPL_IF>
      <TMPL_IF NAME="IUCN"> - IUCN status <TMPL_VAR NAME="IUCN">
        <TMPL_ELSE> - Conservation status unknown
      </TMPL_IF>
      </li>
    </TMPL_LOOP>
    </ul>
  </body>
</html>

Filling in the template is a straightforward matter:

use strict;
use HTML::Template;

my $template = HTML::Template->new( filename => "monograph.html" );
$template->param( GENUS => 'Sarracenia' );
my @species;
while ( <DATA> ) {
    chomp;
    next if /^\s*$/;
    my ( $epithet, $authority, $common_name, $iucn ) = split /\s*:\s*/;
    push @species, {
        EPITHET     => $epithet,
        AUTHORITY   => $authority,
        COMMON_NAME => $common_name,
        IUCN        => $iucn,
    };
}
@species = sort { $a->{'epithet'} cmp $b->{'epithet'} } @species;
$template->param( SPECIES => \@species );
print $template->output;

__DATA__
alata : Alph.Wood : Pale pitcher plant : NT
flava : L. : Yellow pitcher plant : LC
leucophylla : Raf. : White pitcher plant : VU
minor Walt. : Hooded pitcher plant : LC
oreophila : (Kearney) Wherry : Green pitcher plant : CR
psittacina : Michx. : Parrot pitcher plant : LC
purpurea : L. : Purple pitcher plant :
rubra : Walt. : Sweet pitcher plant :

HTML::Template has three important methods. The first is new():

my $template = HTML::Template->new( filename => "monograph.html" );

This creates a templating object which will fill in the gaps in a file called monograph.html, which is the HTML-ish file shown above. The second important method is param(), which takes a hash of name => value pairs:

$template->param( TMPL_VARIABLE_NAME => "value to substitute in" );
$template->param( GENUS => 'Sarracenia' );

Any occurrence of the tag:

<TMPL_VAR NAME="GENUS">

in the template will be replaced with the value Sarracenia when you come to use output,

print $template->output;

the template object will duly fill in the gap:

<h1>Genus Sarracenia</h1>

The module also allows for conditionals and loops. To create loops, rather than using a simple hash, you use a reference to an array of hashrefs instead:

my @species;
while ( <DATA> ) {
    my ( $epithet, $authority, $common_name, $iucn ) = split /\s*:\s*/;
    push @species, {
        EPITHET     => $epithet,
        AUTHORITY   => $authority,
        COMMON_NAME => $common_name,
        IUCN        => $iucn,
    };
}
$template->param( SPECIES => \@species );

which generates something like this in the output::

<li>alata Alph.Wood [Pale pitcher plant] - IUCN status NT</li>
<li>flava L. [Yellow pitcher plant] - IUCN status LC</li>
<li>...

If you pass the param() method a ( SPECIES =>\@array_of_hashrefs ) pair, the module will look for a corresponding <TMPL_LOOP NAME="SPECIES"></TMPL_LOOP> pair in the template. So in this case, we define an arrayref called SPECIES, which contains a number of { EPITHET => "flava", AUTHORITY => "L.", etc } hashrefs in the script. When we send this data to the template, it sets <TMPL_VAR NAME="EPITHET"> and <TMPL_VAR NAME="AUTHORITY"> to each of the corresponding values from the loop variable.

You’ll also notice that testing for conditionals is just as easy:

<TMPL_IF NAME="IUCN">
    IUCN status <TMPL_VAR NAME="IUCN">
<TMPL_ELSE>
    Conservation status unknown
</TMPL_IF>

We set a parameter in the template object called IUCN in the script. In the template, if this is TRUE, then the HTML between the <TMPL_IF NAME="IUCN"></TMPL_IF> will be filled in appropriately and outputted. You can also (as we have done here), specify a <TMPL_ELSE> within this structure to be filled in and outputted if IUCN is FALSE.

I also use Win32::OLE and Parse::RecDescent a huge amount, but they will be posts all of their very own.

Next up…packages and writing modules.

Leave comment

Oct 05

Command line

By polypompholyx in Perl

Command line Perl

Despite being a suitable for large projects, Perl grew out of Unix shell scripting, so it allows you to run it from the command line directly:

perl -e "print 'Hello'; print 2 + 2;"

if you’re on Windows, or:

perl -e 'print q:Hello:; print 2 + 2;'

if you’re on Unix. The quotes matter. The -e is the execute switch, and there are several others:

perl -v

will get you information on the version of perl you’re running.

perl -h

will tell you all the switches you can use.

perl -w

turns on warnings, (and note that all these switches can also be bunged on the shebang line of a normal script if you’d rather).

perl -c programfile.pl

checks the syntax of a perl script programfile.pl without actually executing it.

Taint

perl -T

turns on taint mode. When running under taint, perl will not allow you to do certain unsafe things. In particular, anything entered by users of the script will be considered tainted, and you will not be allowed to use it in potentially dangerous things like:

system( $x );

To untaint data, you need to run any externally sources strings through pattern-match-and-capture, which perl assumes you’re sensible enough to write so as to preclude the possibility of letting bad things through.

Loading modules on the command line

perl -MFile::Find -e "find( sub{ print qq($_\n) if /\.jpg/ }, shift )" D:/pictures

The -M switch allows you to use a module from the command line, here File::Find. This one-liner just prints every JPEG file found in D:/pictures.

Text manipulation from the command line

Here’s a little one-liner that prints out the palindromes found in a dictionary file, using the fact that reverse will reverse a string in scalar context:

perl -l -n -e "print if $_ eq reverse" custom.dic

This one has two new switches, -n and -l. The -n switch assumes a:

while ( <> ) { ... }

loop around the program. The filehandle-less <> diamond operator is a shorthand for opening and reading in every file that follows on the command line, line-by-line. The -l switch does line-end processing (i.e. chomp): it’s quite clever, in that it removes trailing newlines from lines read in from these files, but it adds them back on again if you print them. So:

perl -lne "print if $_ eq reverse" custom.dic

is shorthand for something like:

$\= $/;
for my $file ( @ARGV ) {
    open my $FILE, "<$file";
    while ( defined ( $_ = <$FILE> ) ) {
        chomp;
        print if $_ eq reverse $_;
    }
}

Only much more convenient. The first line, $\= $/; sets the output record separator $\ to the input record separator $/ . You have met the second of these before: it contains the character that chomp will remove from strings, and that delimits lines when reading from a <FILEHANDLE>. It is ordinarily set to \n. The $\ variable is the mirror image of $/ : it’s what perl will append to every call to print. It is ordinarily set to undef. By setting the output record delimiter to the input record delimiter, our chomped newlines will be re-appended when we print $_.

File manipulation from the command line

Another useful switch is -i (also accessible by setting the special variable $^I). It does an in-place edit of the files on the command line, so whenever you are thinking “Oh, I want to do blah to files foo, bar and baz, but want to back them up first”, look no further than:

perl -i.bak -p -e "blah" foo bar baz

The -i.bak switch tells perl to copy each file it opens to a back-up file with .bak appended to it (or whatever you specify). This effects our backing-up. We then use the -p and execute switches to manipulate the original files.

The -p switch is the same as -n, only it prints the result of the manipulation at the end. Not only this, but when you are doing in-place editing, it prints the result to a special filehandle called ARGVOUT, which is opened for writing on each file on the command-line. Under normal circumstances, a plain print; statement will print STDOUT $_;, but when you are in-place editing, the -p will do an implicit print ARGVOUT $_; instead.

So to replace every vowel in a file called abjadify.txt with x’s, we can use:

perl -i.bak -p -e "tr/aeiouAEIOU/xxxxxXXXXX/" abjadify.txt

Which will back up our original file to abjadify.txt.bak, transliterate all the vowels to x’s, then write the result to the file, producing something like:

Xt's qxxtx xxsy tx rxxd x lxngxxgx yxx knxw wxll whxn yxx rxplxcx xll thx vxwxls
wxth x's, xs lxngxxgxs xrx vxry rxdxndxnt xnd thxrx xrx xsxxlly lxts xf clxxs 
tx dxstxngxxsh bxtwxxn wxrds lxkx bxxts xnd bxxts, bxt nxt hxrx xbvxxxsly!

Next up…Debugging

Leave comment

Oct 04

Debugging

By polypompholyx in Perl

Debuggering

So now you know how Perl works, and how to use it both for scripts and one-liners. But what do you do when it doesn’t? And how do you use it for larger projects?

Perl has some bugs and misfeatures, but it’s extremely unlikely that you’ve found a new one that’s not in the docs, so if a program fails to run as you expect, the chances are it’s you that’s buggered up. How can you find out where you’ve inserted a problem?

Planning

The zeroth thing to do is make sure you don’t make life difficult for yourself in the first place: plan your code before you start. Work out what you want to do, how you plan to achieve it, and then write bare-bones prototype code:

#!/usr/bin/perl
use strict;
use warnings;

my ( $input_file, $output_file ) = @ARGV;
my @fields = parse ( $input_file );
open my $OUTPUT, ">", $output_file or die "Cannot open '$output_file' for writing:$!\n";
foreach ( @fields ) {
    print $OUTPUT "$_\n";
}

sub parse { 
    print "Got to the parser\n";
}

You can flesh out these bare bones later, testing each new bit of functionality as you go. This is especially useful if you have many largely independent subroutines to write. It’s easier to debug code when you know which block the mistake is in.

Modularisation

If you ever find yourself repeating a piece of code, it’s very likely you should be putting it in a subroutine. Which would you rather: debugging one occurrence of a possible bug (and all code is a possible bug), or debugging eighty? The same applies to subroutines themselves. If you ever use a subroutine across more than one script, perhaps you should be putting it in a module.

CPAN

Don’t reinvent the wheel: check out CPAN before you write a program that copies files (File::Copy), interfaces with a database (DBI::), or traverses directory trees (File::Find). It’s extremely unlikely you will do a better job of hand-rolling of these functions yourself.

Style

The other important thing is to make your code clean. Check out perldoc perlstyle, but – most importantly – be consistent no matter how you choose to write your code. Comment your code with #comments, but bear in mind that explaining the bigger picture, weird gotchas, and why you are doing things are all much more important than explaining the minutiae of how.

Documentation

Document your code and your API with POD. This will make life easier for anyone using or modifying your code later, which will probably include your own good self at some point. Although TIMTOWTDI, choose the most appropriate way. Which of these would you prefer to debug?

(open A,"<$ARGV[0]")||die($!);
($a,$b,$c,$d,$e)=split/\//,<A>;
print "$_\n" for($a,$b,$c,$d,$e);

or:

use strict;
use warnings;
my $input = shift;
open my $INPUT, '<', $input or die "Can't open '$input' for reading: $!\n";
print "$_\n" foreach split m{ / }x, <$INPUT>;

or:

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $input = shift; 
    # get the input file
open my $INPUT, '<', $input or die "Can't open '$input': $!\n";
    # open the input file
my $line = <$INPUT>; 
    # read a line from the input file
my @records = split m{ / }x, $line; 
    # split the records on colons
foreach my $record ( @records ) { print "$record\n"; }
    # print them out \n delimited

These all do the same thing: the first is horrible: ugly formatting, leaning toothpicks /\//, no strict, $a, $b…where you really need an array (not to mention that using $a and $b is a bad idea because of sort), a nondescript A for a filehandle, (blah)|| instead of low precedence blah or.

The third buries its intent in spurious commentary: it’s obvious what it does, but I’ve used ten lines of wankingly self-indulgent verbosity when the second shows you can do the same thing much more clearly in just three. Well written code shouldn’t need many comments: it’s obvious already what the third code does without echoing every line in English. Reserve comments for nasty things like regexes, ugly-but-necessary constructs, and commenting the gist of paragraphs of code. Verbosity isn’t necessarily a good thing. It’s quite obvious what this does (read it backwards from split to foreach):

print "$_\n"
  foreach
    reverse
      sort { lc $a cmp lc $b }
        grep { ! /^#/ }
          split m{ / }x,
            ( "usr/bin/perl/#comment/blah" );

whereas:

my $string   = "usr/bin/perl/#comment/blah";
my @splat    = split m{ / }x, $string;
my @grepped  = grep { ! /^#/ } @splat;
my @sorted   = sort { lc $a cmp lc $b } @grepped;
my @reversed = reverse @sorted;
foreach my $item ( @reversed ) { print "$item\n"; }

Has rather more chances of bugging up, if only from misspellings.

`use strict` and `use warnings`

Ensure the first lines of any code you write contain:

use strict;
use warnings;

These will catch some of the commonest mistakes, like trying to write to read-only filehandles, variables you only use once (probably misspellings), and so on. You may also find use diagnostics; helpful: it translates warnings into something more descriptive and gives you ideas on how to fix stuff.

Check your variables

OK, so you’ve not made life difficult for yourself in the first place. And it’s still not working. What next? Well, in general, you will know roughly where the cock up is: it’s probably close to the last bit of code you typed (or in a subroutine/modules used by that code you just typed). Sprinkling some print statements around liberally in the general area is a dirty but extremely effective way of debugging:

my $var = "this";
if ( $var = "that" ) { print "TRUE\n"; }

TRUE

Oops. Add print $suspect_variables:

my $var = "this";
if ( $var = "that" ) { print "$var\n"; print "TRUE\n"; }

that
TRUE

Assignment != equality

Ah. We have commited Perl goof number 1: getting = (assignment) and == (numerical equality) mixed up:

$a = 20;
if ( $a = 2 ) { print "TRUE\n"; }

$a is assigned the value 2, which returns the value 2, which isn’t undef or 0, so it’s TRUE. D’oh! In a similar vein, don’t get eq and == mixed up, and don’t get = and =~ mixed up in regexes.

Dump your data

Sprinkling about print statements may be augmented by the use of any of the following.

For things nastier than a single string, like objects or hashrefs:

use Data::Dumper;
print Dumper( \$very_complex_data_structure );

For redirecting output:

open STDOUT, '>', "stdout.txt" or die $!;
open STDERR, '>', "stderr.txt" or die $!;
print "output this";
warn "whinge about this";

For avoiding Error 500: You can't write Perl when playing with CGI applications (if anyone actually does that any more!):

use CGI::Carp qw( fatalsToBrowser );

[{(“Balance”)}]

Besides messing up = and ==, some other frequent cock-ups include…

Messing up pairs and semicolons. It’s very easy to lose track of paired things like {} [] <> and (). Most text editors have a brace-matching function that will help you find missing braces and parentheses. Whether or not you’ve remembered to put a ; at the end of every line is also a common source of problems.

Quotes are even better at this; quotes in the general sense of " ", <<"HEREDOC"; HEREDOC\n, / /, s##!! , qx@@ , and tr||| . Don’t forget to escape quotes if you have to embed them. Furthermore, be warned: the ‘wrong’ line may be reported when an uncompilable program complains about things like this:

$string = "forgot the quote at the end,;
# so perl thinks the remaining lines are still string
print qq:$_\n: for ( 1.. 10 ) # and forgot the semicolon here too
print " and only now does it realise something bad has happened";

String found where operator expected at D:\Steve\perl\t.pl line 4, 
    at end of line (Missing semicolon on previous line?)
Can't find string terminator ' " ' anywhere before EOF at D:\Steve\perl\t.pl line 4.

Miscellaneous gotchas

Arrays and lists index from 0, not 1. Don’t forget many Perl operators will return different things in list or scalar context too: splice, localtime, each and arrays being common cases in point.

Printing to a closed filehandle is a silent error unless you use warnings; and it’s very easy to forget the > in:

open my $FH, '>', $f; # for writing

As far as modules go, bugs in your own modules are your problem, deal with them in the same way as using a script: write a small script that uses a bit of functionality from the module, and make sure each bit works individually. Other people’s modules are usually quite well tested, but beware that some modules don’t work on all systems (they’ll warn you when you install them), and beware of using old scripts that may use old versions of modules, and vice versa.

Exponents are written like Fortran-esque **, not like Excel-esque ^, and are more closely associated than unary minus, hence -2**2 is -4, despite what maths might say about the matter (I have never understood this myself). Precedence issues are also sometimes a problem: if in doubt, add parentheses.

RTFM, STFW

If all else fails, you still have perldoc or the HTML that comes with the ActiveState distribution. The literature that comes with perl is extensive to a fault, so use it.

perldoc -f function_you_may_be_using_the_wrong_syntax_for
perldoc perltrap

perltrap is the ‘Traps for the unwary’ documentation: the above are the commonest problems for people such as me who came with no idea about other programming languages. If you are a C programmer or Python hacker, your problems (how do I take an address? What’re all these braces for?) your gotchas may be different.

The perlmonks‘ website is lovely: it has a FAQ for common ‘how do I do this’, a tutorial, and trawling through the archives is a good way of picking up tips. You can also post requests for help, which are almost always answered with grace and helpfulness, but RTFM before you waste someone’s time with a spurious question about why this:

$a = "foo"; print "TRUE" if $a = "bar";

doesn’t work.

Line-noise is a bug not a feature

Perl is a language for dirty little hacks, shell scripts and for confusing maintenance staff with all Th@t_L!Ne_\n0i$e, or so I’m told. Hopefully, all the nagging about use strict; POD, comments, the /x modifier for regexes and the importance of modules, classes and debugging means that your dirtier hacks are not something you’d want to release onto the world. Here are some tips that may help if you move from writing little scripts to programming bigger, portable applications. Note that I’ve only really developed scripts where the customer is me, so I have nothing useful to say about dealing with projects where the biggest problem is working out what the fuck the customer actually wants, rather than how to implement it.

Write the code, the tests and the documentation as you go along, rather than leaving one or other until the end. Without documentation you will end up squandering hours trying to understand the code you yourself wrote not a few months ago. Without tests, you will end up breaking old features in the same breath as adding new ones.

Keep an eye on the future extensibility and portability of your code. Don’t code in obvious portability flaws: use File::Spec rather than concatenating together file-paths with backslashes (which won’t work under Unix); don’t shell out with system to call programs that may not exist on another OS (and which you could probably emulate perfectly well using your own code, or someone else’s modules) .

Be strict and well-formed in your output, but be forgiving to your input.

If your script has options, give them sensible defaults, and document them.

It’s much better to write clean, consistent, nicely formatted scripts that people like using and don’t mind maintaining, rather than writing twisty, messy guff that eventually ends up as some unmaintainable but irremovable ball of mud with a cargo-cult built up around it (“we know it works, we just don’t know how it works”).

Use $variables_that_mean_something, don’t call everything $file, $variable, $thing and @data.

Modularise your code, i.e. use loosely coupled modules and classes (do not create God objects), avoid global variables, and if there are a load of configurations, put them in a separate file, especially if they are accessed by many scripts, etc.

Time and memory

Even if you don’t have a Computing Science degree, don’t forget that computing is a form of engineering, and therefore of applied maths. Algorithms have memory and time costs, and algorithms O(>N²) do not scale: an algorithm whose time of execution or memory footprint (O) increases more rapidly than the square of the amount of input data (N) is unusable for anything but tiny, trivial cases. The thing to look out for here is embedded loops: if you have anything that looks like:

for my $first ( @all_the_data_items ) {
    for my $second ( @all_the_data_items ) {
        print "Match!\n" if $first eq $second;
    }
}

you are skirting the borders of unusability: for every ($first) item in @all_the_items, you make a comparison to every ($second) item in @all_the_items. Hence you will make N² comparisons, where N is scalar @all_the_items. Hence, the script’s time of execution will increase with the square of N. Whatever you do, don’t put another loop in the inner loop that goes over all the data, or you’ll likely be dead by the time the script has dealt with more than ten thousand items (seriously: if it takes 1 ms to do the innermost loop thing, it will take (10000**3)/(1000*60*60*24*365) = 32 years). The Benchmark module may help you decide on issues of speed and optimisation, but ‘make it work, make it right, then make it fast’ is a traditional warning against the dangers of premature optimisation.

Testing

When you create a module Blah with standard tools, it will provide you with a boilerplate test script Blah.t. Inside a test script, you can probe outputs for known inputs and check that your module is working correctly. For example, if Blah exports a function called greet, you might well like to check that:

use Test::More tests => 2;
    # Note that you need to tell Test::More how many tests you intend on running
BEGIN { use_ok('Blah') };
    # Checks that the module compiles correctly
ok( greet( "Einstein" ) eq "Hello, Mr. Einstein" );
    # ok checks that the comparison made is TRUE.

The ok function from Test::More checks whether the argument it is given is true or not. If you ever have to refactor a module, it’s essential to know whether or not your new version behaves in the same way as the previous version. If you write tests that cover the code sufficiently well (checking every branch in the logic), then you will have a much better idea of whether your refactored module will be a drop-in replacement. Even better, you will be told immediately which tests are failing (and hopefully, therefore, where the code is still broken). Computer programs are largely black boxes to their users, and the main thing they are interested in is not the neatness of the box’s contents (which they’ll never look at), but the fact that no matter what’s in the box, when you give it input A, it always produces output B, and not a recipe for cheesecake or a segfault.

Leave comment

Oct 04

Bits and bobs

By polypompholyx in Perl

This post is a bit of a rag-bag of useful stuff that doesn’t fit elsewhere.

Creating code at runtime

There’s another way of creating code on the fly, besides closures. This is eval. eval comes in two flavours: string and block. The block eval looks like:

$number = <STDIN>;
my $answer;
eval {
    $answer = 2 / $number;
};
print $@ ? "Divide by zero error or similar: $@" : $answer;

This is a useful (and indeed the only way) to catch exceptions, that is, divide-by-zero errors and their ilk. Note you need a ; at the end of the eval { BLOCK }; because eval is a statement. If something goes wrong in the block, the special variable $@ is set with what went wrong. So eval-ling a block allows you to test Perl code, and make fatal things non-fatal.

If you actually want to create new code on the fly, you can use string eval:

my $name = <STDIN>;
eval "sub $name { return \"hello\" } ";

This will create a subroutine on the fly called $name that returns ‘hello’, which you can then call normally. The string is quoted, and the usual rules for double quoted string interpolation apply. eval is very powerful (i.e. dangerous), which you should be aware of before you even think of using it:

my $cmd = <STDIN>;
eval $cmd;

is going to get you in an awful lot of trouble if someone types in

system("rm -rf *")

on Unix, or

system("DELTREE c:\\windows")

if you’re on Windows. Beware.

What’s the time Mr Wolf?

$time = scalar( localtime );

print $time;

localtime returns the local time in an array, type perldoc -f localtime on a command line for details. The commonest way of using it though, is calling it in scalar context, which returns a useful descriptive string.

Bed-time

sleep 1;

makes perl sleep for 1 second (ish: subject to some iffiness).

Time to get up

print "\a";

\a is an annoying alarm beep.

Pretty printing

Perl has two functions for this, printf and sprintf. Both use the same syntax, but printf actually prints the prettified string, whereas sprintf just returns it, so you can store the prettified version elsewhere, e.g. in a variable. The format is a very complex: perldoc -f sprintf for the gory details.

printf takes at least two arguments. The first is a control string, the second is the string to prettify. Control strings contain placeholders that start with %. They end with a letter that indicates the format you want: f is a fixed decimal floating point number. e is a scientific notation float, s is a vanilla string, and u is an unsigned (no + or −) integer, and so on. Between the % and the letter can come some bits and bobs to specify the format you want your string in. If you put a number in, it specifies the minimum field width. If you put a - in between, it means left justify, so:

my $string1 = "carrots";
my $string2 = "beans";
printf(
    "%-10s neatly lined up\n%-10s neatly lined up too",
     $string1,              $string2
);

carrots   neatly lined up
beans     neatly lined up too

See that the %-10s in the control string act like place-holders for the list of strings that follow. You can format your code nicely so that its readers will know which placeholder refers to which string.

Another useful one is:

my $string = "1.23465326362643743657563";
printf( "%.3f", $string );

1.235

A . followed by a number indicates a maximum number of decimal places. Incidentally, if you need to print a literal % character in a control string, you’ll need to escape it, like this: %%. I won’t cover them here, but another way of messing with strings is using the pack and unpack operators, which allow you to convert between strings (like “100”) and (for example) their binary equivalent (i.e. the actual 8-bit binary string 00000100). perldoc -f pack for the details.

Loop control

If you have a bunch of loops nested in each other:

while ( <$FILE> ) {
    while ( my $word = split /\s+/, $_ )     {
        print "$word" unless $word =~ /^do_not_print_me$/i;
    }
}

You’ll often want to abort one or other of them prematurely. For this you’ll want next and last. Both drop you out of the current innermost loop: next skips any remaining code, and restarts the loops with the next value, whilst last kills the innermost loop dead. In this case next will move onto the next $word, whilst last will ignore the rest of the $words generated by split :

while ( <$FILE> ) {
    while ( my $word = split /\s+/, $_ ) {
        next if $word =~ /^#/;
            # ignore any 'word' starting with a #
        last if $word =~ /^END_OF_LINE$/;
            # ignore the rest of the words in the line
        print $word;
    }
}

The problem with this is that maybe you want to drop out of the outer loop if you find something in the inner loop. To do this, you can use labels. For example, if you were trying to parse a Perl file (not a good idea: the only thing that can parse Perl code properly is the perl interpreter), you might try something like this:

LINE: while ( <$FILE> ) {
    WORD: while ( my $word = split /\s+/, $_ ) {
        next LINE if $word =~ /^#/;
            # ignore the rest of the line, it's only a comment
        last LINE if $word =~ /^__END__$/;
            # ignore the rest of the lines if you find Perl's __END__ token
        next WORD if $word =~ /rude/;
            # don't print anything rude
        print $word;
    }
}

The LABEL:s allow you to jump out of a loop from any depth. The much deprecated goto LABEL allows you to jump to anywhere in the code, but now I’ve told you about it, forget about it, there is a whole world of hurt there. There’s also redo which makes for the simplest loop possible:

LOOP: {
    print "Hello. Did you know you can kill a perl program with Control-C\n";
    redo LOOP;
}

This sort of loop seems to be looked down upon, but I’ve found it useful from time to time.

Heredocs

Heredocs are a way of including a large block of text in a Perl script without having to quote it line by line:

my $name = "Stewie Griffin";
print <<"THIS";
Hello,
This is a double quoted heredoc, as you can see from the fact that the
THIS is written with double quotes. This means you can interpolate  variables
like $name. If you use single quotes around the THIS, the heredoc follows
single-quote rules. There is one thing different: you no longer need to escape
"quotes" like in a normal quoted string. This is because the end of the string
is terminated by the THIS label, so quotes don't mean anything special.
The THIS label must be on a line on its own, up against the margin, like
THIS

$string = <<'HERE';
This also works: you can assign a heredoc to a string. The token HERE is
single-quoted so $name will not interpolate.
HERE

print $string;

I generally prefer to use large descriptive tokens like __MESSAGE_BODY__ as they stand out better.

Next up…installing modules.

Leave comment

Oct 04

References and data structures

By polypompholyx in Perl

References

Arrays and hashes and lists and the manipulation thereof are very useful, but sometimes you want something a bit less flat; a little more multidimensional. Perl does indeed allow for multidimensional arrays (of a sort), but the syntax is a little odd. Let us enter the world of the reference.

References are like signposts to other data. To create a multidimensional array, what you actually need to do is create a normal 1-dimensional array whose elements are these signposts to other arrays. In this way, we can fake a multidimensional array. So, how do we create a reference? There are two ways. The first is to use the \ operator, which creates references to bits of data you have named and defined previously:

$scalar     = "hello";
$scalar_ref = \$scalar;
@array      = ( 'q', 'w', 'e', 'r', 't', 'y' );
$array_ref  = \@array;
%hash       = ( key => "value", lemon => "curd" );
$hash_ref   = \%hash;

Because references are just a single signpost saying “the thing I’m referring to is just over there“, they are $ingular data, hence they are always $calars. Creating arrayrefs and hashrefs is such a common thing to do, there is a shorthand, which allows you to create them directly, and without naming them:

$array_ref  = [ "q", "w", "e", "r", "t", "y" ];
$array_ref2 = [ qw{ you can use qw to save time } ];
$hash_ref   = { key => "value", lemon => "curd" };

These create references to anonymous arrays and anonymous hashes. In the first example, we created references to named arrays and hashes, like @array and %hash. In this example, there is no named array or named hash, only the [ ] brackets or { } braces, which are never named.

You can also create references to anonymous subroutines:

$code_ref = sub { return 2**$_[0] };

Which is very, very useful, as we shall see.

Dereferencing

Creating references is easy then: you just need the right mix of \, and { } [ ] braces and brackets. How do we get back to the contents? This is dereferencing. To see how this works, let’s take an example of the multidimensional array we wanted in the first place:

@unidimensional = (
    ( "this is just a one dimensional array", "even though"),
    ( "we have nested", "parentheses" )
);
    # parentheses don't nest in perl:
    # everything gets flattened to a single list
@multidimensional = (
    [ "but this one", "really is multidimensional" ],
    [ "two by", "two" ]
);
    # brackets mean create references to anonymous arrays
    # and put them in @multidimensional

The multidimensional array is really a one-dimensional array that contains pointers (references) to the location of other arrays. So Perl doesn’t really have multidimensional arrays, but you can fake them using references. Getting at the data is a little complex:

my @multidimensional = (
    [ "but this one", "really is multidimensional" ],
    [ "two by", "two" ]
);
my $element_1_1 = $multidimensional[1]->[1]; # gets 'two'
print $element_1_1;

two

The $multidimensional[1] bit is obvious: we’re just using the usual (counting from zero) syntax for pulling out the 2^nd element of the array. If we actually captured this:

my @multidimensional = (
    [ "but this one", "really is multidimensional" ],
    [ "two by", "two" ]
);
$element_1 = $multidimensional[1];
print ref $element_1;
print $element_1;

the scalar $element_1 would contain the reference itself. We can use the ref operator to find out what the scalar points to:

print ref $element_1;

ARRAY

We can also find out what perl calls the reference:

print $element_1;

ARRAY(0x183f1dc)

or similar, which isn’t very informative! The 0x... is a hexadecimal code to the array’s location in memory, basically the ‘map reference’ where the array has been stored. So although:

$multidimensional[1]

returns a reference to an array ('0x183f1dc'):

->[1]

is the bit that actually ‘dereferences’ the reference, chasing the pointer to the array at 0x183f1dc. You could think of the -> as ‘follow this signpost to the real data’. The -> dereferencing arrow can also be used to pull out bits of multidimensional hashes:

my %multi_hash = (
    name => {
        fore => "Steve",
        sur  => "Cook",
    },
    age => 26 # fucking hell, this tutorial is 10 years old
);
my $forename = $multi_hash{name}->{'fore'};
my $age      = $multi_hash{'age'};
print "name $forename, age $age\n";

name Steve, age 26

HoH, AoA and other Thingies

You can see here that a multidimensional structure needn’t be a boring rectangular matrix (an array of arrays, AoA), exactly 2 by 2 or 4 by 4 by 3. Not does it need to be a hash of hashes (HoH). It can be an unbalanced and weird Thingy: you can have anything you like:

my @complex = ( # is an array at the top-level, containing...
    "a scalar",
    { and => "an internal hashref" },
    "another scalar",
    [ # element 3 of the @complex is this big arrayref
        [
            "a two deep arrayref",
            "with two elements"
        ],
        { # element 1 of this arrayref is a little hashref
            lemon => "curd" # key lemon of this hashref is curd
        }
    ]
);
print my $wanted = $complex[3]->[1]->{'lemon'};

Hopefully the syntax should be OK now: if you wanted the words “curd”, you need:

my $wanted = $complex[3]->[1]->{'lemon'};

i.e. take the third element of @complex (don’t forget we count from 0!), that is the big arrayref at the end, dereference it to get the first element (the little hashref inside the arrayref), then dereference the hashref with the key lemon.

When we do objects later, we’ll find that they are usually anonymous hashrefs:

$anonymous_hashref = { name => "Cornelia", species => "Elaphe guttata" };

to get at the name this time, we need the syntax:

$name = $anonymous_hashref->{'name'};

Note that because we are starting with a scalar (not an array as in the last example), the very first thing we need to do is dereference it with an arrow: the scalar $anonymous_hashref points to an (anonymous) hash, so you can’t just look at its keys, because a scalar doesn’t have keys. If this doesn’t make sense, compare these:

@array    = ( qw/ some elements/ );
$zeroeth_element_of_array    = $array[0];
$arrayref = [ qw/some elements/ ];
$zeroeth_element_of_arrayref = $arrayref->[0];

There is a certain shortcut you can take when dealing with multidimensional arrays and hashes (and mixtures, like arrays of arrayrefs of hashrefs). Since an array or hash can only contain a single 1 dimensional list, we know that:

@array = ( [ "two", "by" ], [ "two", "elements" ] );
$element_0_1 = $element[0][1];

can only mean:

$element_0_1 = $array[0]->[1];

So, in general, if you’re dealing with something that is a real array or hash at the highest level, you can miss off the -> arrows completely unambiguously, so our earlier example:

$wanted = $complex[3]->[1]->{'lemon'};

could be written with less line noise as:

$wanted = $complex[3][1]{'lemon'};

However, you must be careful with this syntax if you’re dealing with something that is a scalar at the highest level (i.e. an anonymous hashref or arrayref):

$arrayref    = [ [ "two", "by" ], [ "two", "elements" ] ];
    # note this is scalar (an arrayref), not an array
$element_0_1 = $arrayref->[0]->[1];
$element_0_1 = $arrayref->[0][1];

The last two are equivalent, but note you can’t dispense with the first -> , which is actually obvious, as:

$element[0][1];

means the first element of the zeroeth arrayref contained in @element, which doesn’t exist (or if it does, you’ll get the value from that, which isn’t what you’re after).

You’ve now seen that constructing references can be done in many ways. TIMTOWTDI. Some examples:

$arrayref  = [ 0, 1, 2, 3 ];
    # create a scalar that is an anonymous arrayref
@array     = ( 0, 1, 2, 3 );
$arrayref  = \@array;
    # create a reference to a real named array
@multi     = ( \@one, \@two, \@three );
    # multidimensional array constructed from references to named arrays
@multi2    = ( [ 1, 2 ], [ 3, 4 ], [ 5, 6 ] );
    # multidimensional array constructed from anonymous arrayrefs
$arrayref2 = [ [ 1, 2 ], [ 3, 4 ], [ 5, 6 ] ];
    # multidimensional: arrayref constructed from anonymous arrayrefs

Slicing thingies

Often you won’t want to play with just a single element of a thingy (which is the not very descriptive term for a Perl data structure composed of some gungy mass of references). You’ll want a slice of a thingy, or to iterate over the whole array pointed to by an arrayref. How do we do this? Well, both need the same sort of dereferencing syntax:

$arrayref = [ 0, 1, 2, 3, 4 ];
@slice    = @{ $arrayref }[ 0, 2 ];
@anonymous_array_pointed_to_by_arrayref = @{ $arrayref };

The way to read @{ $arrayref } is ‘the array referred to by $arrayref‘. If you want a hash, you use a similar syntax:

$hashref = {
    lemon      => "curd", 
    strawberry => "jam", 
    orange     => "marmalade",
};
@hashslice     = @{ $hashref }{ lemon, orange };
    # don't forget slices are plur@l data
%anonymous_hash_pointed_to_by_hashref = %{ $hashref };

This (of course?) produces yet another way of accessing the individual elements of a reference:

$hashref = {
    lemon      => "curd", 
    strawberry => "jam", 
    orange     => "marmalade",
};
$hashelement  = ${ $hashref }{ "lemon" };
    # the pairs inside the hash referred to by $hashref are $calar data
$arrayref     = [ 0, 1, 2, 3, 4 ];
$arrayelement = ${ $arrayref }[2];
    #the elements inside the array referred to by $arrayref are $calar too

So that if we have a dereferenced arrayref:

@{ $arrayref }

we access its elements with the usual Perl syntax:

${ $arrayref }[ INDEX ]

In simple cases, the { } braces can be omitted, although I personally never miss them off, as I get easily confused:

@array = @{ $arrayref };
@array = @$arrayref;

are equivalent nonetheless.

Creating reference structures at run-time

All the examples so far have relied on you knowing the structure of the thingy at compile time. Although there will be times when you will create huge perl thingies in your code, more often than not, you will be building them at run time from data entered by users or from other files. In fact, a great deal of programming parsing is to do with converting data from one format to another. Data stored in flat files is called ‘serialised’ data. Data stored within a script will usually be more tree-like. Conversion from serialised input (such as XML files, text files or user input) to serialised output (such as an HTML file, or an email) via an internal parse tree of some sort is often a job for modules from CPAN, but you’ll frequently need to create a quick and dirty converter for a simple format such as the one below. Here, we use simple array and hash manipulation operators (like push, for, and hash assignment) to create a Thingy on the fly. Since references are just pointers to hash and array structures, building and manipulating them is a simple matter of getting up close and personal to these functions.

#!/usr/bin/perl
use strict;
use warnings;
my $record;
my @people;
while ( <DATA> ) {
    chomp;
    if ( /\*\*\*/ ) {  # start of record marker
        $record = {};  # create an empty hashref
    }
    elsif ( my ( $field, $data ) = / (\w+) \s* = \s* (.*) /x ) {
        if ( $field eq "pets" ) {
            $data = [ split /\s*,\s*/, $data ];
            # create an anonymous arrayref:
            # you can wrap things that return lists in []
            # and create arrayrefs as simply as this. Good, isn't it?
        }
        $record->{ $field } =  $data;
        # add key/value pair to the anonymous hashref $record
    }
    elsif ( /---/ ) { # end of record marker
        push @people, $record; # add the hashref to the @people array
    }
    else {
        next;
    }
}
# we have now created a tree in memory, looking something like:
# @people = (
#    { name=>'alice', age=>37, pets=>[] }, 
#    { name=>'bob',   age=>23, pets=>[ 'dog' ] },
#    { name=>'eve',   age=>26, pets=>[ 'millipede', 'snake' ] },
# );
for my $person ( @people ) {
    print ucfirst "$person->{'name'} is $person->{'age'} years old ";
    # ucfirst capitalises the first letter of a string
    if ( my @pets = @{ $person->{'pets'} } ) {
        # assignment is neatest here, as we use @pets in a minutes
        local $" = ', ';
        # the $" is a special perl variable, containing 
        # the thing used to separate array elements in quoted strings:
        # usually it's a space, but we make it a comma and space for
        # our output here
        print "and has the following pets: @pets.\n";
    }
    else {
        print "and has no pets.\n";
    }
}
# everything after a __DATA__ token is available to a perl script and
# can be read automagically via the DATA filehandle, which is opened
# on running your script
__DATA__
***
name = alice
age = 37
pets =
---
***
name = bob
age = 23
pets = dog
---
***
name = eve
age = 26
pets = millipede, snake
---

Alice is 37 years old and has no pets.
Bob is 23 years old and has the following pets: dog.
Eve is 26 years old and has the following pets: millipede, snake.

So we take one serial format (our __DATA__ section), convert it into an internal thingy, then dump the thingy in our own chosen format.

Passing by value and passing by reference

What other practical uses are there for references? Well, if you’ve tried to return arrays from subroutines, you’ll find that it doesn’t work how you expected:

my ( @one, @two ) = fruits();
print "one @one\ntwo @two\n";
sub fruits {
    my @first  = qw/ lemon orange lime/;
    my @second = qw/ apple pear medlar/;
    return @first, @second;
}

one lemon orange lime apple pear medlar
two

The problem is that arrays in list context (such that return gives its arguments) interpolate their members. That is, return flattens @first and @second into one big list. The second problem is that arrays are greedy, so when we return this flattened list from the subroutine, @one slurps up all the return values, and @two gets nothing. There is a kludge to get around this:

my ( @one, @two );
( @one[ 0 .. 2 ], @two[ 0 .. 2 ] ) = thingy();

but this is obviously dependent on knowing how many elements are returned: this will break if the two arrays have variable lengths. What we really need to do is pass the arrays by reference:

my ( $one, $two ) = fruits();
print "one @{$one}\ntwo @{$two}\n";
sub fruits {
    my @first  = qw/ lemon orange lime/;
    my @second = qw/ apple pear medlar/;
    return \@first, \@second;
}

The subroutine returns a list of arrayrefs (effectively an array of arrays), which the body of the program then plays with using the @{ $ref } dereferencing syntax. You can be explicit if you want, and dereference the arrayrefs back to real arrays if you’d rather:

my ( $one, $two ) = fruits();
my @one = @{ $one };
my @two = @{ $two };

There’s something rather important to note about passing things by reference instead of by value, as we have done here:

my @spices = qw( ginger kratchai galangal );
by_value( @spices );
print "@spices\n";
sub by_value {
    my @spices = @_;
    push @spices, "turmeric";
}

ginger kratchai galangal

When you pass things by value, you are passing copies of the data, so manipulating these data in the sub does nothing interesting to the original data. To do that, you’d have to return the altered array, and capture it in @spices to change the original data. However, when you pass by reference, you are passing pointers to the original data, so if you manipulate what the references point to, you are modifying the original data:

my @spices = qw( ginger kratchai galangal );
by_reference( \@spices );
print "@spices\n";
sub by_reference {
    my $spices = shift;
    push @{ $spices }, "turmeric";
}

ginger kratchai galangal turmeric

Another source of subtle bugs is directly messing with the elements of @_:

my @spices = qw( ginger kratchai galangal );
by_value( @spices );
print "@spices\n";
sub by_value {
    $_[0] = 'cardamom';
}

cardamom kratchai galangal

Like $_ in a foreach loop, the items in @_ are aliased to the items in the list of arguments passed to the subroutine. Modifing them will modify the data in the body of the program, which is generally something you don’t want to do. It’s always best to be explicit if you want the subroutine to modify its arguments. If you don’t want it to, pass by value, and immediately copy the contents of @_ into some nice lexically scoped variables.

`Data::Dumper`

References can get quite complicated, which is where the module Data::Dumper comes in very handy. Modules can be used, like you have seen with strict: they imports some extra functionality to your program, in this case, a function called Dumper().

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @complex = (
    "a scalar",
    { and => "a internal hashref" },
    "another scalar",
    [
        [
            "a two deep arrayref",
            "with two elements"
        ],
        {
            lemon => "curd"
        }
    ]
);
print Dumper( \@complex );

$VAR1 =
[
  'an array',
  {
    'containing' => 'a hashref'
  },
  'a scalar',
  [
    [
      'a two deep arrayref',
      'with two elements'
    ],
    {
      'lemon' => 'curd'
    }
  ]
];

Data::Dumper takes a reference to any perl data structure and spits out the contents in a) a pretty printed format, and b) in a format that you can actually print to a file:

open my $FILE, $file;
print $FILE Dumper( \@complex );

to save it for later (serialisation). You can then recover the data with:

do $file;

Although don’t do this if you do not trust the data!

Anonymous functions and closures

You may remember a while ago that I said you can create anonymous subroutines:

$coderef = sub { return 2*$_[0] };

To use these, you can dereference them like normal, using () parentheses, as you might have guessed:

print $coderef->(3);

This is very powerful: it means you can actually return bits of code from subroutines:

#!/usr/bin/perl
use strict;
use warnings;
my $multiply = <STDIN>;
my $coderef  = construct( $multiply );
print $coderef->( 2 );
sub construct {
    my $number  = shift;
    my $coderef = sub { return $number * $_[0] };
    return $coderef;
}

This creates coderefs on the fly which will multiply by whatever number you decide via STDIN. The technical term for such things is closures: the weird thing about them is that the $number in sub construct is a lexically scoped my variable, and ought to disappear forever (go out of scope) when you leave the subroutine. However, when you use:

$coderef->( 2 ); # or equivalently
&{ $coderef }( 2 ); 
    # & is the sigil for subroutines, like arrays get @ and hashes get %

the value of $number you entered when you constructed the coderef is still there, deeply bound into the coderef, even though it ‘shouldn’t’ really exist outside the scope of sub construct. Magic.

So far we have only talked about so called hard references. There are also such things as soft references, which if you try to use when strict is on will barf. In the interests of hygiene, I’ll not tell you about them, they’re not much used, for good reason, and you can always find out about them yourself. If you ever think you need one, it’s almost certain what you actually need is a hash.

Next up…bits and bobs.

Leave comment

Oct 04

Substitutions, splitting and joining

By polypompholyx in Perl

Substitution and transliteration

Matching patterns is very useful, but often we want to do something more than just match things. What if you want to replace every occurrence of a certain thing with something else? This is the domain of the s/// and tr/// operators. s/// is the substitution operator, and tr/// is the transliteration operator. tr/// is useful for simple things:

my $string =  "all lowercase with 5ome num8er5";
$string    =~ tr/a-z/A-Z/;
print $string;

ALL LOWERCASE WITH 5OME NUM8ER5

You just make a list on one side of the tr///, and a list on the other side (hyphens can be used to create natural ranges), and perl will map one lot to the other. The substitution operator is even more powerful and useful:

$_ = "old M\$ dross";
s/old/new/i; # substitute any occurrence of old with new, case insensitively
s/M\$/Microsoft/i;
s/dross/loveliness/i;
print; # did you forget print defaults to $_ ?

new Microsoft loveliness

Interpolation in regexes

In the second one, note you have to escape the $. This is because both pattern matching and substitution can interpolate variables:

my $name   = "Cornelia";
my $string = "Cornelia is a corn-snake.";
print "Matched $name\n" if $string =~ /$name/;
$string =~ s{is}{was}; # *sniff*
print $string;

Matched Cornelia
Cornelia was a corn-snake.

Note that like m//, s/// and tr/// can use the usual ‘any quotes you fancy’, although avoid ? and ' , as they have a special significance. So:

s|A|B|;  # three the same
s(A){B}; # two pairs
s{A}|B|; # one pair, two the same

all work, although I’d only recommend the middle one.

Substitution modifiers

The s/// can take all the modifiers (/s, /x, /i) that matching m// can take, but it has another two of its own, /g and /e. /e is like a little eval (which we will discuss later) that evaluates the substitution’s right hand side, and /g means ‘globally’, i.e. do it to every match you find:

my $string =  "2 3 4 5 6";
$string    =~ s/ (\d+) / 2 * $1 /xge; # double every number you match
print $string;

4 6 8 10 12

If you hadn’t noticed, when you use a substitution with capture parentheses, the captures are in $1, etc., as usual, and you can use these on the right hand side of the s///. Of course, you can also use /g and /e separately. In fact, you can use /g on m// as well:

$_ = "2 3 4 5 6";
while ( /(\d+)/g ) { print "$1 times 2 is ", $1 * 2, "\n"; }

2 times 2 is 4
3 times 2 is 6
4 times 2 is 8
5 times 2 is 10
6 times 2 is 12

Here, the /g means ‘keep matching till you run out of string’.

Splitting and joining strings

There are several operators that use pattern matching of one sort or another. The first is split. split expects a list. The first argument is the regex you want to split the string on, the rest of the arguments are things to split. You can capture the split bits in an array:

my $string   = "A : colon:delimited: file: with: some : random :spaces";
my ( @bits ) = split /\s*:\s*/, $string;
    # splits on colons surrounded by optional spaces
print "$_\n" foreach @bits;

A
colon
delimited
file
with
some
random
spaces

The opposite of split is join, which has a similar syntax, only it expects not a regex as its first argument, but a string. So:

my $joined = join "|", qw/one two three four five six/;
print $joined;

one|two|three|four|five|six

How about this:

print join "|", reverse split /\s*:\s*/, 
    "A: colon: delimited  : file: with  :    spaces";

spaces|with|file|delimited|colon|A

Running list operators into each other like this a) is clever, but b) easily becomes unreadable. Caveat scriptor.

Grepping

Another useful tool for regex is grep. This operator takes a regex as its first argument too, and a list of things to ‘grep‘ as the rest. What is grepping? Well, grepping means ‘returning the things that match from a list’:

my ( @names )     = qw/ Cornelia Atropos Lachetis Amber /;
my ( @match )     = grep   /^A/, @names;
my ( @not_match ) = grep ! /^A/, @names;
print "Start with A @match\nDon't @not_match\n";

Start with A Atropos Amber
Don't Cornelia Lachetis

See that you can make an anti-grep using the ! ‘not’ before a regex. The way grep actually works is by running through the list you give it, setting $_ to each item in turn. It then uses the regex to pattern match on $_, as usual. Only things that match are returned. grep is useful for finding lines in a file that match a certain pattern. It’s another of those Perl operators that returns different values in scalar and list context. In list context (previous example) it return the list of matches, but in scalar context:

my $number = grep /^A/, @names;

it returns the number of matches. grep can be heavily abused, syntactically speaking:

grep /regex/, LIST;
grep { /regex/ } ( LIST );

Both work the same, although I always use the latter, as it makes the condition more obvious. This may vaguely remind you of sort. I prefer the second version, even though it’s line noise for its own sake.

Mapping

One final operator before we leave regexes. map has nothing to do with regexes, but it has a similar syntax to grep (and to sort for that matter). I love map. There’s nothing like it for bringing out the mathematician in you. map needs a block of code that does something to $_, followed by a list, just like grep. map then runs though the list, using $_ to cache each value, so you can torture it with the block of code:

@mapped = map { DO_SOMETHING_TO $_ } ( LIST );

So:

@doubled = map { 2 * $_ } ( qw/ 2 4 6 8 10 / );
print "@doubled";

4 8 12 16 20

This is shorthand for:

@doubled = map { return 2 * $_ } ( qw/ 2 4 6 8 10 / );
print "@doubled";

in case you were wondering: blocks return the last thing they evaluated in the absence of an explicit return statement.

Dull? Yes. But how about:

@selective_doubles = 
    map { /[24680]$/ ? ( 2 * $_ ) : $_ } ( qw/ 1 2 3 4 5 6 7 8 / );
print "@selective_doubles";

1 4 3 8 5 12 7 16

which returns a list of numbers that have been doubled iff (if and only if) they are even.

One word of warning for both grep and map. $_ is not a copy of the data in the list you feed to these functions, it’s an alias to the actual values of the list. That means that if you modify $_ itself, rather than just returning it, you will alter the items in the list fed to grep or map, not just the items in the returned list. This may be what you want, but probably isn’t:

my @original = qw/Abacus chocolate sprite/;
print "original: @original\n";
my @returns = map { s/A//gi; } ( @original );
print "afterward: @original\nreturned: @returns\n";

original: Abacus chocolate sprite
afterward: bcus chocolte sprite
returned: 2 1

You may be wondering what the hell has happened. Well, firstly, the actual members of @original have been altered, because s/// messes with $_ directly. Hence all the A characters have been stripped. The s/// operator returns the number of substitutions in scalar context, hence @returns contains 2 (Abacus), 1 (chocolate) and undef (since sprite contains no /A/i). If you remember that a map is basically a foreach loop:

my @mapped = map { DO_SOMETHING_TO $_ } ( LIST );

and

my @mapped;
foreach ( LIST ) {
    my $return_value = DO_SOMETHING_TO $_;
    push @mapped, $return_value;
}

are the same thing, you’ll be fine. As long as you remember that altering the value of $_ in a foreach loop indirectly alters the original value in the LIST, that is! Go on, try writing the s/// map as a foreach loop, and you’ll see what I mean.

my @original = qw/Abacus chocolate sprite/;
print "original: @original\n";
my @returns;
foreach ( @original )
       {
    my $return_value = s/A//gi;
    push @returns, $return_value;
}
print "afterward: @original\nreturned: @returns\n";

Told you so. What you probably need in this case is a temporary variable:

my @original = qw/Abacus chocolate sprite/;
print "original: @original\n";
my @returns = map { my $tmp = $_; $tmp =~ s/A//gi; $tmp; } ( @original );
print "afterward: @original\nreturned: @returns\n";

original: Abacus chocolate sprite
afterward: Abacus chocolate sprite
returned: bcus chocolte sprite

So, to summarise:

The s/// operator acts like the m// operator, but selectively substitutes text. The tr/// operator is quicker and easier for simple substitutions. The syntax of the new list operators is:

@splat = split /\s/, @splitees;
@junt  = join '+', @joinees;
@mup   = map { $_ * 2 } @mappees;
@grap  = grep { /\d+/ } @grepees;
@argh  = map { "IP: $_" } 
           join '.', split /\:/, 
             grep { /^\d{1,3}:\d{1,3}:\d{1,3}:\d{1,3}$/ } 
               ( @ip );

Next up…references and data-structures.

Leave comment

Oct 04

Regexes

By polypompholyx in Perl

Regular expressions

Before we looked at file manipulation, we covered how to write comparisons with conditionals:

if ( $string eq "Something we're interested in" ) {
    print "Ha, ha!";
}
else {
    print "Boring";
}

What happens if there’s more than one thing you’re interested in though? Writing a gigantic if/elsif/else structure or even a given/when switch will make your head spin, and you’ll never be sure you’ve got every possible version of the thing you’d like to match. Take for example, matching something as simple as a letter, number or underscore character:

if    ( $test eq "a" ) { print "OK" }
elsif ( $test eq "b" ) { print "OK" }
...
elsif ( $test eq "9" ) { print "OK" }
...
elsif ( $test eq "_" ) { print "OK" }
else                   { print "Not a letter, number or underscore!" }

This is a waste of time that will be over 63 eye-bending lines long, and still won’t match the correct spelling of ‘naïve’, let alone хуёво. So, from time immemorial, there have been things called ‘regular expressions’ or ‘regexes’, which are a way of explaining to a programming language the things you want to match in a neat and tidy fashion. Regex are really a language all of their own (a form of logic programming). However, despite looking like executable line-noise, they are incredibly useful and powerful.

In Perl, regex are written in quotes, of a sort. Here is such a regex:

/\w/

The / / are the ‘quotes’ for the regex: the regex itself is just the \w bit. This regex does exactly what those 63 lines of code would do very badly: they match a single letter, number or underscore. As you know, the \ is an escaping character: anything after it has some special meaning. The w stands for ‘word’, and \w will match a single occurrence of any ‘word’ character, which is defined as a letter, underscore or number (i.e. the things valid in the names of Perl variables and subroutines). The proper name for this is a character class, which we’ll cover later. For the moment, suffice it to say \w is the same as [A-Za-z0-9_] (only it can also cope with non-ASCII letters in Unicode). So the program we really want to write is:

#!/usr/bin/perl
use strict;
use warnings;

chomp( my $test = <STDIN> );
if   ( $test =~ /\w/ ) { print "OK" }
else { print "Not a letter, number or underscore!" }

The =~ is the ‘binding operator’. It makes perl do the regex on the right to the variable on the left. So:

$test =~ /\w/;

and

$_ =~ /\w/;

will test $test and $_ for wordiness respectively. In fact, (as usual), there’s a shorthand for the second one: $_ is the default variable, and if perl finds a naked slash-delimited regex, it’ll assume you mean $_ =~ /naked_regex/:

$_ =~ /\w/;

and

/\w/;

are the same thing. If a regex matches, it returns TRUE, so:

print "Match" if /\w/;

will print “Match” only if $_ contains a word character.

Another useful way to write this is with a logical operator:

/\w/ && print "Match";

Which does the same thing: the && is a short-circuit operator, so if the first thing is FALSE (i.e. $_ is not wordy), it doesn’t bother evaluating the second (i.e. print "Match"). If you want the match to fail (return FALSE) if it matches a word character, you can use !~ :

$test !~ /\w/;

or simply negate a naked regex with the not ! operator:

! /\w/;

The ternary if/then/else operator

To make our original program even tinier, we can use this default shorthand, and a new operator, the ? : operator:

chomp( $_ = <STDIN>);
print /\w/ ? "OK" : "Not a letter, number or underscore!";

The ? : operator is like a tiny ‘if else’ statement:

print (
    if $_ matches /\w/ ?
    then return "OK" :
    else return "Not a letter, number or underscore!"
);

A ? B : C will test A to see if it is TRUE. If it is TRUE, it returns B, if it is false, it returns C. print then gets handed whatever this statement returns, i.e. “OK”, or “Not a letter…”.

One-or-more word characters

Now, what if we want to match more than one word character?

/\w+/;

will do just that: a + means ‘one or more of the preceding character’. So this pattern will match a, bbbbbb, d_99 and so on. However, it will also match 999;;;plop, because 999 matches /\w+/ (perl never bothers going as far as the ‘plop’, as it’s already satisfied the match with the 999 – in fact, just with 99).

Anchors and escapes

If we want to make sure that we match a thing made entirely out of word characters, we can use:

/^\w+$/;

The ^ means ‘beginning of the string’ and $ means ‘end of the string’, (beginning and end of the string you =~ bind to the regex). So this regex will only match strings composed purely of word characters.

Another useful escape sequence is \s, which matches a space character (including both literal spaces, and \n newlines, \r carriage returns, \f form feeds, \t tabs and a few other obscure things). To match a space only, you can just use:

/ /;

and to match a newline:

/\n/;

\d will similarly match a single digit [0-9].

Capturing parentheses

An extremely important thing you can do with a regex is to capture what perl actually matched. To do this, you use ( ) parentheses within the regex:

/^(\w+)$/;

If the regex matches $_, which it will if $_ is composed entirely of ‘word’ characters, then the thing that \w+ matched will now be squirrelled away by perl for your perusal. How do we get at these stored goodies? Well, there are two ways. The first is to use the pattern match variables, $1, $2, $3, $4 … Whatever was captured by the first set of parentheses will appear in $1, the second set in $2, and so on. So:

/(\w(\s+)(\w+))/;

If this actually matches $_, then the entire match \w\s+\w+ will be found in $1, the space characters \s+ will be found in $2, and the last word characters \w+ will be found in $3. Another way to do this is to assign the results of the regex to a list outside the regex:

my ( $wholething, $space, $word ) = $test =~ /(\w+(\s+)(\w+))/;

Here, if the regex matches, the values of $1, $2 and $3 will be dumped into $wholething, $space and $word respectively. You may have just noticed that a regex is a context sensitive thing: in list context it returns the match variables, in scalar context, it returns TRUE or FALSE.

Regex modifiers

If the regex:

/(\w(\s+)(\w+))/;

makes you eyes hurt, you can use the /x extended modifier, thus:

/
    (
      this in $1
        \w    # a word character
        (\s+) # some spaces, capture into $2
        (\w+) # some more word characters, capture into $3
    )
/x;

perl ignores whitespace in a /x modified regex. Another very useful modifier is /i, which makes a regex case insensitive:

/^hello, world$/i;

will match “Hello, World”, “hello, world” and indeed “HEllO, WoRLd”. Note that in regexes, unescaped letters and numbers mean just what you type: it’s only escaped alphanumeric characters (\w word character, \d digit) and punctuation (+ one or more, ^ start of string) that mean something special.

Greediness and quantification of regex atoms

Regexes are ‘greedy’ and ‘lazy’ by nature. If you have this situation:

$_ = "hello everybody";
/(\w+)/;
print $1;

hello

$1 will end up with “hello” in it. This shows that regexes are lazy (they match at the first place in the string they can, so “hello”, not “everybody”), and that they are greedy (the regex has matched the maximum possible number of letters, “hello”, not just “h” or “hell”). The modifier + always tries to greedily slurp up as many characters as it can and still match the whole sequence. The same applies to *, which is zero or more of the preceding character:

/^\w*$/;

will match any alpha_num3ric string, and also the empty string “”. Another quantifier is the ?, which indicates you want to match zero or one of the preceeding character:

/Steven?/;

Will match Steve or Steven.

The second most pointless regex in the world is this:

/.*/;

The . is a special metacharacter that means ‘any character except \n‘, so this regex will match pretty much anything as long as it’s not entirely a string of newlines. The most pointless regex of all is:

/.*/s;

The /s modifier makes . match \n too (it treats a multi-line string with embedded \n as a single line). So this regex matches zero or more of anything, so it will always match regardless of what $_ is!

You can specify exactly how many of a character you want using {n,m} braces:

/\w{3}/;   # matches exactly 3 alpha_num3rics
/\w{3,8}/; # matches 3 to 8 alpha_num3rics
/\w{3,}/;  # matches 3 or more alpha_num3rics
/\w{1,}/;  # pedant's version of /\w+/;
/\w{0,}/;  # pedant's version of /\w*/;
/\w{0,1}/; # pedant's version of /\w?/;

Sometimes, greedy regexes are not what you are after. You can stop regexes being greedy using the ? modifier on any of the quantifying metacharacters, i.e. * ? {n,m} and + . So:

$_ = "hello everybody";
/(\w+?)/;
print $1;

This code returns the smallest possible match, rather than the greediest.

Character classes and Unicode characters

As I said earlier, \w is (as far as basic ASCII is concerned) equivalent to the ‘character class’:

[A-Za-z0-9_]

Brackets are used to surround a list of characters that comprise the class. Here are some useful(?) classes:

[aeiouAEIOU] # English vowels
[10]         # binary digits
[OIWAHMVX]   # bilaterally symmetrical capital letters

Any quantifier appearing after a character class applies to the whole character class: one or more of any of the characters in the braces:

/[A-Z]+/

Matches one or more capital letters. You can define your own character classes using this notation, but please have a care for those who live outside the comfy world of 7 bits:

$_="El niño";
/(\x{00F1})/ and print "Yep, matched an n-tilde: $1";

The \x{00F1} (which can be abbreviated to \xF1 if this isn’t ambiguous) is the Unicode code point of the ñ character. You can also use named characters with the ‘charnames’ pragma…

use charnames ':full';
$_="á é í ü or even ñ";
/(\N{LATIN SMALL LETTER N WITH TILDE})/ and print "Yep, matched an n-tilde: $1";

To save yourself even more time, you can use utf8:

use utf8;
my word = "λόγος";
print "It's all Greek to me\n" if $word =~ /^\w+$/;

This changes the sematics of \w so that it’ll match Greek, Arabic, hiragana, hangul, and maybe – one day – even Tengwar. If this pragma is loaded, it will also allow you to create subroutines with non-ASCII names:

use utf8;
λόγος();
sub λόγος
{
    print "You'll be lucky if 'λόγος' prints correctly in your terminal!\n";
}

Most of the punctuation metacharacters (the characters like + and . and * that mean something special in a regex) lose their meta-nature inside a character class. Usually, you have to escape these metacharacters in a regex:

/\*/;
/ \+ \? /x;

The first will match a literal * character, the second a literal string of +?. But inside a character class, you don’t need to bother:

/[*+.]+/;

will match one or more asterisks, periods or pluses there’s no need to escape them, because only a few characters mean something special inside a character class. The characters that do mean something special inside a character class include -, which makes a natural range, as you saw in the definition of \w (hence [A-Z], [a-f], [1-6], [0-9A-Fa-f], etc.), and ^, which means ‘anything except…’ iff it’s the first item in the brackets. So:

/[^U]/;          # anything but the capital letter U
/[^A-Z0-9]/;     # anything but capital letters and numbers
/[A-Z^]/;        # capital letter or caret
/[^A-Z^]/;       # anything but a capital letter or caret
/[^A-Za-z0-9_]/; # anything but a word character.

Now, that last one could be written more easily as /[^\w]/ or even better as /\W/, the \W being Perl’s shorthand for ‘anything but an alpha_numeric’. Likewise \S is anything but whitespace, and \D is anything but a digit.

Leaning toothpick syndrome

If you do want to include a special character like - or ^ in a character class, you’ll need to escape it:

/[ \\ \/ \- \] ]/x; # note the x so I can pad them nicely with spaces

This will match a single backslash \ (which you always need to escape in Perl, whether in plain code, regex or in a character class). It will also match a forward slash /, a ] close-bracket (this needs escaping, else it’ll be prematurely interpreted as the end of the character class) or a hyphen -. You may be wondering about why you also have to escape the /. This is for similar reasons escaping quotes in strings. If you don’t escape the regex delimiter /, perl will think the regex finishes in the wrong place. Fortunately for matching path names under Unix, like qq() and q(), you can specify your own regex quotes with m() (for match):

m(\w+?);
m{[\\ / \- \] ]}x;

See that with the second, you no longer need to escape the /. This is very useful in situations where otherwise you’d be writing:

/C:\/perl\/bin\/perl\.exe/;

which is called leaning toothpick syndrome:

m{C:/perl/bin/perl\.exe};

is rather better. As with quoting strings, avoid clever and cute delimiters: stick to slashes, parentheses or braces unless you want the maintainer of your code to come calling with a machete.

Alternation and grouping without capturing

What else can you do with regexes? Well, you can specify alternatives:

/foo|bar/;

which will match both foo and bar, using the | or pipe-character. One problem with this is sometimes you’ll need to group things using parentheses:

/([Cc]ornelia|my ex-snake) eats (\w+)/;

but now the interesting thing you’re trying to capture (what [Cc]ornelia eats) is in $2, not $1, which may be OK, but if you’d rather not have spurious pattern match variables to ignore, you can use the grouping-but-not-capturing (?: ) regex extension:

( $food ) = /(?:[Cc]ornelia|my ex-snake) eats (\w+)/;

The (?: ) allows grouping, but doesn’t squirrel away a value into $1 or its friends, so it doesn’t interfere with assigning captures to lists. There are dozens of other regex extensions looking like (?...) in Perl regexes, which you can explore yourself (they also make Perl’s regular expression highly irregular to computer scientists).

Match variables

Perl has three special regex punctuation variables. $` $& and $' . These are the pre, actual, and post match variables:

my $string =  "Cornelia ate mice that I'd thawed on the radiator";
$string    =~ /mice|mouse/;
print "PRE $`\nMATCH $&\nPOST $'\n";

PRE Cornelia ate
MATCH mice
POST that I'd thawed on the radiator

Using these three variables will slow down your program, and are almost unreadable, but use them if you must.

Back-references

One last thing to do is to use what you’ve already matched, i.e. back-reference within a regex. Say you want to find the first bold or italic word in an HTML document:

my $html_input_file = shift @ARGV;
local $/ = undef; 
    # this sets the local 'input separator' to nothing, so that
open my $HTML, $html_input_file
    or die "Bugger: can't open $html_input_file for reading: $!";
$_ = <$HTML>;
    # this will slurp in an entire file, rather than a line at a time
m{
    <(i|b)>
        # an <i> or <b> tag, captured into $1
    (.*?)
        # minimum number of any characters captured into $2
    </\1>
        # an </i> or </b>, depending on the opening tag
}sxi;
        # . matches \n, extended, case insensitive
print "$2\n";

The \1 allows the pattern to match the same something that would end up in $1, here ‘b’ or ‘i’. This isn’t written $1 like you’d expect (there is a good but technical reason). This regex (or some variation on it) looks like it will parse HTML. However, it is actually impossible to parse nested languages like HTML or XML without a more complex sort of grammar than can be provided by regexes. Getting around this problem can wait until a later post.

Quote-regex operator

Regexes can be used both directly, and stored for later use using the qr() operator. This q(uote) r(egex) operator is a simple way of keeping regexes and passing them around like strings:

my $regex = qr/(?:milli|centi)pedes?/i;
my $text  = "Millipedes are cute. No really.";
print "Found something interesting\n" if $text =~ /$regex/;

You can use $regex wherever you’d usually use a regex (in a match, or a substitution), and you can pass it to subroutines, or use it as part of a larger regex. Note that any modifiers, like /i, are internally incorporated into the string and honoured. You can even print out the $regex as a string. How useful.

Regex summary

Atoms of regexes: alpha_numeric characters, character class escapes (\w word, \W not-word, \s space, \S not-space, \d digit, \D not-digit), character classes [blah1-9] and negated classes [^blah1-9], escaped metacharacters (\. a literal . period), metacharacters ( . anything but \n).
Alternatives : use the | for alternatives.
Quantifiers for the atoms: * (0 or more), + (1 or more), ? (0 or 1), {n,m} (between n and m).
Greediness : can be turned off with a ? following the + ? * {n,m} quantifiers.
Capturing : use () parentheses, and grab $1, $2, etc. Use (?: ) to avoid captures if you just want to use the parentheses to group, not capture.
Backreferences : use \1, \2, inside the match instead of $1, $2, etc.
Modifiers: /x ignores whitespace and comments, /s makes . match \n, and /i make the regex case-insensitive. These are usually called the /X modifiers, even though the / is actually part of the regex quoting mechanism. There is also a /m modifier that changes the semantics of the start and end of string markers (^ $ \A \Z \z). perldoc perlre for details.

Next up…substitutions, splitting and joining.

Leave comment

Oct 03

Files and directories

By polypompholyx in Perl

Reading and writing to files

The symbol table is a little esoteric, so let’s get back to practicalities. How do you mess about with files and directories in Perl? A simple example:

#!/usr/bin/perl
use strict;
use warnings;

open my $INPUT, "<", "C:/autoexec.bat"
    or die "Can't open C:/autoexec.bat for reading $!\n";

open my $OUTPUT, ">", "C:/copied.bat"
    or die "Can't open C:/copied.bat for writing $!\n";

while ( <$INPUT> ) {
    print "Writing line $_";
    print $OUTPUT "$_";
}

Here we open two files, one to read from, one to write to. The $INPUT and $OUTPUT are filehandles, just like STDIN was, only we have created these two ourselves with open. It’s a good idea to give filehandles uppercase names, as these are less likely to conflict with Perl keywords (we don’t want to try reading from a filehandle called print for example).

Note that it’s also possible to write the above in the following way:

open INPUT, "C:/autoexec.bat"
    or die "Can't open C:/autoexec.bat for reading $!\n";

open OUTPUT, ">C:/copied.bat"
    or die "Can't open C:/copied.bat for writing $!\n";

while ( <INPUT> ) {
    print "Writing line $_";
    print OUTPUT "$_";
}

You can miss off the $ sigil on the filehandles. However, modern Perl usage is to use a lexically scoped filehandle (except for the standard input, output and error handles that are opened automatically for you). You will see the old style filehandles in code, but you should avoid them if you are running under perl versions > 5.8, as they rely on global variables, and are subject to the same sort of clobbering that we saw earlier.
You can miss off the < on calls to open, and perl will assume you mean ‘to read’. However, it’s better practice to explicitly state what you mean with the three argument form.
You can combine the read/write/append token into the filename. However, both this and missing out the < on opening to read can be the cause of subtle bugs, so you’d be better to avoid them.

`die`

The open command always needs at least two arguments: a filehandle, an optional read/write/append token, and a string containing the name of a file to open. So the first line:

open my $INPUT, "<", "C:/autoexec.bat"
    or die "Can't open C:/autoexec.bat for reading $!\n";

means ‘open the file C:/autoexec.bat for reading, and attach it to filehandle $INPUT‘. Now, if this works, the open function will return TRUE, and the stuff after or will never be executed. However, if something does go wrong (like the file doesn’t exist, as it won’t if you’re running on Linux or MacOS), the open function will return FALSE, and the statement after the or will be executed.

die causes a Perl program to terminate, with the message you give it (think of it as a lethal print). When something goes wrong, like problems opening files, the Perl special variable $! is set with an error message, which will tell you what went wrong. So this die tells you what you couldn’t do, followed by $!, which’ll probably contain ‘No such file or directory’ or similar.

A word of advice before we go any further. On Windows, paths are delimited using the \ backslash. On Unix and MacOSX, paths are delimited using the / forward-slash. Perl will happily accept either of these when running under Windows, but bear in mind \ is an escape, so to write it in a string, you’ll have to escape it, thusly:

$file = "C:/autoexec.bat";
$file = "C:\\autoexec.bat";

I’d go with the first one in the name of portability and legibility, although if you ever need to call an external program (using system, which we’ll cover later), you’ll probably have to convert the / to \ with a regex substitution.

The second line:

open $OUTPUT, ">", "C:/copied.bat"
    or die "Can't open C:/copied.bat for writing $!\n";

is very similar to the first, but here we are opening a file for writing. The difference is the >:

open my $READ, "<C:/autoexec.bat";           # explicit < for reading
open my $READ, "<", "C:/autoexec.bat";       # three argument version is safer
open my $WRITE, ">C:/autoexec.bat";          # open for writing with >
open my $WRITE, ">", "C:/autoexec.bat";      # safer
open my $APPEND, ">>C:/autoexec.bat";        # open for appending with >>
open my $APPEND, ">>", "C:/autoexec.bat";    # safer
open my $READ, "C:/autoexec.bat";            # perl will assume you 'read'

The > means open the file for writing. If you do this the file will be erased and then written to. If you don’t want to wipe the file first, use >>, which opens the file for writing, but doesn’t clobber the contents first. The three argument versions are generally safer: consider whether you want this to work:

chomp( my $file_name = <STDIN> );
    # user types ">important_file"
open my $FILE, $file_name;
    # you assume for reading, but the > that the user enters overrides this. Oops.

Reading lines from a file

The next bit is easy:

while ( <$INPUT> ) {
    print "Writing line $_";
    print $OUTPUT "$_";
}

Remember the line reading angle brackets <> ? As in:

chomp ( my $name = <STDIN> );

This is the same, but here we are reading lines from our own filehandle, $INPUT. A line is defined as stuff up to and including a newline character, just as it was when you were reading things from the keyboard (and you also know this is strictly a fib, <> and chomp deal with lines delimited by whatever is in $/ currently). Conveniently:

while ( <$INPUT> )

is a shorthand for:

while ( defined ( $_ = <$INPUT> ) )

i.e. while there are lines to read, read them into $_. The defined will eventually return FALSE when it gets to the end of the file (don’t test for eof explicitly!), and then the while loop will terminate. However, while there really is stuff to read, the script will print to the command line “writing line blah…”, then print it to the $OUTPUT filehandle too using:

print $OUTPUT "$_";

Note that there is no comma between the filehandle and the thing to print. A normal print:

print "Hello\n";

is actually shorthand for:

print STDOUT "Hello\n";

where STDOUT is the standard output (i.e. the screen), like STDIN was the standard input (i.e. the keyboard). To print to a filehandle other than the default STDOUT, you need to tell print the filehandle name explicitly. If you want to make the filehandle stand out better, you can surround it with braces:

print { $OUTPUT } "$_";

Pipes and running external programs with `system`

What else can we do with filehandles? As well as opening them to read and write files, we can also open them as pipes to external programs, using the | symbol, rather than > or <.

open my $PIPE_FROM_ENV, "-|", "env" or die $!;
print "$_\n" while ( <$PIPE_FROM_ENV> );

This should (as long as your operating system has a program called env) print out your environmental variables. The open command:

open my $PIPE_FROM_ENV, "-|", "env" or die $!;

means ‘open a filehandle called PIPE_FROM_ENV, and attach it to the output of the command env run from the command line’. You can then read lines from the output of ‘env‘ using the <> as usual.

You can also pipe stuff into an external program like this:

open my $PIPE_TO_X, "|-", "some_program" or die $!;
print $PIPE_TO_X "Something that means something useful to some_program";

Note the or die $! : it’s always important to check the return value of external commands, like open, to make sure something funny isn’t going on. Get into the habit early: it’s surprising how often the file that can’t possible be missing actually is…

An even more common way of executing external programs is to use system. system is useful for running external programs that do something with some data that you have just created, and for running other external programs:

system "DIR";

Will run the program DIR from the shell, should it exist. Given it doesn’t exist on anything but Windows, there’s no point in running it unless the OS is correct. Perl has the OS name (sort of) in a punctuation variable, $^O. Try running:

print $^O;

MSWin32

to find out what perl thinks your OS is called.

system is a weird command: it generally returns FALSE when it works. Hence:

if ( $^O eq "MSWin32") { system "dir" or warn "Couldn't run dir $!\n" }
else                   { print "Not a Windows machine.\n"             }

will give spurious warnings. Here we have used warn instead of die: warn does largely the same thing as die, but doesn’t actually exit: it just prints a warning. As you may guess from my ‘coding’ the word exit, if you want to kill a perl program happily (rather than unhappily, with die), use exit.

print "Message to STDOUT\n";
warn  "Message to STDERR\n";
exit 0;                   # exits program gracefully with return code 0
die "Whinge to STDERR\n"; # exits program with an error message

What you actually need for system is the bizarre:

system "dir" and warn "Couldn't run dir $!\n";

a (historically explicable, but still bizarre) wart.

perl opens three filehandles when it starts up: STDIN, STDOUT and STDERR. You’ve met the first two already. STDERR is the filehandle warnings, dyings and other whingings are printed to: it is also connected to the terminal by default, just like STDOUT, but is actually a different filehandle:

warn "bugger";

and

print STDERR "bugger";

have largely the same effect. There’s no reason why you can’t close and re-open a filehandle, even one of the three default ones:

#!/usr/bin/perl
use strict;
use warnings;
close STDERR;
open STDERR, ">>errors.log";
warn "You won't see this on the screen, but you'll find it in the error log";

Logical operators

You have now met two of Perl’s logical operators, or and and. Perl has several others, including not and xor. It also has a set stolen from C that look like line-noise: ||, && and !, which also mean ‘or’, ‘and’ and ‘not’, but bind more tightly to their operands. Hence:

open my $FILE, "<", "C:/file.txt" or die "oops: $!";

will work fine, because the precedence of or (and all the wordy logic operators) is very low, i.e. perl thinks this means:

open( my $FILE, "<", "C:/file.txt" ) or die "oops: $!";

because or has an even lower precedence than the comma that separates the items of the list. However, perl thinks that:

open my $FILE, "<", "C:/file.txt" || die "oops: $!";

means

open my $FILE, "<", ( "C:/file.txt" || die "oops" );

because || has a much higher precedence than the comma. Since "C:/file.txt" is TRUE (it’s defined, and not the number 0), perl will never see ‘die "oops"‘. The logical operators like &&, or and || return whatever they last evaluated, here C:/file.txt, so perl will try and open this file, but if it doesn’t exist, there is nothing more to do and you will get no warning that something has gone wrong. The upshot: don’t use || when you should use or, or make sure you put in the brackets yourself:

open( FILE, "<", "C:/file.txt" ) || die "oops";

Operator precedence is a little dull, but it is important. If you are worried, bung in parentheses to ensure it does what you mean. Generally perl DWIMs (particularly if you’re a C programmer), but don’t count on it.

Backticks

One last way of executing things from the shell is to use ` ` backticks. These work just like the quote operators, and will interpolate variables (as will system "$blah @args" for that matter), but they capture the output into a variable:

my $output = `ls`;
print $output;

Like qq() and q() and qw(), there is also a qx() (quote execute) operator, which is just like backticks, only you chose your own quotes:

my @output = qx:ls:;

Directories

Handling directories is similar to handling files:

opendir my $DIR, ".";
while ( defined( $_ = readdir $DIR ) ) {
    print "$_\n";
}

The opendir command takes a directory handle, and a directory to open, which can be something absolute, like C:/windows, or something relative, like . the current working directory (CWD) or ../parp the directory parp in the parent directory of the CWD.

Rather than using the <> line reader, you must use the command readdir to read the contents of a directory. I’ve used the defined explicitly, as you never know what idiot is going to create a file or directory called 0 in the directory you’re reading.

When you get to the end of a directory listing using readdir, you will need to use rewinddir to get back to the beginning, should you need to read the contents in again.

To change the current working directory, you use the command chdir.

Here’s a program that changes to a new directory, and spews out stuff about the contents to a file called ls.txt in the new directory.

#!/usr/bin/perl
use strict;
use warnings;

my $dir = shift @ARGV;
chdir $dir or die "Can't change to $dir: $!";
opendir my $DIR, "."
    or die "Can't opendir $dir: $!\n"; # the new CWD, to which we changed
open my $OUTPUT, ">", "ls.txt" or die "Can't open ls.txt for writing: $!";

while ( defined ( $_ = readdir $DIR ) ) {
    if    ( -d $_ ) { print $OUTPUT "directory $_\n" }
    elsif ( -f $_ ) { print $OUTPUT "file $_\n" }
}
close $OUTPUT or die "Can't close ls.txt: $!\n";
    # pedants will want to use an 'or die' here
closedir $DIR or die "Can't closedir $dir: $!";
    # perl will close things itself, but it doesn't hurt to be explicit

There are a few new things here. @ARGV you may recognise from the symbol table programs. This is another special Perl variable, like $_ and $a. It contains the arguments you passed to the program on the command line. Hence to run this program you will need to type:

perl thing.pl d:/some/directory/or/other

@ARGV will contain a list of the single value d:/some/directory/or/other, which you can get out using any array operator of your choice. In fact, pop and shift will automatically assume @ARGV in the body of the program, so you could equally well write..

my $dir = shift;

and get the same effect. This should remind you of subroutines, the only difference is that array operators default to @ARGV in the body, and @_ in a sub. The V stands for ‘vector’ if you’re interested, it’s a hangover from C.

File-test operators

The rest of the program is self explanatory, except for the -f and -d. Not too surprisingly, these are ‘file test’ operators. -f tests to see if a file is a file, and -d tests to see if a file is a directory. So:

-f "C:/autoexec.bat"

will return TRUE, as will:

-d "C:/windows"

as long as they exist! Perl has a variety of other file test operators, such as -T, which tests to see if a file is a plain text file, -B, which tests for binary-ness, and -M, which returns the age of a file in days at the time the script started. The others can be found using perldoc.

`perldoc`

perldoc is perl’s own command line manual: if you type:

perldoc -f sort

at the command prompt, perldoc will get all the documentation for the function sort (the -f is a switch for f(unction) documentation), and display it for you. Likewise:

perldoc -f -x

will get you information on file test operators (generically called ‘-x‘ functions). For really general stuff:

perldoc perl

will get you general information on perl itself, and:

perldoc MODULE_NAME

e.g.:

perldoc strict

will extract internal documentation from modules (including pragma modules like strict) to tell you how to use them. This internal documentation is written in POD (plain old documentation) format, which we’ll cover when we get onto writing modules. Lastly:

perldoc -h

or amusingly:

perldoc perldoc

will tell you how to use perldoc itself, which contains all the other information for its correct use I can’t be bothered to write out here.

Files and directories summary

A quick summary. Opening files looks like:

open my $FILEHANDLE, $RW, $file_to_open; # note the commas

If $RW is “<“, it’ll be opened for reading, if “>“, for writing, if “>>“, for appending, if “-|“, opened as a pipe from an external command called blah, and if “|-” as a pipe to an external program.

You should always check return values of open to make sure the file exists, with or die $! or similar, which prints to the STDERR filehandle, as does warn. External commands can also be run with system (don’t forget the counterintuitive ‘and die $!‘), backticks, or the qx() quotes. Read from files with the <$FILEHANDLE> angle brackets, print to them with:

print $FILEHANDLE "parp"; # note the lack of comma

and close them with close.

Use opendir, readdir, rewinddir, chdir and closedir to investigate directories (with or die as appropriate), and the file-test operators -x to investigate files and directories. And if in doubt, use the perldoc.

Next up…regexes.

Leave comment

Oct 03

Symbol table

By polypompholyx in Perl

Symbols

That’s pretty much everything for hashes, except for one topic usually missed out from introductory tutorials (possibly rightly!) This post will tell you a little about the innards of what you’ve been doing when you create variables. It’s not really necessary to know this stuff to be able to use Perl for day-to-day stuff, so do feel free to skip to the next post if this one becomes too esoteric.

perl maintains its own internal hash, called the symbol table, or %main:: (that’s ‘hash main double colon’), which you also have access to:

#!/usr/bin/perl
# use strict; # turn off strictures, for reasons we'll come to in a minute
use warnings;

$pibble = 2;
@foo    = ( 1, 4 );
%bits   = ( me => 'tired' );
sub my_sort { return ( $a cmp $b ) }

foreach ( sort keys %main:: ) {
    print "This perl program has a symbol called $_.\n";
}

This perl program has a symbol called STDIN.
This perl program has a symbol called pibble.
...

This program will print stuff about the ‘symbols’ perl has defined for you, and the symbols you have created. Somewhere you will find pibble, foo, bits and my_sort. You’ll also find a lot of other things, including STDIN, the name of the standard input filehandle, and a and b (as in $a and $b).

Typeglobs

The symbol table is just a hash, with the rather atypical name %main::, and that program simply printed out the keys of that hash. If you want to see the values, you’ll have to be acquainted with Perl’s final, and most esoteric data type, the typeglob, and another type of scoping besides my. Arrays have @, scalars have $, and typeglobs have *. The typeglob *foo, contains the definitions of $foo, @foo, %foo, and the subroutine sub foo (which is called &foo : subs get & as their sigil) all rolled into one. Try this program out:

#!/usr/bin/perl
# use strict;
# use warnings; # turn warnings off too

# define some things
$pibble = 2;
@foo    = ( 1, 4 );
$foo    = 'bar';
%foo    = ( key => 'value' );
%bits   = ( me => 'tired' );
sub my_sort { return ( $a cmp $b ) }

print "This program contains...\n";

while ( my ( $key, $value ) = each %main:: ) {
    # iterate over the key/value pairs of the symbol table hash

    local *symbol = $value;
    # this assigns the value from the symbol table to a typeglob

    # the following lines look to see if the typeglob contains 
    # a $, %, @ or & definition

    if ( defined $symbol ) {
        print "a scalar called \$$key\n";
            # remember \$ is just an escaped $ ...
            # followed by the contents of variable $key
    }
    if ( defined @symbol ) {
        print "an array called \@$key\n";
    }
    if ( defined %symbol ) {
        print "a hash called \%$key\n";
    }
    if ( defined &symbol ) {
        print "a subroutine called $key\n";
    }
}

a hash called %ENV
a scalar called $pibble
a scalar called $_
a hash called %UNIVERSAL::
a scalar called $foo
an array called @foo
a hash called %foo
a scalar called $$
...

The values from the symbol table hash are typeglobs, looking something like *main::foo, *main::ENV, *main::_ , etc. If you create your own local typeglob, *symbol, to contain one of these values from the symbol table, you can look to see if the various sub-types (scalar, array, etc.) are defined using $symbol, @symbol, %symbol and &symbol. So, as the loop runs through the $key, $value pairs from the symbol table, $value will at some point contain *main::foo. So:

local *symbol = $value;

creates a typeglob *symbol containing the definitions of symbols called main::foo, and

if ( defined %symbol )

will ask ‘is there a hash in the symbol table called %main::foo?’.

The main:: bit means that we’re looking at symbols from the ‘main’ symbol table. A program can use more than one symbol table: we’ll get onto this when we talk about packages and modules later: the main package and symbol table is simply the one that perl assumes your program is using if you don’t set it explicitly.

`local` variables

There is one final complication. Try sticking a my on any of the variables you’ve defined, like $foo, and run the program. You’ll find they suddenly disappear from the symbol table. What on earth is happening? Well, the dirty secret is that perl actually has two completely independent sets of variables: one set introduced with Perl 5, and a legacy set that harks back to the days of Perl 4. Those that you create without a my, are Perl’s old-style global or package variables, which live in the symbol table, and are extractable with typeglobs. This always includes all subroutine definitions anywhere, as you can’t use my on these. These variables are global, and any program using your code can access them. Even if they’re defined somewhere other than main, e.g. in a different package like File::Find, all you need to mess with them is to know the package to which they belong (here File::Find), the name of the variable ($dir) and you can modify them:

$File::Find::dir = "plopsy";

to probably fatal effect. The reason these package variables were supplemented with my variables in Perl 5 was because there was no way to make package variables truly private to a subroutine. There was no my in Perl 4, and you had to use a thing called local, which you’ve seen above with a typeglob, to create temporary dynamically scoped (as opposed to lexically scoped my) variables:

#!/usr/bin/perl
use strict;
use warnings;

$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";

sub temporary {
    local $variable = "goodbye";
    print "\$variable is $variable in the temporary sub.\n";
}

$variable is hello in the body.
$variable is goodbye in the temporary sub.
$variable is still hello in the body.

This looks to have exactly the same effect as my would, but in fact we’re still talking about the same $variable, it just so happens that perl stashes away the original value when it hits the local, and replaces it when it returns to the body of the program. The symbol table entry is temporarily changed to its new value. In contrast, my creates a completely separate, fresh and unsullied variable with no relationship whatsoever to variables of the same name elsewhere in the program. To see the difference, if you called another subroutine from within temporary(), $variable would still be set to its temporary value of ‘goodbye’:

#!/usr/bin/perl
use strict;
use warnings;

$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";

sub temporary {
    local $variable = "goodbye";
    print "\$variable is $variable in the temporary sub.\n";
    inner();
}

sub inner {
    print "\$variable is $variable in the inner sub.\n";
}

$variable is hello in the body.
$variable is goodbye in the temporary sub.
$variable is goodbye in the inner sub.
$variable is still hello in the body

In contrast, ‘lexically scoped’, my variables live in only a particular part (scope) of the program, and are completely inaccessible outside of it. Each new my $variable is a completely different $variable. They do not appear in any symbol table. If you were to put my instead of local:

#!/usr/bin/perl
use strict;
use warnings;

$variable = "hello";
print "\$variable is $variable in the body.\n";
temporary();
print "\$variable is still $variable in the body.\n";

sub temporary {
    my $variable = "goodbye";
    print "\$variable is $variable in the temporary sub.\n";
    inner();
}

sub inner {
    print "\$variable is $variable in the inner sub.\n";
}

$variable is hello in the body.
$variable is goodbye in the temporary sub.
$variable is hello in the inner sub.
$variable is still hello in the body.

You’ll see that the $variable in temporary() is now a completely different variable, isolated from the rest of the program, unrelated to the $variable in the body of the program, and certainly not accessible from inner() any more. inner() prints out the only $variable visible in its scope, which is the one in the body of the program.

You may well never have to use typeglobs, or the symbol table, or local in anger, but it’s nice to know how stuff works, rather than merely how to use stuff, hence this digression. Normal service will now be resumed.

Next up…files and directories.

Leave comment

CPAN

Command line option processing

Interacting with databases

Handling files portably

Using internet protocols through Perl

Manipulating lists

Graphical user interfaces

Templating

Command line Perl

Taint

Loading modules on the command line

Text manipulation from the command line

File manipulation from the command line

Debuggering

Planning

Modularisation

CPAN

Style

Documentation

use strict and use warnings

Check your variables

Assignment != equality

Dump your data

[{(“Balance”)}]

Miscellaneous gotchas

RTFM, STFW

Line-noise is a bug not a feature

Time and memory

Testing

Creating code at runtime

What’s the time Mr Wolf?

Bed-time

Time to get up

Pretty printing

Loop control

Heredocs

References

Dereferencing

HoH, AoA and other Thingies

Slicing thingies

Creating reference structures at run-time

Passing by value and passing by reference

Data::Dumper

Anonymous functions and closures

Substitution and transliteration

Interpolation in regexes

Substitution modifiers

Splitting and joining strings

Grepping

Mapping

Regular expressions

The ternary if/then/else operator

One-or-more word characters

Anchors and escapes

Capturing parentheses

Regex modifiers

Greediness and quantification of regex atoms

Character classes and Unicode characters

Leaning toothpick syndrome

Alternation and grouping without capturing

Match variables

Back-references

Quote-regex operator

Regex summary

Reading and writing to files

die

Reading lines from a file

Pipes and running external programs with system

Logical operators

Backticks

Directories

File-test operators

perldoc

Files and directories summary

Symbols

Typeglobs

local variables

Recent Posts

Categories

Blogroll

`use strict` and `use warnings`

`Data::Dumper`

`die`

Pipes and running external programs with `system`

`perldoc`

`local` variables