More hashes and sorting

Hash manipulation: each, delete, exists

Before we got side-tracked with subroutines and conditionals, we covered arrays in some detail, and saw the various functions, like push and pop that you can torture them with. The functions for torturing hashes are our next port of call. Perl stores the key => value pairs of a hash in essentially random order (well, unhelpful, if not actually random), so operations like pushing and popping don’t make any sense, as hash items are not ordered in a useful fashion. We’ve already seen how to extract single items and slices of hash:

my %bits          = ( soy => 'sauce', sesame => 'oil', garlic => 'clove' );
my $one_item      = $bits{ 'soy' };
my @several_items = @bits{ 'sesame', 'garlic' };

However, to create a new member of the hash, you can’t use push, as it doesn’t make any sense, so you need to write:

$hash{ 'new_key' } = 'new_value';

You don’t actually need the quotes around the key when you access or create hashes or their elements:

$hash{ newkey } = 'new_value';

but I prefer to use them as I am anal. If you want to find out if a particular hash entry exists, you can use the exists function:

print "yep, it's there\n" if exists $bits{ 'soy' };
if ( exists $bits{ 'soy' } ) {
    print "yep it's there\n";
}

Both of these do the same thing, the first just shows you that you can append a conditional if statement modifier in just the same way you can append foreach. The same applies to for, while, unless, and until.

If you want to remove a hash key, use delete:

delete $bits{ 'soy' };

will remove the pair ( soy => 'sauce' ) from the hash. These functions are all very useful, but the most common thing you’ll want to do with is iterate over the pairs in the hash, in much the same way foreach ( @array ) iterates over the members of an array. There are no less than three variations on this theme.

The first is each, which will return a pair from a hash. You’ll most often see this in constructs using while, like:

my %bits = ( soy => 'sauce', sesame => 'oil', garlic => 'clove' );
while ( my ( $key, $value ) = each %bits ) {
    print "$key has value $value\n";
}
sesame has value oil
garlic has value clove
soy has value sauce

This iterates over the items in the hash, assigning the key value pairs to $key and $value in turn. Note the parentheses around $key and $value. You need these because each returns a two-item-long list. Slinging about lists is one of Perl’s strengths:

my @things = ( 1, 2, 'three' );
    # assign a list to an array

my ( $one, $two, $three ) = ( 1, 2, 'three' );
    # assign a list of values to a list of variables

($x, $y) = ($y, $x);
    # swap two scalars

Note the brackets around the ($one, $two, $three). You need these to make perl realise it’s a list, just as when you create arrays. If you miss them off, perl will try to evaluate $one, $two and $three separately (i.e. in scalar context), and therefore come up with the last thing it evaluated, which is $three. It will then do exactly the same to the other side, come up with 'three', then go ” $three = 'three' “, and nothing else. $one and $two will never be assigned anything. You need brackets to force list context, in the same way as you sometimes need scalar to force scalar context. One important thing to note is that if you put an @array in something like this, it will be greedy:

my ( @greedy, $starved ) =
    ( 'some', 'other', qw/things using the qw operator/ );
print "\@greedy : @greedy\n\$starved : $starved\n";
@greedy : some other things using the qw operator
$starved :

$starved will never get anything: arrays will slurp up everything from a list. There are various ways around this: here’s just one (if you know how many items you want to put in the array):

my( @greedy, $satiated );
( @greedy[ 0 .. 5 ], $satiated ) =
    ( 'some', 'other', qw/things using the qw operator/ );
print "\@greedy : @greedy\n\$satiated : $satiated\n";
@greedy : some other things using the qw
$satiated : operator

The use of a slice assignment:

@greedy[ 0 .. 5 ]

should be fairy self-explanatory: it is a slice of the array, using the .. range operator, so this is just shorthand for:

@greedy[ 0, 1, 2, 3, 4, 5 ]

and will work just fine: the array will only get stuff up to and including the word ‘quotewords‘, and $satiated will get ‘operator‘. Bear this in mind when you mess with @_ in subroutines:

( @gets_everything, $gets_nothing ) = @_;

Getting round this array flattening and greediness will be covered when we get around to discussing references. Anyway, getting back to hashes:

while ( my ( $key, $value ) = each %bits ) {
    print "$key has value $value\n";
}

each generates a two item long list, which is captured into $key and $value, and this is repeated over the entire hash using a while loop. Note I’ve bunged in a my too, as I’ll always be using strict from now on, in the interests of getting you into good habits.

Hash keys and values

The other two ways of torturing a hash are to pull out just its keys or its values:

my %trees = (
    acorn      => "Quercus",
    oak        => "Quercus",
    beech      => "Fagus",
    yew        => "Taxus",
    maidenhair => "Ginkgo",
);

foreach ( keys %trees ) {
    print "\%trees contains the Latin name for $_.\n";
}
foreach ( values %trees ) {
    print "\%trees knows some English names for $_.\n";
}
%trees contains the Latin name for maidenhair.
%trees contains the Latin name for beech.
%trees contains the Latin name for yew.
%trees contains the Latin name for acorn.
%trees contains the Latin name for oak.
%trees knows some English names for Ginkgo.
%trees knows some English names for Fagus.
%trees knows some English names for Taxus.
%trees knows some English names for Quercus.
%trees knows some English names for Quercus.

I’ve escaped the % in the double quoted strings: you don’t need to do this, as unlike arrays and scalars, hashes don’t interpolate their contents in a double quoted string. However, it doesn’t hurt, and may be easier for you to remember. Note that hashes can have several values that are the same (Quercus twice): only keys have to be unique. If both your keys and your values are unique, you can make a bilingual dictionary with reverse

my %Eng_to_Esp = (
    one   => 'unu',
    two   => 'du',
    three => 'tri',
    four  => 'kvar',
    five  => 'kvin'
);

my %Esp_to_Eng = reverse %Eng_to_Esp;
print "The Esperanto for two is $Eng_to_Esp{'two'}.\n";
print "And the English for kvar is $Esp_to_Eng{'kvar'}.\n";
The Esperanto for two is du.
And the English for kvar is four.

You can also see that although a hash itself won’t interpolate in a double quoted string, its members (and items from a normal array) will.

Sorting: sort and <=>

Something you’ll often want to do is sort a list, especially with hashes: as the keys, values and each pairs are in an unhelpful order, you’ll often want to torture them into something more ordered. Perl has a function called sort for just these occasions:

my %trees = (
    oak        => "Quercus",
    beech      => "Fagus",
    yew        =>"Taxus",
    maidenhair => "Ginkgo",
);

print "$_.\n" foreach ( sort keys %trees );
beech.
maidenhair.
oak.
yew.

By default, sort sorts things ‘ASCIIbetically’:

#!/usr/bin/perl
use strict;
use warnings;

my %trees = (
    Oak        => "Quercus", # capital O
    beech      => "Fagus",
    yew        =>"Taxus",
    maidenhair => "Ginkgo",
);

print "$_.\n" foreach ( sort keys %trees );
Oak.
beech.
maidenhair.
yew.

It sorts strings by the ASCII values of their characters, hence O comes before b, because the ASCIIbet goes something like 0, 1, 2 .. 9, (some other things), A, B, C .. Z, (few bits), a, b, c .. z. As here:

print "The ASCII value of O is ", ord "O", "\n";
print "The ASCII value of b is ", ord "b", "\n";
The ASCII value of O is 79
The ASCII value of b is 98

This also demonstrates the use of ord, which tells you the ASCII value of a letter. chr does the opposite, converting ASCII numbers to characters.

print chr( $_ ) foreach ( 74, 117, 115, 116, 32,
97, 110, 111, 116, 104, 101, 114, 32, 112, 101,
114, 108, 32, 104, 97, 99, 107, 101, 114, 46);
Just another perl hacker.

Anyway, the point is, if you want your data sorted numerically, or properly alphabetically, rather than ASCIIbetically, you’ll need to twiddle with sort. sort can take an optional extra argument that tells it how to sort:

my @numbers            = ( 1, 2, 3, 4, 100, 101, 102, 6); # 6 is out of order
my @default_sorted     = sort @numbers;
my @numerically_sorted = sort { $a <=> $b } @numbers;

print " DEFAULT: @default_sorted\n NUMERICALLY: @numerically_sorted\n";
DEFAULT: 1 100 101 102 2 3 4 6
NUMERICALLY: 1 2 3 4 6 100 101 102

Note the default output: 100 comes before 2, because the first character of 100, ‘1’, comes before the first character of 2, ‘2’. So how does the numerical sort work? The extra bit sort needs is a block squashed between the keyword sort and the things to sort, surrounded by braces { }:

sort { $a <=> $b } @numbers;

The spaceship operator, <=> compares two numbers and returns certain values depending on which is larger. The values it compares are $a and $b, which are sort‘s default variables, and stand for pairs of things taken from @numbers. perl does the actual sorting itself: all you need to tell perl is, given a pair of numbers ($a and $b), which one is bigger i.e. should come later in the sorted list?

  • If $a is bigger, you need to tell perl ‘1’.
  • If $b is bigger, you need to tell perl ‘-1’.
  • If they are both equal, you should tell perl ‘0’.

The spaceship operator is a built-in comparison thingummy that does just this for numbers. For strings, the equivalent is cmp (remember == vs. eq), which compares strings character by character according to their ASCII values. Hence:

sort { $a cmp $b } @strings;

is the same as just plain old:

sort @strings;

To sort things properly alphabetically, you might try:

my @trees = qw/oak ash Ginkgo Quercus linden Fraxinus lychee/;
print "$_\n" foreach ( sort { lc( $a ) cmp lc( $b ) } @trees );
ash
Fraxinus
Ginkgo
linden
lychee
oak
Quercus

lc stands for ‘lower case’: it returns strings it is given in lowercase, here so they can be compared without worrying that A-Z comes before a-z in the ASCIIbet. You’ll never guess what uc does.

You can define much more complicated and arbitrary sorting schemes than these, using the ‘1’, ‘-1’, ‘0’ technique. In many of these cases, it’s more convenient to define a subroutine to do the comparisons, such as in_my_arbitrary_way, then call it using:

@weird_sorted = sort in_my_arbitrary_way @things;

Say you’d prefer it if the first word in the dictionary was ‘xenon’, but then afterwards, carried on as normally:

my @strings      = qw( zebedee blob aardvark xenon shark cat dog );
my @funny_sorted = sort funny_sort @strings;
print "@funny_sorted\n";

sub funny_sort {
    if    ( $a eq 'xenon' ) {
        return -1; 
            # if $a is xenon, $a should come earlier, so -1
    }
    elsif ( $b eq 'xenon' ) {
        return 1; 
            # if $b is xenon, $a must come later, so 1
    }
    else {
        return ( lc( $a ) cmp lc ( $b ) ); 
            # otherwise sort alphabetically
    }
}
xenon aardvark blob cat dog shark zebedee

This will run under use strict; even though we’ve not ‘scoped’ the $a and $b in the subroutine using my. This is because $a and $b, as well as all the funny punctuation variables like $_, are special-cased and you don’t need to scope them.

That’s sort pretty much sorted: you can use it in any of these ways:

@sorted = sort @unsorted;
    # use the default ASCIIbetical sort

@sorted = sort { DO_SOMETHING_WITH_$a_AND_$b } @unsorted;
    # use your own sort

@sorted = sort my_sorting_subroutine @unsorted;
    # define your own sort sub elsewhere

Orcish manoeuvre and Schwartzian transform

A common problem when performing sorts is that the internal implementation of sort has to access the data on which the sort is performed several times for each item in the unsorted list. If accessing this data is time consuming (if you have to use a system call for example), the sort can take rather a long time. The canonical example is trying to sort a list of filenames by their age:

my @filenames = glob "data/bacteria/sequences/*.seq";
# glob calls a C-shell style globbing function
sub brute_force { -M $a <=> -M $b }
@sorted = sort brute_force @filenames;

There are several ways to get around this expensive look-up: the Orcish manoeuvre and the Schwartzian transform are two. Both cache the key you are sorting on so that you only need to look it up once. The Orcish manoeuvre is the simpler of the two:

my @filenames = glob "data/bacteria/sequences/*.seq";
{
    my %cached; # lexically scoped in a bare block to keep things tidy
    sub orcish {
        ( $cached{$a} ||= -M $a )
                  <=>
        ( $cached{$b} ||= -M $b )
    }
}
@sorted = sort orcish @filenames;

In the Orcish manoeuvre, we check to see if we have already looked up the value of -M $a or -M $b by requesting the value of $cached{$a} or $cached{$b}. If we haven’t previously cached them, we store the age in the cache, so that next time we don’t have to do the expensive system call to get the age. If we have previously cached them, we just get the value from the cache anyway. The name is derived from the ||=, which could be read “Or-Cache” if you like.

You can compare the Orcish and brute force techniques using the Benchmark module:

use Benchmark;
my @filenames = glob "data/bacteria/sequences/*.seq";
sub brute_force { -M $a <=> -M $b }
{
    my %cached;
    sub orcish   {
        ( $cached{$a} ||= -M $a )
               <=>
        ( $cached{$b} ||= -M $b )
    }
}
timethese(
    10000, # 10 000 iterations of each
    {
        "Brute force" => 
            sub { my @sorted = sort brute_force @filenames },
        "Orcish" => 
            sub { my @sorted = sort orcish @filenames },
    }
);
Benchmark: timing 10000 iterations of Brute force, Orcish...
Brute force: 133 wallclock secs 
  (30.02 usr + 102.50 sys = 132.52 CPU) @ 75.46/s (n=10000)
Orcish: 1 wallclock secs 
  ( 1.43 usr + 0.00 sys = 1.43 CPU) @ 6983.24/s (n=10000)

Benchmark‘s function timethese() takes a number of iterations (10 000 here) and a hashref of name => coderef pairs as arguments, and times how long the coderefs take to run. The Orcish manoeuvre is much faster.

The Schwartzian transform (named after Randal L. Schwartz) is even faster, but rather more complex at first glance:

my @filenames = glob "data/bacteria/sequences/*.seq";
my @sorted =
map {
    $_->[0]
    # 4. Construct the list of names by extracting 
    # them from the sorted arrayrefs
}
(
    sort {
        $a->[1] <=> $b->[1]
        # 3. Sort this list of arrayrefs based on their file ages
    }
    (
        map {
            [$_, -M]
            # 2. Convert them into list of [ filename, age ] arrayrefs
        }
        (
            @filenames
            # 1. Take a list of filenames
        )
    )
);
print "$_\n" for @sorted;

Read it “upwards”: first we take @filenames, and create a list of [ filename, age ] arrayrefs using map. Then we sort based on the age ( $a->[1] <=> $b->[0] ), to create an ordered list of these [ filename, age ] arrayrefs. Finally, we grab the filenames $_->[0] and create a final list of sorted filenames using map again. This map-sort-map is elegant and fast as the array lookups are quicker than the hash lookups in the Orcish manoeuvre.

Hashes summary

Hashes are as simple to use as arrays too: you can use any of the following for hash torture:

my %hash = (
    telephone   => "Bell",
    television  => "Baird, no Farnsworth, no Baird",
    lightbulb   => "Insert argument here",
    JesusChrist => "Paul of Tarsus",
);

print $hash{ 'lightbulb' };                 # access
print @hash{ 'lightbulb', 'television' };   # slice
$hash{ www } = "Berners Lee";               # append
print "Yes" if exists $hash{ 'telephone' }; # exist
delete $hash{ 'JesusChrist' };              # remove

while ( my( $k, $v ) = each %hash ) {
    print "$v invented $k\n";               # iterate
} 

print keys %hash;                           # keys
print sort values %hash;                    # values

Next up…symbol table.

Conditionals

Conditionals: if and unless

We’re on the home-straight in understanding the code from the last post now:

#!/usr/bin/perl

use strict;
use warnings;

my @peas = qw/chick mushy split/;
while ( my $type = pop @peas ) {
    print "$type peas are ", flavour( $type ), ".\n";
}

sub flavour {
    my $query = shift @_;
    my @peas = qw/chick garbanzo/;
    foreach ( @peas ) {
        if ( $query eq $_ ) {
            return "delicious";
        }
    }
    return "disgusting";
}

The very last bit is the conditional statement in the flavour() subroutine. This part compares the type of pea the subroutine was passed with all the peas in its own @peas, and if it matches any of them, the subroutine returns ‘delicious’. The if conditional has the general form:

if     ( THIS_IS_TRUE ) { DO_SOMETHING; }

which is analogous to:

while  ( THIS_IS_TRUE ) { DO_SOMETHING; }

The equivalent of:

until  ( THIS_IS_TRUE ) { DO_SOMETHING; }

is:

unless ( THIS_IS_TRUE ) { DO_SOMETHING; }

Comparison operators: eq and ==

The actual comparison the if statement makes is:

$query eq $_

The eq tests to see if two strings are identical. Perl has two sets of comparisons: numerical and string. The ‘equal to’ test is eq for strings, and == for numbers (that’s two = signs). Perl goof number one is getting == comparison and = assignment mixed up.

In addition to ‘equal to’ comparisons, Perl also has greater than, less than, greater than or equal to, less than or equal to, and not equal to comparisons. For numbers these are >, <, <=, >=, and != respectively. The equivalents for strings are gt, lt, ge, le, and ne.

The reason Perl makes a distinction between numerical and string comparisons is because “2” and “2.0” are numerically equal, but not stringily equal : "2" == "2.0" is TRUE because 2 and 2.0 are the same numerically (mathematicians: shush). However, "2" eq "2.0" is FALSE, because they are clearly not the same string of characters. You want the maths symbols to compare things as numbers, and the language symbols to compare them as strings.

if statements can be optionally followed by any number of elsif statements, and an optional else statement, so:

if    ( THIS_IS_TRUE ) {
    DO_THIS_THING;
}
elsif ( THIS_OTHER_THING_IS_TRUE ) {
    DO_THIS_OTHER_THING;
}
else {
    DO_THE_DEFAULT_THING;
}

Which is all very simple and obvious. You can also nest if‘s inside other if‘s to a gazillion degrees, which is a perfect way of making code unreadable, but will be necessary from time to time.

Switch statements: given and when

If you come from a C background, you may be wondering if Perl has a switch statement, which, if you don’t, is basically a shorthand for a very long if...elsif...elsif...elsif...else statement. As of Perl 5.10.0, it does, but you need to explicitly enable it:

use 5.14.1;
given ( $arg ) {
    when ( $_ eq 'quit' )  { exit;               }
    when ( $_ eq 'squit' ) { say "Norovirus";    }
    default                { system "perldoc Math::Complex" }
}

given sets $_ to the argument you give it ($arg), and you can then test  $_ against various values using  when. If one of the  when cases succeeds, control leaves the given switch structure. If none of the cases succeed, an optional  default block can be called.  The new keyword exit simply causes a Perl program to stop. system we’ll come across later, but for the moment, just see what happens if you enter neither ‘quit’ nor ‘squit’ when you run this script.

Next up…More hashes and sorting

Bondage, discipline and subroutines

Lexical my variables and use strict;

You may have noticed a little thing I slipped in the last script: the keyword my in the chomp. my is a very important keyword, although you’ll note that it doesn’t seem to make any difference if you delete it and run the program. What my does is pin a variable to a particular part of your program, so that it can’t be seen from elsewhere. This may not seem very useful at the moment, but is exceedingly important as your programs get bigger. Such as here:

#!/usr/bin/perl

use strict;
use warnings;

my @peas = qw/chick mushy split/;
while ( my $type = pop @peas ) {
    print "$type peas are ", flavour( $type ), ".\n";
}

sub flavour {
    my $query = shift @_;
    my @peas = qw/chick garbanzo/;
    foreach ( @peas ) {
        if ( $query eq $_ ) {
            return "delicious";
        }
    }
    return "disgusting";
}

Many new things, we’ll take it a bit at a time. Most Perl tutorials I’ve read leave my until the very end, but it’s not really very difficult, and in the interests of getting you into good habits early, we’ll take it on now. The first step to writing well behaved scripts is to bung this at the top:

use strict;
use warnings;

The first line turns on Perl’s bondage and discipline mode. The second line enables safe words warnings. In strict mode, if you do not use my (or its big brother, our) on every variable and therefore safely pin them down to particular bits of your code, your program will barf.

It’s a ridiculous question, but why should you want bondage and discipline? Why should you want to hogtie variables down to specific places in your code? Well, on little throwaway scripts, you might not, and it’s fine not to bother. But on big things, with lots of user defined functions (subroutines), it’s essential, as we shall see.

The next part of the code goes:

my @peas = qw/chick mushy split/;

i.e. create an array called @peas containing the obvious items. Note the ugly and unwise choice of quoting characters. Then:

while ( my $type = pop @peas ) {
    print "$type peas are ", flavour( $type ), ".\n";
}

while loops

Three new things here, the while loop, the pop and the flavour(). We’ll take these in turn.

while is another loop control, like for and foreach. It has the general form:

while ( THIS_IS_TRUE ) { DO_SOMETHING; }

So when is:

my $type = pop @peas

“TRUE” then? Perl considers anything apart from undefined variables, empty strings, and the number zero as TRUE. pop pulls the last member out of an array and returns it (shortening the array by one). Here the popped member is captured each time into the variable $type. Since "chick", "mushy" and "split" are not the number zero, and are most clearly defined as something, $type is TRUE until you try to pop a non-existent, undefined, fourth item out of the array, whereupon the loop exits. Which is all very obvious really:

while ( there are still things to pop out of the array ) { DO_SOMETHING; }

So all this loop does is iterate over the array, just like foreach, but empties the array from the end in so doing. Perl has several other sorts of loop, in addition to while, for and foreach loops. This one should be fairly obvious too:

until ( THIS_IS_TRUE ) { DO_SOMETHING; }

Array functions: pop, push, shift, unshift, splice, reverse

Perl also has plenty of other array manipulators. pop will pull out the last member of an array. If you want to pull values out of the front end, you’ll need shift, which returns the first member of an array, shortening the array by one from the front. If you want to add things to an array, you’ll want to use push or unshift, which add things to the end or beginning of an array respectively. For example:

@peas = ( "chick", "mushy", "split" );
print "\@peas contains ( @peas )\n";

$foo = pop @peas;
# $foo contains "split", @peas now contains ("chick", "mushy")
print "$foo was popped, ( @peas ) are left in \@peas\n";

$bar = shift @peas;
# $bar contains "chick", @peas now contains just ("mushy")
print "$bar was shifted, ( @peas ) is left in \@peas\n";

push @peas, "garbanzo";
# @peas now contains ("mushy", "garbanzo")
print "garbanzo was pushed, now \@peas contains ( @peas )\n";

unshift @peas, "marrowfat";
# @peas now contains ("marrowfat", "mushy", "garbanzo")
print "marrowfat was unshifted, now \@peas contains ( @peas )\n";

push @peas, $foo, $bar;
# @peans now contains ("marrowfat", "mushy", "garbanzo", "split", "chick")
print "( $foo $bar ) were pushed, now \@peas contains ( @peas )\n";
@peas contains ( chick mushy split )
split was popped, ( chick mushy ) are left in @peas
chick was shifted, ( mushy ) is left in @peas
garbanzo was pushed, now @peas contains ( mushy garbanzo )
marrowfat was unshifted, now @peas contains ( marrowfat mushy garbanzo )
( split chick ) were pushed, now @peas contains ( marrowfat mushy garbanzo split chick )

push and unshift are list operators, and canadd an entire list of things to the array. Bearing in mind an array is just a list with delusions of grandeur:

@peas  = ( "chick",  "mushy",   "split" );
@beans = ( "adzuki", "haricot", "mung"  );
push @peas, @beans, "and this too";
print "@peas\n";
chick mushy split adzuki haricot mung and this too

will shove the entire contents of @beans onto the end of @peas, followed by the string "and this too".

The least popular array operator is splice. Although splice can do everything pop, push, shift and unshift can do and more, it has a rather difficult syntax:

splice @ARRAY, START_INDEX, THIS_MANY, LIST;

will remove THIS_MANY items starting from START_INDEX, and replace them with the contents of LIST. Incidentally, splice is one of the context sensitive operators: in list context, it will return all the spliced out items, but if you call it in scalar context, it returns just the last item removed from the array, rather than the whole list of them. So:

@all_removed = splice ...;
#list context, because there's an @rray to capture what splice returns
$last_one_removed = splice ...;
#scalar context, because there's only a $calar to capture the output of splice

THIS_MANY and LIST are optional, defaulting to 1 and undefined (undef) respectively.

pop @things;

and

splice( @things, -1, 1, undef );

mean the same thing: both remove a single item (1): the last (-1) member of an array (@things), and replace it with nothing (which is called undef in Perl). pop is more intuitive though.

Another useful array operator is reverse:

@backward_peas = reverse @peas;

reverse leaves @peas itself unchanged, but returns the array in reversed order, here to be captured in @reversed. If you want to reverse an array in situ, use:

@array = reverse @array.

The distinction between an array and a list is similar to that between a scalar and a value: an array is something you can name, like @bits, whereas a list is just a comma-separated list of values in a script. Likewise, $that is a scalar, but 'this' is just a value.

You can slice lists in the same way as you slice arrays:

my @bits = ( 'this', 'is', 'a', 'list', 'not', 'an', 'array' )[ 0 .. 1, 5 .. 6 ];
print "@bits";

However, you cannot pop a list:

my $word = pop ( 'this', 'is', 'a', 'list', 'not', 'an', 'array' );
print $word;
Type of arg 1 to pop must be array (not list).
Execution aborted due to compilation errors.

The reason for this is that although it makes sense that you can slice, or even reverse a list:

print reverse ( qw( t s i l ) );

you cannot remove the last item from a list, because a list is not a variable: to pop a value from the list would be equivalent to taking an eraser to the text of your script, and that is nonsensical.

Subroutines (functions)

Anyway, back to the point. The only other new thing in the code we were examining above:

while ( my $type = pop @peas )
    { print "$type peas are ", flavour( $type ), ".\n"; }

is the function flavour(). Although Perl has some bizarrely named operators (like chomp, pop, getgrent and dump), flavour is not amongst them. flavour() is a user defined function, or subroutine. To create a subroutine you need to write something like:

sub NAME { DO_SOMETHING; }

And to call it, you simply need to write

NAME( ARGUMENT_LIST );

The flavour subroutine is called by the body of the program to determine how the three peas of interest taste. Subroutines frequently need to return things to the main part of the program: in this case, flavour() returns what the subroutine thinks about certain sorts of pea. So let’s look at how flavour() does this:

sub flavour {
    my $query = shift @_;
    my @peas = qw/chick garbanzo/;
    foreach ( @peas ) {
        if ( $query eq $_ ) {
            return "delicious";
        }
    }
    return "disgusting";
}

The default subroutine array @_

The first new thing here is another of the infamous punctuation variables, @_. @_ contains a list of all the arguments passed to the subroutine, in this case, whatever the value of $type was when the subroutine was called in the body of the program.

For the sake of argument, let’s say this is "chick". @_ is just an array, so shift will pull the first member out as it would with any array. So $query will end up containing "chick". Like $_, @_ is assumed by certain operators: in a subroutine, shift will assume @_ if you don’t tell it otherwise:

sub blah {   $arg   = shift @_;  }
sub blah {   $arg   = shift;     }
sub blah { ( $arg ) =        @_; }

are more-or-less equivalent, although note that the last onme doesn’t actually modify @_. I almost always use the last one, since it’s easier to add extra arguments later. In the last one, we have assigned @_ to a [one item long] list (in parentheses):

( $name, $date, $error, @other_things ) = @_;
( $arg )                                = @_;

which allows you to refer to the arguments with pretty names, rather than the perfectly valid, but rather painful:

$_[0];
$_[1];
...

Note that you can’t just say:

$arg = @_;

if there’s only one argument, since the $arg forces scalar context and arrays tell you how big they are, not what’s in them in this context. The parentheses are required, unless (of course), you actually want to know how many arguments were passed, rather than what arguments were passed. Which is unlikely.

Lexical scope

The subroutine flavour() defines a list of peas ("chick" and "garbanzo"), called @peas. And this is where my comes in. flavour‘s @peas has exactly the same name as the @peas in the main body of the program. How is perl supposed to know the difference? What my does is prevent the @peas in the subroutine from trashing the @peas in the main body of the program.

Try this out:

@peas = qw/chick mushy/;
    # The body of the program contains an array called @peas
print "In the body of the program, \@peas contains @peas.\n";
trasher();
    # Call the subroutine, no need for arguments
print "Oh dear, it appears that \@peas in the body of the program has been trashed.\n";
print "Now it contains @peas.\n";
print "This is because \@peas in the subroutine overwrites the \@peas in main.\n";

sub trasher {
    @peas = qw/petit-pois yellow-gram/;
        # Because we haven't pinned  this @peas down with 'my',
        # it refers to the same @peas array as that in the body of the program
    print "In the subroutine trasher, \@peas contains @peas.\n";
}
In the body of the program, @peas contains chick mushy.
In the subroutine trasher, @peas contains petit-pois yellow-gram.
Oh dear, it appears that @peas in the body of the program has been trashed.
Now it contains petit-pois yellow-gram.

Without the my to pin down the two separate @peas to their proper places, subroutines can overwrite variables in the body of the program. This is usually a Bad Thing: subroutines can change the value of variables in the body of the program, but that doesn’t mean they should be allowed to!

In general, a good subroutine is a black box: you feed it values, and it feeds values back. That way, people can use your subroutines and functions without worrying what they might do to the variables in their program, or indeed, what their program might do to yours. Sometimes, you really will want a subroutine to change a ‘global’ variable, that is one in the body of a program, but more often than not, you don’t, and my is the way to stop it, thus:

@peas = qw/chick mushy/;
print "In the body of the program, \@peas contains @peas.\n";
well_behaved( );
print "Using my, we have avoided trashing \@peas in the body of the program\n";
print "\tIt still contains @peas.\n";

sub well_behaved {
    my @peas = qw/petit-pois yellow-gram/;
    print "In the subroutine well_behaved, \@peas contains its own values, @peas.\n";
}
In the body of the program, @peas contains chick mushy.
In the subroutine well_behaved, @peas contains its own values, petit-pois yellow-gram.
Using my, we have avoided trashing @peas in the body of the program
    It still contains chick mushy.

So what exactly does my do? It stops a variable being visible outside the block in which it is declared. Blocks are things enclosed in { } braces:

BODY OF PROGRAM HERE
START OF OUTER BLOCK {
    OUTER BLOCK'S SCOPE EXTENDS FROM HERE
      start of inner block {
      inner block's scope
      } end of inner block
    TO HERE AND INCLUDES THE INNER BLOCK'S SCOPE TOO
} END OF OUTER BLOCK

The ‘scope’ is basically what is enclosed in a block. If you created a my variable in the inner block, only things in the scope of the inner block could see it. The outer block would not be able to see it (or trash it) at all. If you created a my variable in the outer block, only things in the outer block’s scope could see it (but this does include the inner block!). The BODY OF PROGRAM couldn’t see either. A subroutine is just a particular case of this:

BODY OF PROGRAM HERE
START OF SUBROUTINE BLOCK {
    SUBROUTINE'S SCOPE EXTENDS FROM HERE
      start of inner block {
      inner block's scope
      } end of inner block
    TO HERE AND INCLUDES THE INNER BLOCK'S SCOPE TOO
} END OF SUBROUTINE BLOCK

So the @peas declared in the subroutine well_behaved() is only visible (and is the first variable of that name that is visible) within the braces that surround the subroutine:

sub well_behaved {
    my @peas = qw/petit-pois yellow-gram/;
    print "In the subroutine thing, \@peas contains @peas.\n";
}

Outside this italic ‘scope’, my @peas is invisible, to both the body of the program, and to any other subroutines you might create. A my variable is only visible from the place it is created to the end of the innermost enclosing block.

There a few quasi-exceptions to this:

foreach my $pea ( @peas ) { print $pea; }

DWIMs (“does what I/you mean”): the $pea is scoped to the inner block (and the rest of the program can’t see it) even though it seems to be declared in the scope of the program, not of the foreach block. This is a Good Thing.

One thing to be careful of is if you want to use a loop to stuff things into a my variable:

foreach ( @a ) { my @b; push @b, $_; } # WRONG
my @b; foreach ( @a ) { push @b, $_; } # RIGHT

The first one will create a new @b on each pass of the loop, and when the loop exits, @b goes out of scope and is destroyed! Waste of time. Use the second one. While we’re on the subject of foreach loops, you should know that the loop variable stands for the actual variable from the list you’re looping over, so mucking with it will muck with the original list:

#!/usr/bin/perl
my @bits = qw/ b c m t /;
print "@bits\n";
foreach my $bit ( @bits ) { $bit .= "ap" };
print "@bits\n";
b c m cr
bap cap map crap

To allow a program to run under use strict; we must declare every variable in the program (both the main body and the subroutines) with my. Variables declared with my in the main body of the program are still be visible to subroutines (since the scope of the body includes all its subroutines), and subroutines can still change them.

The penultimate bit of the program:

    foreach ( @peas ) {
        if ( $query eq $_ ) {
            return "delicious";
        }
    }

simply determines whether the type of pea that flavour() gets passed matches anything in flavour()‘s own @peas. If it does, it will return “delicious”, using:

return "delicious";

return sends back the list of things you give it (here the list is just one item long) to the main body of the program. So if we pass flavour() the value ‘chick’, which is in flavour()‘s list of delicious peas, flavour('chick') will be ‘delicious’ and this is exactly what is printed out by the body of the program. However, if what we pass doesn’t match any of flavour()‘s preferences, the foreach loop will end naturally, and we come across:

return "disgusting";

which it duly does.

Subroutines summary

We’ve rather glossed over the if conditional but that is the topic of the next post. To summarise subroutines:

create (declare) them with:

sub blah { DO_SOMETHING; }

use (call) them with:

blah( LIST_OF_ARGUMENTS );
blah( $calar, @nd_an_array_too, @nd_another_array );
blah(); # if blah doesn't need telling what to do

All the arguments – including any items from arrays passed as arguments – will be flattened into a single long list, which is passed to the subroutine, and available for manipulation within the subroutine inside the default array:

@_

which you can get at using any array operator (or assigning it to a list).

my $arg1 = shift @_;
my $arg2 = pop @_;
my $arg3 = shift;
my( $arg4, @args5 ) = @_;

Exit the subroutine with:

return ( "something\n", 'and maybe another', $thing, @or_things );
return; # or just exit without returning anything at all

Subroutines will return without an explicit return with the value they last evaluated. I always use return as I like to be explicit. You can capture what is returned in the usual way: if blah() takes a list of arguments, and returns just one thing:

$thing_returned_by_blah = blah( $argument, @other_arguments );

or if blah takes no arguments at all but returns a list:

@lot_of_things = blah();

etc., etc.

Finally, be warned that:

use strict;
if ( $you_do_not_use eq "my variables" ) {
    my @variables;
    my $pinned_down;
    print "you'll trash variables of the same name in the program body.\n";
    print "and strict will kill you";
}

Next up…conditionals.

Prettier loops and nicer code

Idiomatic Perl

The original code we were studying is very ugly, and Perl makes writing short, diabetogenic code easy. It also makes writing shoddy, abominable code easy too, but I wouldn’t recommend this as a course of action. A much more attractive way of writing the horrible code from the last page is:

#!/usr/bin/perl
print "What is your name?";
chomp( $name = <STDIN> );
@beans = qw( adzuki haricot mung );
print "\@beans contains ", scalar @beans, " members, @beans\n";
foreach ( @beans ) {
    print "$name likes $_ beans.\n";
}

This is much tidier, and probably more readable even without any further explanation. There are a few new things though. The assignment of <STDIN> (“Steve\n” or whatever) to $name actually returns the variable $name, which is exactly what chomp needs to work on. So those two lines can be combined:

$name = <STDIN>;
chomp $name;

is exactly the same as:

chomp( $name = <STDIN> );

This is a typical bit of idiomatic Perl.

Quote operators

The qw( ) is a simpler way of making a list of variables. It stands for “quote words”. To save you the effort of all those:

('X', 'Y', 'Z');

quotes and commas, you can just write:

qw(X Y Z);

which is much easier on the eye, although – of course – it’s unsuitable if any of X, Y or Z contain spaces. You can mix and match quoting systems, so:

@bits = ( "hello, sailor", "1", "mung", "adzuki", "haricot", 
    "dal\n", "\t", $thing );

could also be written:

@bits = ( 'hello, sailor', 1, qw( mung adzuki haricot ), 
    "dal\n", "\t", $thing );

1 is a number, not a string, so it doesn’t need quotes, and I’ve only used double quotes for things that really need them (strings with escapes like \t in them). Written like this, it’s not really any more readable, but it gives you the idea. Incidentally, it doesn’t matter which (non-alphanumeric) character you use around a qw list:

qw(X Y Z);
qw{X Y Z};
qw/X Y Z/;
qw[X Y Z];
qw^X Y Z^;

are all equivalent. As long as you use the same character, or pairs of naturally paired characters like ( ), [ ] and { } then this’ll work. I’d prsonally avoid anything but parentheses and braces though, unless you’re deliberately trying to make your code unreadable.

Perl also has the quote operators qq( ) and q( ), which are equivalent to " " and ‘ ' ' respectively, but you get to chose your own quote characters, which can be useful if your string contains lots of some character you’d rather not have to keep escaping:

q(Exploding 'chocolate' cake and an awful 'lot' of 'quotes');

and:

'Exploding \'chocolate\' cake and an awful \'lot\' of '\quotes\'';

are equivalent, but the first doesn’t make your eyes bleed. Just because you can do something doesn’t mean you should: anyone using qq' ' and q" " is clearly ill. If you need to include the quote character itself in a string, whether you use conventional double quotes, or the qq( ) or q( ) operators, you’ll have to escape it, just as you did with single quotes in single-quoted strings:

"This is "wrong" ";
# because of the embedded, unescaped " characters
"This is \"OK\" ";
# properly escaped
qq!This is overexcited! and wrong! not to mention unreadbly awful!;
# because the second and third !s are unescaped
qq!This is OK\! though!;
# properly escaped !

Please don’t even think of using exclamation marks as delimiters: it’s cute until the maintainer of your code has you killed.

qq{This uses a choice of "delimiter" that's (very) well chosen};

The next bit of the shortened code:

print "\@beans contains ", scalar @beans, " members, @beans\n";

is virtually the same as before, but you’ll note that list operators like print don’t really need the ( ) parentheses, although feel free to leave them in if it prevents ambiguity. Some languages distinguish between functions (which need parentheses) and operators, which don’t. In Perl they’re by-and-large the same thing, and parentheses are only required for creating lists, or for mathematical (precedence) reasons: 2+(8*3) is different from (2+8)*3.

foreach loops and the default variable $_

The C-style for loop is mercifully replaced with something much, much tidier:

foreach ( @beans ) {
    print "$name likes $_ beans.\n";
}

Which is hugely more intuitive than the for(;;){} loop, and it runs more quickly too. The only difficult thing here is the infamous $_ variable.

$_ is the ‘default’ variable. It is automagically set, and automagically assumed by many functions. A more explicit way of writing this loop is:

foreach $bean ( @beans ) { print "$name likes $bean beans.\n" }

The

foreach $bean ( @beans )

means “set $bean to the value of each member of @beans in turn”. If you don’t supply a loop variable (here, $bean), it will be conveniently assumed you wanted to put each bean into $_. We’ll be seeing a lot more of $_ as we go on.

You may be wondering what characters you can get away with in the name of a Perl variable: they should start with a letter (upper or lower case), and thereafter can contain any of A to Z, a to z, 0 to 9 and the underscore _. YMMV with alphanumerics from other scripts (Perl supports Unicode, but that’s a whole other post).

Other punctuation variables

However, Perl is also liberally sprinkled with “punctuation variables”, which you will need to learn as you go along. Rather than starting ‘$letter’, they generally start ‘$punctuation’, like $_ , $! , $@ and $/, to name four of the most useful ones. These variables are special, and are often set or assumed by certain functions. Earlier you learnt that chomp will remove newlines from variables. This is actually a fib: it will remove whatever is in $/ from the end of the string if it’s present, it just happens that $/ (the “input record separator”) is set to \n by default:

$string = "hello:";
chomp $string;
print "$string\n";
    # chomp does nothing, as $/ is \n and $string doesn't end in \n
$/ = ":";
chomp $string;
print "$string\n";
    # chomp will now remove : from the end of strings, which it duly does
hello:
hello

Of these punctuation variables, $_ is particularly infamous: the following do the same thing:

foreach $bean ( @beans ) { print $bean; }
foreach ( @beans ) { print; }

perl assumes $_ as the loop variable in the foreach, and also assumes $_ if no other arguments are given to print. Some criticise Perl scripts for having the invisible thread of $_ running through them, but there’s nothing to stop you being more explicit if you want.

Statement modifiers

Back to the program. There are a few more short-cuts that are instructive, if we ignore the middle bit:

print "What is your name?";
chomp( $name = <STDIN> );
@beans = qw( adzuki haricot mung );
foreach ( @beans ) {
    print "$name likes $_ beans.\n";
}

can also be written:

print "What is your name?";
chomp( $name = <STDIN> );
@beans = qw( adzuki haricot mung );
print "$name likes $_ beans.\n" foreach @beans;

For simple, single statements like the last one, you can append the loopy bit and avoid all those braces in this pleasantly readable style. Beware using this on long lines though, as the loopy bit (statement modifier) at the end can get lost. You should always have a care for the future readers of your code: well written code can read like prose if you’re careful, which makes understanding it much easier. Perl gives you ‘more than one way to do it’, but that doesn’t mean you should chose the first appalling way that comes into your head. However, if you want serious brevity:

print "What is your name?";
chomp( my $name = <STDIN> );
print "$name likes $_ beans.\n" foreach qw( adzuki haricot mung );

There is little distinction between a list and a bona fide array, so if you just put a list where a real array is expected, it’ll generally still do what you mean.

Next up…bondage, discipline and subroutines.

Simple loops

Loops

Hopefully you’re now happy with the fundamental data types in Perl: scalars, arrays and hashes, so most of the script below shouldn’t be too mysterious:

#!/usr/bin/perl
print "What is your name?\n";
$name = <STDIN>;
chomp $name;
@beans = ( "adzuki", "haricot", "mung" );
print( "\@beans contains ", scalar @beans, " members: @beans\n" );
for ( $i = 0; $i < scalar @beans; $i++ ) {
    print "$name likes $beans[$i] beans.\n";
}
@beans contains 3 members: adzuki haricot mung
Steve likes adzuki beans.
Steve likes haricot beans.
Steve likes mung beans.

Don’t get scared by the line-noise (i.e. all the punctuation): if you do, regexes will probably be fatal. The very beginning should be quite obvious now:

print "What is your name?\n";
$name = <STDIN>;
chomp $name;

means something like, “read in someone’s name from a command prompt, get rid of the trailing newline from the input, and stick the result in the scalar variable  $name“, whilst:

@beans = ( "adzuki", "haricot", "mung" );

means “create a list of the strings ‘adzuki’, ‘haricot’ and ‘mung’, and store the list in the array variable @beans“. Note that @beans and a hypothetical $beans (and an even more hypothetical %beans) would be completely different variables: mucking about with one would have no effect on the other.

Context

The next line prints out some stuff about the beans:

print( "\@beans contains ", scalar @beans, " members, @beans\n" );

You may have some difficulty with exactly what is going on here, but all we’re doing is feeding print a list of three items:

print( "a string", something to do with @beans, "another string" );

Let’s consider the three items in the list separately:

"\@beans contains "

is responsible for the bit of the output that looks like

@beans contains

You need to escape (use a \ on) the @ of @beans. If you put a Perl variable into a double-quoted string, the variable name will be replaced with the actual contents of that variable. So, if you don’t escape the @, perl will interpolate the entire contents of @beans (“adzuki haricot mung”) into what it’s going to print, a fact that we’ll take advantage of in a minute. With the \, it prints a literal @ symbol instead, followed by the string ‘beans’ (no pun intended). So, \$ and \@ are two more escape characters like \" and \n.

print is a list operator, that is, it will print out any stuff you put in a list after it. In just the same way as array variables, a list is some stuff between ( ) parentheses, separated by commas, so the stuff we’re feeding print is a list (just look at the code), and the next item is:

scalar @beans

Perl is a helpfully (well, usually helpfully) context sensitive language. Unsurprisingly, the two main contexts it recognises are scalar (singular) context, and list (plural) context, which you’ve already met in passing when we discussed array and hash access and slicing. print forces list context on the things that follow it. So, if you simply put this:

@beans

you’d get a mess, because the usual behaviour of an array in list context is to interpolate all its members. perl would squadge the contents of @beans (“adzukiharicotmung”) into the output, not even padding it with spaces like it does between double quotes. That’s not what we’re after here. What we actually want is the number of items in the array. In scalar context, an array will return the number of members it contains, which is just what we want. Scalar context can be forced using the scalar operator, hence the upshot of:

scalar @beans

is to give the size of the array, rather than its contents, to print, which it duly does:

3

List/scalar context is one of Perl’s more esoteric, but more useful features, and we’ll come across more as we go along. For now, remember that if you assign stuff to an array:

@holes = ( "nostril", "meatus", "bumhole");

or use an operator that expects a list, like print:

print( "nostril", "meatus", "bumhole");

the arguments (“nostril”, and so on) will be interpreted in list context unless you explicitly use the scalar keyword. Many Perl functions, and some sorts of variable, return different values in scalar and list contexts. Most importantly for the present moment, an array will be interpreted as a list of its members in list context, but will be interpreted as the number of its members in scalar context. Hence:

@holes = ( "nostril", "meatus", "bumhole");
print @holes, "\n";
    # print forces list context on what it's given
$number = @holes;
    # $number forces scalar context so perl assigns the size of
    # array @holes to $number
print $number;
print scalar @holes;
    # scalar also forces scalar context, overriding print's
    # preference for lists
nostrilmeatusbumhole
3
3

Finally, the last item given to print in our original program is:

" members, @beans\n"

which is responsible for the output of

members, adzuki haricot mung

Double quote interpolate the contents of the array into the string, padding the members with spaces, hence the output of:

adzuki haricot mung

rather than:

adzukiharicotmung

for loops

The very last bit of the code:

for ( $i = 0; $i < scalar @beans; $i++ ) {
    print "$name likes $beans[$i] beans.\n";
}

is a for loop. for loops take a bit of explanation, and mercifully, Perl has a cute shortcut for loops which we’ll investigate in a minute. But first, let’s learn it the hard, C-style way, which is occasionally useful. A for loop, as you may know/have guessed, looks like:

for ( LOOP_VAR = START_VALUE; TEST_LOOP_VAR; INCREMENT_LOOP_VAR ) {
    DO_SOMETHING;
}

The ( ) and { } are required, but you can write this with non-K&R style bracing:

for ( LOOP_VAR = START_VALUE; TEST_LOOP_VAR; INCREMENT_LOOP_VAR )
{
    DO_SOMETHING;
}

or with hideous bracing like this:

for 
( LOOP_VAR = START_VALUE; TEST_LOOP_VAR; INCREMENT_LOOP_VAR ) 
{ DO_SOMETHING; }

if you’d rather, whatever looks best to you: Perl is largely space and newline insensitive as mentioned above. It’s fairly obvious from the output that the loop in the program prints:

Steve likes XXXXX beans.

for each member of the array @beans. How exactly does it do this? First it sets the loop variable, $i, to a starting value of 0. Then it tests the loop variable to see if it’s smaller than the scalar size of @beans (which is 3). On each passage of the loop, it does the INCREMENT_LOOP_VARIABLE thing: $i++ means ‘add 1 to $i‘. So the upshot of all this is simply that $i is a counter that ticks 0, 1, 2. Immediately upon hitting 3 the loop will terminate, as 3 is not < 3). Hopefully the for loop should now be crystal clear. Hence:

for ( $i = 0; $i < scalar @beans; $i++ ) {
    print "$name likes $beans[$i] beans.\n";
}

means:

for
(
    set $i to 0;
    keep looping while $i is less than 3;
    increment $i by 1 on each pass of the loop
)
{
    print out Steve likes the $i'th member of @beans
}

Operators

The increment operator, ++ , can be used in two ways: you can either write:

$i++

or

++$i

In this particular case, it doesn’t matter which you use, but in fact $i++ increments $i after returning it (post-increment), whereas ++$i increments it before returning it (pre-increment). So:

$i = 1;
print "At the start, \$i is $i.\n";

print "\$i still returns ",
    $i++, 
    " when you post-increment it, but its value is now $i\n";
$i = 1;
print 
   "But if you pre-increment it, \$i's value will be incremented by one to ",
   ++$i,
   "before its value is actually printed.\n";
At the start, $i is 1.
$i still returns 1 when you post-increment it, but its value is now 2.
But if you pre-increment it, $i's value will be incremented by one to 2 before its value is actually printed.

Note the escaped (backslashed) $ on some of the $i‘s. You already know you need to escape a @ in a double-quoted string to print a literal @ character. The exact same thing applies to $.

There’s no reason why you shouldn’t write

$i++;

as

$i = $i + 1;

but the former is much more concise. It is probably painfully obvious what  --$i and $i-- do by extrapolation, but there is more than one way to decrement a variable too (an unofficial Perl motto is TIMTOWTDI: there is more than one way to do it, although very often, one of the possible ways is markedly less disgusting):

--$i;

is shorthand for:

$i -= 1;

which itself is shorthand for:

$i = $i - 1;

The middle version is something which you may find useful. The -= subtraction assignment operator is one of a whole class of similar operators, like += (so $i++ is short for $i += 1 which is itself short for $i = $i + 1).

Perl has all the usual mathematical operators, like +, -, *, /, % (modulus), and ** (raise-to-the-power-of), and any of these may be used like -= or += too, so:

$a = $a ** 2;
# square $a and bung the result back into $a

is the same as:

$a **= 2;

One operator lacking in most other languages is the x operator (that’s just a little “x” character), which will ‘multiply’ strings:

$string = "hello";
$three_copies_of_string = $string x 3;
print $three_copies_of_string;
hellohellohello

You can of course also use the x in a x= construction too:

$string = "hello";
$string x= 3; # put "hellohellohello" into $string

One final useful operator is . which concatenates two strings together:

my $world = ", world\n";
my $cat   = "This is the concatenation operator in action";
my $msg   = "hello" . $world . $cat . "\n";
print $msg;
hello, world
This is the concatenation operator in action

Next up…prettier loops and nicer code.

Hashes

Hashes

Perl actually has two sorts of array. Simple arrays like those we’ve already discussed are simply called arrays. The second sort of array is called an associative array, or hash. They’re very similar to dictionaries, if you’re a Python programmer. Hashes are created with a % for %we_ran_out_of_sensible_symbols, and contain pairs of data (ah, maybe it was for the two blobs in a % sign) called keys and values. Here are some typical hash creations:

%hash = ( key1 => 'value1', key2 => 'value2', key3 => 'value3' );
%dictionary = (
    aardvark  => "first",
    name      => "Steve",
    age       => 35, # that was 25 in the original version of this tutorial. fuck.
    zebedee   => "character in The Magic Roundabout",
);

The syntax is nearly the same as for normal arrays: the only difference is the =>. The => is just a “fat comma”, and in fact, if you replaced the => with a real comma, and surrounded each of aardvark, Steve, age and zebedee with quotes, it would still work. Hashes are simply created from lists, and they’re not fussy about their commas. However, => has two advantages: it makes the fact you’re creating pairs of data more explicit, and it also means you don’t have to quote the string that forms the key (this is done automagically for you): note it’s aardvark, not 'aardvark'. You might also note that I’ve neatly lined up the fat commas: pretty code is readable code.

The first member of each pair in a hash is called the key, and the second is called the value. To access the individual members of a hash, you use a similar syntax to accessing array members, but rather than [INDEX] brackets, you use {KEY} braces:

$value = $hash{ 'key' };

So:

%fruit_trees = (
    apple => "Malus", 
    pear  => "Pyrus",
    plum  => "Prunus"
);
$latin_name_of_apple = $fruit_trees{ 'apple' };
print $latin_name_of_apple;

Again, like the funny syntax for normal arrays, you need a $, because you’re accessing $ingle bits of data from the hash. Hashes are stupendously useful, since it is far easier to use a hash than an array in a situation like:

$personal_details{ 'name' };

since if you used an array:

$personal_details[ 2 ];

you’d have to remember whatever arbitrary index you stored that the person’s name at (2 here). It’s far easier to use a hash in these circumstances, and just ask for what you want with a key.

Hash slices

As with arrays, you can also take slices of hashes, but the syntax will make you do a double take:

%trees = (
    apple =>  "Malus",
    pear  =>  "Pyrus",
    plum  =>  "Prunus",
    oak   =>  "Quercus",
    ash   =>  "Fraxinus",
    yew   =>  "Taxus",
);
@latin_names_of_fruit_trees 
    = @trees{ 'apple', 'pear', 'plum' }; # hash slice
print "@latin_names_of_fruit_trees\n";

What’s with the @ for the slice syntax? Remember that $, @ and % aren’t really part of the name of a variable, they are ways of accessing or creating data of a specific type. When we take a slice of a hash, we are accessing a list of values, just as when we are slicing an array, so we need a @. If it’s any consolation, I think this is horrible too. Again, note the pretty arrangement of the tree hash: Perl is largely whitespace-insensitive, so you can largely get away with writing your code in whatever creative way you want.

So, to summarise. Perl has three fundamental data types: scalars, arrays and hashes. When creating them, you’ll need to use $, @ and % respectively. When accessing them, you’ll need to use $ to access $ingle bits of data, and @ to access plur@l slices of data, with the appropriate parentheses: [INDEX] for arrays, {KEY} for hashes.

Next up…simple loops.

Arrays and slices

Types

You should now be able to create very simple scripts that can take input from the keyboard, assign it to scalar variables, and echo it back to the screen with print. Time for some technicalities.

Perl is often (and wrongly!) termed a weakly typed language. For those of you coming from bondage and discipline languages like C or Java, this means Perl does not require you to explicitly point out what kind of variable your variables are: floats, doubles, integers, characters, booleans, strings, etc.

For those of you who are learning programming here, this means you don’t have to worry about what you put into a variable like $name, in particular, you don’t need to worry if its a little integral number, a big fat floating point number, a single character, a string, or anything else, and you are free to assign different values, and different kinds of value dynamically as the program runs.

Perl does have data types though. However, what perl worries about is not the distinctions a computer might like to make (in terms of how big a chunk of data is in memory), but the distinction between singular and plural, much like many human languages. In Perl, singular data is called scalar data, and comes in variables starting with a $. Plural data is called array (or list) data, and is stored in variables starting with a @.

Scalars

Scalar data is $ingular data. Scalar variables always start with a $ for $calar, and can contain one of pretty much anything. $name could contain an integer (2), a floating point number (12.045e+89), a character ("K") or a string ("or Pretty much anything else"). They can even contain pointers to other sorts of data (“references”, but we’re getting ahead of ourselves). To create a scalar variable, you just write:

$string = "Some string or other";
$number = 12453;

You don’t need to, and indeed should not, put quotes around numbers.

Arrays

In contrast to scalar data, array data is plur@l data. Array variables start with a @ for @rray, and store a list of scalars. Creating an array could look like any of the following:

@numbers  = ( 1, 3.05, 4, 2e-10, 23 );
@strings  = ( "Hello", 'everybody', "I'm Dr. Nick Riviera\n" );
@allsorts = ( 12.5, 'plop', "some tabs\t\t\t\t", $chocolate, 56, "C" );

When you create an array, you need to put the (suitably quoted, scalar) values in a list between parentheses, separated by commas, i.e. arrays are created like this:

@array = ( $scalar1, $scalar2, $scalar3 );

To access the individual members of an array, imaginatively called @array, we need to use square brackets. The syntax is:

@array = ( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22 );
$the_9th_member_of_the_array = $array[8];
print $the_9th_member_of_the_array;

And $the_9th_member_of_the_array will print out ’16’. There are two gotchas here.

Firstly, what has happened to the @ of @array?

And secondly, why 8, not 9?

The $ and @ ‘sigils’ aren’t really part of the name of the variable. They are more a way of telling perl whether the thing you’re talking about is a single thing or a list of things. When you create an array, you use a @ sign, because you want  perl to put a list into it. However, when you want to access an individual member of an array, you’re trying to get a singular piece of data, a scalar, so this gets an $.

This syntax is a little weird, but you’re currently stuck with it: when you’re talking about a list of data, you need a @, as in @array = ( blah... ); but to access a single bit of array data, you need to a $, as in $array[index_number];

Which brings me to the second point. In Perl, arrays index from the number 0, so the 9th element in an array is at index number 8: $array[8]. If you just put your brain into perl/C gear and think of the 1st element of an array as being the ‘zeroth’, 0th, member, you may find this easier. So:

#!/usr/bin/perl
@beans = ( "adzuki", "haricot", "mung" );
    # anything following a '#' is a comment
    # perl will ignore everything from the '#' to the end of the line
print "$beans[0]\n";  # this is adzuki
print "$beans[1]\n";  # this is haricot
print "$beans[2]\n";  # this is mung
print "$beans[-1]\n"; # this is also mung
adzuki
haricot
mung
mung

Negative indices are counted back from the end of the array, so the last item is the minus-1th.

Slices

Another useful way to access pieces of arrays is to take slices of them:

@cake = ( "flour", "eggs", "milk", "sugar", "butter", "sultanas", "water" );
@slice_of_cake = @cake[ 3, 6 ];
print "@slice_of_cake\n";
sugar water

@slice_of_cake will contain the list ( “sugar”, “water” ). Note that the slice syntax @slice_of_cake needs a @, because you are creating a list. Likewise, @cake also gets a @, because you’re creating plur@l data, not a $ingle piece of data. Slices are useful for getting at multiple bits of data simultaneously, without tediously having to write single accesses for each bit. You can also assign things to a slice:

@array = ( "hello", "everybody", "I'm", "Dr.", "Nick", "Riviera" );
print "@array\n";
@array[ 3, 4, 5 ] = ( "a", "complete", "charlatan" );
print "@array\n";

You can see from the previous few scripts that, like scalar variables, arrays also interpolate their values when put in double quotes. Furthermore, the items will be helpfully padded with spaces too.

Next up…hashes.

Input and output

Input

The “not a hello world script” script is a little dull. Let’s try something a more useful:

#!/usr/bin/perl
print "What is your name?\n";
$name = <STDIN>;
chomp( $name );
print "$name likes beans.\n";

Don’t forget the semicolons at the end of each statement.

The first line of this program should be obvious now, but the others may need a little explaining. The second line contains the line-reading operator < > and the STDIN filehandleSTDIN (the “standard input filehandle”) is automatically opened whenever you run perl, and is where things inputted through pressing buttons on the keyboard live. If you surround a filehandle like STDIN with the < > line reading angle brackets, you will read a line from that filehandle. A line to perl is some text followed by a newline character, hence:

<STDIN>

means “read a line from the keyboard”. Which it duly does, grabbing Grumblepuss\n or suchlike: whatever you type in followed by the enter key (which appends a \n newline). The = is an assignment operator, and $name is the name of a Perl variable. Hence,

$name = <STDIN>;

means, “read a line in from the keyboard and store it in the variable $name“.

If you are from a Java or C background, you may be wondering what the $ is for, although anyone sufficiently elderly to remember BBC Micro BASIC may feel slightly more at home. The $ at the front of $name means that $name is a “scalar” variable (think $ for $calar), i.e. a  $ingle, flat item of data, not a list. Scalar variables are Perl’s bread and butter, and they can hold anything: an integer, a floating point number, a string, or even more esoteric things like objects and filehandles.

Cleaning up input

Next up in our script is our second Perl function (we’ve already met print). This goes by the peculiar name of chomp:

chomp( $name );

When you read data in from the keyboard, you will have to press the Enter key. In addition to ending your line of input, this will also add a newline character to whatever you just typed. Hence $name will actually contain Grumblepuss\n,with the newline at the end, not just Grumblepuss on its own. The chomp function removes the newline from the variable it is given, so after chomping, $name will just contain Grumblepuss. The parentheses aren’t really necessary here: they’re just a matter of taste. My personal taste is to leave parentheses off most Perl built-ins, so we’ll immediately change that script to…

#!/usr/bin/perl
print "What is your name?\n";
$name = <STDIN>;
chomp $name;
print "$name likes beans.\n";

Output

Finally, the last line

print "$name likes beans.\n";

prints out

Grumblepuss likes beans

or similar. Because we have used " " double quotes, the contents of variables like $name will be interpolated into what is output automagically. If you’d used ' ' single quotes on the print statement, you’d have got:

$name likes beans.\n

which isn’t what we want here. As we saw earlier, double quotes interpolate escape sequences like \n; now you have seen that they also interpolate the values of scalar variables too. This is extremely useful, and saves you the effort of writing something like:

print $name, "likes beans.\n";

This would also work, as print will print out a comma-separated list of values just as happily as a single item, but it’s rather more effort. Single quotes neither interpolate escapes (except \\ and \') nor interpolate variables.

Versions of perl equal to or higher than 5.10.0 (I’m using 5.16.1 at the instant of writing this) have a new built-in called say, which automatically adds a newline to the end of the outputted data:

#!/usr/bin/perl
use 5.16.1;
say "We don't need no stinking newlines";
We don't need no stinking newlines

The say function is not available automatically for backwards-compatibility reasons, but if you include a use 5.10.0; line to require a version of perl greater than or equal to 5.10.0 then say will be made available.

Next up…arrays and slices.

Educated guesswork

Someone recently asked me, “How many cells are there in the human body?”

I have quite a lot of correspondence of this kind, just to be clear.

I have to confess my first thought was “Not a bloody clue”. My second thought was “I bet the Internet knows”; but my third was “The Internet is full of bloody lies”.

My fourth thought went on for so long, it became part of the introduction to cell biology and genetics lecture I gave this year, and is reproduced below for your pleasure.

A good place to start is simply to divide the mass of the human body by the mass of a human cell:

Ncells = Mhuman / Mcell

So the first thing we’re going to need is a decent idea of the mass of a human, and an idea of the mass of an isolated human cell.

I am a very stumpy male human being, so my 60 kg of man-flesh will contain rather fewer cells than an ‘average’ human, but we have to start somewhere:

Mass of Steve [CC-BY-SA-3.0 Steve Cook]; apparently 60.8 kg

Given that I was wearing a very heavy pair of jeans, let’s call that a nice round:

Mhuman = 60 kg

I would be doing several millennia of instrument makers and legislators a disservice if I didn’t point out just how bloody amazing it is that you can find out how massive a human body is, with a cheap instrument, on an arbitrary but internationally recognised scale, to three significant figures. It seems a shame we so easily overlook the everyday magic of early 21st century civilisation.

The mass of a cell is a rather trickier thing to determine, so we’re going to have to conjure up some convenient half-truths simplifying assumptions. It’s a lot easier to measure the dimensions of a human cell than it is to measure its mass, as – thanks again to the instrument makers – we have microscopes available for this very purpose.

Conversion between mass and volume is a doddle as long as you know how dense the thing you’re measuring  is. Fortunately, I went swimming last week, and know that the difference between me floating and sinking depends on whether I have taken a big gasp of air or not. So the human body must have very nearly the same density as water, which is a convenient:

ρwater = 1 kg L−1

So my volume is about:

Vhuman = 60 × 1 = 60 L

To work out the volume of a cell, we just need a picture and a scale. Below is just such a picture with a scale: something I scraped out of my cheek, under a microscope, with a “graticule” eye-piece ruler.

Human cheek cell at 1000 times magnification [CC-BY-SA-3.0 Steve Cook]

Each small division on the eye-piece graticule you can see across the image is 1 µm: I know this on account of having put an actual ruler (of sorts) under the microscope and worked out the conversion factor. So this cheek cell is about 40 µm across.

To convert this into a volume, we also need to know how thick this cell is, whereupon we hit trouble, as I have no idea at all. However, I do know that blood cells are round-ish (or at least Werther’s Original-ish), as they tumble about, but they turn out to be a much smaller 7 µm across:

Human red blood cell at times 400 maginification [CC-BY-SA-3.0 Steve Cook]

These are at ×400 rather than ×1000, so each graticule unit is 2.5 µm rather than 1 µm, so the cells are about 7 µm across

Hmm. Given that looking at just two sorts of cell is already giving us trouble, it seems reasonable to make some more stuff up simplifying assumptions, and pretend that human cells are perfect cubes with sides 10 µm by 10 µm by 10 µm.

Such idealised cells have a total volume of 1000 µm3 , otherwise known as a picolitre:

  • 1 L = cube of side 100 mm
  • 1 mL = 10−3 L = cube of side 10 mm
  • 1 µL = 10−6 L  = cube of side 1 mm
  • 1 nL = 10−9  = cube of side 0.1 mm (100 µm)
  • 1 pL = 10−12 L  = cube of side 0.01 mm (10 µm)
  • 1 fL = 10−15 L  = cube of side 0.001 mm (1 µm)

So the volume of one of my cells is about:

Vcell = 10−12 L

If we pretend my body is made entirely of slightly oversized, cubical red blood cells, then it would contain:

Ncells = Vhuman / Vcell = 60 L / 10−12 L =  6 × 1013 cells

i.e. about 60 trillion of them.

QED?

No. We have made a number of simplifying assumptions, and now is the time to worry about them, and their effects. Some are fairly trivial, others very foolhardy.

  • We have assumed that humans have the same density as water, when in fact they’re usually a bit denser, at least when they’re deflated.
  • We have assumed cells are cubes, when they (usually) demonstrably are not. A sphere is only about 52% the volume of the cube that could contain it, although spheres can pack in a way that leaves only 26% space between them.
  • We have assumed that I am a representative human being, when I already confessed to being stumpy.

All these assumptions are iffy, but none of them would make a massive difference to our estimate of the total number of cells: at most, we’re talking halving or doubling the 60 trillion.

The more serious problems with our assumptions are two we didn’t really state explicitly. Firstly, human beings are not perfectly packed bags of cells; and indeed, much of what constitutes a human body is not cells at all. Secondly, the hand-waving about the size of an average cell hides a great mass of complications, only barely hinted at by the  cheek cells versus red blood cell disparity.

Humans are made lots of different kinds of stuff, almost 100% of which is things other than cubical red blood cells. The internet is indeed full of lies, but much of it can be cross-checked with common-sense. The figures below (culled from Wikipedia) seem to fit with what I’ve seen when converting non-human animals into pies. Obviously, the numbers vary greatly from person to person, but this whole calculation has been a tissue of lies rough-and-ready estimate, so let’s not get too hung up about them:

  • Bones = 35% = 20 L
  • Muscles = 40% = 24 L
  • Blood = 10% = 6 L
  • Faeces = 3% = 2 L

Bones do contain cells, but what they contain a lot more of is calcium phosphate and collagen, neither of which is cellular. The 35% of you that is bone doesn’t actually contribute significantly to the total number of cells in you. We should therefore knock-off about 20 L from the original 60 L of my body we assumed was made of cells.

Gray's anatomy bone tissue [Out of copyright]

Only the ‘bone corpuscles’ (black blobs) in this image of a section of bone are actually cells; the rest of the bone is a non-cellular matrix of collagen and calcium phosphate

Skeletal muscles are made of very large cells, about 1 cm long (i.e. 10 000 µm by 10 µm by 10 µm). So, skeletal muscle cells are about 1000 times more voluminous than normal cells. With our original assumptions, the cells in 24 L of my body are being over-counted by a factor of 1000! Since this is such a massive difference, we can excusably pretend that muscles are non-cellular: despite being 40% of my volume, they contribute less than 1% of my cells! It’s also worth noting that skeletal muscles cells often contain more than one nucleus, so even calling them ‘cells’ should give you pause for thought.

Blood is about half plasma, which is non-cellular, so we can knock off another 3 L.

The human cells in faeces are mostly dead, so we can probably also knock those off the total too. All told, 49 L out of my 60 L are non-cellular, or nearly so. We need to reduce the 60 trillion estimate by five sixths, giving a (hopefully more reliable) estimate of:

10 trillion human cells

But.

And it’s a big butt (ba-dum-tish).

The faeces we dismissed as inhuman non-cellular muck, are actually packed with cells, only they’re not human ones. Much of my faeces (and I presume yours too, but I’d rather not check) is made of bacteria; and bacterial cells are titchy:

Enterobacter W1 [CC-BY-SA-3.0 Steve Cook]; cylindrical cells about a micron long

The bacteria above are Enterobacter sp., which are found in the human gut, although this strain (‘W1’) is of interest to me because they break down a wood preservative. They’re about 2 µm long by 0.5 µm wide, so they have a volume of just 0.5 femtolitres, some 2000 times less voluminous than the human cells we modelled above.

Vbacterium = 0.5 × 10−15 L

Assuming my faeces are made of perfectly packed bacteria, 2 L of it contains:

Nbacteria = Vpoo / Vbacterium = 2 L / 0.5 × 10−15 L =  4 × 1015 cells

4 quadrillion bacterial cells

By this estimate, there are about 500 times more bacterial cells in 2 L of my poo than in the other 58 L of ‘me’. Even if we’ve over-estimated the proportion of faeces that is bacteria by a factor of 500 (unlikely!), the human cells in my body are still outnumbered by bacterial cells.

This leads us to two rather startling conclusions:

  1. To a good first approximation, none of the human body is made of cells.
  2. To a good second approximation, none of the cells in the human body are human.

This is why I love maths, and this is why how you answer a question is always far more interesting than what that answer is.

Not a hello world script

Rather than employing the usual tactic of showing you how to write the accursed ‘hello world’ program, so beloved of all computer programming introductions, let’s start with something entirely different:

print "This is most definitely not a hello world script.\n";

If you save this as the file thing.pl, or similar, then you can run it from the command line (i.e. DOS prompt, bash shell, terminal window) thusly:

perl thing.pl

And it will probably do what you expect.
If you didn’t misspell anything in the script, when executed it should do something along the lines of printing out:

This is most definitely not a hello world script.

If you’re on a Unix or MacOSX, it’s better if you write this instead:

#!/usr/bin/perl
print "This is most definitely not a hello world script.\n";

or

#!/bin/perl
print "This is most definitely not a hello world script.\n";

at the top of your script, all on its own. Then you can just type:

thing.pl

at the command prompt. The /usr/bin/perl (or whatever) after the shebang (#!) is the path to the perl interpreter, and it tells the Unix shell where to find it. Windows will also cope with a naked thing.pl providing you have included  C:\Perl\bin in your PATH environment variable and added *.PL to the PATHEXT environment variable. The ActiveState installer sorts this all out for you.

It’s generally considered polite to put a shebang at the top of scripts even if you’re not on Unix, in the name of portability: perl works just about anywhere, and it’s a nice idea to try and make your Perl scripts and modules work anywhere too. Note the subtle difference between perl (the executable program that interprets your program) and Perl (the language interpreted by the perl interpreter). This distinction, and even more so the fact that neither is written ‘PERL’ is a shibboleth: using ‘PERL’ will mark you out as a n00b, so Don’t Do That.

So far so easy. The only thing I hope you may be wondering about is the \n. This means ‘newline’, and will print whatever your operating system thinks is a newline at the end of the output. The \n character is an escape sequence because whatever comes after a \ has a special meaning to perl: \n is a newline, \t is a tab, and \\ is a literal \ character. We will come across many other sorts of escape sequence, especially when we get onto regexes later.

Something very important to notice about the script that we have just written is the ; semicolon at the end of the line: Perl statements must (almost) always end in a semicolon, unlike those in some other languages beginning with p. Forgetting trailing semicolons is a frequent source of bugs.

Strings and quoting

One of Perl’s strengths is the multitude of ways you can quote and manipulate a string like "This is most definitely not a 'hello world' script". First, try this out:

#!/usr/bin/perl
print 'This is most definitely not a hello world script.\n';

note the ' single quote replacing the " double quote we used in the first script. If you run this, it will print out:

This is most definitely not a hello world script.\n

In Perl, double quotes will interpolate escape characters like \n, and \t, so an escape sequence like \n gets converted into a real, literal newline between double quotes. Single quotes will translate only the escape \' to a literal ' and the escape \\ to a literal \. Double quote are therefore what you need to do clever, intuitive things; and single quotes are what you need to do simple, literal things. Since single quotes only interpolate \' and \\, this is the way to insert single quotes into a single-quoted string:

#!/usr/bin/perl
print 'This is most definitely not a \'hello world\' script.';

which prints out:

This is most definitely not a 'hello world' script.

with embedded single quotes around the ‘hello world’ bit. The reason we must escape the single quotes within our string is that if you don’t escape the two single quotes around the 'hello world' bit, perl will think you mean the string to end just before the hello, and will throw a tizzy:

#!/usr/bin/perl
print 'This is most definitely not a 'hello world' script.\n';

without suitable escaping of the quotes, looks to perl like:

#!/usr/bin/perl
print 'This is most definitely not a '
    # FIRST UNESCAPED SINGLE QUOTE ENDS THE STRING,
    # WHAT THE HELL IS THIS NEXT BIT SUPPOSED TO MEAN?
      script.\n';

The same applies to double quotes: if you want double quotes in a double-quoted string, you’ll need to escape them too \":

#!/usr/bin/perl
print "I said \"hello world\". Are you deaf?\n";

In Perl, all strings (i.e. anything that isn’t a number) should be quoted somehow or other. If you don’t use quotes, perl will quite likely barf, probably saying something about ‘bare-words’. You must put quotes around strings, so:

print "Nubbins";

is fine, as is:

print 'Nubbins';

but the bare-word Nubbins in this:

print Nubbins;

will not be interpreted as a string. Perl has several other methods of quoting strings like heredocs and printf, which we’ll also cover later.

Next up…input and output.

Load more