snapsvg

2011-11-24

Testing for the end of a for(each) loop

One sign of a good templating language is that you can test for the last iteration of a loop without having to count the iterations. PHP is no exception! Assuming you're quite happy with hacky syntax (you have to be if you're using PHP) then you can do this:

foreach ($array as $item) {
  if (each($array) === false) {
    # last iteration
  }
}

You iterate over the original array using each, but ignore the output if it's not boolean false. foreach derps around on the copy of the array, and when the original array finishes being iterated, each returns false! So you basically iterate over the array while you iterate over the array.



In Template::Toolkit you can also test for the last iteration of a loop with loop.last, which is why that is also a good templating language.

In conclusion, PHP is a templating language.

Oh right, leave a comment with your own one for whatever bozo templating language you like :)

2011-11-10

In Context


Natural language is contextual. A famous quotation demonstrates this in a way we all find familiar:

Time flies like an arrow; fruit flies like a banana.

Usually we interpret this as "the manner in which time flies (verb - to fly) is akin to an arrow" and "the creatures called fruit flies (noun pl. - fly) like (to find agreeable) a banana". However, we can also interpret the former as "Time (measure chronologically) flies (noun pl. - fly) the way an arrow would", except of course this is nonsense and we don't consider it.

Perl is the same. The three data structures covered in previous posts can be used in all sorts of ways, and Perl will attempt to do the right thing based on context. This is done by having rules about context, and what to do in those contexts.

Time Flies

When we say "time flies" in English we can either mean a) imperative "time", noun "flies" or b) noun "time" verb "flies". In Perl we can differentiate between nouns and verbs because nouns have sigils and verbs don't.

$scalar variables have $ and a singular name because they represent a single thing. @array variables use the @ sigil and have plural names because they refer to more than one thing. %hash variables refer to a set of named things; their names are often singular because either we refer to the set ("The option set %option") or a thing from the set ("The debug option $option{debug}").

Verbs, of course, are subroutines and so have no sigil. That means in Perl we can't misinterpret that:

time @flies;

Notwithstanding the fact that time is already a Perl function for finding the time, we can understand this because flies has to be a noun.

Flies Like A Banana

Context in Perl is not a grammatical thing in the natural-language sense explained above, but it is used to determine what we do with the nouns that we pass around.

A common example is to check for an array to have content:

if (@flies) {
    ...
}

In list context, an array represents the list of scalars it contains, as we saw demonstrated in this post, but in scalar context the array is treated as the number of items in it; i.e. its length.

The test in an if statement is in boolean context. Booleans are truey and falsey values: they are scalar. An empty array in scalar context is zero, which is false; hence an empty array is false.

The while loop also uses boolean, and hence scalar, context. However, the for(each) loop uses list context:

for my $fly (@flies) {
    ...
}

This is sensible because if it used scalar context it would always loop over one thing, which is the number of things in the array. A scalar in list context doesn't really do anything special; it remains that scalar:

for my $fly ($one_fly) {
    ...
}

but it does help with concatenation of several things:

for my $fly ($bluebottle, $mayfly, @various_flies) {
    ...
}

Fruit Flies Like

Surprises crop up when dealing with context once operators and functions get involved. For example:

my $banana = map { $_->fruit_i_like } @flies;

You might think, maybe this will take the first fruit that a fly likes, or the last, or some data structure representing the result of the map. Well the latter is the closest: the data structure returned is in fact an integer representing the number of fruits the flies, in total, liked. The context in this statement is in the = operator: it is determined by the thing on the left and it is enforced on the thing on the right. perldoc says that map in scalar context returns the length of the constructed list.

We can take the first fruit liked by a fly by putting the = operator in list context. In this case, that is done using parentheses:

my ($banana) = map { $_->fruit_i_like } @flies;

Parentheses only create list context on the left of =; as we learned in this post, parentheses on the right only serve to override the precedence of operators.

Of course, assigning to an aggregate type is also list context:

my @fruits = map { $_->fruit_i_like } @flies;
my %likes = map { $_->name => $_->fruit_i_like } @flies;

(In this latter example we assume fruit_i_like returns only one item; otherwise the constructed list will break the expected hash construction.)

Another common confusion is when the operator or function itself enforces a context:

my $random_fly = $flies[rand @flies];

The rand function returns a random number; if it is provided with an integer it will return a random number between 0 and that integer, including 0 and excluding the integer. Normal behaviour for a function is to use an array as a parameter list, but rand overrides this by enforcing scalar context on its arguments.

That means that the array given to rand is evaluated in scalar context, giving its length. The output is used in [] to index the array; array indices are required to be integers so the decimal return value from rand is truncated. Since rand can never return the integer provided to it, the maximum return value from this is the last index of the array, and the minimum is zero.

Enough With The Rearranging Of The Quotation You Cited

The context of any particular part of Perl can be subtle. I will describe a few common situations where context is key to the operation of the script.

if

As explained, if enforces scalar context.

if ($fly) { ...} 
if (@flies){ ... }
if (%like) { ... }
if (grep { $_->likes($banana) } @flies) { ... }
if (my ($fly) = grep { $_->flies } @times) { ... }

All of these are in scalar context. What about the last one though? We know that () on the left of = creates list context; but if() is scalar context.

Context happens in stages, naturally, just as expressions are evaluated sequentially. The = operator is in list context as defined by its LHS. This allows $fly to contain the first result from the grep. The cunning thing is that the = operator itself returns the value on its LHS after assignment: this is evaluated in scalar context and used as the boolean value for the if.

Consider:

if (my @times_that_fly = grep { $_->flies } @times) { ... }

You can expect @times_that_fly to be set and non-empty in the if block, and also for it to contain the result of the grep.

<>

The operation of the <> "diamond" operator is context-sensitive. That means it has different behaviour in list and scalar contexts. In scalar context, it returns one line of input. In list context, it returns all lines of input.

my $line = <>;
my @lines = <>;

As we know, for is a list-context control structure; hence it will read all lines into memory before iterating them:

for my $line (<>)

The while control construct, like if, tests a boolean value, except it does it repeatedly and stops running when it is false. Hence the expression in the parentheses is evaluated in scalar context. So it is idiomatic to see while used instead of for when iterating over input:

while (<>) { ... }
while (my $line = <>) { ... }

But I hear you cry that the line of input "0" is false! But it should be treated fairly! Well for a start it will probably be "0\n", but apart from that, while(<>) is actually magic and comes out as:

while (defined($_ = <>))

and even if you set your own variable:

while (my $line = <>) { ... }
while (defined(my $line = <>)) { ... }

but this is only true inside a while().

$array[rand @array]

We've covered this one but it's still for this list. The rand function enforces scalar context on its operand; hence, instead of the array providing a list of arguments to the function, it is itself the argument, but in scalar context, hence its length is returned. The array subscript [] then truncates the return value of rand to an integer.

=()=

Called the goatse operator by those with a sense of humour that compels them to do so, this is not really an operator at all but a trick of context. You see, not all operators return a count in scalar context.

For example, the match operator m/.../ or just /.../ merely returns true or false in scalar context. With the /g modifier it returns the next match! In what universe, then, can a mere Perl hacker get m/../g to return the number of matches?

The universe with the goatse operator of course.

First we know that () on the LHS of = creates list context. That is all that is required for m/../g to return a full list of all matches. So the match is run in list context and that's what we want. The output of our goatse operator is of course going to be a scalar since we want an integer; this enforces scalar context on the leftmost =, which has the effect of putting the return value of the rightmost = (which is the list created by ()) in scalar context, hence counting it.

my $count =()= /cats/g
my $count = ( () = /cats/g )

@array[LIST], @hash{LIST}

A subtle use of context is the one in array and hash subscripts. The context inside { } or [ ] when accessing elements is determined by the sigil you used on the identifier in the first place. Thus:

$array[SCALAR]
@array[LIST]
$hash{SCALAR}
@hash{LIST}

This can have unexpected effects; a warning is generated if you use the @ form but only provide a single value, but in some cases the context can propagate, for example if you are calling a function or something.

Context and Subroutines

A subroutine is called with the context of the place where you called it. This is often a useful tool, because a subroutine can also interrogate the calling context with the wantarray operator.

When a subroutine is called its return operator inherits the calling context. If it doesn't have a return operator, the usual rules apply about an implicit one.

The main reason this happens is that any basic function call - i.e. one with no special behaviour based on context - should act as though the return value were in the code in place of the function call. Consider:

rand @flies;

sub get_flies {
    # ... generate @flies
    return @flies;
}

rand get_flies();

It is reasonable to assume that, because the function returned an array, the behaviour of the latter rand will be the same as the former, and it is: the return of the subroutine is evaluated in scalar context because rand enforces scalar context.

Usually you want to save the return value of a function into some local variable, though; otherwise you have to run the function twice to use it twice, which is wasteful:

(get_flies())[rand get_flies()]

It is important to be aware of the context in which you are calling your function, especially if the function has different behaviour in different contexts. For example, here we have list context:

localtime(flies());

and here we have scalar context:

rand(flies());

and here we have a syntax error:

time(flies());

because time doesn't take any arguments. By default, all subroutines give list context when called, so:

sub context_sensitive {
    return wantarray ? 'STRING' : qw(TWO STRINGS);
}

sub print_things {
    print $_, "\n" for @_;
}

print_things context_sensitive;

Here the context_sensitive sub is in list context, so it returns qw(TWO STRINGS), which is indeed two strings, as noted by the print_things function, which will print each thing with a new line after it so we can see what's going on. However, this sort of thing can lead to confusion when you don't realise the function is context sensitive:

my $string = context_sensitive;
print_things $string;

The above code exhibits different behaviour from the first, because context_sensitive was called in scalar context and the result of that was given to print_things. This can be surprising, because it means that context-sensitive subroutines are not always drop-in replacements for the variable they were saved to.

The scalar keyword can be used here to enforce scalar context on the subroutine.

print_things scalar context_sensitive;

The scalar keyword can be used in the general case when you want to either override or be explicit about the context of anything.

scalar @array;
scalar $scalar;
scalar sub_call();
scalar grep {...} @array;

There is no real list-context equivalent because in general list context is a superset of scalar context, but as with the goatse operator, there are exceptions. Hence the construct () = can be used to evaluate something in list context when normally it would be in scalar context. The reason this is not common is the return value of this assignment will still be evaluated in scalar context, which in most cases is the same as simply not doing this at all.


Void Context

Sometimes you might get a warning along the lines of "useless use of constant in void context". What does this mean?

The difference between a statement and an expression is action. An expression is anything that returns a value: 5+5, sub_call(), @array ...

A statement is one that performs an action: $foo = 5+5, sub_call(), if(@array).

Not all legal lines of code in Perl are statements:

my $foo = 5+5;   # statement
5+5;  # not a statement

It is this latter line of code that will cause the warning about void context. It should be at least slightly apparent why it is in void context: there is no operator or function call enforcing context on the expression. There is no context at this point! This is called void context; neither scalar nor list context is enforced on the simple line of code 5+5;

Void context is not always incorrect. For example, we know that the = operator returns the value on the left hand side. That means that whenever you perform an assignment, the whole assignment itself is performed in void context. It all works because of the layered nature of expressions and sub-expressions. Consider:

$foo = @bar = baz()

baz() is performed in list context because of the assignment to @bar, but that assignment itself is performed in scalar context because of the assignment to $foo. But the assignment to $foo doesn't cause baz() to be run in scalar context. @bar = baz() happens first, and baz() is in list context. This whole expression returns @bar because that is the behaviour of =. That means the next thing that happens is $foo = @bar, which puts @bar in scalar context and populates $foo. This expression returns $foo, but there is no other expression in this statement. So this assignment happens in void context, which does nothing.

This is also void context:

baz();

You don't get a warning for the simple reason that it is perfectly legitimate to have a function that doesn't return anything. However, you could generate your own warning if your function is useless in void context:

sub baz {
  warn "baz() called in void context" if not defined wantarray;
}

The wantarray operator is undefined in void context.

In Summary and/or Conclusion

The context in which you are working defines the way Perl's nouns are treated. You should always be aware of what context you are in at any one time, and how to recognise context.

An assignment is in the context of the left-hand side.

Tests (while, if, ?:) are scalar context, and for loops are list context.

Array and hash subscripts are in the context of the sigil you use.

Operators may enforce context, and some operators also enforce coercion, in order to make the operation possible. The scalar operator is solely intended to enforce scalar context.

The argument list to functions is in list context. Overriding this is possible with prototypes but not recommended in the majority of cases.

With nothing enforcing context, the expression is in void context. Void context will throw a warning if the expression is a constant.

Functions inherit the calling context for their return statement. The nullary operator wantarray is used to inspect that context.

2011-10-02

What do you people do all day?

Change list for last Perl release:

http://search.cpan.org/~flora/perl-5.14.2/pod/perl5140delta.pod

Change list for last PHP release:

http://docs.php.net/manual/en/migration53.new-features.php

What

2011-10-01

Understanding and Using map and grep

Edits: There is no strict reason why map should not return a shorter list, so I'm not discouraging it any more.

map and grep are useful tools in Perl. To understand them needs only the ability to follow some logic, but to use them requires an understanding of lists. You should read this post on lists to understand how lists work in Perl. The main thing you should understand from that is how lists exist in memory, and how they are used with relation to arguments to functions.

map and grep are actually operators, but they behave so much like functions that you'll find them in perlfunc. What they do is they take a list of arguments, like functions do, and convert that list into another list. map transforms the input list, and grep filters the input list.

Remember that lists with zero or one item in them are still lists.

Understanding map

To understand map you need to be able to understand the following:

y = f(x)

If you don't, let's briefly explain it. It means that if you do something (f) to a value (x) you will get another value (y). You've probably seen this before in graphs - the straight line formula is y = mx + c; we can see that all of mx + c is a function using x (m and c are constants), meaning that if you feed an x coordinate in, you will get a y coordinate back. We say that y is a function of x.

In the case of map, x will always be a scalar, because of how lists in Perl are lists of scalars. y could be more than one value, this being how your output list can be longer than your input list. f is provided to map.

map is therefore a transformation. Its purpose is to consistently transform your input list into an output list, by providing each item of your input list to your f, and collecting the ys that you get out of it.

map turns this:

map { f } x, x1, x2, x3, x4...

into this:

f(x), f(x1), f(x2), f(x3), f(x4)...

A simple example is mathematical: even numbers. The nth even number is 2n—you learned this in school. If you didn't go to school then you'd better catch up.

That means, for y to be the xth even number, our f is 2x.

f(x) = 2x
f(1) = 2×1 = 2
f(2) = 2×2 = 4
...

That means that, given the numbers 1 to 10 we can find the first 10 even numbers that are greater than 0. If we provide our f to map, and the list 1 .. 10, we should get that.

Defining f(x) for map is a simple case of using the special variable $_ in place of x. For now, we will define our f using {}.

print join ', ', map { $_ * 2 } 1 .. 10
2, 4, 6, 8, 10, 12, 14, 16, 18, 20

This is a good example of what we were learning about lists in the previous post. The join operator joins a list of values using the specified string, in this case a comma and a space. There is no requirement in Perl that this list be either an array or a hard-coded, comma-separated list: it can be anything that, in list context, returns a list1. That could be a function, or another operator like map or grep.

This example serves to explain in simple terms how some functionality can be routinely applied to a set of values to return a new set of values. There is no requirement that y is different from x after the execution of f(x). For example, you might wish to search a dictionary for a set of words, but the words themselves might be suffering from grammar, and hence have different letter cases. The dictionary itself would be entirely in lowercase, and hence you would want a lowercase edition of your set of words.

my @lc_words = map { lc } @words

In this case, the f() that we're applying to all members of the array is the operator lc, which operates on $_ by default and returns a lowercase version of its operand. If the input word happened to be already lowercase, the output will equal the input. No matter.

List Return Values

If you recall, lists are lists, in Perl. Arrays and hashes are things constructed from lists, and both of these are treated as lists when used where a list is expected. You can see this principle in action in these examples: our first example used the range operator .. to construct the list of integers between 1 and 10, and fed that to map. The second example used the array @words, which we assumed to exist for the example, as the input list.

And then the output list was used in the first example as the argument list to join, and in the second example it was used as the constructor for a new array.

So far we have been returning only one value, but the f we provide to map can return any number values because it is evaluated in list context. By this mechanism you can turn an array into a hash, which is probably the most common use of that.

my @words = qw(cat dog cow dog monkey);
my %words = map { $_ => 1 } @words;

The map block here returns the input word, $_, and 1. Remember that in this post we learned that => is semantically equivalent to , except with the quoting of the word to its left. Remember also that in the same post we learned that a hash is constructed from any even-sized list (or an odd-sized list, with a warning). So this block is returning two items for every one we put in, doubling the length of the list.

The use of the fat comma in a construct such as this is idiomatic: since it has no actual effect on the output of the map block it is really only there to hearken to the usual construction of a hash, which is a list that uses => a lot. The image conjured is of taking each element of @words and turning it into word => 1, joining it all together until we have an even-sized list thus:

my %words = (
    cat => 1,
    dog => 1,
    cow => 1,
    dog => 1,
    monkey => 1
);

Knowing the intermediate steps to get to that may help:

my %words = map { $_ => 1 } qw(cat dog cow dog monkey);
my %words = map { $_, 1 } qw(cat dog cow dog monkey);
my %words = ('cat', 1, 'dog', 1, 'cow', 1, 'dog', 1, 'monkey', 1);

Of course, in a true application, you would not set @words in this manner in the first place. @words will be computed, perhaps from a file passed in by name on the command line, which you have processed and turned into this array.

A little knowledge about hash construction is relevant here to know the actual purpose of this map block. If you understand that a repeated key in the constructing list is simply overwritten by the latest occurrence of that key, you will see that this has the effect of making the input list unique. Although the output list is exactly double the length of the input list, that list is immediately used to construct a hash. The hash itself then undertakes its usual behaviour, and in this example the repetition of "dog" in the original array is overwritten by its last appearance by the construction of that hash - which is fine in this instance because all the keys are associated with the value 1. Thus any repeated word is not counted.

You can also skip over part of the list as well. Perl makes no assumptions on the size of your returned list when running map. You can use this to omit elements from your output list while doubling the rest.

To do that, simply test $_ and return an empty list if you don't want the element:

map { test $_ ? ($_ => 1) : () } @things

You might return the empty list if the test returns true, for example if you are trying to transform elements you have not already transformed and remove those you have:

my %to_cache = map { in_cache $_ ? () : ($_ => transform $_) } @stuff;

For some test in_cache, which returns true if its argument is in the cache, and some function transform, which returns a transformed version of its argument, you can thus construct a hash of transformed stuff ready for caching. Of course, you don't have to double the list; you can use this simply to filter it. But if you're not going to transform $_ then it is more sensible to use grep

When Not To Use map

Now that you have a new toy you might be tempted to use it. It is simple nature. Well as with every tool, it is unwise to use it outside its purpose. There are four principles that you should follow when playing with map:

The Principle of Transformation. You can use map to test each item and conditionally return the empty list, thus shrinking the size of the list. But unless your map block transforms $_ in some other way (by returning an altered version, or a longer list), what you have actually got is grep. In some cases it may be easier to read if you use both map and grep, using the output of grep as the input to map.

The Principle of Immutability. You should never transform the input list. That seems contradictory, but by this I mean never alter $_ directly; you should always return an altered copy. It is not, in all cases, possible to change $_. Only when the input list is an array is it possible to overwrite $_ for all values you get from the list. When it is a list constructed by some other means, even a hash, the input values are often immutable, which means you can't change them so don't try. Always return a copy of $_, with transformations applied to the copy.

The Principle of Brevity. If your transformation is quite long-winded you should either a) put it in a sub or b) use a for loop and push onto another array. map is a construct of brevity; this should be honoured.

The Principle of Containment. If you need to do anything else while iterating over your list other than transforming x into y then you should use a for loop for that. Your map block should have no side effects, which means when the map has finished executing, everything should be the same as it was when it started, except now there is a new list in memory.

The general principle is that you are getting back from map a new list, and leaving the old one alone. Don't run map without using the result!

Understanding grep

grep is an operator that has many similarities to map. It takes an input list, and returns another list. You provide some function f, akin to that described for map earlier, and it is run with each successive item from the list.

The difference is simply that grep returns a list of equal size or shorter than the input list.

The f that you give to grep is a filter, not a transformation. It tests each $_ and returns trueness or falseness:

f(x) ∈ (1,0)

We are not looking in this case to return a new value, y, from our function, but rather any value that is true or not true. Reasonably one could point out that if you are outputting a truth value you are running a transformation. You are: but the result of grep itself is part of the input list. The result of map is the output list.

grep can be written in terms of map:

@filtered = grep { test } @things;
@filtered = map { test ? $_ : () } @things;

A common adage is "spelt grep, pronounced filter". That's a mnemonic, of course; it's pronounced grep.

A traditional use of grep is to find the defined items in an array:

grep { defined } @array

It is not uncommon to have produced an array or list with undefined elements, especially if that list came from a map block that could return undef in some cases. Another common idiom is to map and grep at the same time:

my @phone_numbers = map { $_->{phone} } grep { $_->{phone} } @people;

Here @people is assumed to be an array of hashrefs representing people. The grep filters out those for whom the key 'phone' returns a true value, and the map then returns the actual value.

We can thus draw this distinction between the two: map collects output values, and grep collects input values. You can see this is the case: the map block is assuming that the input list is a list of hashrefs, and extracts strings from them. We collect the list of things that map outputs. For that to work, it means that the output of grep has to be hashrefs; and since the input to grep is hashrefs but the grep block only returns a value from that hashref, we see that the output of grep is just a subset of the input.

When Not To Use grep

grep doesn't have the principle of transformation because you're not supposed to transform the list.

The Principle of Immutability - grep should not alter the input values; it should merely test them.

The Principle of Containment - nothing in the grep block should affect anything outside of the expression. Everything should be the same after the grep has finished running as it was beforehand, except now there is a new list in memory of (copies of) some or all of the input list.

The Principle of Brevity - if you need to do a long-winded process to find out whether or not you want to keep a particular $_, take it somewhere else, because it doesn't belong here. Make a sub and put it in that.

Similar Things

There is a simple concept here: You take a list, you apply a function to each element, and you get another list. map is the simplest example of this because the list you create is exactly the list that map returns. grep is arguably more complex because the list you create with this process (a list of true and false values) is itself used to alter the input list, so you get back a different list from the one you build.

sort

sort is another list operator. Like grep, the list it returns comprises elements of the input list. Unlike grep, the list it returns is the same length as the input list. Clearly, it sorts the input list, and returns a new one with the items in order.

An important thing to have already realised by now is that if you are sorting you have to do two variables at once, instead of one. Perl sorts, haha, this out for you by providing you $a and $b instead of $_. Your task is to tell Perl whether $a is greater than, less than, or equal to $b, by returning -1, 0 or 1 respectively from the block.

The operators <=> for numbers and cmp for strings do this easily for you, but this generalisation of the sorting process allows for you to run $a and $b through any algorithm in order to sort the list.

my @sorted = sort { length $a <=> length $b } @strings;


for

The for loop is much more than it is in other languages, especially when used as a statement modifier. That's when you put it at the end instead of putting it first and using braces. for is the operator you should use when you want to modify the array itself, rather than create a modified copy. Note you can only fully modify arrays and only the values of hashes.

$_++ for @array; # increment every value
$_++ for values %hash; # same
$_ .= "-old" for @filename_backups; # append "-old" to each filename

map is equivalent to doing a for loop over the old array, and pushing things onto a new array:

my @output;
push @output, $_ * 2 for @input;

Presumably map is more efficient, but part of Perl is that you have the option of being more expressive with your code, which is to say that you should write what you mean. If you mean to perform a map operation, use map; otherwise when someone else reads your code (i.e. you, next month) there will be no head-scratching wondering why you used a for loop.

Other functions in List::Util, List::MoreUtils and List::UtilsBy serve to run a function across all items of a list in the same general way, and return either a list or a single value.

List::Util
  • first - Find the first item in a list; usually used to find any item that matches the criterion. Often therefore used to find whether any item matches the criterion.
    my $caps_key = first { uc $_ eq $_ } keys %hash;
  • reduce - Collapse a list into one value, by repeatedly applying the function. This is called reduction, as in "map-reduce". Note the use of $a and $b, like sort: we are trying to reduce a list into one value so we have to do two values at once, rather than one.
    my $sum = reduce { $a + $b } @numbers;

List::MoreUtils

This module has many more than List::Util so I won't list them all.
  • any, all, notall, none - Test the whole list. any is similar to first from List::Util; the difference being that it will return a true value if your search succeeds, whereas first could return a false value if you're looking for false values. These are essentially grep, except they will stop looking if they find the answer - grep will process the whole list in all cases.
    if (any { defined } @things) {}  # if any is defined
    if (all { defined } @things) {} # if all are defined;
    if (notall { defined } @things) {} # if not all are defined
    if (none { defined } @things) {} # if no thing is defined
  • pairwise - This only works on arrays, but it takes one value from the first array and one value from the second, and gives them both to your function. Then it collects the outputs. It's like map, except it does two things at a time.
    my %hash = pairwise { $a => $b } @keys, @values;
    my @totals = pairwise { $a * $b } @prices, @quantities
List::UtilsBy

This is a convenience module for all those cases where normally you'd do the same thing to both $a and $b. I will stick to a couple of examples.
  • sort_by - Applies the procedure to both $a and $b and compares the results as strings (nsort_by for numbers). Saves typing.
    my @sorted = nsort_by { length } @strings;
    my @sorted = nsort_by { $_->mother->mother->mother->age } @people;
  • uniq_by - Makes the list unique based on the function.
    my @unique_by_colour = uniq_by { $_->colour } @fruit;
In Conclusion

I hope this has given you, the newcomer to Perl, a good idea of what map and grep do. I also hope it has given you an insight into the general concept of applying a function to a sequence of values, and doing something with the result. It is an operation that is much more common than you'd think, even in real life.

It is the basis of mail merge, for example, where you take a template and a list of names and you put each name in the template and, with the resulting list of letters, print and send them. That's map.

It is the basis of searching, where you have a quantity of items and you're looking for all of those that match a certain criterion. You apply your criterion to each and keep those that match. That's grep.

It is the basis of sorting things by height, or alphabetically, or by the number of times they've won the World Cup: you take the list of things, and two at a time you find out how many times they've won the World Cup and you sort based on that. That's sort (or sort_by or nsort_by).

Examining or modifying a list is an awfully common operation. One thing I cannot teach you, however, is how to recognise when you should use one. Feel free to ask!

1 All right smart arse. Everything in list context is a list, even if it is only one scalar, because that's just a list with one item in it. The point is that this statement is true regardless of the length of the returned list.

2011-09-06

Einstein's Constraint: Booleans

Everything should be kept as simple as possible, but no simpler.
Albert Einstein
Perl is a language that combines ideas from many other languages. It is a language designed by a linguist, and hence it uses principles from natural language. The design of Perl is therefore a combination of two things: convention set out by its muses and Larry Wall's desires.

This meant that Perl was free to pick and choose from different conventions or invent new techniques that solved problems inherent in others'. Application of Einstein's Constraint is clear in Perl 5 (as well as a few instances of failure to apply it!), and today we will look at booleans.

True and False

The most obvious starting point when deciding how to implement booleans is to decide what will be true and what will be false. Probably the most common falseness and trueness are zero and not-zero, respectively, a convention popularised—if not invented—by C. In some dialects of Lisp, an empty list is false, and in many languages, True and False are their own types.

JavaScript, PHP, Python: many languages have explicit types for true and false—global, singleton values that always represent trueness and falseness. This allows truth to be explicitly defined. But many languages these days, including PHP, JavaScript, Python and Perl, all employ a concept called coercion to switch between different types implicitly, i.e. swapping without you having to ask for it. When this is available, we also have to consider what other values have truth or falsehood.

Is it simpler to a) have True and False as separate values and coerce into them, or b) use existing values, and a rule?

Let's ask Einstein.

To answer, we have to consider usage. Perl draws a lot on the do-what-I-mean, or DWIM, philosophy of programming, a philosophy which naturally leads to Perl automatically and transparently switching between data types where possible (and in fact it is always possible, thanks to various operators). Strings and numbers are interchangeable; objects can be converted to strings; arrays to scalars or lists. Anything can be used anywhere and a rule is applied, consistently, so that the programmer knows what to expect perl1 to do.

Observing this principle it seems like we're already tending towards b. Perl is, after all, already designed with rules in mind: a consistent (and concise) set of ground rules is the easiest way to understand what to expect in a given situation.

But if we follow the thought further we realise that once there is type coercion, the difference is moot. If we introduce a separate value for true and a separate one for false, this means we have to create a whole new set of rules for how to coerce other values into these two whenever there is a boolean test. If we already have to make that decision, it then follows that it doesn't matter what we coerce into: what matters is what values are false and what values are true.

Einstein's Constraint, then, says that since we have to implement b) anyway, and know the rules, it is tautologous to implement a) as well. Indeed, there may or may not be boolean data types internally to the Perl interpreter, but this makes no difference to Perl as a language. So, we can decide that we will simply choose values, rather than types, to be true or false. This is sensible, because Perl doesn't have types in that respect. True, false, 0 and 1 are all scalar values.

Deciding which values should be true and which should be false is the next logical step. Convention from C tells us that zero should be false; but Perl has list types, and Common Lisp suggests an empty list should be false. This nicely follows existing rules: since boolean values are scalar, and an array in scalar context is its length, then a zero-length array is naturally false because it is treated as zero. A list—however it is constructed—in scalar context returns its last item; an empty list will return no item. Nothingness, then, is falsehood.

Type coercion says that the string '0' is equal to 0, and hence false. We might consider that the strings '00' and '0E0'(0×100) also numify as 0. But note that 0 will not stringify as either of these. Only the string '0' is fully equivalent to 0, because the conversion between these two will never produce a different value.

Furthermore, we are coercing to booleans, not integers: that is to say, we want to know whether it is true or not, and we are not using any specific type to represent truthiness. You may spot inconsistency. In this respect, you might say that Perl does have a boolean type, and you'd be right in a sense. However, there is no true and false; there is merely a state of trueness that a value can have. The value only has a value of truth when it is used as a truth value; in other languages, true and false always represent truth values. Truth is contextual. How recondite.

Perl therefore defines '0' to be false for consistency with 0, but any other string to be true, including '00' and the common zero-but-true value '0E0'. The exception is the empty string '', which falls into the "nothingness" category, since there's nothing in it, and is also false.

Finally, the undefined value is conceptually equivalent to an empty list: it is a scalar with no value. That, too, is a false value.

Removing the tautology has made this as simple as possible. There is no remaining tautology here.

Comparison Operators

Einstein's Constraint removed the need for extra values from Perl to represent truth and falsehood. The same mantra in fact extends Perl's collection of operators for comparison.

To explain why, we should take a look at languages that don't. JavaScript and PHP both use == to compare all things: strings, integers and objects. They both also have === to compare without coercion.

As mentioned, coercion is the practice of treating a variable as another type by silently converting from its current type to another. Using == to compare two types in JavaScript or PHP will coerce both, one or neither operand to a different type, based on rules documented somewhere.

Using == to compare two types in Perl will coerce both operands to numbers, and compare the results. Using eq will coerce both operands to strings, and compare the results.

Why?

This becomes easy to explain when we consider the types of comparison available to us. The two obvious ones are numerical and string comparison. Two numbers are equal if they represent the same platonic value—020, 16 and 0x10 are all equal. Two strings are equal when they contain the same characters in the same order.

Then you might suggest that two arrays are equal if they are the same length, which works for Perl. Or that they are equivalent: the same keys and the same values, which works for PHP. Or that they are the same actual array, which works for PHP and JavaScript.

What about two objects? Perl's objects are necessarily references, so referential equality seems reasonable—but Perl also has operator overload, so the decision could be given to the objects themselves. PHP has true objects, so you might suggest that an operator overload would be good, but PHP doesn't think you can be trusted with that so it compares them by comparing their attributes instead. JavaScript also uses referential equality, but doesn't allow for operator overloading either.

For PHP and JavaScript, == is actually an equivalence operator, and hence numerous rules are needed to determine what is equivalent to what. === is also an equivalence operator: it is still not an equality operator. It just happens that the equivalence has fewer rules, and in many cases equality is the only satisfactory state.

Also, we've been focusing on equality. What about the other operators, < and >? Strings can be compared lexically: there are rules for what is "less than" and what is "greater than". Objects, well, who knows? PHP's manual commits the fatal flaw of calling == the "comparison operator", whereas it is in fact a comparison operator called the equality (or equivalence!) operator—a mistake which allows the manual to conveniently omit the rules for the other ones. JavaScript takes a better approach and simply decides that objects are not comparable and returns false when you try a magnitude test.

But what do you do when the strings could be integers? Do you compare lexically, or numerically? Neither is incorrect. You would be upset if you were trying to sort a list by lexical analysis only to find that your language was assuming they were numbers and treating them numerically, and likewise the reverse. Is the string "10" less than or greater than the string "011"? If we weren't type-coercing we would know instantly: it's a string, so it's greater. But we are, so we don't.

Here is a generalised table over all of ==, <, <=, > and >= operators, showing you what coercion you can expect from the languages we've mentioned, on various operands. The result column is the result of the operator <, for reference. I chose that one because it performs the most erratically. In the example, [] are used to refer to real arrays in Perl, not arrayrefs.

Operands Treated as Result
L R PHP JS Perl PHP JS Perl
0 1 Numbers Numbers Numbers 1 1 1
"0" "1" Numbers Numbers Numbers 1 1 1
"a" "b" Strings Strings Numbers 1 1 0*
"10" 11 Numbers Numbers Numbers 1 1 1
"10" "011" ??? Strings Numbers 0 0 1
"10b" "11a" Strings Numbers Numbers 0 1 1
[1, 2, 3] [1, 2, 3, 4] ??? Objects** Numbers 1 0 1
[1, 2, 3] 4 Array is always greater Objects** Numbers 0 0 1
false true Booleans Numbers Numbers2 1 1 -

* Non-numeric strings numify as zero, and a warning is cast that you numified a non-numeric string. ** < and > are defined always to return false; otherwise, true is returned if they refer to the same thing.

This is Einstein's Constraint again. Perl has made it as simple as possible, but no simpler. In PHP's case it is not defined generally over the five operators how the arguments will be treated. In JavaScript's case, each pair of operands is consistent across the five operators, but the language is inconsistent as the operands change. There are rules, but why should you have to remember them? In Perl's case they are always treated as numbers. It cannot be simpler without being more complex elsewhere.

Perl sidesteps the whole issue simply by stating that if you use any of <, <=, ==, >=, > or the special Perl-only <=>3 then they are treated as numbers; and if you use any of lt, lte, eq, gte, gt or cmp, then they are treated as strings. The mnemonic is simple: the mathematical operators are used on numbers, and the letters operators are used on letters.

Triple Equals

A hue and a cry! What audacity to not mention that PHP and JavaScript have the triple-equals operator, === that enforces type checking as well. With this magical operator, we solve the problem a different way. We can, in all cases, avoid the problem of type coercion by simply demanding that it not take place.

All cases? No. Since both languages have false as well as an undefined (null) value and zero, how do you test a string, read from standard input, for falsity? Or how do you compare a variable that exists, but is not defined or is false, and differentiate it from zero? And how many more rules and exceptions are there to this new operator, that can compare types as well? Are we forgetting the principle that we should be able to implicitly treat any type than any other type? Didn't we learn a lesson from the true/false thought experiment?

Perl's use of two types of operators for two types of comparison remains simpler, and the main reason is that all things are supposed to be coerced into all other things. That is a sound principle in Perl, but without these extra operators, other languages find a barrier preventing them from seamlessly implementing the philosophy.

That aside, there is not a triple-equals version of <= or >= is there? Those are the troublemakers, after all. Those are the ones that force us to sort our number-like strings the way they want to, not the way we want to. How do we prevent this behaviour on these other operators? Oh sod it all, let's just have separate comparisons for strings and numbers.

1 By convention, Perl is the language and perl is the interpreter.
2 The Constraint explained earlier shows why we got rid of the boolean type for Perl. While this row is correct for PHP's and JavaScript's two boolean values, all three languages will come a cropper if you try to compare a trueish value with a falsish one. Perl, again, simplifies it by not doing this, and therefore we can't say what Perl will treat true and false as because they don't exist. But it would be numbers.
3 The spaceship operator returns -1 if A is less than B; zero if they are equal; and 1 if A is greater than B. The same test is three lines of code, or two chained ternary conditional operators, in other languages: sort { $a <=> $b } ... is better than sort { $a == $b ? 0 : $a < $b ? -1 : 1 } ... because a) it is legible. cmp does the same, but for strings.

2011-08-10

Your System is not Gödel-Proof

Gödel tells us, paraphrased, that no system can be fully expressed in terms of itself. For example, the dictionary, which attempts to express the meaning of all English words, nevertheless uses English to do so. You have to know enough words to learn what the other words mean, and you can build up from there.

In other words, every system needs its axiomata. An axiom is essentially a fact about a system that is assumed known.

This is analogous to the design of a system. It seems somehow more elegant to design a system that works based on things that are already working than to write a new procedure to make something work: these are your axiomata. Overengineering comes in when you relentlessly try to base your system on axiomata instead of simply creating a new entry in your dictionary. When you find yourself trying to find the "most elegant" solution to your problem you might actually be trying to find the "least work" solution.

Overengineering, if you think about it, has the ultimate goal of having the entire thing just work if you prod at a particular pressure point in your towering mass of pre-existing code.

Well stop it. You can't make entire system without writing a bit of code. Heck you don't even have a system if it's just a collection of axiomata. You will have to write at least a bit of glue code. And don't try too hard to leave your system as a collection of axiomata for new systems. Make it work, first.

2011-08-02

Lists, and Things Made Of Lists

In the post , we talked about how some of Perl's data types are aggregate types, while others are not. We differentiated them as whether the type holds one scalar, or any number of scalars. The scalar data type is not aggregate—it holds but one thing—and arrays and hashes are aggregate.

This post is intended to explain how lists are used in the context of these data types.

Lists

Perl's aggregate data types are the array and the hash. Each is constructed from a list. The actual definition of a list covers quite a lot of cases—a lot of ways in which these can be constructed. However, the basic concept of "a list" is pretty simple; it's an ordered sequence of (zero or more) scalars.

When you assign a value to a scalar you usually either populate it with input data or assign it a literal value:

my $input = <>
my $limit = 100;
my $user = 'user';

When you assign a value to an aggregate data type you populate it with a list:

my @lines = <>;
my @days  = ('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat');
my %colour = (
  red => '#ff0000',
  green => '#00ff00',
  blue => '#0000ff',
);

A list is a sequence of scalars. The most basic way of constructing a list is with the comma operator.

The Comma Operator

Little did you, the new Perl developer, know, but the humble comma is also an operator like all others. It has low precedence, and its job is to concatenate two lists together. Things are lists when they are in list context.

A common misconception is that parentheses form list context. After all, every time you see a list, you see parentheses! Not strictly true. The parentheses are simply there to make sure the comma operator happens first; it is the context of the whole expression that determines context. Stay with me and I'll try and make it clearer.

To create a list we use the comma operator.

1, 2, 3, 4, 5, 6

This fragment of code makes no sense on its own and thus needs some context to make sense. However, it is an expression—it's called that because it returns a value.

Where we use the expression determines the value it returns.

my @array = (1, 2, 3, 4, 5, 6);

This is the example we're familiar with. The context of the expression is determined by the assignment operator. When we assign to an array, the expression on the right-hand-side of the assignment operator is in list context; thus the comma operator in our expression is in list context, and hence creates a list.

Why the parentheses? This looks perfectly innocuous and, indeed, perfectly legible to any newcomer to Perl:

my @array = 1, 2, 3, 4, 5, 6;

But that's because the newcomer who reads it is not as educated as you are about to become, and is not aware that the assignment operator = has higher precedence than the comma operator. That means it's evaluated first. That means you get this:

(my @array = 1), 2, 3, 4, 5, 6;

Because you are an honourable and competent Perl developer you have enabled warnings. Thanks to this, you are warned not once but five times that you have "Useless use of a constant in void context".

In the latter example, no comma operator is evaluated in list context, because the assignment operator is evaluated first. It consumes the array (or hash) and the 1, and is then done. The remaining comma operators are then evaluated in void context, which is the context anything is evaluated in when there is no operator or other syntax imposing a different context. Just saying 2 is useless in void context, so Perl tells you you have done it.

In other list contexts, there are already parentheses:

for my $i (1, 2, 3, 4, 5, 6) { ... }

And in some, there is no operator with higher precedence, so we don't need parentheses:

push @array, 1, 2, 3, 4, 5, 6;

In scalar context (remember: this is determined by what you're assigning to), the comma operator will return its right-hand operand. That is to say, if you try to build a list with commas and assign it to a scalar, you will get the last item.

my $scalar = (1, 2, 3, 4, 5, 6); # $scalar = 6

And of course if you forget the parentheses, the assignment happens first, and you get warnings.

my $scalar = 1, 2, 3, 4, 5, 6; # $scalar = 1

Generally there is no reason to do this.

Hashes

We didn't mention hashes above, to keep it simple. Hashes are also aggregate data types and are also constructed from lists. However, the most common way of seeing a hash constructed in code is like this:

my %colour = (
  red => '#ff0000',
  green => '#00ff00',
  blue => '#0000ff',
  ...  # etc
);

What is this? In some languages you will find that there is a specific syntax required to create a hash (or associative array—but that's a ) but in Perl the syntax is merely convenience. There is nothing particularly special about the syntax above; you can construct a hash from any list (but of course you will be warned if you use an odd number of elements, since hashes are paired).

my %colour = (
  'red',  '#ff0000',
  'green', '#00ff00',
  'blue', '#0000ff',
  ...  # etc
);

This operator => is known as the fat comma, because it has the same effect and precedence as the comma, but it is 2 characters and, hence, fat. Other than that, you'll notice the other difference is that in the first example I didn't have to quote the string keys. The syntactic benefit of the fat comma is that it quotes the bareword to its left for you, which covers the majority of cases, and means you only have to quote keys that don't look like identifiers.

But, ultimately, you have still created a list. You still have to use parentheses, and as we will learn further down, you can construct the hash by using anything that returns a list.

"Construct"?

Yes. We use this term when we give a variable a value. We might say we use this term when we create a new variable, but of course we can reconstruct an existing variable at any time.

When we declare a new array or hash but don't perform an assignment at the same time, we are implicitly constructing it from an empty list.

my @array;        # These two
my @array = ();   # are equivalent
my @array = 1 .. 5;            # These two are
my @array = (1, 2, 3, 4, 5);   # also equivalent

Constructing a hash does impose the requirement that the provided list be even in length, or else a warning will be generated. Otherwise, there is no special requirement to constructing a hash.

my %hash;        # These two
my %hash = ();   # are equivalent
my %hash = 1 .. 6;               # These two are
my %hash = (1, 2, 3, 4, 5, 6);   # also equivalent
my %hash = ( 'a', 1, 'b', 2 );   # And these two are 
my %hash = ( a => 1, b => 2 );   # also equivalent

List Unpacking

List unpacking is the principle of doing what you just did, but with a list on the left hand side of the assignment operator as well as the right.

Just to confuse you, parentheses on the left hand side of the assignment operator do create list context.

List unpacking takes sequential items from the source list, and assigns them in order into the scalar or aggregate values in the destination. This example involves just scalars:

my ($first, $second) = @days;

In this example, the rest of @days is ignored if it is more than 2 items long. $first and $second get undef if @days is not long enough to populate them. Using our example from earlier we will expect $first and $second to have 'Mon' and 'Tue' in them, respectively.

This next example uses one scalar and one aggregate. If any of the items on the left is an array, it gobbles up all the rest of the list on the right.

my ($mon, @tue_to_sun) = @days;

That means this doesn't work:

my ($mon, @tue_to_sat, $sun) = @days;

While the Perl hackers could feasibly make this work, there are logical problems that are essentially unsolvable. Since Perl uses the concept of DWIM as much as possible, it is better to avoid trying to make this work than to make it not do what you meant.

Logically, this brings us back to the copying of an array that we've seen before, simply by not using that scalar:

my (@days_copy) = @days;

Because the @days_copy puts the assignment operator in list context anyway, we can lose the parentheses, and we're back to square one.

You can also swap existing variables around using the same syntax. Here's an example that makes sure $x is always greater than (or equal to) $y:

if ( $y > $x ) {
  ($x, $y) = ($y, $x);
}

This list unpacking idea is usually used to fetch the parameters to a function out of the special array @_. We'll see that later.

Interchangeability

The fact that most newcomers to Perl don't immediately grasp is that whenever a list is required, either an array or hash can be used in its place. Both an array and a hash, used as a list, will yield their contents as such a list. Being unordered, the list you get out of a hash may not be in the same order as the list you put into the hash, but it'll have the same contents, and the pairs will maintain their association.

In that vein, all of the following are valid, albeit of debatable usefulness.

my @dirs = ('.', '..', '/', '/home');
my %pointless_hash = @dirs;
my @dirs_copy = @dirs;
my @hash_pairs = %pointless_hash;
my @useless_variable = (@dirs, @hash_pairs, @dirs);
my $count = @dirs;   # You know about this of course

push @dirs, @dirs;
push @dirs, %pointless_hash;

for my $item (@dirs) { ... }
for my $key_or_value (%pointless_hash) { ... }

for my $item ('/opt', @dirs, 1, 2, 3, 
@hash_pairs, %pointless_hash, $count) {
  ...
}

Both the aggregate data types simply become a list again when you use them as lists. Of course, a scalar becomes a list as well when you use it as a list:

my $cur_dir = '.';
my @dirs_to_scan = $cur_dir;

In the previous example you can see the comma operator being used with scalars, literals (also scalar, of course), arrays and hashes, all at once. Although a confusing and contrived example, it endeavours to show that the aggregate data types can be used in any list situation and will behave consistently; i.e., as a list of the scalars they contain.

The Compound Data Structure Confusion

All this helps to explain the confusion of newcomers to Perl when it comes to trying to create complex data structures, which is when they don't use references to make hashes or arrays of hashes or arrays.

With this new-found knowledge, it should be clear what is wrong with the following code:

my @dirs = ('.', '..', '/', '/home');
my %options = (
  dirs => @dirs
);

Of course the hash constructor is a list. The fat comma => is just a normal comma with style, and the array is just an array! It's in list context, so it behaves consistently—i.e. just as we've seen it behave so far.

The above hash assignment is exactly equivalent to this:

my %options = (
  'dirs', '.', '..', '/', '/home'
);

... which is a 5-element list—which is a warning, as we already know. This problem is solved by the use of references, which would turn, in this example, @dirs into a single scalar, essentially wrapping up the whole array as a single value in the list.

Other List Constructors

The comma operator is not the only way of constructing a list. The range operator .. constructs a list of all numbers between two integers, or all alphabetically sequential strings between two strings of a particular length.

my @array = 1 .. 6;
my %hash = 1 .. 6;
my @letters = 'a' .. 'z';

The qw operator makes a list of strings by splitting on whitespace:

my @animals = qw/cat mouse dog rat monkey/;
my %genus = qw/
  cat felis
  dog canis
  mouse mus
/;
use Module qw/this is a list as well/;

Note that none of these list constructors requires parentheses—because there isn't a comma in the syntax. You can use parentheses—qw()—but that is the syntax of the qw operator, and not treated as actual parentheses at all.

keys and values

A hash is an aggregate data structure that is paired. Half of its scalars are keys, and the other half are the values associated with those keys.

You can query the hash for either list separately from the other. Both keys and values return a list.

my %colour = (
  red   => '#ff0000',
  green => '#00ff00',
  blue  => '#0000ff',
);
my @colour_names = keys %colour;
my @colour_hexes = values %colour;

for my $colour_name ( keys %colour ) {
  my $hex = $colour{$colour_name};
  ...
}

As long as you don't change the hash, both keys and values will return the list in the same order—that is to say, if you were to interleave them again, the pairs would match up.

map, grep and sort

These three operators act on lists and return another list. Everything you have seen up to now applies to both the list you input, and the list you get back.

That is to say, wherever you use a list, you can use map, grep or sort on that list instead.

my $dir = '.';
opendir my $dirh, $dir;
my @files = readdir $dirh; #all files

# loop all files
for my $file ( @files ) {...} 

# loop some files
for my $file ( grep { $_ !~ /\.\.?/ } @files ) {...} 

# loop files in alphabetical order
for my $file ( sort @files ) {...} 

# loop files without their extensionsF<4>
for my $file ( map { s/\..+$//r } @files ) {...} 

We can use @files as a list directly; or we can perform a sort, map or grep on it to return a different list. sort alters order of the elements; map alters the elements themselves; and grep reduces the number of elements.

Since everything at this point is a list, you can chain them together.

for my $file ( sort map { s/\..+$//r } grep { $_ !~ /\.\.?/ } @files ) { ... }

The input list for sort is the output list of map; the input list to map is the output list from grep; and the input list to grep is the list you get by using an array in list context.

Functions

Now that we've seen lots of different uses of lists, arrays and hashes in list context, and we've seen a few different ways of constructing them,we can tackle the final confusion of newcomers to Perl: function arguments.

When you pass arguments to a function they appear in the special array @_ inside the function. Let's look at how we call a function.

sub add {
  my ($x, $y) = @_;
  return $x + $y;
}

add 1, 2;  # returns 3

The parameter list to a function is in list context. It is a parameter , after all. The parameters to the add function above are 1 and 2. Look familiar? It's the comma operator in list context, creating a list out of the scalars 1 and 2. There are no parentheses because they are optional for function calls in Perl; there is no other operator on this line, so we don't need to override the precedence of the comma operator like we did at the start of the post when constructing aggregates.

Since the parameter list is Just A List this means everything we've talked about so far also applies.

sub add {   
  my ($x, $y) = @_;
  return $x + $y; 
} 

my @numbers = (1, 2); 
add @numbers;  # returns 3

The array @numbers is used as a list because it is in list context, and hence its values are sent into the function and appear, as usual, in @_.

This, therefore, explains how you can do things like this:

sub cat_noise {
  my %options = @_;

  if ($options{meow}) {
    say $options{meow};
  }
  else {
    say "Meow.";
  }
} 

my %opts = qw/ meow purr /; 
cat_noise( %opts );

I put parentheses in here for clarity, but let's reduce this hideously contrived example using the rules we've already mapped out so far.

First, we know that the traditional way of constructing a hash, with =>, is just a tidy way of constructing a list. So a hash is just constructed from a list.

We also learned that qw is an operator that creates a list by splitting on whitespace, and can use any character to delimit its argument. This time we chose /. This, therefore, is what Perl sees:

my %opts = ('meow', 'purr');

We then send %opts into cat_noise. Again, we've seen that if you use a hash where a list is expected, a list is what you get. So Perl unpacks the hash again and sends the resulting list to cat_noise:

cat_noise( 'meow', 'purr' );

Inside cat_noise, the first thing we do is unpack the list provided by @_ into an aggregate data type—a hash called %options. Then %options is the basis for the body of the function, wherein we check for the existence of the meow key, and say its value if it exists, and "Meow." if not.

We can see therefore that the way we pass a hash into a function is to use it as a list, and then convert it back into a hash by using @_ as a list. Some people advocate passing this as a hash ref so that you avoid constructing a new hash, which is theoretically slightly faster.

More Common Examples

A hash from a map

Sometimes you may see a construct like this:

my %uniq = map { ($_ =&gt; 1) } @array;
my @array_uniq = keys %uniq;

What is happening here? As we know, map returns a list and you construct a hash from a list. map also accepts a list, and you can use an array as a list too. In the block we give to map, we actually also return a list—a 2-item list. That means that the list we get out of map will have 2 items for every 1 item we put into it. That one item is represented by the $_, and the second item is simply 1.

So if @array were a list of colours:

my @array = qw( red green blue yellow red );

Then Perl would create a 2-item list for each of these, and our output would be:

( red => 1, green => 1, blue => 1, yellow => 1, red => 1 );

And so we create the hash:

my %uniq = ( red => 1, green => 1, blue => 1, yellow => 1, red => 1 );

Since the key 'red' is repeated, the latter is accepted as the de facto pairing—not that it matters because both values are 1—but 'red' still only appears once in the hash (because keys are unique).

Now if we run keys on it, we get back a list that contains the unique elements of the original @array

my @array_uniq = keys %uniq; # red, green, blue, yellow

Default options

That leads us onto this:

my %opts = (%defaults, %options);

This ought to now be clear. Both hashes are expanded to their representative lists; the contents of the %options hash must come after the contents of the %defaults hash. That means their values take precedence, and any missing values in %options are still in the list because of %defaults.

Further Considerations

Left as an exercise to the reader are the ideas of building an array bit-by-bit and using that as a function parameter list, and of returning a list from a function and using that as another function's parameter list.

Having seen what happens when you try to put an array or a hash into another array or hash—the list-flattening effect—you should now read . These are the mechanism by which the entire array or hash can be stored as a single scalar, thus providing the logical boundaries between the list that is in the array, and the list that is in the sub-array. Or hash.

The technically-minded may wish to now read about , being a way of changing the way Perl understands the parameter list you provide. The curious reader should be aware that prototypes are not a general tool, and can cause much confusion and inconsistency in the way you and others expect things to work if they are misused.
1 It may confuse you to see that 1; is often used to return from functions and, indeed, from modules. Note that functions are evaluated in the context of where they are called, which means this could be evaluated in a non-void context. Therefore, you do not get a warning about that. This is true of modules too, which is why you can use any true value as the module's return value.
1 In fact you don't get a warning about 1; because 0 and 1 are exempt from this warning (see ). However, the warning does apply to all other constants, including strings.
1 If you don't use the parentheses you get scalar context when assigning to a scalar, and the comma on the left suffers the same problems as it did before, i.e. the precedence is wrong. If the item immediately before the equals sign is a scalar, you get scalar context, which is the last element when you use the comma operator: my $x = (1, 2, 3, 4, 5, 6); # x = 6
1 The /r in the substitution here (s///r) is introduced in Perl 5.14, and is used to return the altered string instead of altering the actual string. Prior to 5.14, you can do this by applying the regex to a copy of the string: map { (my $x = $_) =~ s/\..+$//; $x } LIST
1 Function prototypes are out of the scope of this post.

2011-07-01

Introducing Protip


A while ago I had a long, protracted conversation with my manager trying to convince him that our company should have a github account for select open-source projects we, as a company, want to release into the wild, on the basis that it would be good PR et cetera. That conversation went like this:

Me: I think we should put some open-source projects on github

Him: Good idea. People can download this stuff anyway when it's on the web so we might as well put it out there on purpose.

It is honestly quite a pleasure to work for a manager savvy enough to hold this opinion, rather than the sort of manager you hear about who, in spite of all observational evidence, maintains a world view that the company's code is its own and the correct answer lies in various obfuscation and encryption techniques that entirely defeat the point of the code being secured in the first place.

So without further ado I present the project that spawned this highly modern thinking, Protip.

This is a jQuery plugin intended to make a tooltip that is actually useful. Having tried many other tooltips I found that most suffered from the same basic problem: The method of deciding what should be in the tooltip (and what the tooltip should look like) was highly arbitrary, or at least difficult to shoehorn into your average document, to the extent that the majority of your tooltip logic was creating the tooltips in the right place so that the plugin, which is meant to save you work, can see them. By which time you might as well have written your own tooltip anyway. So I did that.

Protip can take a function as the tooltip specifier, and the function returns a jQuery object. Simple as that. There are a few1 predefined such functions but generally you tell the plugin what and where your tooltip is.

It is currently a bit hastily written and hence there is a certain quantity of Javascripty scope unsureties going on, but nothing a bit of a refactor won't solve.

Here it is again. Go nuts. Feedback appreciated in the form of patches or pull requests. https://github.com/propcom/Protip

1 1

2011-06-21

It's as if they thought it through.


I wonder why we have separate arrays and hashes in Perl. Other languages don't do it. After all, the principal difference between an array and a hash is that an array references its items by ordinal position and hashes use strings to name them. A hash could surely be conflated with an array simply by using integers as the string keys - especially since Perl can use strings and integers interchangeably.

We would have to make changes, but all it would really need is a way of detecting when the user intended to use an ordinal array and when the user intended to use an associative array. This should be easy enough: all we need to do is check whether all the keys are sequential and start at zero, and we know it's an ordinal array. To accommodate the fact this may be a coincidence, we can create a second set of functions so that the user can specify that even though the array appears to be ordinal it is actually just that the keys happen to be numeric and happen to be in order starting from zero. We'd also have to change the way sort works, in fact creating two functions: one function that orders an ordinal array and re-creates the keys when the values are in their new positions, and one function that, having sorted the array by value, makes sure the keys still refer to the same value. Of course, sorting integers as strings returns a different order from sorting integers as integers ('10' is alphabetically between '1' and '2'), so we would need a keys function that knew whether to return strings or integers so that we know, when sorting the list of keys, whether to sort them as strings or integers.

Splicing would also require two functions, of course. It doesn't really make sense to splice a nominal array because there is no inherent order to it; but since a fundamental tenet of structural programming is that if you make two things the same, you must treat them the same, then we have to make it make sense. Since splicing is all about removing things by their position (it's very easy to remove a key from a nominal array: just remove it), we need to give associative arrays an internal order. Or possibly just whinge when we use a thing that doesn't look like an ordinal array in splice, thereby affirming a difference between ordinal and associative arrays that we are desperately trying to pretend doesn't exist.

We'd also have to determine what to do when, for example, someone creates an entry in an array by giving it an ordinal position that doesn't exist. Do we create an array of suitable length and fill it with bits of emptiness in order to maintain the illusion that this array is ordinal? Or do we create it as an associative array with a single numerical key? What happens if someone creates key [2], then key [0], then key [1]? Do we sneakily go back and pretend we knew they meant this to be an ordinal array from the beginning, or do we treat this as an associative array and annoy the hell out of the user, who expected an ordinal array with three entries?

And then finally an extra function is needed so that we can refer to elements by their ordinal position even if it's not a real ordinal position: after all, -1 is a valid associative array key but in an ordinal array it means "the last element" like it does in common C functions like substr, so we'd have to create a way of referencing the array backwards without accidentally confusing a negative index with a string key.

Oh yes. That's why.

Further reading

Here's a Wikipedia link: http://en.wikipedia.org/wiki/Waterbed_theory — if anyone can find TimToady's paper on this on the interwebs, I'd like to link to that from here too, so I'd be grateful for that.

2011-06-11

The Anatomy of Types

A chief confusion of people new to Perl is the apparently disconnected syntax used to refer to variables. Of particular consternation is the syntax used for accessing arrays and hashes: especially slices thereof. This seems to be because the creation of and use of arrays and hashes is taught at a simpler level than the level of understanding required to actually see how they work.

Here's a table that shows some variables, as they are used, and how they divide up. It also shows the number of items each expression will return.

Expression
Sigil Identifier Subscript Number of items
$ scalar 1
@ array Many
% hash Many pairs
$ array [0] 1
$ hash {key} 1
@ array [0,1,2] Many
@ hash {'key1', 'key2'} Many
% array [0,1,2] Many pairs
% hash {'key1', 'key2'} Many pairs

1. The Sigil

$

$ refers to a scalar. A scalar is a single, atomic item. Its contents cannot be divided without applying further processing to the scalar itself. Whenever an expression begins with a $, it is a scalar, and there is one item.

@

@ refers to more than one scalar, in some order. Without a subscript, it refers to an array; otherwise it simply refers to a list. Saying it is "in order" means that we can identify any item within the list by its numerical position; it also means that there is a first, second, nth and last element in it.

%

% refers to a hash. A hash is also a collection of scalars, but there is no order to them. Rather than each scalar being in a known position in a list, instead half of the scalars are referred to by the other half. The "other half" are all strings and are called keys. If the % is used you know that you are referring to a set of items that alternate between keys and values. Having no order, it is therefore meaningless to talk about the first, second, nth, or last element of the hash.

Apply these rules to the table above. See that every expression whose sigil is a $ gives us 1 item; every expression whose sigil is @ gives us many (zero or more) items; and every expression whose sigil is % gives us many paired items.

2. The identifier

The identifier is the name of the variable. Without its sigil it is fairly meaningless because it could refer to anything1. With its sigil, suddenly we know what form of variable we are talking about - scalar, array or hash. And with a sigil and a subscript, we know yet again that we are talking about one or many scalars, and which type of variable the identifier refers to.

Here's the tricky part. Each identifier can refer to all types. It is perfectly legitimate (albeit often quite a bad idea) to have all three of $var, @var and %var in the same scope at the same time.

This is allowable because it is impossible for there to be ambiguity. There is no crossover in either of the tables below, either within themselves or between them. A combination of sigil and subscript can tell us exactly which type of variable the identifier refers to, and therefore Perl simply allows for all types to be under a single name. Thus:

Expression Looks for
$var $var
@var @var
%var %var
$var[0] @var
$var{key} %var
@var[0,1] @var
@var{'key1', 'key2'} %var
%var[0,1] @var
%var{'key1, 'key2} %var

3. The Subscript

When you have an aggregate data structure (array or hash) you know that you are talking about possibly multiple scalars at once. Arrays are accessed by selecting an item by its position, and hashes are accessed by using the string key we associated with the scalar.

Armed with the knowledge about what the sigil means we can consult the table above to pull apart the familiar way of accessing arrays and hashes to get an item out:

my $first_item = $things[0];

We know $first_item is a scalar because it has a $. We know $things[0] is a scalar because it has a $.

my $first_name = $person{first_name};

We know $first_name is a scalar because it has a $. We know $person{first_name} is a scalar because it has a $.

Assigning a scalar to a scalar makes perfect sense. Although it appears that the sigil has changed on the array and hash, what we actually see is that the identifier of the array is 'array'; the identifier of the hash is 'hash'; and the choice of sigil is effected by how much of the data structure we want.

Array and Hash Slices

Arrays and hashes are aggregate data types, which means they contain multiple scalars. It is reasonable therefore to expect we can request more than one item from them at the same time.

Since one item is referred to with the $ sigil, and we used a $ to access a single item from the aggregate, then we can simply use @ to refer to multiple items from the same aggregate.

my @both_names = @person{'first_name', 'last_name'};

Observe that we can access two values from the hash by supplying both keys as a list in the subscript and using @ instead of $. This of course applies to any quantity of keys, and also applies to arrays

my @relevant_things = @things[0,3,5];

This action of taking several selected elements from an aggregate is called slicing.

A warning about hash slices

Remember to use the @ instead of the $ when taking a hash slice. The syntax of putting a list in the subscript to get a scalar refers to a long-deprecated feature that you never want to use intentionally.

Key-Value/Index-Value Slices

We've seen how you can use $ and a subscript to get a single scalar, we've seen how you can use @ and a subscript to get a list of values. You can also (as of perl 5.20) use % and a subscript to get an index-value or key-value pair.

my %part = %whole{'relevant', 'parts', 'only'};
my %index_value = %things[0,3,5];

This kind of slice returns a pair for each thing you're slicing; both the key or index as well as the value.

Working Backwards

We can work backwards from a line of code to know what we are talking about. Perl has to do this, because we change the sigil depending on how many things we're talking about.

To determine where a scalar comes from, we need to look at the subscript. Arrays and hashes don't tend to have names that immediately make it obvious that they are arrays or hashes. But subscripts have syntax that resolves this cleanly.

An identifier followed by brackets - [ ] - refers to an array. An identifier followed by braces - { } - refers to a hash. An identifier followed by no subscript refers to the exact type the sigil refers to. The sigil refers to the type of the returned value. The identifier, coupled with the subscript, tells us what type of data structure the value comes from.

Given the identifier 'var', the following table helps explain where the data comes from in various situations:

Sigil Subscript Looks for
$ $var
@ @var
% %var
$ [ ] @var
$ { } %var
@ [ ] @var
@ { } %var
% [ ] @var
% { } %var

This confirms our rule: that without a subscript, the sigil determines the variable we seek; otherwise, the subscript does.

This can be rationalised simply. If we use a subscript, we are requesting only a part of the aggregate variable in question; i.e. a selection of one or several of the scalar values it contains. This means that, if a subscript is present, we can use it to determine where the data should come from. If we don't use a subscript, it is therefore reasonable we actually intended to refer to the aggregate itself - and this is indeed the case. But in all cases, the sigil still determines the type of data we get back, be it a scalar or a list or a paired list.

Scalars are not aggregate, so there is never a subscript that will translate into a scalar. That's why '$var' appears only once in the table.

Further reading

So far we have talked about lexical variables (think "braces"). There are two other types of variable: package and global. Package variables are accessed by their fully-qualified name ($Package::var) from other packages, or the same as above from within the package. Global variables - other than the built-in set - should be avoided.

Read Symbol Tables in perlmod for information on package variables. And you could do worse than read about typeglobs, a special internal data type for referring to the entire set of types available in the symbol table.

1 Actually, it can't refer to anything at all. An identifier without a sigil is usually interpreted as subroutine call, but can result in ambiguity that causes strictures to complain about barewords. Nevertheless, a (named) subroutine is actually a package variable, and we are talking about lexicals here.