Pod::Cats: Understanding and Using map and grep

Edits: There is no strict reason why map should not return a shorter list, so I'm not discouraging it any more.

map and grep are useful tools in Perl. To understand them needs only the ability to follow some logic, but to use them requires an understanding of lists. You should read this post on lists to understand how lists work in Perl. The main thing you should understand from that is how lists exist in memory, and how they are used with relation to arguments to functions.

map and grep are actually operators, but they behave so much like functions that you'll find them in perlfunc. What they do is they take a list of arguments, like functions do, and convert that list into another list. map transforms the input list, and grep filters the input list.

Remember that lists with zero or one item in them are still lists.

Understanding map

To understand map you need to be able to understand the following:

y = f(x)

If you don't, let's briefly explain it. It means that if you do something (f) to a value (x) you will get another value (y). You've probably seen this before in graphs - the straight line formula is y = mx + c; we can see that all of mx + c is a function using x (m and c are constants), meaning that if you feed an x coordinate in, you will get a y coordinate back. We say that y is a function of x.

In the case of map, x will always be a scalar, because of how lists in Perl are lists of scalars. y could be more than one value, this being how your output list can be longer than your input list. f is provided to map.

map is therefore a transformation. Its purpose is to consistently transform your input list into an output list, by providing each item of your input list to your f, and collecting the ys that you get out of it.

map turns this:

map { f } x, x₁, x₂, x₃, x₄...

into this:

f(x), f(x₁), f(x₂), f(x₃), f(x₄)...

A simple example is mathematical: even numbers. The nth even number is 2n—you learned this in school. If you didn't go to school then you'd better catch up.

That means, for y to be the xth even number, our f is 2x.

f(x) = 2x
f(1) = 2×1 = 2
f(2) = 2×2 = 4
...

That means that, given the numbers 1 to 10 we can find the first 10 even numbers that are greater than 0. If we provide our f to map, and the list 1 .. 10, we should get that.

Defining f(x) for map is a simple case of using the special variable $_ in place of x. For now, we will define our f using {}.

print join ', ', map { $_ * 2 } 1 .. 10
2, 4, 6, 8, 10, 12, 14, 16, 18, 20

This is a good example of what we were learning about lists in the previous post. The join operator joins a list of values using the specified string, in this case a comma and a space. There is no requirement in Perl that this list be either an array or a hard-coded, comma-separated list: it can be anything that, in list context, returns a list¹. That could be a function, or another operator like map or grep.

This example serves to explain in simple terms how some functionality can be routinely applied to a set of values to return a new set of values. There is no requirement that y is different from x after the execution of f(x). For example, you might wish to search a dictionary for a set of words, but the words themselves might be suffering from grammar, and hence have different letter cases. The dictionary itself would be entirely in lowercase, and hence you would want a lowercase edition of your set of words.

my @lc_words = map { lc } @words

In this case, the f() that we're applying to all members of the array is the operator lc, which operates on $_ by default and returns a lowercase version of its operand. If the input word happened to be already lowercase, the output will equal the input. No matter.

List Return Values

If you recall, lists are lists, in Perl. Arrays and hashes are things constructed from lists, and both of these are treated as lists when used where a list is expected. You can see this principle in action in these examples: our first example used the range operator .. to construct the list of integers between 1 and 10, and fed that to map. The second example used the array @words, which we assumed to exist for the example, as the input list.

And then the output list was used in the first example as the argument list to join, and in the second example it was used as the constructor for a new array.

So far we have been returning only one value, but the f we provide to map can return any number values because it is evaluated in list context. By this mechanism you can turn an array into a hash, which is probably the most common use of that.

my @words = qw(cat dog cow dog monkey);
my %words = map { $_ => 1 } @words;

The map block here returns the input word, $_, and 1. Remember that in this post we learned that => is semantically equivalent to , except with the quoting of the word to its left. Remember also that in the same post we learned that a hash is constructed from any even-sized list (or an odd-sized list, with a warning). So this block is returning two items for every one we put in, doubling the length of the list.

The use of the fat comma in a construct such as this is idiomatic: since it has no actual effect on the output of the map block it is really only there to hearken to the usual construction of a hash, which is a list that uses => a lot. The image conjured is of taking each element of @words and turning it into word => 1, joining it all together until we have an even-sized list thus:

my %words = (
    cat => 1,
    dog => 1,
    cow => 1,
    dog => 1,
    monkey => 1
);

Knowing the intermediate steps to get to that may help:

my %words = map { $_ => 1 } qw(cat dog cow dog monkey);
my %words = map { $_, 1 } qw(cat dog cow dog monkey);
my %words = ('cat', 1, 'dog', 1, 'cow', 1, 'dog', 1, 'monkey', 1);

Of course, in a true application, you would not set @words in this manner in the first place. @words will be computed, perhaps from a file passed in by name on the command line, which you have processed and turned into this array.

A little knowledge about hash construction is relevant here to know the actual purpose of this map block. If you understand that a repeated key in the constructing list is simply overwritten by the latest occurrence of that key, you will see that this has the effect of making the input list unique. Although the output list is exactly double the length of the input list, that list is immediately used to construct a hash. The hash itself then undertakes its usual behaviour, and in this example the repetition of "dog" in the original array is overwritten by its last appearance by the construction of that hash - which is fine in this instance because all the keys are associated with the value 1. Thus any repeated word is not counted.

You can also skip over part of the list as well. Perl makes no assumptions on the size of your returned list when running map. You can use this to omit elements from your output list while doubling the rest.

To do that, simply test $_ and return an empty list if you don't want the element:

map { test $_ ? ($_ => 1) : () } @things

You might return the empty list if the test returns true, for example if you are trying to transform elements you have not already transformed and remove those you have:

my %to_cache = map { in_cache $_ ? () : ($_ => transform $_) } @stuff;

For some test in_cache, which returns true if its argument is in the cache, and some function transform, which returns a transformed version of its argument, you can thus construct a hash of transformed stuff ready for caching. Of course, you don't have to double the list; you can use this simply to filter it. But if you're not going to transform $_ then it is more sensible to use grep

When Not To Use map

Now that you have a new toy you might be tempted to use it. It is simple nature. Well as with every tool, it is unwise to use it outside its purpose. There are four principles that you should follow when playing with map:

The Principle of Transformation. You can use map to test each item and conditionally return the empty list, thus shrinking the size of the list. But unless your map block transforms $_ in some other way (by returning an altered version, or a longer list), what you have actually got is grep. In some cases it may be easier to read if you use both map and grep, using the output of grep as the input to map.

The Principle of Immutability. You should never transform the input list. That seems contradictory, but by this I mean never alter $_ directly; you should always return an altered copy. It is not, in all cases, possible to change $_. Only when the input list is an array is it possible to overwrite $_ for all values you get from the list. When it is a list constructed by some other means, even a hash, the input values are often immutable, which means you can't change them so don't try. Always return a copy of $_, with transformations applied to the copy.

The Principle of Brevity. If your transformation is quite long-winded you should either a) put it in a sub or b) use a for loop and push onto another array. map is a construct of brevity; this should be honoured.

The Principle of Containment. If you need to do anything else while iterating over your list other than transforming x into y then you should use a for loop for that. Your map block should have no side effects, which means when the map has finished executing, everything should be the same as it was when it started, except now there is a new list in memory.

The general principle is that you are getting back from map a new list, and leaving the old one alone. Don't run map without using the result!

Understanding grep

grep is an operator that has many similarities to map. It takes an input list, and returns another list. You provide some function f, akin to that described for map earlier, and it is run with each successive item from the list.

The difference is simply that grep returns a list of equal size or shorter than the input list.

The f that you give to grep is a filter, not a transformation. It tests each $_ and returns trueness or falseness:

f(x) ∈ (1,0)

We are not looking in this case to return a new value, y, from our function, but rather any value that is true or not true. Reasonably one could point out that if you are outputting a truth value you are running a transformation. You are: but the result of grep itself is part of the input list. The result of map is the output list.

grep can be written in terms of map:

@filtered = grep { test } @things;
@filtered = map { test ? $_ : () } @things;

A common adage is "spelt grep, pronounced filter". That's a mnemonic, of course; it's pronounced grep.

A traditional use of grep is to find the defined items in an array:

grep { defined } @array

It is not uncommon to have produced an array or list with undefined elements, especially if that list came from a map block that could return undef in some cases. Another common idiom is to map and grep at the same time:

my @phone_numbers = map { $_->{phone} } grep { $_->{phone} } @people;

Here @people is assumed to be an array of hashrefs representing people. The grep filters out those for whom the key 'phone' returns a true value, and the map then returns the actual value.

We can thus draw this distinction between the two: map collects output values, and grep collects input values. You can see this is the case: the map block is assuming that the input list is a list of hashrefs, and extracts strings from them. We collect the list of things that map outputs. For that to work, it means that the output of grep has to be hashrefs; and since the input to grep is hashrefs but the grep block only returns a value from that hashref, we see that the output of grep is just a subset of the input.

When Not To Use grep

grep doesn't have the principle of transformation because you're not supposed to transform the list.

The Principle of Immutability - grep should not alter the input values; it should merely test them.

The Principle of Containment - nothing in the grep block should affect anything outside of the expression. Everything should be the same after the grep has finished running as it was beforehand, except now there is a new list in memory of (copies of) some or all of the input list.

The Principle of Brevity - if you need to do a long-winded process to find out whether or not you want to keep a particular $_, take it somewhere else, because it doesn't belong here. Make a sub and put it in that.

Similar Things

There is a simple concept here: You take a list, you apply a function to each element, and you get another list. map is the simplest example of this because the list you create is exactly the list that map returns. grep is arguably more complex because the list you create with this process (a list of true and false values) is itself used to alter the input list, so you get back a different list from the one you build.

sort

sort is another list operator. Like grep, the list it returns comprises elements of the input list. Unlike grep, the list it returns is the same length as the input list. Clearly, it sorts the input list, and returns a new one with the items in order.

An important thing to have already realised by now is that if you are sorting you have to do two variables at once, instead of one. Perl sorts, haha, this out for you by providing you $a and $b instead of $_. Your task is to tell Perl whether $a is greater than, less than, or equal to $b, by returning -1, 0 or 1 respectively from the block.

The operators <=> for numbers and cmp for strings do this easily for you, but this generalisation of the sorting process allows for you to run $a and $b through any algorithm in order to sort the list.

my @sorted = sort { length $a <=> length $b } @strings;

for

The for loop is much more than it is in other languages, especially when used as a statement modifier. That's when you put it at the end instead of putting it first and using braces. for is the operator you should use when you want to modify the array itself, rather than create a modified copy. Note you can only fully modify arrays and only the values of hashes.

$_++ for @array; # increment every value
$_++ for values %hash; # same
$_ .= "-old" for @filename_backups; # append "-old" to each filename

map is equivalent to doing a for loop over the old array, and pushing things onto a new array:

my @output;
push @output, $_ * 2 for @input;

Presumably map is more efficient, but part of Perl is that you have the option of being more expressive with your code, which is to say that you should write what you mean. If you mean to perform a map operation, use map; otherwise when someone else reads your code (i.e. you, next month) there will be no head-scratching wondering why you used a for loop.

Other functions in List::Util, List::MoreUtils and List::UtilsBy serve to run a function across all items of a list in the same general way, and return either a list or a single value.

List::Util

first - Find the first item in a list; usually used to find any item that matches the criterion. Often therefore used to find whether any item matches the criterion.
my $caps_key = first { uc $_ eq $_ } keys %hash;
reduce - Collapse a list into one value, by repeatedly applying the function. This is called reduction, as in "map-reduce". Note the use of $a and $b, like sort: we are trying to reduce a list into one value so we have to do two values at once, rather than one.
my $sum = reduce { $a + $b } @numbers;

List::MoreUtils

This module has many more than List::Util so I won't list them all.

any, all, notall, none - Test the whole list. any is similar to first from List::Util; the difference being that it will return a true value if your search succeeds, whereas first could return a false value if you're looking for false values. These are essentially grep, except they will stop looking if they find the answer - grep will process the whole list in all cases.
if (any { defined } @things) {} # if any is defined
if (all { defined } @things) {} # if all are defined;
if (notall { defined } @things) {} # if not all are defined
if (none { defined } @things) {} # if no thing is defined
pairwise - This only works on arrays, but it takes one value from the first array and one value from the second, and gives them both to your function. Then it collects the outputs. It's like map, except it does two things at a time.
my %hash = pairwise { $a => $b } @keys, @values;
my @totals = pairwise { $a * $b } @prices, @quantities

List::UtilsBy

This is a convenience module for all those cases where normally you'd do the same thing to both $a and $b. I will stick to a couple of examples.

sort_by - Applies the procedure to both $a and $b and compares the results as strings (nsort_by for numbers). Saves typing.
my @sorted = nsort_by { length } @strings;
my @sorted = nsort_by { $_->mother->mother->mother->age } @people;
uniq_by - Makes the list unique based on the function.
my @unique_by_colour = uniq_by { $_->colour } @fruit;

In Conclusion

I hope this has given you, the newcomer to Perl, a good idea of what map and grep do. I also hope it has given you an insight into the general concept of applying a function to a sequence of values, and doing something with the result. It is an operation that is much more common than you'd think, even in real life.

It is the basis of mail merge, for example, where you take a template and a list of names and you put each name in the template and, with the resulting list of letters, print and send them. That's map.

It is the basis of searching, where you have a quantity of items and you're looking for all of those that match a certain criterion. You apply your criterion to each and keep those that match. That's grep.

It is the basis of sorting things by height, or alphabetically, or by the number of times they've won the World Cup: you take the list of things, and two at a time you find out how many times they've won the World Cup and you sort based on that. That's sort (or sort_by or nsort_by).

Examining or modifying a list is an awfully common operation. One thing I cannot teach you, however, is how to recognise when you should use one. Feel free to ask!

¹ All right smart arse. Everything in list context is a list, even if it is only one scalar, because that's just a list with one item in it. The point is that this statement is true regardless of the length of the returned list.

Pod::Cats

snapsvg

2011-10-01

Understanding and Using map and grep

No comments:

Post a Comment