snapsvg

2011-06-21

It's as if they thought it through.


I wonder why we have separate arrays and hashes in Perl. Other languages don't do it. After all, the principal difference between an array and a hash is that an array references its items by ordinal position and hashes use strings to name them. A hash could surely be conflated with an array simply by using integers as the string keys - especially since Perl can use strings and integers interchangeably.

We would have to make changes, but all it would really need is a way of detecting when the user intended to use an ordinal array and when the user intended to use an associative array. This should be easy enough: all we need to do is check whether all the keys are sequential and start at zero, and we know it's an ordinal array. To accommodate the fact this may be a coincidence, we can create a second set of functions so that the user can specify that even though the array appears to be ordinal it is actually just that the keys happen to be numeric and happen to be in order starting from zero. We'd also have to change the way sort works, in fact creating two functions: one function that orders an ordinal array and re-creates the keys when the values are in their new positions, and one function that, having sorted the array by value, makes sure the keys still refer to the same value. Of course, sorting integers as strings returns a different order from sorting integers as integers ('10' is alphabetically between '1' and '2'), so we would need a keys function that knew whether to return strings or integers so that we know, when sorting the list of keys, whether to sort them as strings or integers.

Splicing would also require two functions, of course. It doesn't really make sense to splice a nominal array because there is no inherent order to it; but since a fundamental tenet of structural programming is that if you make two things the same, you must treat them the same, then we have to make it make sense. Since splicing is all about removing things by their position (it's very easy to remove a key from a nominal array: just remove it), we need to give associative arrays an internal order. Or possibly just whinge when we use a thing that doesn't look like an ordinal array in splice, thereby affirming a difference between ordinal and associative arrays that we are desperately trying to pretend doesn't exist.

We'd also have to determine what to do when, for example, someone creates an entry in an array by giving it an ordinal position that doesn't exist. Do we create an array of suitable length and fill it with bits of emptiness in order to maintain the illusion that this array is ordinal? Or do we create it as an associative array with a single numerical key? What happens if someone creates key [2], then key [0], then key [1]? Do we sneakily go back and pretend we knew they meant this to be an ordinal array from the beginning, or do we treat this as an associative array and annoy the hell out of the user, who expected an ordinal array with three entries?

And then finally an extra function is needed so that we can refer to elements by their ordinal position even if it's not a real ordinal position: after all, -1 is a valid associative array key but in an ordinal array it means "the last element" like it does in common C functions like substr, so we'd have to create a way of referencing the array backwards without accidentally confusing a negative index with a string key.

Oh yes. That's why.

Further reading

Here's a Wikipedia link: http://en.wikipedia.org/wiki/Waterbed_theory — if anyone can find TimToady's paper on this on the interwebs, I'd like to link to that from here too, so I'd be grateful for that.

2011-06-11

The Anatomy of Types

A chief confusion of people new to Perl is the apparently disconnected syntax used to refer to variables. Of particular consternation is the syntax used for accessing arrays and hashes: especially slices thereof. This seems to be because the creation of and use of arrays and hashes is taught at a simpler level than the level of understanding required to actually see how they work.

Here's a table that shows some variables, as they are used, and how they divide up. It also shows the number of items each expression will return.

Expression
Sigil Identifier Subscript Number of items
$ scalar 1
@ array Many
% hash Many pairs
$ array [0] 1
$ hash {key} 1
@ array [0,1,2] Many
@ hash {'key1', 'key2'} Many
% array [0,1,2] Many pairs
% hash {'key1', 'key2'} Many pairs

1. The Sigil

$

$ refers to a scalar. A scalar is a single, atomic item. Its contents cannot be divided without applying further processing to the scalar itself. Whenever an expression begins with a $, it is a scalar, and there is one item.

@

@ refers to more than one scalar, in some order. Without a subscript, it refers to an array; otherwise it simply refers to a list. Saying it is "in order" means that we can identify any item within the list by its numerical position; it also means that there is a first, second, nth and last element in it.

%

% refers to a hash. A hash is also a collection of scalars, but there is no order to them. Rather than each scalar being in a known position in a list, instead half of the scalars are referred to by the other half. The "other half" are all strings and are called keys. If the % is used you know that you are referring to a set of items that alternate between keys and values. Having no order, it is therefore meaningless to talk about the first, second, nth, or last element of the hash.

Apply these rules to the table above. See that every expression whose sigil is a $ gives us 1 item; every expression whose sigil is @ gives us many (zero or more) items; and every expression whose sigil is % gives us many paired items.

2. The identifier

The identifier is the name of the variable. Without its sigil it is fairly meaningless because it could refer to anything1. With its sigil, suddenly we know what form of variable we are talking about - scalar, array or hash. And with a sigil and a subscript, we know yet again that we are talking about one or many scalars, and which type of variable the identifier refers to.

Here's the tricky part. Each identifier can refer to all types. It is perfectly legitimate (albeit often quite a bad idea) to have all three of $var, @var and %var in the same scope at the same time.

This is allowable because it is impossible for there to be ambiguity. There is no crossover in either of the tables below, either within themselves or between them. A combination of sigil and subscript can tell us exactly which type of variable the identifier refers to, and therefore Perl simply allows for all types to be under a single name. Thus:

Expression Looks for
$var $var
@var @var
%var %var
$var[0] @var
$var{key} %var
@var[0,1] @var
@var{'key1', 'key2'} %var
%var[0,1] @var
%var{'key1, 'key2} %var

3. The Subscript

When you have an aggregate data structure (array or hash) you know that you are talking about possibly multiple scalars at once. Arrays are accessed by selecting an item by its position, and hashes are accessed by using the string key we associated with the scalar.

Armed with the knowledge about what the sigil means we can consult the table above to pull apart the familiar way of accessing arrays and hashes to get an item out:

my $first_item = $things[0];

We know $first_item is a scalar because it has a $. We know $things[0] is a scalar because it has a $.

my $first_name = $person{first_name};

We know $first_name is a scalar because it has a $. We know $person{first_name} is a scalar because it has a $.

Assigning a scalar to a scalar makes perfect sense. Although it appears that the sigil has changed on the array and hash, what we actually see is that the identifier of the array is 'array'; the identifier of the hash is 'hash'; and the choice of sigil is effected by how much of the data structure we want.

Array and Hash Slices

Arrays and hashes are aggregate data types, which means they contain multiple scalars. It is reasonable therefore to expect we can request more than one item from them at the same time.

Since one item is referred to with the $ sigil, and we used a $ to access a single item from the aggregate, then we can simply use @ to refer to multiple items from the same aggregate.

my @both_names = @person{'first_name', 'last_name'};

Observe that we can access two values from the hash by supplying both keys as a list in the subscript and using @ instead of $. This of course applies to any quantity of keys, and also applies to arrays

my @relevant_things = @things[0,3,5];

This action of taking several selected elements from an aggregate is called slicing.

A warning about hash slices

Remember to use the @ instead of the $ when taking a hash slice. The syntax of putting a list in the subscript to get a scalar refers to a long-deprecated feature that you never want to use intentionally.

Key-Value/Index-Value Slices

We've seen how you can use $ and a subscript to get a single scalar, we've seen how you can use @ and a subscript to get a list of values. You can also (as of perl 5.20) use % and a subscript to get an index-value or key-value pair.

my %part = %whole{'relevant', 'parts', 'only'};
my %index_value = %things[0,3,5];

This kind of slice returns a pair for each thing you're slicing; both the key or index as well as the value.

Working Backwards

We can work backwards from a line of code to know what we are talking about. Perl has to do this, because we change the sigil depending on how many things we're talking about.

To determine where a scalar comes from, we need to look at the subscript. Arrays and hashes don't tend to have names that immediately make it obvious that they are arrays or hashes. But subscripts have syntax that resolves this cleanly.

An identifier followed by brackets - [ ] - refers to an array. An identifier followed by braces - { } - refers to a hash. An identifier followed by no subscript refers to the exact type the sigil refers to. The sigil refers to the type of the returned value. The identifier, coupled with the subscript, tells us what type of data structure the value comes from.

Given the identifier 'var', the following table helps explain where the data comes from in various situations:

Sigil Subscript Looks for
$ $var
@ @var
% %var
$ [ ] @var
$ { } %var
@ [ ] @var
@ { } %var
% [ ] @var
% { } %var

This confirms our rule: that without a subscript, the sigil determines the variable we seek; otherwise, the subscript does.

This can be rationalised simply. If we use a subscript, we are requesting only a part of the aggregate variable in question; i.e. a selection of one or several of the scalar values it contains. This means that, if a subscript is present, we can use it to determine where the data should come from. If we don't use a subscript, it is therefore reasonable we actually intended to refer to the aggregate itself - and this is indeed the case. But in all cases, the sigil still determines the type of data we get back, be it a scalar or a list or a paired list.

Scalars are not aggregate, so there is never a subscript that will translate into a scalar. That's why '$var' appears only once in the table.

Further reading

So far we have talked about lexical variables (think "braces"). There are two other types of variable: package and global. Package variables are accessed by their fully-qualified name ($Package::var) from other packages, or the same as above from within the package. Global variables - other than the built-in set - should be avoided.

Read Symbol Tables in perlmod for information on package variables. And you could do worse than read about typeglobs, a special internal data type for referring to the entire set of types available in the symbol table.

1 Actually, it can't refer to anything at all. An identifier without a sigil is usually interpreted as subroutine call, but can result in ambiguity that causes strictures to complain about barewords. Nevertheless, a (named) subroutine is actually a package variable, and we are talking about lexicals here.