snapsvg

2011-06-21

It's as if they thought it through.


I wonder why we have separate arrays and hashes in Perl. Other languages don't do it. After all, the principal difference between an array and a hash is that an array references its items by ordinal position and hashes use strings to name them. A hash could surely be conflated with an array simply by using integers as the string keys - especially since Perl can use strings and integers interchangeably.

We would have to make changes, but all it would really need is a way of detecting when the user intended to use an ordinal array and when the user intended to use an associative array. This should be easy enough: all we need to do is check whether all the keys are sequential and start at zero, and we know it's an ordinal array. To accommodate the fact this may be a coincidence, we can create a second set of functions so that the user can specify that even though the array appears to be ordinal it is actually just that the keys happen to be numeric and happen to be in order starting from zero. We'd also have to change the way sort works, in fact creating two functions: one function that orders an ordinal array and re-creates the keys when the values are in their new positions, and one function that, having sorted the array by value, makes sure the keys still refer to the same value. Of course, sorting integers as strings returns a different order from sorting integers as integers ('10' is alphabetically between '1' and '2'), so we would need a keys function that knew whether to return strings or integers so that we know, when sorting the list of keys, whether to sort them as strings or integers.

Splicing would also require two functions, of course. It doesn't really make sense to splice a nominal array because there is no inherent order to it; but since a fundamental tenet of structural programming is that if you make two things the same, you must treat them the same, then we have to make it make sense. Since splicing is all about removing things by their position (it's very easy to remove a key from a nominal array: just remove it), we need to give associative arrays an internal order. Or possibly just whinge when we use a thing that doesn't look like an ordinal array in splice, thereby affirming a difference between ordinal and associative arrays that we are desperately trying to pretend doesn't exist.

We'd also have to determine what to do when, for example, someone creates an entry in an array by giving it an ordinal position that doesn't exist. Do we create an array of suitable length and fill it with bits of emptiness in order to maintain the illusion that this array is ordinal? Or do we create it as an associative array with a single numerical key? What happens if someone creates key [2], then key [0], then key [1]? Do we sneakily go back and pretend we knew they meant this to be an ordinal array from the beginning, or do we treat this as an associative array and annoy the hell out of the user, who expected an ordinal array with three entries?

And then finally an extra function is needed so that we can refer to elements by their ordinal position even if it's not a real ordinal position: after all, -1 is a valid associative array key but in an ordinal array it means "the last element" like it does in common C functions like substr, so we'd have to create a way of referencing the array backwards without accidentally confusing a negative index with a string key.

Oh yes. That's why.

Further reading

Here's a Wikipedia link: http://en.wikipedia.org/wiki/Waterbed_theory — if anyone can find TimToady's paper on this on the interwebs, I'd like to link to that from here too, so I'd be grateful for that.

15 comments:

  1. The reason that there are both arrays and hashes is that arrays are for ordered data, and hashes are for un-ordered data.

    The last element of an ordered array makes sense. The last element for an associative array doesn't.

    ReplyDelete
  2. I kind of took this as a bit of a swipe at PHP which doesn't distinguish between arrays and hashes syntactically (and it turns out doesn't actually have arrays, as a consequence).

    ReplyDelete
  3. Perl differentiates between arrays and hashes because AWK did. That thinking you described was done by Aho, Weinberger, and Kernighan.

    ReplyDelete
  4. Yes, PHP conflates the two to its detriment (or at least to my detriment).

    Are there any other languages that have made this choice?

    ReplyDelete
  5. > Other languages don't do it.
    Python does it

    ReplyDelete
  6. What other languages besides PHP? Off the top of my head Python, Ruby, Java, C++, C# don't do it this way.

    ReplyDelete
  7. tcl does this. I'm not aware of it having caused us serious problems in (heavy) production use. I'd be interested to know what problems it could cause us. Presumably performance issues on arrays expected to be ordinal with serial access.

    ReplyDelete
  8. >> Other languages don't do it.
    > Python does it

    This I did not know. But I intended that to mean "there are languages that don't do it". I could have worded that better :)

    ReplyDelete
  9. Lua also does this; tables are the only (built-in) data structure and handle arrays, hashes, and records. For example, employee.name is just syntactic sugar for employee["name"].

    ReplyDelete
  10. I had never come across hashes before I learned Perl in the 90s. Before then I'd used many languages that had arrays but no hashes - BASIC (including Visual Basic which added dicts later), ALGOL, FORTRAN, Pascal, C and probably many others.

    ReplyDelete
  11. Lua conflates the two. A table in Lua has a hash part and an array part. Values may be stored in either and Lua may switch representations based on a density calculation.

    It works, but as a programmer I'd prefer to handle the data structure myself.

    ReplyDelete
  12. Actually most of the languages makes a (big) difference, because although they are named similarly, arrays with numeric indices and associative arrays are completely different data structures and have different performance and usage patterns.

    ReplyDelete
  13. Ruby certainly has separate Hash and Array classes. It's one of the things I loved about it and quickly grew to hate even more in PHP.

    ReplyDelete
  14. Well, as I'm sure a lot of people pointed out, Arrays and Hashes are completely different data structures. Perl, being a language of it's era, needed to have both for a simple reason: arrays were always fast, but hashes were very CPU intensive back in the day when computers ran in the low MHz range, not GHz. The performance difference between looping/accessing an array with 10000 entries and a hash with 10000 entries could be *extremely* big.

    Nowadays we can ignore all that with GBs of memory and multi-core GHz processors. But when looking for the rationale behind a specific language structure, specially in a language as old as Perl, you have to look way back.

    ReplyDelete
  15. By other languages, you mean just PHP right? Of course PHP would fuck that up

    ReplyDelete