Introduction to Perl

Variable types in Perl - scalars, arrays, hashes

[Previous Page] |[Next Page] Table of Contents: small | med | large

In case I haven't mentioned it before, read Learning Perl [SCHW97]

In Perl, fundamentally, there are scalars, arrays, hashes (associative arrays), and code. These are accessed through the following symbols:
$foo scalar named foo
@foo array of scalars named foo
%foo hash of scalars named foo
&foo subroutine named foo

Okay, technically, there's a couple more, typeglobs and IO handles,

*foo type glob for foo - contains all of the above and IO references but we can discuss those later...

One of the things that I really like in Perl is that you do not need to predeclare your variables. We'll really take advantage of this later with complex data structures. But for now, just know that you can just use a variable without any declaration or warning. (Unlike C, C++, where you have to declare your variables ahead of time.)

SCALARS [back to top]

Scalars are basically integers, floats, strings, or references (pointers). Well, actually in Perl, there aren't any integers, except though one of the standard modules, but don't worry about that. I'll try to discuss references later. So for now, I'm just going to mention floats and strings.

Quite simply, a float is just a number, such as

	$answerLUE = 42;
	$pi        = 3.14159265;
	$light_v   = 3e10;
(That last one is scientific notation for 30,000,000,000 - don't use commas for perl numbers though.)

A string is a series of characters, often input or output to/from a program, or a word or something. You can even read in an entire file into a single string. One of the nice things in Perl is that you don't need to predeclare the size of your string. It can be arbitrarily huge, and you are really only limited by your hardware and operating system. You can designate strings in several ways:

	$director     = "Cameron Crowe";
	$help         = "Try executing:\n\t$script $filename\n";
	$codeForLater = '$help = "Try executing:\n\t$script $filename\n";';
The first is the easiest to understand. I'm simply assigning the string Cameron Crowe to the scalar variable $director. To designate the string, you enclose it in quotes( ' or "). The second line is a little more complex, and demonstrates what I think are the two most common escape sequences. These are really simple though. Also, note that I have 2 variables: $script and $filename inside the double quotes ("). One of the really convenient things in Perl (csh, awk, and probably python have this too) is that you can expand variables inside double quotes This makes it much easier to write out reports. Note that C/C++ does not have this capability. You either need to printf/sprintf or seperate everything with a << operator for cout. (csh, sh, awk, and probably python all have this capability too - it's pretty common in languages where compilation is not an explicit step)

Finally, in the third example, note that I use single quotes ('). This explitly means "do not expand variables or escape sequences." So the string itself has the $, \n, and \t inside it rather than the variable expansions, newlines, or tabs.

To me, the single quotes mean literal, and what you see is what you get. In general, I prefer to use single quotes when I can, only using double quotes if I need to expand something. So in the above example, I would usually have used single quotes for $director.

In perl though, there is another way to express quotes. You may see this a fair amount from advanced perl programmers. Sometimes, it makes it a little hard to read, especially for beginners. I know it confused me for a while. You can use q() to indicate single quotes and qq() to indicate double quotes. This is useful if you expect to be using the " or ' symbol a lot inside your string. Though if you have a long string, I would suggest learning about HERE documents.

	$director     = qq(Cameron Crowe);
	$help         = qq(Try executing:\n\t$script $filename\n);
	$codeForLater = q($help = "Try executing:\n\t$script $filename\n";);
Note that though you use the () symbols, you probably shouldn't think of this as a subroutine call. That's an exception to the otherwise golden rules. Actually, even those () characters are up for debate. Read Programming Perl [WALL00] for more details on that. I think using the qq/q notation for strings generally makes things harder to read, without too much payoff..

I would like to introduce the concept of concatanation now. It's pretty simple. This is how you build strings with other strings:

	$firstName = 'Cameron';
	$lastName = 'Crowe';
	$wholeName = $firstName . ' ' . $lastName;
	# $wholeName now holds 'Cameron Crowe'
That is, the . operator concatanates 2 strings together. It is worth noting that concatanating a string to a number will automatically convert the number to a string so it is easy to say:
	$tmpFileCounter = 83292;
	$fileName = '/tmp/melScript';
	$fileName .= '.' . $tmpFileCounter;
	# $fileName now holds '/tmp/melScript.83292'
That is, I just concatanated the counter to the end of the fileName. It is convenient to do this in a simple operator. Note that I also slipped in another version of the . operator: .= - this just means concatanate to the end of the string on the left.

Incidentally, I really like the writer/director Cameron Crowe. Every movie he writes is golden. Go and watch Fast Times at Ridgemont High, Say Anything, Singles, Jerry Maguire, and Almost Famous - all great movies.

If you used C, you probably would have had to use the sprintf routine. If you used C++, you have a string class, and it probably overloaded an operator +, so you get similar functionality to the above. The trouble in C++ is that everyone seems to think they are a genius who can bring something new to the idea of strings, so everyone seems to have their own string class. So wherever you go, there's probably another implementation of strings in some proprietary library and a different set of sting operators.

Incidentally, it also impresses me how many sets of software think they have something new to offer in representing a color (a set of 3 numbers) or a point (a set of 3 or 4 numbers, depending on if you are into the homogeneous thing or not).

I'll wait until the discussion about arbitrary structures before I talk about references/pointers.

ARRAYS/LISTS - Intro [back to top]

An array or list is a collection of scalars in a certain order. It is worth noting that the order is preserved and can be accessed. However, it can only be accessed numerically. For example, to define an array, we can say:

	# One scalar first
	$specialFrame = 5000;

	# arrays
	@operations = ('generateRIB', 'renderRIB', 'cleanupRIB');
	@renderFramesA = ( 1, 2, 3, 10, 15, 33, 105);
	@renderFramesB = (108, 110, 111, 112);
	@renderFramesAll = (@renderFramesA, @renderFramesB, $specialFrame);
	@nextFrames = ( (5001, 5002), (5003, 5004), (5005, 5006));
	@renderFramesCopy = @renderFramesAll;
All of the above lines will define an array/list of scalars. The first couple are pretty obvious. But the fourth one may confuse people. Some people new to Perl think that that format would look like @renderFramesAll is an array with 3 elements: two arrays and a scalar. This would be wrong. This actually will concatanate the two arrays together and put them into @renderFramesAll. @renderFramesAll will actually contain the values: (1, 2, 3, 10, 15, 33, 105, 108, 110, 111, 112, 5000) It is actually pretty easy to tack on more values. Similarly, in the next line, many beginners think this is the way to make and array of arrays. And again, this is wrong. @nextFrames actually gets the values (5001, 5002, 5003, 5004, 5005, 5006) as a flat list/array.

Also, it is simple to assign/copy one array to the other. Note that when you do this (like the last line above), it actually copies the elements from one array to the other, so you should be careful if you have really big lists. Then again, I sling around lists of 13,000 names without much thought, so it's not that big a deal.

So if the above examples do not have an array or arrays, how do you make and array of arrays? Well, I'll get to that in a later page. In the meantime, just be aware that that isn't the way to do it. In Perl 4, there actually wasn't a way to get an array of arrays, so it was very natural to know that the above were just flat lists, and that this was a convenient way to concatanate lists together. When Perl 5 came up with a way to get arbitrary data structures (like arrays or arrays), the above syntax was already taken so they had to make another way. I'm just used to it now. I'm not going to defend the way they implemented it, but the quicker that you get over it, the quicker you can get to doing something useful with it.

Now in the scalars section, I talked about q and qq as alternate ways of defining a string. But I didn't like them much for that. This was really to transition to the array version with qw(quoted word). Now, with arrays, we can actually define a list of values separated by whitespace, and we can even omit the commas and quotes. For instance, the equivalent to @operations above could be:

	@operations = qw(generateRIB renderRIB
Sometimes, that will just be more convenient. Now again, the qw notation isn't usually used that much by beginners, so it looks a little confusing, and you'll probably see it a lot in obfuscated Perl. I actually do like using this notation though just because it saves me the trouble of quotes and commas. For long lists, I think it actually tends to make code easier to read. You can use any mixture of whitespace (spaces, tabs, newlines) to seperate tokens. Personal preference.

Have I mentioned Learning Perl [SCHW97] yet? You should really read that book. It'll tell you more about arrays and slices and stuff. I don't use slices much, so I'm not going to discuss them.

ARRAYS/LISTS - Accessing Elements [back to top]

Now, how do you access the elements in the array? Oh, there are so many ways, and that is one of the many wonderful things about Perl. First off, if we want to grab one of the scalars in our array, we can say:

	$firstOperation = $operations[0];
	$secondOperation = $operations[1];
	$thirdOperation = $operations[2];
That's pretty simple, isn't it? On the left, we see scalar values. On the right, we see we also use a $ to access @operations even though @operations is an array. This is because we are accessing a scalar element of the array, accessing in a scalar context. The indicies of @operations start with 0, unless you do some really bizarre stuff that you shouldn't be doing in the first place. Also, something else to note is the square brackets []. This is how we access elements of arrays/lists. Square brackets are a pretty good indicator that we're dealing with something related to an array.

But you could do that in any language. Even C++. Heck, even in MEL. But in C/C++, you would often use arrays to create stacks, FIFOs, buffers, and stuff. Realizing that these are common uses of arrays, Perl offers more array access methods. The important ones are really push, pop, shift, unshift. For instance, if the operations were implemented as a FIFO buffer (that is First In, First Out), where you want one part of your program to add operations to a list and another part of the program to read out the operations in the same order they were added, you could use:

	@operations = ();	# empty list
	# Section to create operations list
	# Use push to add a new element to the end of the list
	push @operations, 'generateRIB';
	push @operations, 'renderRIB';
	push @operations, 'cleanupRIB';

	# Section to read an operation
	# Use shift to take an element off the beginning of the list.
	$currOp = shift @operations;

Or you could make a stack. Think of a stack of trays in a cafeteria. You can push them on one at a time, then remove them one at a time. Note that with this, the last tray you added is the first that you remove. When would you encounter this in production? Well, this is actually fundamental to most rendering, especially RenderMan.

EX 2.4.1: A diversion - stacks, scene graphs, and RIBs

This is apparent if you read a RIB file. The RIB describes your scene. At some point, you'll probably see a lot of Rotates, Translates, Scales, and NuPatches. This describes the hierarchy of how the scene is put together. In Maya, you know this as parenting. Each of the operations like Rotate can be thought of as a "Transform." You can have layers of transforms, and eventually, deep down, there is a NuPatch. Now how do you describe this? Well, you need to indicate which transforms affect which NuPatches. But transforms could affect other transforms, so you will need to indicate a block of Begin and End for each transform. So in your RIB, you will see something like:

  Scale 2 2 2
    Translate 1 2 3
    # Patch A
    NuPatch blah blah blah
    Translate 4 5 6
    # Patch B
    NuPatch blah blah blah
This means that the Scale is applied to both patches, but the Translates are applied to only 1 patch each because they are inside the TransformBegin and TransformEnd blocks. But try to conceptualize how you would implement this. As you read this file, you want to create a scene graph, creating children as you go along. This is actually similar to the lunch trays. Each time you see a TransformBegin, you will want to push a transform onto your stack. Each time you see the NuPatch, you create a child (The NURBS Patch). Each time you see a TransformEnd, you pop a transform off the stack.

Suppose now you are writing a parser for RIB files. This is actually pretty common. I would guess that most people don't write a real formal parser, but just do quickie regular expression matching, and that's good enough in general. As you parse the RIB, you want to keep track of the transforms. I won't get into how you get the name attribute, but you may have code like:

	# Found a TransformBegin
	push @transformStack, $currentTransformName;
	# Operate on new transform

	# ...

	# Found a Translate, Rotate, or Scale
	# incorporate the transform into the current transform

	# ...

	# Found a NuPatch
	# Output the current transformation and the patch to the renderer

	# ...

	# Found a TransformEnd
	$currentTransformName = pop @transformStack;
	# Now, I'm back up at parent.
Now, you may be thinking "If I was going to build a renderer, it has heavy matrix crunching, so I would use C/C++. Well, first off, I would never choose C++ anymore, except for glue logic to other APIs where I'm forced to use it. But I'd also say it's not as far fetched as you may think. For instance, my friends at Steamboat Software actually do use Perl to parse RIB files to pass information on to jig in their program, hpartojig (or is that hpar2jig?). Incidentally, you can apply the same principles to Inventor files, mostly with the { and } indicating the transform begin and end. That's probably a pretty long diversion just to describe the principle of Last-In, First-Out (LIFO, or stack).

The point is that in Perl, you can implement these data structures extremely easily with commands intrinsic to the language. Note however that all of these operations, push, pop, shift, unshift actually alter the array, either adding or removing elements, where the $operations[0] just reads them. Of course, there are other wonderful array operations like grep, map, foreach, and I'll probably get to them on a later page.

ARRAYS vs LISTS - fine points [back to top]

You might have noticed that I make a distinction of talking about arrays and lists, and you may wonder, "What's the difference?" Well, they are very very similar. And to this day, I still have trouble telling them apart. This is one of the unfortunate things in Perl. Now, if you're disciplined about accessing scalar return values of subroutines into scalar values and array/list return values of subroutines into arrays, you will stay out of trouble. The trouble occurs when you try to implicitly cast arrays and lists return values to scalars.

This is discussed in Programming Perl [WALL00] in the 3rd edition, pp 72-74. But here's where it gets confusing:

	# LIST context
	$operation = ('generateRIB', 'renderRIB', 'cleanupRIB');
	# got 'cleanupRIB'

	# ARRAY context - $operation will be the scalar context of the array.
	# This will actually return the number of elements in the array
	# instead of giving you any value in the array.  So $operation
	# will be 3 ($operation didn't get an operation at all)
	@operations = ('generateRIB', 'renderRIB', 'cleanupRIB');
	$operation = @operations;
	# got 3

	# List assignment - As if this wasn't confusing enough, if the
	# left hand side of the assignment is a list, the values will be
	# assigned in the order listed.  So...
	($operation) = ('generateRIB', 'renderRIB', 'cleanupRIB');
	# got 'generateRIB'

	@operations = ('generateRIB', 'renderRIB', 'cleanupRIB');
	($operation) = @operations;
	# also got 'generateRIB'
You can probably understand why I find this confusing. But as we see in the last example, if we receive the array or list into a list context, then we are pretty safe, and we know that we are receiving the first value into the scalar. This is actually pretty useful in receiving paramenters in a subroutine. If you return values from an array you should receive the same type you return. That is, receive scalars into scalars and arrays/lists into lists.

There's actually one more convenience. On the left hand side, if you include an array as the last element in the list, it will absorb all the rest of the elements:

	@operations = ('generateRIB', 'renderRIB', 'cleanupRIB');
	($operation, @remainingOperations) = @operations;
	# $operation got 'generateRIB'
	# @remainingOperatiiions got ('renderRIB', 'cleanupRIB')

I guess the only other thing I would add is: Live with it. I honestly don't get stung by this one much myself. But I do try to avoid it by not doing any implicit casting from and array/list to a scalar.

Now that I've gone over that, I'm going to refer to arrays and lists interchangably, usually as "arrays."

Another diversion - storing data structures [back to top]

Once again, I'm going to take a diversion and talk about the underlying data structures. Granted you don't need to think about this stuff to use Perl, but this helps me think about using structures.

An array is a very simple structure. If you have elements 0-11, then your array is 12 elements long. You access your first one as $ARGV[0] and your 12th as $ARGV[11]. And the storage is also very simple. Somewhere in memory, you are just allocating enough space for 12 elements, and storing them there. Now, suppose you have 1,000,000 elements. Well, that's also pretty simple. You just allocate enough space for 1,000,000 elements and access them with indicies 0-999,999. And that's all fine and good.

Now, suppose we want to keep some notes about jobs we sent to our render queue. Now, supposing we're a major effects house or something and we've been in business a long time and have sent a lot of jobs to the queue. Suppose each job on the queue has an ID number ranging from 0 to 65535 or something. In a day, we might submit 100 jobs to the queue (probably more, but stick with me for a minute here), so we might see all of our jobs in a day have numbers like 9782, 9783, 9784, ... Now, we think back to how we store this data. Well, we could just stick this into an array of information keyed on the job ID. However, we could wind up allocating a block of 9785 elements (or worst case 65536 in this example), when we really only have 3 elements with high numbers. That means we've just wasted over 9500 entries! It is easy to conceive of a data structure that would hold 1k of data (output logs, usage statistics), so these 9500 entries could easily translate into 9,500,000 byes (over 9Meg of memory!), when the used portion is really only about 3 of those or 3,000 bytes. That could be a waste.

We want to have a more efficent way of storing data. The one that immediately comes to mind would be to keep an array of pointers to our information data structure. So we initially allocate a block of 65536 pointers (pointers are only about 4 bytes), so this would be about 1 Meg. Then we only allocate the information data structure as we need it, so we will only use our 3,000 bytes for the information. So in this case, we only use a little over 1 Meg. But that 1 Meg is pure overhead, compared to the 3k that we are really using. Again, this has a lot of waste.

Supposing though that we had an algorithm that could make a 2 digit number out of all the keys. Think of 9782, 9783, 9784, as keys into this structure. And supposing we could think of a function that could magically transform 9782 into 82, 9783 into 83, and 9784 into 84. Then we could just allocate 100 pointers (400 bytes) and then as we need an info block, we allocate the 1000 bytes for it. For the 3 jobs I keep talking about, they will still take 3000 bytes. So now we're talking about a total of 3400 bytes, 3.4k to hold our data. That's a lot less overhead. But now, we think about this "algorithm" and realize that it's not perfect. After all, our job IDs could range from 0 to 65535, and we're only mapping this into 100 locations, so there's bound to be some duplication if we have a huge amount of jobs. Well, under the hood, clever programmers just keep a list at each location of all the jobs that wound up in that bin. I won't address how they do that, but let's just assume that it is very low overhead, and really not that difficult.

Now, let's think about an alternative to this. We could just keep a list of the keys and a list of the data structures. Instead of using 9782, 9783, 9784, ... as the keys directly, I will instead keep this as just another piece of data. So as I get this info, I will allocate 2 arrays of 3 elements. One array will just hold the numbers 9782, 9783, 9784, and the other will hold the information data structures. Now we're looking at about 1006 bytes per element, or a total of 3018 bytes. Our data structures are shrinking! When we want to access the data for job 9782, we read through our first list until we find the key 9782. Then we look over at the other array and get our information block. That's nice and efficient for storage, but realize that we might not be adding job IDs in numerical order, so our search for the key 9782 might not be very efficient, especially for large lists.

Let's amend this idea. Supposing that in our first array, instead of just storing the job ID itself, we also store a pointer to the elements in the other array. This means that the first array will now take up 6 bytes per element - 2 for the number 0-65535, and 4 for the pointer to the other array's information elements. The second array will still just hold the data blocks about the job. Actually, it may hold an array of pointer to these blocks, so let's say it takes 1004 bytes per element, so between our 2 arrays, we are looking at 1010 bytes per element. And to store information about 3 jobs, we're looking at 3030 bytes. Okay, our data structure just got a little bit bigger. But now, suppose we sort the keys in our first array. Now, this means that we can run a binary search through the keys (which is the fastest search method through an ordered list) when we're looking for the information about job 9782. Then, once we find it, we get a pointer to the data in the other array. (Actually, the other array is not necessary at this level, but it may make garbage cleanup easier).

Yes, I know I should have some diagrams. Maybe I'll add some later. But don't hold your breath.

Anyway, this last idea has some reasonable compromises. We store our data with very little overhead. We know we have 3000 bytes of data regardless of method. But this last scheme has only 30 bytes of excess, rather than the first scheme which had a whopping 9.5 million bytes of waste. One of the areas it still needs help with is in resizing the array, as you add elements. It actually makes sense to allocate both arrays of pointers a little bigger than you need (say 100 elements). And maybe use one of the concepts from an earlier scheme to store everything in this smaller array, and have some low overhead coding under the hood that will keep track of duplicate entries.

That earlier scheme has another benefit. Given an index, we had an algorithm that transformed it into a key directly. This means that we immediately know where to look for the data related to this key. The disadvantage of the idea with the two arrays (sorted key list with pointers to the data) is that as you add keys, you have to re-sort the list. As your list grows, it will take longer and longer to figure out where to put the new key, and rearrange the other keys accordingly. If we hash out the indexes directly into keys, we save that time. (But we need to count on someone to manage the duplicate entries.)

In that earlier scheme, I described a rather trivial algorithm that would translate 9782 into 82. That's just a simplification. Sometimes, what we want is to use a string as a key. That is, instead of indexing off a job number, perhaps we have some information that we would like to key off a something like 'animate', 'render', or 'composite'. Now, think about sorting the list of keys and doing a search. To determine if the string matches for something like 'composite', there are 9 characters, so there have to be at least 9 comparisons to decide if this is the key. But we have to do these comparisons for each key (though we'll be able to reject most of them after the first character). Back when we were dealing with job IDs, all we had to do was one comparison for each key. This would give us incentive to find a way to come up with a way to translate 'animate' to something like 82. There are some simplistic ways to do this. For example, each character can be thought of as a number from 0-255 (I'm not even going to think about unicode right now). Perhaps we can take the first two characters and come up with a number like:

	(first character code) * 256 + (second character code)
This will give us a number from 0-65535. Just as before, you can imagine we will have a lot of duplication (like 'animate' and 'antitrust'), but again, all I can say is: have faith that a clever programmer can come up with a way to maintain a list of the duplicates, with a way to access it, without a lot of overhead.

The trick here is that though we expect to be able to get low overhead ways of handing 2 words that wind up as a duplicate key, we still want to avoid winding up in the same bucket as much as we can. We might try to be more clever about this. For instance, there probably will be a high correlation in the first 2 letters (like 'th' or 'qu'). And we might even speculate that there will be a high frequency of vowels (a, e, i, o, u) in the third letter. If we think that most keys will be at least 4 characters long, we might want to modify our key to use the first and fourth characters, using 0 if there are less than 4 characters. Again, that is not a perfect algorithm, and you'll get some duplicates. Ideally, we want to evenly fill our buckets with the least collisions. Now, there's a whole field of Electrical Engineering/Mathematics/Computer Science called "Information Theory" that discusses this a lot. If you are interested in following up on this, try reading up on Encryption or Compression and look for phrases like "high entropy". I'm not being facetious. This really is an interesting topic. I took a few quarters of it myself in college. But it is pretty hardcore math theory. You might want to look into basic statistics first going into it.

The points I'm getting to are:

Okay, that was a long winded discussion about data structures, but I was intending to provide a motivation for hashes and associative arrays. There's a lot of consideration that goes into making a hash table, and I've tried to explain how and why you may want to do this. Fortunately, Perl has a mechanism that does all of this work for you: Hashes/Associative Arrays.


One of the things Perl has that many of the other languages I've used don't is hashes intrinsic to the language. These are also known as associative arrays. I cannot do justice to how useful these things are. One could argue that in C++ that there are maps in the C++ stl (standard template library). That may be true, but for one thing, you need to include the stl, you probablly have to consult a reference manual every time you want to figure out how to access them, and of course, it's been my observation that any C++ code that I use templates in seems to take twice as long to compile, and my compile logs are twice as big. (You probably notice a strong theme here that I no longer like C much, and I loathe and despise C++).

So now what are these wonderful things? The simple way on the surface to think of them is that they are arrays. But instead of having to access something with an index, you can access them with a string. That is,

	%programTable = (
		'animate'	, 'maya.bin',
		'render'	, 'render',
		'composite'	, 'shake',
	# Equivalent to:
	$programTable{'animate'}   = 'maya.bin';
	$programTable{'render'}    = 'render';
	$programTable{'composite'} = 'shake';
Now, admittedly, I have not read the Perl source code, so I don't really know how hashes work. But if I had to guess, I would think they would use a lot of the ideas discussed above in the diversion about data structures. I don't think anyone would seriously use the hash scheme I outlined. That was just to illustrate what you have to think about when making a hash. But they probably did use some hash scheme, and they probably do have some low-overhead scheme to index all the keys. In other words, the prior discussion doesn't describe the tactics Perl designers took, but it probably is raising similar design concerns.

Before I go any further, I want to make a few notes about the assignment schemes above.

There is an alternate way to do the first scheme. Well, kind of. There is the => which acts like a comma (,). But people (including me) use it a lot in hash assignments because it makes it look like we're making a correlation between the keys and values. Really, the , and => are interchangable, but the => is actually used for clarity:
	%programTable = (
		'animate'	=> 'maya.bin',
		'render'	=> 'render',
		'composite'	=> 'shake',
Actually, there is one difference between => and ,. If you use the => operator, Perl knows that this is usually used with hash assignments, so the value to the left of it is probably the key to a hash. In this special case, if it sees a bareword, then it assumes it is a string. So we can take the shortcut:
	%programTable = (
		animate		=> 'maya.bin',
		render		=> 'render',
		composite	=> 'shake',

The question now is: when do we use a hash and when do we use an array? Hashes are really wonderful things. Generally, if you want to access the values of an array by string keys, you almost always want to use a hash. If you are using numbers to access, you probably want an array. But you might consider using a hash if you have a sparse array.

As much as I love hashes, there is one thing that they are not good at: maintaining order. If you care about the order of the elements in the array (like in a stack or a list of operations), then you probably want to maintain an array. Now, there is something up in CPAN called Tie::IxHash that actually does maintain the order. But I'm not going to get into that here. I still have not used the module yet. Usually with hashes, you just access the list of keys using keys %programTable, and don't count on the order being anything that makes any sense.

Actually, I take that back. Let's consider the idea that under the hood, they have a hash scheme that converts the strings into some random looking number to fit into a small number of bins. The more random the number the better. So internally, it is probably keeping track of the elements in the array based on this hash key. Perl won't tell you what that hash key is, but if you're looking for some rationale in the order keys come back to you, well, that's it.

One of the other things to be aware of under the hood is that though it is low overhead, there is overhead nonetheless when you use a hash vs. an array. In a hash, it is probably maintaining 2 arrays internally, one for the key and one for the value. There is also some additional storage space for the hash keys, but not much. I would never actually make the storage overhead a consideration when deciding whether or not to use a hash.

SUBROUTINES [back to top]

You can write a subroutine and access it like:

	sub helloWorld {
		print "hello world\n";

This created a subroutine called helloWorld, and I accessed it by using the &. That's pretty simple. I give my subroutines a list of arguments using (); A subroutine is just a code block that you can access by name really. Sometimes, they return values. Sometimes they don't. In Perl, a subroutine is a subroutine. A single routine may sometimes exit out without returning a value and other times return a value. You don't need to pre-declare anything. I find that convenient. I also like being able to pass in a variable number of arguments and put the intelligence into the subroutine to figure out what to do.

Anyway, I don't have a lot to say about subroutines. Learn about them. Read Learning Perl [SCHW97] and Programming Perl [WALL00]

Golden Rules or What the %{$_}=($&=>[@_,@{$_},'friggin']);#heck do all these symbols mean? [back to top]

I have a few guidelines for interpreting Perl expressions. Maybe I'm exaggerating by calling them "golden rules." There are a couple more quidelines for other common symbols. But these have some ambiguities. Perl has a bunch of special variables. Sometimes (inside map or grep) you do not have a choice and you have to use cryptic symbols like $_ though there are many other places where you can use these symbols. Generally, I recommend against it because it makes code harder to read. But a few of the symbols you might encounter are:

So now, some of you might be curious what that expression meant at the top of this section:

%{$_}=($&=>[@_,@{$_},'friggin']);#heck do all these symbols mean?
I consider this obfuscated perl and I'm actually using it as an example of bad coding. But let's break it down anyway. First off, anything after the pound sign (#) is a comment, so we can ignore it. Also, let's put in some whitespace to clarify things a little.
%{$_} = (
	$& => [@_, @{$_}, 'friggin']
Okay, this is already less intimidating. I'm cheating a little bit here. I'm using symbolic dereferencing here. Pretend that I'm foreaching through a list of variable names. In Perl, I can access this through dereferencing:
	$procName = 'prman';
	%{$procName.'Configs'} = ( 'ShadingRate' => .25 );

	# Equivalent to:
	%prmanConfigs = ( 'ShadingRate' => .25 );
Already, we can see benefits over strictly compiled languages (C, C++). I can reference variables without knowing what they are. This means, for instance, that I can let a user tell me what variable name they want to change without having a huge if/case block. Actually, you can't have character arrays as case arguments, so you're really stuck with a huge if/else-if block in C/C++.

Now, with dereferencing in see that %{$_} and @{$_} are simply symbolic dereferences to whatever variable name is held in $_, the current arg, most likely from a surrounding foreach. Note that the @{} and %{} notations are also used for normal references, so $_ could actually hold a pointer. However, since I am accessing it with both @{$_} and %{$_}, I realize that it would actually be invalid to try to hash dereference an array ref, and it would be invalid to try to array deref a hash ref, I'm going to say that's not what's happening above. Instead, let's just say that it's a symbolic dereference, and I'm just providing a name of a variable. So supposing that the $_ held the string 'prmanConfig'. Then the big expression would be equivalent to:

%prmanConfig = (
	$& => [@_, @prmanConfig, 'friggin']
Refering back to my notes on special variables, realize @_ is just the current array and $& is just the last pattern that matched. So we see that all we are really doing here is initializing the hash %prmanConfig with a single lookup.

We take whatever was last matched in a regular expression and use it as a key, and the value it points to is actually an ref to an array. Recall that one of the golden rules is that [] refers to array activity. In this case, we are creating an anonymous (no variable name) array.

Well, that was pretty contrived, and really a useless block of code. But I thought some people may wonder what that really expanded to.

© 2001 Steve Hwan, hostname: @pacbell.net, username: svhwan
You should probably use the word "PERL" in the subject line to get my attention.
Last Modified: Sun Dec 2 14:48:57 2001