DISCLAIMER: THESE PAGES ARE STILL UNDER CONSTRUCTION. NO CODE EXAMPLE BEEN TESTED YET.

Introduction to Perl

Regular Expressions - renaming/renumbering files


[Previous Page] |[Next Page] Table of Contents: small | med | large

You're probably getting sick of seeing this note, but I do feel compelled to say: I'm only going to glance over this topic with some notes on my observations/examples. But if you want a more thorough treatment, see Learning Perl [SCHW97] and Programming Perl [WALL00] That's where I learned a lot about regular expressions. These books, as well as the Perl Cookbook [CHRI98] seem to recommend one of the O'Reilly books, Mastering Regular Expressions. People in the Perl community seem to like this book, though I've never read it myself. I don't think regular expressions are really that difficult, but they can account for some fo the more cryptic Perl code.

Regular expressions are a very powerful set of tools for matching and extracting data.

One way of testing for matching data is simple equality. That is, by saying something like:

	if ($filename eq 'myElement.0001.tif') {
		# do something with filename
	}
but this is kind of limiting. In UNIX or DOS, you may be used to the concept of globbing, where the characters * and ? had special meanings. For instance, you might do something like this in UNIX:
	% ls myElement.*.tif
And that would have the special meaning the * will match any number of characters, or none at all. ? has the special meaning of: "match any single character."

Now, in Perl, we actually have access to this matching mechanism in file globbing. In a later note, I'll discuss alternatives to file globbing, and why I don't like file globbing. But it is a quick way to get introduced to some of Perl's features. I think it works something like this: You can put a globbing expression inside the <> operators. Unfortunately, this looks a lot like a file handle, even though it isn't. I think it's actually a little sloppy, but the idea is that you can put a globbing expression into the operator, and get a list of files back. For instance,

  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 foreach $file (<*>) {
  7 	system "render $file";
  8 }

Listing 2.8.1 for code_untested/renderRibs-1.pl

Regular expressions expand on this idea and give you much much more.

some simple regular expressions [back to top]

I think the easiest way to explain these is if I just run though a few sample expressions. Let's suppose I want to get a list of all the files in my directory, then take a subset of them.

	@filenames = <*>;
	@myElementList = grep(/myElement/, @filenames);
You may recognize grep as being a UNIX command. This command looks for a condition (1st arg) in the list (remaining args), and returns a subset of the list satisfying this condition. Usually, this condition is a regular expression (though I've done bizarre things like looking for light nodes in a scene graph). By the way, I think the origins of grep actually go back to the edlin or ed days. These were very primitive line editors. If I remember correctly, this was a shorthand command for Global Regular Expression Print. That is, it would print out all lines in the file (global) that matched a certain regular expression. With all the 1 letter commands and modifiers that were prevalent at the time, this was probably written somwhere as: g/re/p, where re is a regular expression. No doubt someone saw that and realized it's a useful function, but didn't think much of confused users, and wrote a useful UNIX command, grep. Since it is a command that is frequently used by UNIX hackers for searching through files, it was used in Perl too.

The code fragment above is a very very simple regular expression. It just looks like a literal, and it will match any string that contains myElement. That is, it will match 'myElement.0001.tif' as well as 'BOGUSmyElement'.

We might want to get pickier about our expression. For instance, in the ls example, we narrowed our search to things ending with 'tif' by using the glob character *. In Regular Expressions, we have something similar. But in Regular Expressions, this functionality is split between two special characters:

So if we want to match 0 or more occurances of any character in a regular expression, we actually need to say .* - and if we want to match the dot character, we need to escape it with \. so we can expand our previous expression like this:
	@filenames = <*>;
	@myElementList = grep(/myElement\..*\.tif/, @filenames);
This would almost be equivalent to the globbing expression that we gave ls above. Almost? What's missing? Well, our current expression will match something like: myElement.0001.tif, but it will also match a name like: BOGUSmyElement.0001.tifBOGUS.

That is, we're not limiting anything on the head or tail. But we can and should. Now would be a good time to introduce 2 more special regexp symbols:

So we could limit our expression with:
	@filenames = <*>;
	@myElementList = grep(/^myElement\..*\.tif$/, @filenames);
A subtle difference, but this will protect us from the extra characters at the beginning and the end. Now we can think about how to improve the expression. The expression above, and the globbing expression in UNIX will still catch patterns like: myElement.whatever.tif, though we may know ahead of time that we just want numbers. Actually, both globbing and regular expressions allow you to specify limited possible matches. Supposing now we want to detect a 4-padded number. That is, we know our frame is 4 digits long:
	@filenames = <*>;
	@myElementList = grep(/^myElement\.[0123456789][0123456789][01223456789][0123456789]\.tif$/, @filenames);
I chose this way to express it because it is similar to the way you would do it with globbing. Just as with file globbing, Now, I'm not sure if globbing lets you abbreviate ranges or not, but I know you can in regular expressions:
	@filenames = <*>;
	@myElementList = grep(/^myElement\.[0-9][0-9][0-9][0-9]\.tif$/, @filenames);
which makes it a little more readable. However, in Perl, we can actually do a little better. There is a shortcut in the case of digits: we can use
	@filenames = <*>;
	@myElementList = grep(/^myElement\.\d\d\d\d\.tif$/, @filenames);
In fact, suppose we want to relax the "4-padded" business and just say we want to match "one or more" digits. There is another special character: so we can say:
	@filenames = <*>;
	@myElementList = grep(/^myElement\.\d+\.tif$/, @filenames);
Actually, we could make another improvement. We might have a negative frame. This means that we could have an optional minus sign. This brings up another optional character: Note that this is distinctly different from what ? means in the globbing expression. In globbing, it would mean to match any single character. In regular expressions, it depends on the previous character, and it basically means that the previously listed character is optional. So finally, we could make our expression:
	@filenames = <*>;
	@myElementList = grep(/^myElement\.-?\d+\.tif$/, @filenames);

There is also another operator, s, which can operate on regular expressions. This is the substitute operator, and it pretty much came from sed, another old scripting language. It will take 2 sections of expression between slashes. The first part is the pattern to match and the second part is what to replace it with. Unlike sed (which really didn't have variables), in Perl, you can stick variables in either part.

I'll use s/// in my next example. But I'll say again that there are more special characters in regular expressions, and there's more to them than what I've gone over here. Hopefully, this will give a reasonable introduction, but you really should buy a real Perl book. A few are listed at the top of this page. Also, there is an online tutorial (assuming it's still online). See the section about "a little s/// and m//" in [SHEP00]

EX 2.8.1: Changing the basename of a file

Let's try to apply what we have so far. A really common program that everyone seems to write is:

Rename a whole lot of images by changing the basename.
I think a lot of people use a script like this, so facilities sometimes believe they need a general "rename" script, and start adding all sorts of convoluted features which no one can remember. As it turns out, this is an exceedingly simple script to write. Let's just jump into this.
  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 foreach my $filename (<oldbasename.*.tif>) {
  7 	my $newfilename = $filename;
  8 	$newfilename =~ s/^oldBasename/newBasename/;
  9 	rename($oldfilename, $newfilename);
 10 }

Listing 2.8.2 for code_untested/changeBasename-0.pl
Here, we change the files with names like oldbasename.0001.tif into newbasename.0001.tif. We might want to do this to change the names of our renders from "element_v0001.0001.tif" to "element_final.0001.tif." Actually, I strongly discourage this practice. Keeping version numbers helps trace the workflow of how the element came to be. And typically, when someone says something is "final," they never really mean it. However, I would encourage softlinking the final version to something called "final". This is actually pretty sloppy on a lot of counts, but usually it's enough to get the job done. When you're running a script like this, usually you will be in the image directory already, and all the files will have this basename. The main things I don't like are:

But again, I stress that most times, this is really all you need. Honestly, what I usually do is from UNIX, ls -1 the directory into a temporary file like deleteme or something, then use vi to make a mini script to mv files and just source it. But I'm trying to discuss regular expressions in Perl here, so I'll continue with this example.

Also, note that this typically is not a speed critical application, so even the regular expression recompilation isn't a big deal.

I do want to emphasize though how useless a show-wide renaming script it, because you can write a script on a per-use basis like the one above in about 2 minutes...

But that all being said, let's pretend that we wanted to write a file basename renaming script. We can generalize this a little bit by making the old basename and new basenames arguments that we pass in to our script:

  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 ($oldBasename, $newBasename) = @ARGV;
  7 foreach my $filename (<*>) {
  8 	my $newfilename = $filename;
  9 	$newfilename =~ s/^$oldBasename/$newBasename/;
 10 	if ($newfilename ne $oldfilename) {
 11 		rename($oldfilename, $newfilename);
 12 	}
 13 }

Listing 2.8.3 for code_untested/changeBasename-1.pl

We can make this marginally better by using the /o option for the regular expression, which means "compile it Once." In this case, we want to do this because the variable doesn't change - we are replacing the same basename the whole time:

  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 ($oldBasename, $newBasename) = @ARGV;
  7 foreach my $filename (<*>) {
  8 	my $newfilename = $filename;
  9 	$newfilename =~ s/^$oldBasename/$newBasename/o;
 10 	if ($newfilename ne $oldfilename) {
 11 		rename($oldfilename, $newfilename);
 12 	}
 13 }

Listing 2.8.4 for code_untested/changeBasename-2.pl

EX 2.8.2: Binding materials to patches

I can't remember if I already went over this example, though I know I'll go over it again with some more tricks.

Bind materials to patches. You will have an array of patches, and an array of material expressions, and a lookup table.
I'll try to explain the motivation for this strategy. Typically, you will have more patches than you do materials. For instance, you might have a model of a house. Probably a lot of your patches will have the same shader/material/appearance, such as "wood" or "paintedStucco." We might, for instance, have individually modeled the shingles separately and named them shingle_001, shingle_002, shingle_003... So we can easily define a regular expression for them like ^shingle_\d+. We would then want to have a mapping from the expressions to the materials, which we can easily do with a hash lookup:
  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 foreach my $material (@materials) {
  7 	my $materialExp = $materialExpHash{$material};
  8 	foreach my $patch (@patches) {
  9 		if ($patch =~ /$materialExp$/) {
 10 			&bindMaterial($patch, $material);
 11 		}
 12 	}
 13 }

Listing 2.8.5 for code_untested/bindMaterial-2.pl

I will address this again in a later article in the "Doing it Better" section, but for now, I'll just mention that this is not the most efficient way to do things.

Extracting data with regular expressions [back to top]

So we realize that regular expressions can be good for determining if a string matches a certain pattern. But that's just the tip of the iceberg. Another use of regular expressions is to pull data out of a string. This brings up yet another set of special characters in the regular expressions:

Using a regular expression with parenthesis will return an array of values that match the parenthesized sections. Note that if you were to take that array in a scalar context, you would just get the number of matches. This is actually a really common gotcha. That is,
	# BAD!  $frame is in a scalar context, so you will get a 1 if anything
	# matches.  Actually, sometimes, you may want to get that count, but
	# usually that's not what you're after.
	$frame = $filename =~ /^myElement\.(-?\d+)\.tif$/;

	# good! assigning to a list on the left hand side
	($frame) = $filename =~ /^myElement\.(-?\d+)\.tif$/;
The new thing here is the =~ operator. This means to bind a string ($filename) to a regular expression or regular expression operation. But hopefully, you guessed that from the context.

So far, I've just used // to indicate a regular expression. If I wanted to, I could include an m for "match:" There are some advantages and additional options available with using the m but I don't generally use them, so I won't go into them. But what I mean is, you could say:

	# good! assigning to a list on the left hand side
	($frame) = $filename =~ m/^myElement\.(-?\d+)\.tif$/;

EX 2.8.3: Changing the frame number of a file

This seems to be another common task. Let's say you animate your shot from frames 1-100. Then, as you're editing your piece together, you find that you want to trim out 23 frames from the head. So the animation now goes from 24-100. But then your production (unwisely) mandates that everything you turn in must start at frame 1 because some archaic piece of software can't handle starting on a frame other than 1. (I think that's a serious flaw in the software because renumbering the renders separates the renders from the animation. You may find a flaw in a rendered renumbered frame, but when you go back to trace it in the animation, the frames no longer match and you have to just hope that someone remembered to log a note on the scene somewhere where you can easily access it...

But typically, as a production TD, you do not have a choice about using the archaic piece of software, and in the end, you're just following orders from your manager and/or producer. So let's say one of them is making you renumber your files...

  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 $offset = $ARGV[0];
  7 
  8 @filenames = <basename.*.tif>;
  9 foreach my $filename (@filenames) {
 10 	rename( $filename, 'TMP'.$filename);
 11 }
 12 foreach my $filename (@filenames) {
 13 	my($basename, $frame, $ext) = $filename =~ /(.*)\.(-?\d+)\.(.*)/;
 14 	my $newframe = $frame+$offset;
 15 	my $newfilename = $basename . '.'
 16 				. sprintf("%04d", $newframe) . '.' . $ext;
 17 	rename( 'TMP'.$filename, $newfilename);
 18 }
 19 

Listing 2.8.6 for code_untested/renumberFiles-1.pl

The basic idea here is that when you renumber the files, you don't want to overwrite files prematurely. That is, if there are frames 1-100 and you need to add 10 frames to each. If you start with frame 1 and immediately renumber it to frame 11, then you've just lost the original frame 11. So I am moving everything to TMP files first. Let's try and revise this regular expression though... We'll take advantage of \w:

And I used the same trick for negative numbers that I used in the beginning of this section.
  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 $offset = $ARGV[0];
  7 
  8 @filenames = <basename.*.tif>;
  9 foreach my $filename (@filenames) {
 10 	rename( $filename, 'TMP'.$filename);
 11 }
 12 foreach my $filename (@filenames) {
 13 	my($basename, $frame, $ext) = $filename =~ /(.*)\.(-?\d+)\.(\w+)$/;
 14 	my $newframe = $frame+$offset;
 15 	my $newfilename = $basename . '.'
 16 				. sprintf("%04d", $newframe) . '.' . $ext;
 17 	rename( 'TMP'.$filename, $newfilename);
 18 }
 19 

Listing 2.8.7 for code_untested/renumberFiles-2.pl

There are a couple things left. First off, let's have a little more flexibility in the separators. Instead of just period(.), let's allow either period or dash(-):

  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 $offset = $ARGV[0];
  7 
  8 @filenames = <basename.*.tif>;
  9 foreach my $filename (@filenames) {
 10 	rename( $filename, 'TMP'.$filename);
 11 }
 12 foreach my $filename (@filenames) {
 13 	my($basename, $sep1, $frame, $sep2, $ext)
 14 			= $filename =~ /(.*)([\.\-])(-?\d+)([\.\-])(\w+)$/;
 15 	my $newframe = $frame+$offset;
 16 	my $newfilename = $basename . $sep1
 17 				. sprintf("%04d", $newframe) . $sep2 . $ext;
 18 	rename( 'TMP'.$filename, $newfilename);
 19 }
 20 

Listing 2.8.8 for code_untested/renumberFiles-3.pl

Here, we see one more special symbol in regular expressions: []. Actually, we've seen it before in the simple examples section on this page. In this case, we want to match period or dash, but both of them have special meanings inside regular expressions, especially inside the [] symbols. So just to be safe, I protect them by backslash escaping them.

Now, sometimes, some programs like to output their frames without extentions. I think this is from the olden days or something, because these days, there are so many file formats, it's kind of stupid not to give an extention. But, supposing some programs do and some don't. Then we do want to consider that last bit to be optional.

  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 $offset = $ARGV[0];
  7 
  8 @filenames = <basename.*.tif>;
  9 foreach my $filename (@filenames) {
 10 	rename( $filename, 'TMP'.$filename);
 11 }
 12 foreach my $filename (@filenames) {
 13 	my($basename, $sep1, $frame, $lastPart, $sep2, $ext)
 14 		= $filename =~ /(.*)([\.\-])(-?\d+)(([\.\-])(\w+))?$/;
 15 	my $newframe = $frame+$offset;
 16 	my $newfilename = $basename . $sep1
 17 				. sprintf("%04d", $newframe) . $lastPart;
 18 	rename( 'TMP'.$filename, $newfilename);
 19 }
 20 

Listing 2.8.9 for code_untested/renumberFiles-4.pl
There's nothing really new here. I've gone over everything that's used. But there are a couple things to clarify:

Finally, /(.*)([\.\-])(-?\d+)(([\.\-])(\w+))?$/ is pretty darned messy looking, and probably a little bit intimidating looking to beginners. Perl actually allows you to comment your regular expressions with the special /x directive:
  1 #!/bin/sh
  2 #! -*- perl -*-
  3 eval 'exec $PERLLOCATION/bin/perl -x $0 ${1+"$@"} ;'
  4  if 0;
  5 
  6 $offset = $ARGV[0];
  7 
  8 @filenames = <basename.*.tif>;
  9 foreach my $filename (@filenames) {
 10 	rename( $filename, 'TMP'.$filename);
 11 }
 12 foreach my $filename (@filenames) {
 13 	my($basename, $sep1, $frame, $ending, $sep2, $ext)
 14 			= $filename =~ /(.*)		# use anything as base
 15 					([\.\-])	# dot or dash
 16 					(-?\d+)		# optional neg, digits
 17 					(		# optional extention
 18 						([\.\-])	# dot or dash
 19 						(\w+)	# alphanumeric or _
 20 					)?
 21 					$/x;		# end of expression
 22 	my $newframe = $frame+$offset;
 23 	my $newfilename = $basename . $sep1
 24 				. sprintf("%04d", $newframe) . $ending;
 25 	rename( 'TMP'.$filename, $newfilename);
 26 }
 27 

Listing 2.8.10 for code_untested/renumberFiles-5.pl

The first thing to notice is this whole /x thing at the end of our expression. This lets us use a special expanded notation of regular expressions that lets us split the expression between lines (it will ignore whitespace, so if you want to match whitespace, you need to use \s or maybe escape your whitespace (with backslash \) or maybe use \t. I haven't checked this out though because the situation hasn't come up for me yet. But the thing to keep in mind is that generally whitespace is ignored. Also, we are allowed to put comments in. I recommend limiting your regular expression comments to alphanumeric characters(no punctuation) though. I had a special character evaluate on me once, and I don't remember which one.

There's actually a good discussion about this (/x with more examples in The Perl Cookbook [CHRI98]

A good exercise for the reader would be to account for the possibility of the frame not having 4-padding. That is, have the script analyze the frames to decide if it needs to do the sprintf or not.

Summary [back to top]

I'm just going to repeat myself and summarize some of the regular expression symbols I've gone over and maybe add one or two:

But if you want to know more, get a real Perl book. There are a lot more options, but the above should be enough to get you started though.
© 2001 Steve Hwan, hostname: @pacbell.net, username: svhwan
You should probably use the word "PERL" in the subject line to get my attention.
Last Modified: Sun Dec 2 14:47:32 2001