Work: The Gentle Art of Software Testnig

Jeffreys Copeland & Haemer

(Server/Workstation Expert, August 2000)



test (n.) 1. Real users bashing on a prototype long enough to get thoroughly acquainted with it, with careful monitoring and followup of the results. 2. Some bored random user trying a couple of the simpler features with a developer looking over his or her shoulder, ready to pounce on mistakes. Judging by the quality of most software, the second definition is far more prevalent. See also demo.
                        --- The Jargon File, 4.1.0


Zzzz ... Huh? What? Oh. Sorry. We thought someone said we were going to talk about testing, and we fell asleep for a minute.



Fashion

We have tried to stay out of the hustle-and-bustle of high-tech high fashion, but we occasionally find that fashion won't stay out of us. Boring or eccentric areas we were working in that became fashionable before we could get out have included Unix, PC's, portability, internationalization, POSIX, and Linux. Most of our friends and relatives know that we have no taste, and that our early work in these areas just shows that sooner or later, everything becomes fashionable. PC Week is the fulfillment of Andy Warhol's prophecy: ``In the future, everyone will be famous for 15 minutes.'' (Most, but not all. One COBOL programmer has been following us around ever since the last fringe thing we evangelized him with came into vogue: the Internet.)

Unfortunately, it's happening again. For the last couple of months, our friends have been abuzz with talk of an alternative to standard development methodologies called ``eXtreme Programming'' (XP), and last weekend Dave Taenzer gave us a copy of extreme Programming explained [ISBN 201-61641-6], by Kent Beck. (Since some object-oriented styles promote MixedCaseIdentifiers, and programming books now look like they're in German, we suppose one way to make book titles stand out is not to capitalize them.)

We read it. For better or worse, it turns out we already are eXtreme Programmers. We suspect this is because many core XP beliefs come down to one of our core methodologies: eXtreme Laziness (XL). A couple of examples:


Beck doesn't mention build tools or configuration management, but we're patient: their 15 minutes are coming. It does take several other stands guaranteed to make an academic Software Engineering professor's fur stand up. Implementation documentation? Useless. Code ownership? Bad idea. Software architects? Testers? Analysts? Don't want 'em. We won't discuss these because we're not writing a column about XP. Besides, the book is mercifully tiny (190 pp.), and does a better job promoting its religion than we would. You can also find good information at web sites like http://www.extremeprogramming.org.

In this column, the XP fad we want to stand up and applaud for is eXtreme Testing. This is nothing less than having developers write and run tests from the beginning (mind you, not just test plans -- actual tests). Even before there's working code.


We Become Testy

In our experience, continuous integration mostly verifies that everyone's code builds together. This by itself is nothing to sneeze at, but it also is a good stepping-stone to higher goals. Once you can build frequently, you can test frequently. Once you can build at a moment's notice, you can test at a moment's notice.

This isn't wishful thinking. Our makefiles have productions that update our code with latest versions from CVS, do an incremental build to bring our executables up-to-date, and then run an automated test suite.

Once the gear-grinding failure noise subsides to a dull scraping sound, we use cron to run the tests nightly, and configure the tools to send us email about failures. When we're notified of failures, we use CVS to figure out who has changed code since the last successful build, and forward the problems to the guilty parties, who can usually fix them immediately, because the code is fresh in their minds.

Here again, having CVS (or something like it) makes a world of difference. Even when we get a new test that reveals a long-standing bug -- and we do -- if we have an old version that doesn't have the bug, we can use CVS to do a binary search for the checkin that created the problem. Building and testing a version from two months ago is this one-liner:

cvs update -D'2 months ago'; make test

Developers actually love continuous testing, since it gives them a safety net. They can develop with relatively reckless abandon, secure in the knowledge that the test suite will tell them what they've broken -- not in a week or a month, but right away. Finding two concurrent bugs is at least five times harder than finding a single bug.

Dave Taenzer sums it up neatly: ``I like to add my bugs one at a time.''

Notice that we're not testers in this picture -- just mail routers. All that's keeping us in the picture is that we haven't yet figured out how to get the test tools to look in CVS by themselves.

We're often ready to run relatively full suites as soon as we can first integrate, but the suites continue to grow steadily, with new tests added each time we want to add new features, and each time we find new bugs.

Incremental integration and testing also subtly changes those conversations with your management that begin, ``So you've finished writing the code; that means we can ship tomorrow, right?''

Your answers become, ``Yes.''


Regression Testing

Right now, we work on projects that run thousands of tests every night.

A testing atmosphere like this produces a quality ratchet. We've all seen new code -- even simple bug fixes -- break old features. Brooks, in his still-wonderful The Mythical Man Month (Addison-Wesley, ISBN 0-201-83595-9), cites studies showing that a bug fix introduced into a large system has about a 50% chance of breaking something else that used to work: two steps forward and one step back. Attempts to improve the quality of the product in one area can cause the quality to slip backwards in another. Continuous testing lets you find and fix these cases as soon as they happen: two steps forward and no steps back.

Test suites that compare what software does today with what it did yesterday are called regression tests. It's important (though sometimes confusing to managers) to emphasize that regression testing doesn't test whether something works -- only whether it has changed. Each time we fix a bug, we ``break'' our regression tests. Surprisingly often, a bug fix in one area even fixes other tests that we hadn't guessed it would affect. When they do, we've moved the ratchet forward even more notches.

Regression tests aren't the only kinds of tests we run. Let's look at some other useful test types.


Conformance Tests

Conformance tests ask, ``How close are we to where we're trying to go?'' (These are the tests that your manager probably confuses with the regression tests. That conformance tests often serve as the core of a good regression test suite doesn't help unconfuse things.)

For example, if you want to ask whether your operating system is POSIX.1-conforming (or where it's not), you can run the POSIX Conformance Test Suite (PCTS), from the U.S. Department of Commerce's National Institute of Standards and Technology (NIST). The PCTS tests a series of assertions about POSIX.1, listed in ANSI/IEEE 2003.1, and reports which assertions pass and which fail on your system. (Actually, the possible outcomes are PASS, FAIL, UNRESOLVED, UNTESTED, and UNSUPPORTED, each of which the standard defines precisely.)

NIST also supplies conformance test suites for

(http://www.itl.nist.gov/div897/ctg/software.htm).

All these are ``free,'' which means that we all pay for the COBOL85 conformance test suite: Your tax dollars at work.

At the other end of the spectrum, capitalism brings us vendors that develop test suites for commercially important de facto and de jure standards. For example, QualityLogic (http://www.qualitylogic.com) makes a good living producing useful and extensive conformance tests for printer languages, like PostScript and PCL. (For those of you who've been involved with printers, QualityLogic is Genoa Technology's new name.)

What if you can't get third-party conformance tests? Sometimes you can generate them automatically. A personal example will illustrate this nicely.

Minolta-QMS recently added a PDF interpreter to their printers. During this project, Haemer needed a PDF-conformance test suite, but none was commercially available. At first, he thought about generating an array of PDF files by hand, but he worried that one person working with one set of tools might not provide enough PDF variety. The solution? A 43-line shell script that used Perl and its WWW:Search and LWP::Simple modules to find and retrieve random PDF files from the web.

But what if automatic generation won't work in your case, and you still lack the tests you need?

You ask your fellow developers to create them.

Wait. The developers?

Correct. For reasons we alluded to above above, developers quickly get addicted to continuous testing. Once they're hooked, we find it's easy to get them to generate useful tests.


Performance Tests

We don't know who first said, ``First, make it work, then make it fast,'' but we can now say it quickly.

Thinking carefully about this aphorism leads to an epiphany: even if the latest changes haven't broken anything, your product may not be working very fast any more.

Performance tests ask ``Just how slow are we?'' As soon as you are doing regular regression testing, you can time how long it takes to run each test.

You could use a stopwatch for performance testing, but this, too, can be automated. POSIX's times() lets you track the amount of user and system time that a command and its children take. These times are largely independent of the system load. You can often collect these times while you're collecting the regression data.

Those of you working in Perl may also be able to make good use of the Benchmark module.


Progression Tests

We confess. We just made this word up.

We wanted a word to describe a form of testing that we do and never see mentioned in the testing literature. Here's the problem it solves:

Imagine you have a suite of conformance/performance tests that announces the presence of 1000 failures. Your code has bugs.

Your boss says, ``Bugs? BUGS? Fix them!''

After a year's steady work, you drop the number to 500; after a second, you drop it to 100; after a third, you drop it to 0. You also fix the failures in a handful of new regression tests that someone's added in along the way.

``At last!,'' your boss tells the CEO, ``Our code is defect-free.''

``At last!,'' marketing tells the PR firm, ``Our code is defect-free.''

``At last!,'' other programmers who read your company's ads say, ``Proof that all those other companies are run by idiots, too.''

Cutting the number of known failures from 1000 to 500 suggests you've cut the number of bugs substantially. But once you've removed most test suite failures, the suite is only a regression test; fixes to the remaining bugs no longer measure much of anything. You need an measure for bugginess that isn't tainted by people just working to make the measurements look good.

We solve the problem with an old statistics trick: we sequester a random collection of tests that we never fix bugs in. Well, at least, not on purpose. We use these progression tests to measure our progress over time.

Here's how it works: Suppose you take the 1000-failure test suite and set aside half the tests, chosen at random, as progression tests. The rest are conformance tests. Two years later, you've fixed all 500 of the conformance tests bugs. And, without ever addressing them directly, 2/3 of the 500 former failures in the progression suite now pass, too.

``At last!'', your boss tells the CEO, ``Our code is defect-free.''

``Nope,'' you interject, ``but we have fixed all the problems that our conformance suite found. If you want a reasonable guess, our progression tests say we only have about one-third as many bugs in our code as we did two years ago. At this rate, I'd expect that to drop to one-ninth in another two years.''

Students of testing will see a similarity to ``error seeding,'' which starts by having a third party put random errors into the code before testing starts. The speed of disappearance of introduced errors can be used to estimate many things, including the efficiency of testers in finding bugs and the distribution and volume of naturally occurring errors.

Suppose we introduce 100 ``random'' bugs and, after a year, the testers find 10%. Suppose also, in that same period, they find 100 bugs we didn't know about. If the introduced bugs are typical of all bugs in the code, then the other 100 bugs represent about 10% of all the bugs that we didn't know about, and there are roughly 900 others that we didn't put in left for us to find.

The trick with this is the phrase, ``If the introduced bugs are typical of all bugs in the code.'' Error seeding requires a way to introduce representative ``random errors.'' Progression testing sidesteps this by choosing pre-exisiting natural errors.

Also, because the progression suite includes tests that initially pass, we can measure the frequency of newly-introduced bugs. (Our regression tests are completely incapable of letting us measure this because we fix any newly-introduced bugs in the regression tests immediately.)

In our experience, the hardest thing about maintaining a progression-test suite is convincing your fellow developers not to fix reproduceable bugs for which you have perfectly good test cases.

There are many ways to label tests -- application tests, unit tests, system tests, stress tests, acceptance tests ... -- but these are our four favorites: conformance tests, performance tests, regression tests, progression tests.

Let's put this vocabulary to use and write some eXtremely Fashionable (XF) test code with Perl's Test and Test::Harness modules. But not this month.

We usually put lots of code in our columns. This time, just setting the stage brought us over our word limit.

We would be negligent, while talking about testing, not to mention Don Libes' Expect -- a Tcl application that lets you write scripts to drive interactive programs. If you have an interactive program that you need to test, Expect is the tool of choice. Other languages now provide Expect-like extensions, but Expect takes best-of-show. (Though Kevin Cohen, at MapQuest, tells us that Perl's Expect.pm is ``The realization Of The Full Potential Of Perl.'') Don, not coincidentally, works for NIST, where they do lots of testing. Like the COBOL85 conformance test suite, Expect is free. You can find more information at http://expect.nist.gov.

Finally before we go, the winner of our June contest was Paul Livesey of Boa FP Systems, Ltd, in Berkshire, England. Paul was the first reader -- despite trans-Atlantic postal latency -- to tell us that the late Robert Coveyou, a mathematician at Oak Ridge National Laboratory and eight-time Tennessee state chess champion, was the source of the quote ``The generation of random numbers is too important to be left to chance.'' Keep those cards and letters coming, folks.

Until next month, Happy Trails.