Test Smells

The book has now been published and the content of this chapter has likely changed substanstially.

About This Chapter

In the A Brief Tour narrative I gave a very brief overview of the core patterns and smells in this book. This chapter provides a more detailed introduction to the "test smells" we are likely to encounter on our projects. I introduce the basic concept of test smells before introducing the smells in three broad categories: test code smells, automated test behavior smells and project smells related to automated testing.

An Introduction to Test Smells

In the book "Refactoring: Improving the design of existing code", Martin Fowler documented a number of ways that the design of code can be changed without actually changing what the code does. The motivation for this refactoring was the identification of "bad smells" that frequently occur in object-oriented code. These code smells were described in a chapter coauthored by Kent Beck that started with the famous quote from Grandma Beck: "If it stinks, change it." The context of this quote was the question "How do you know you need to change a baby's diaper?" And so a new term was added to the programmer's lexicon.

The code smells described in "Refactoring" focused on problems commonly found in production code. Many of us had long suspected that there were smells unique to automated test scripts. At XP2001, the paper "Refactoring Test Code" [RTC] confirmed these suspicions by introducing a number of "bad smells" that occur specifically in test code. The authors also recommended a set of refactorings that can be applied to the tests to remove the smells.

This chapter is devoted to providing an overview of these test smells. More detailed examples of each test smell can be found in the reference section.

What's a Test Smell?

A smell is a symptom of a problem. A smell doesn't necessarily tell us what is wrong because there may be several possible causes for a particular smell. Most of the smells in this book have several different named causes; some causes even appear under several smells. That's because a root cause may reveal itself via several different symptoms (or smells.)

Not all problems are considered smells and some problems may even be the root cause of several smells. The "Ocam's Razor" for deciding whether something really is a smell (vs. just a problem) is the "sniffability test". That is, it must come and grab us by the nose and say "something is wrong here". As we will see in the next section, I have classified the smells based on the kinds of symptoms they exhibit (how they "grab us by the nose"!)

Based on the "sniffability" criteria, I have chosen to demote some of the test smells listed in prior papers and articles to "cause" status. I have mostly left their names unchanged so we can still refer to them when talking about a particular side-effect of applying a pattern. In this case, it is more appropriate to refer directly to the cause rather than the more general but sniffable smell.

Kinds of Test Smells

Over the years we have discovered that there are at least two different kinds of smells: code smells that must be recognized when looking at code, and behavior smells that affect the outcome of tests as they execute.

Code smells are coding level anti-patterns that a developer, tester or coach may notice while reading or writing test code. The code just doesn't look quite right or doesn't communicate its intent very clearly. Code smells must first be recognized before we can take any action and the need for action may not be equally obvious to everybody. Code smells apply to all kinds of tests including both Scripted Tests (page X) and Recorded Tests (page X). They become particularly relevant for Recorded Tests when we have to maintain the recorded code. Unfortunately, most Recorded Tests suffer from Obscure Test (page X) specifically because they are recorded by a tool that doesn't know what is relevant to the human reader.

Behavior smells, on the other hand, are much harder to ignore because they cause tests to fail (or not compile at all) at the most inopportune times such as when we are trying to integrate our code into a crucial build; we are forced to unearth the problems before we can "make the bar green". Behavior smells, too, are relevant to both Scripted Tests and Recorded Tests.

Code and behavior smells are typically noticed by developers as they automate, maintain and run tests. More recently, we have identified a third kind of smell; ones typically noticed by the project manager or customer who does not look at the test code or run the tests. We call these project smells because they are indicators of the overall health of a project.

What to Do About Smells?

We will always have some smells on our project because some smells will just take too much effort to get rid of. The important thing is that we be aware of the smells and know what causes them so that we can make a conscious decision about which ones we need to pay attention to in order to keep our project running efficiently.

It all comes down to cost vs. benefit. Some smells are harder to stamp out than others; some smells cause more grief than others. We need to eradicate the smells that cause us the most grief because these are the ones that will keep us from being successful. That being said, many of the smells can be avoided by having a sound test automation strategy and by following good test automation coding standards.

While we carefully classified the various kinds of smells, it is important to note that very often we will observe symptoms of each of the kinds of smells at the same time. This is because project smells are the project-level symptoms of some underlying cause. That cause may show up as a behavior smell but ultimately there is probably an underlying code smell that is the root cause of the problem. So the good news is that we have three different ways to identify a problem. The bad news is that it is easy to focus on the symptom at one level and to try to solve that problem directly without understanding the root cause.

A very effective technique for identifying the root cause is "Five Why's" [TPS]. First, we ask why something is occurring. Once we have identified the factors that led to it, we ask why each of those factors occurred. We repeat this process until no new information is forthcoming. In practice, asking "why" five times is usually enough hence the name "Five Why's". (This practice is also called "root cause analysis" or "peeling the onion" in some circles.)

In the rest of this chapter, we will look at a set of common test-related smells that we are likely to encounter on our project. I will start with the project smells and work down to the behavior smells and code smells that cause them.

A Catalog of Smells

Now that we have a better understanding of test smells and their role in projects that use automated testing, let's look at some smells. Based on the "sniffability" criteria I outlined earlier, I will focus on introducing the smells and will leave the discussions about the various causes to the individual smell descriptions in the reference section of this book. I will start by introducing the project smells and then work down to the behavior smells and code smells that may have contributed to them.

The Project Smells

Project Smells are symptoms of something wrong on the project. Their underlying root-cause is likely to be one or more of the code or behavior smells but since project managers rarely run or write tests, these smells are likely the first hint they get that something may be less than perfect in test automation land.

Project managers focus most on functionality, quality, resources and cost. Hence, the project level smells tend to cluster around these ideas. The most obvious metric a project manager is likely to encounter as a smell is the quality of the software measured in defects found in formal testing or by users/customers. If the number of Production Bugs (page X) is higher than expected, the project manager must ask "Why are all these bugs getting through our safety net of automated tests?"

The project manager may be monitoring the number of times the daily integration build fails as a way of getting an early indication of software quality and adherence to the team's development process. They may get worried if the build fails often and especially if it take more than a few minutes to get the build fixed. Root cause analysis of the failures may indicate that many of the test failures are not due to buggy software but rather, Buggy Tests (page X). This is an example of the tests crying "Wolf!" and consuming a lot of resources to fix but not actually contributing to increasing the quality of the production code.

Buggy Tests is just one contributing factor to a more general problem of High Test Maintenance Cost (page X) which can severely affect the productivity of the team if not addressed quickly. If the tests need to be modified too often (e.g. every time the system under test (SUT) is modified) or if the cost of modifying tests is too high due to Obscure Tests, the project manager may decide that all the effort and expense being directed toward writing the automated tests would be better spent writing more production code or doing manual testing. At this point, they are likely to tell the developers to stop writing tests. (It can be hard enough to get project managers to buy in to letting developers write automated tests. It is crucial that we don't squander this opportunity by being sloppy or inefficient. This, in a nutshell, is why I started writing this book: to help developers succeed and avoid giving the pessimistic project manager an excuse for putting a halt on automated testing.)

On the other hand, they may decide the the Production Bugs are being caused by Developers Not Writing Tests (page X). This is likely to come out at a process retrospective or a root cause analysis session. Developers Not Writing Tests may be caused by too aggressive a development schedule, developers being told not to "waste time writing tests" by their supervisors or not having the skills to write tests. Other causes could be an imposed design that is not conducive to testing or a test environment that is leading to Fragile Tests (page X). Finally, it could be caused by Lost Tests (see Production Bugs) - tests that exist but are not included in the AllTests Suite (see Named Test Suite on page X) used by develpers during check-in or by the automated build tool.

The Behavior Smells

are smells we encounter while compiling or running tests. We don't have to be particularly observant to notice them as they will come to us via compile errors or test failures.

The most common behavior smell is Fragile Test. This is when tests that used to pass start failing for some reason or another. The Fragile Test problem is what has given test automation a bad name in many circles, especially when using commercial "Record and Playback" test tools that promise easy test automation. Once recorded, the tests are very susceptible to breakage and often the only remedy is to rerecord them because the test recordings are hard to understand or modify by hand.

The root cause of Fragile Tests can be grouped into four broad categories:

Interface Sensitivity (see Fragile Test) is when the tests are broken by changes to the test programming API or the user interface used to automate the tests. Commerical Record and Playback Test (see Recorded Test) tools typically interact with the system via the user interface. Even minor changes to the interface can cause tests to fail even though a human user would say the test should still pass. This is partly what gave test automation tools a bad name in the past decade.
Behavior Sensitivity (see Fragile Test) is when the tests are broken by changes to the behavior of the SUT. This may seem like a "no-brainer" (of course the tests should break if we change the SUT!) but the issue is that only a few tests should be broken by any one change. If many/most of the tests break, we have a problem.
Data Sensitivity (see Fragile Test) is when the tests are broken by changes to the data already in the SUT. This is particularly a problem for applications that use databases. Data Sensitivity is a special case of Context Sensitivity (see Fragile Test) where the context in question is the database.
Context Sensitivity is when the tests are broken by differences in the environment surrounding the SUT. The most common example is when tests depend on the time or date but it can include the state of devices such as servers, printers or monitors on which the SUT depends.

Data Sensitivity and Context Sensitivity are examples of a special kind of Fragile Test, a Fragile Fixture, where changes to a commonly used test fixture cause a bunch of existing tests to fail. This increases the cost of extending the Standard Fixture (page X) to support new tests and that discourages good test coverage. Fragile Fixture's root cause is poor test design but it shows up when the fixture is changed rather than when the SUT is changed.

Most agile projects use some form of daily or continuous integration that includes compiling the latest version of the code and running all the automated tests against it. Assertion Roulette (page X) can make it hard to determine how and why tests failed during the integration build because there is insufficient information included in the failure log to clearly identify which assertion failed. This can lead to slow troubleshooting of the the build failures since the failure needs to be reproduced in the development environment before we can even start speculating on the cause of the failure.

A common cause of grief is when tests start failing for no apparent reason. Neither the tests nor the production code has been modified yet the tests are suddenly failing. When we try to reproduce them in the development environment, they may or may not fail. These Erratic Tests (page X) are both very annoying and time consuming. Because there are so many possible causes I will list just a few of them to give you a general idea:

Interacting Tests occur when several tests use a Shared Fixture (page X). They make it hard to run tests individually or to run several test suites as part of a larger Suite of Suites (see Test Suite Object on page X). They can also cause cascading failures (where a single test failure leaves the Shared Fixture in a state that causes many other tests to fail.)
Test Run Wars occur when several Test Runners (page X) are running tests against a Shared Fixture at the same time. They invariably happen at the worst possible times such as when trying to fix the last few bugs before a release.
Unrepeatable Tests provide a different result between the first and subsequent test runs. They may force us to do Manual Intervention (page X) between test runs.

Another productivity-sapping smell is Frequent Debugging (page X). Automated unit tests should obviate the need to use a debugger in all but rare cases since the set of tests that are failing should make it obvious why the failure is occurring. Frequent Debugging is a sign that the unit tests are lacking in coverage or are trying to test too much functionality at once.

The real value of having Fully Automated Tests (see Goals of Test Automation on page X) is being able to run them frequently. Agile developers doing test-driven development often run (at least a subset of) the tests every few minutes. This behavior is to be encouraged because it makes the feedback loop shorter thereby reducing the cost of any defects introduced into the code. When the tests require Manual Intervention each time they are run, developers run the tests less frequently and this increases the cost of finding all the defects introduced since the tests were last run because more changes were made to the software since they were last run.

Another smell that results in the same net impact on productivity is Slow Tests (page X). When the tests take more than roughly 30 seconds to run, developers stop running them after every individual code change and wait for a "logical time" to run them such as before a coffee break, lunch or meeting. This results in a loss of "flow" and once again increases the time between when a defect was introduced and when it was identified by a test. The most frequently used solution to Slow Tests is also the most problematic; a Shared Fixture can result in many behavior smells and should be the solution of last resort.

The Code Smells

Code smells are the "classic" bad smells as first described by Martin Fowler in "Refactoring" [Ref]. These smells must be recognized by the test automater as they maintain test code. Most of the smells introduced by Fowler are code smells. Code smells typically affect maintenance cost of tests but they may also be early warnings signs of behavior smells to follow.

When reading tests, a fairly obvious though often overlooked smell is Obscure Test. It can take many forms but all have the same impact; it is hard to tell what the test is trying to do; it does not Communicate Intent (see Principles of Test Automation on page X). This makes the cost of test maintenance much higher and can lead to Buggy Tests when a test maintainer makes the wrong change to the test. A related smell is Conditional Test Logic (page X). Tests should be simple, linear sequences of statements. When tests have multiple execution paths, we cannot be sure exactly how the test will execute in a specific case.

Hard-Coded Test Data (see Obscure Test) can be insidious for several reasons. First, it makes tests harder to understand because we need to look at each value and guess whether it is related to any of the other values to understand how we are expecting the SUT to behave. The second problem is when testing a SUT that includes a database. The Hard-Coded Test Data can lead to Erratic Tests if tests happen to use the same database key or Fragile Fixtures if the values refer to records in the database that have been changed.

Hard-to-Test Code (page X) may be a contributing factor to a number of other code and behavior smells. The problem is most evident to the person writing the test who cannot find a way to set up the fixture, exercise the SUT or verify the expected outcome. They may be forced to test more software (a larger SUT consisting of many classes) than they would like. When reading a test, Hard-to-Test Code tends to show up as an Obscure Test because of the hoops the test automater had to jump through to interact with the SUT.

Test Code Duplication (page X) is bad because it increases the cost of maintaining tests. We have more test code to maintain and that code is harder to maintain because it often coincides with Obscure Test. It is often caused by cloning tests and not putting enough thought into how to reuse test logic intelligently. (Note that I said "reuse test logic" and not "reuse Test Methods (page X).") As our testing needs emerge, it is important that we factor out commonly used sequences of statements into Test Utility Methods (page X) that can be reused by various Test Methods (page X). (It is equally important that we do not reuse Test Methods as that results in Flexible Tests (see Conditional Test Logic).) This reduces the maintenance cost of tests several ways.

Test Logic in Production (page X) is undesirable because there is no way to ensure that it won't accidently be run when it shouldn't (See the sidebar Ariane (page X) for a cautionary tale.). It also makes the production code larger and more complicated. It may also cause other software components or libraries to be included in the executable.

What's Next?

In this chapter we have seen a plethora of things that can go wrong when automating tests. In the Goals of Test Automation narrative I describe the goals we need to keep in mind as we automate our tests so that we may have an effective test automation experience. That will set us up to start looking at the principles that will help steer us clear of many of these problems.

Page generated at Wed Feb 09 16:39:29 +1100 2011