In recent years developers have come to understand the importance of testing as a part of the development cycle. People no longer treat tests as a side business that disappears once you get your code out the door, and that's a good thing.
You probably heard of "black box" testing vs. "white box" testing, which often translates to unit tests vs. integration or "product tests". While unit testing (with all of its variants) has been receiving lots of attention and has many tools and examples to learn from, product testing receives much less coverage.
When faced with the task of building a product test suite, many turn to the popular set of tools for testing in Python:
nose. In this article I'll try to demonstrate some of the challenges that need addressing when building such frameworks, and where the popular tools fall short of providing them with reasonable comfort.
Frameworks vs. Loaders and Runners
The first thing we should note is the key difference between
nose and similar tools, and
nose were originally created to make up for the lack of a proper test runner in the first versions of
unittest. That is, their goal is to discover and run tests.
unittest, however, is a framework. It provides a skeleton and a basic structure for your tests and testing logic.
This means that when someone says he built his framework using
nose, for instance, it usually means that he/she used a mixture of
unittest. The former covering the running phase, while the latter taking care of the code structure. This means that in most cases today,
unittest is the "weapon of choice" for testing-related activities in Python.
Product Testing != Unit Tests
unittest, as its name suggests, was created for unit tests. Most examples demonstrating it look more or less like this one:
import unittest class MyFirstTest(unittest.TestCase): def test_fibonacci(self): self.assertEquals(fibo(6), 720)
Before we go on I want to make one thing clear -
unittest is great. It is great to have a built-in framework for such use cases, and a lot of effort has been put into it over the years. My only claim is that while it is great for unit testing, it is not ideal for product testing.
The real world of complete or near-complete products is not about testing math libraries. Real world test code gets long and complex really fast, and requires additional facilities to make it maintainable. I will try to demonstrate several such facilities in the following sections.
Test Method Parameterization
Let's say you want to test a smarter version of
mkdir, which works whether or not a path already exists. You want to make sure it works in both cases, right? Here's the
unittest way of doing this:
class MkdirTest(unittest.TestCase): def test_mkdir_path_exists(self): os.mkdir(path) smart_mkdir(path) def test_mkdir_path_does_not_exist(self): smart_mkdir(path) def tearDown(self): self.assertTrue(os.path.isdir(path)) super(MkdirTest, self).tearDown()
To try to deal with the code duplication above, we can try to extract the common code into a separate method (it's even more important if the sequence being tested is more than one statement):
class MkdirTest(unittest.TestCase): def test_mkdir_path_exists(self): self._test_mkdir(path_exists=True) def test_mkdir_path_does_not_exist(self): self._test_mkdir(path_exists=False) def _test_mkdir(self, path_exists): if path_exists: os.mkdir(path) smart_mkdir(path) def tearDown(self): self.assertTrue(os.path.isdir(path)) super(MkdirTest, self).tearDown()
This works, of course, but is still far from ideal. For starters, we still have duplicated code (although in another form), and the danger of code bloat still exists (to demonstrate this, imagine you want to test a cartesian product of 3 parameters)[^for_loops].
Another example of parameterization is what I call setUp parameterization. Let's assume we're testing a key/value store that can either work in-memory or on a file. In this case we want to have a set of tests that appear the same, with different setups. We can do this with pure
unittest is by using inheritance:
class _KeyValueTestBase(unittest.TestCase): def test_set_value(self): self.store.set("a", "b") self.assertEquals(self.store["a"], "b") def test_delete_value(self): self.store.set("a", "b") self.store.unset("a") self.assertNotIn("a", self.store) class InMemoryKeyValueTest(_KeyValueTestBase): def setUp(self): super(InMemoryKeyValueTest, self).setUp() self.store = InMemoryKeyValueStore() class FileKeyValueStoreTest(_KeyValueTestBase): def setUp(self): super(FileKeyValueStoreTest, self).setUp() self.store = FileKeyValueStore("/path/to/store.db")
Once again, this is far from ideal. You have to skip forward to read the setup code before reading the test case code, and the use of inheritance tends to make maintainers' lives harder down the road. There is also a potential issue with switching runners here[^underscores].
It also turns out that supporting parameterization over
unittest is much trickier than it initially seems[^infi.unittest.parameters].
Hooks and Outside Intervention
The need to interfere with critical test lifetime events arises a lot during product tests. It is very reasonable for you to want to enforce certain things to happen in certain stages of all tests, say during test setup phase. In
unittest, the only reasonable way to do this is to have all your tests derive from a single base class, doing the critical actions in its
Unfortunately, doing this increases the risk of that single base getting bloated with setup code pretty quickly, and will make it very hard for you to decouple the responsibilities into multiple classes/files. Introducing a constraint over inheritance may also hinder future attempts to refactor your class hierarchy.
We would be much better off if we had a hook mechanism through which we could register for events such as test starts and ends.
py.test implement such hooks to some degree (they're not cross-compatible though), but since they cannot assume much about the tests they're running, those plugins remain "outsiders" and are logically a part of the loader/runner.
unittest is not very inviting when it comes to implementing such hooks on your own. For example, this is the implementation of
TestCase.run (At least in Python 2.7):
def run(self, result=None): orig_result = result if result is None: result = self.defaultTestResult() startTestRun = getattr(result, 'startTestRun', None) if startTestRun is not None: startTestRun() self._resultForDoCleanups = result result.startTest(self) testMethod = getattr(self, self._testMethodName) if (getattr(self.__class__, "__unittest_skip__", False) or getattr(testMethod, "__unittest_skip__", False)): # ... [9 lines of skipping logic] try: success = False try: self.setUp() except SkipTest as e: self._addSkip(result, str(e)) except KeyboardInterrupt: raise except: result.addError(self, sys.exc_info()) else: try: testMethod() except KeyboardInterrupt: raise # ... [21 lines of exception handling nuances] except: result.addError(self, sys.exc_info()) else: success = True try: self.tearDown() except KeyboardInterrupt: raise except: result.addError(self, sys.exc_info()) success = False cleanUpSuccess = self.doCleanups() success = success and cleanUpSuccess if success: result.addSuccess(self) finally: result.stopTest(self) if orig_result is None: stopTestRun = getattr(result, 'stopTestRun', None) if stopTestRun is not None: stopTestRun()
Aside from the fact that this is a 80-line-long function, it contains the entire state implementation for the
unittest runner. It is also not very easy to augment by inheritence (I think we all can agree that the template method pattern would be very much appreciated here,). Adding hooks for critical lifetime events will require you to hack around this implementation or completely duplicate it[^hooks_impl].
Some people may argue that
unittest intends for
run to be a polymorphic method, and you don't have to use the default
TestCase. How can this make sense? it's the features of
TestCase that bring people to use
unittest in the first place. Without the
TestCase is just a bunch of assertion methods awkwardly bound in a class.
Helper Libraries and Shared State
As the amount of code repeated on your tests increases, you would like to extract it to helper libraries.
Let's say we're testing a home router (through a wrapper library, of course), and we're faced with this code to begin with:
class RouterResetTest(TestCase): def setUp(self): super(RouterResetTest, self).setUp() self.router = RouterAPI(ROUTER_IP_ADDR) def test_can_reset_router(self): self.router.soft_reset() self.router.wait_until_reset_over() self.assertTrue(self.router.is_on()) self.assertIn("PWR", self.router.get_blinking_led_names()) self.assertEquals(self.router.get_ip_addr(), ROUTER_IP_ADDR) def test_can_upgrade_router(self): self.router.download_and_install_upgrade() self.assertTrue(self.router.is_on()) self.assertIn("PWR", self.router.get_blinking_led_names()) self.assertEquals(self.router.get_ip_addr(), ROUTER_IP_ADDR)
This is typical for complex products. We want to check a bunch of things every now and then, sometimes in the middle of our test cases. Even for the novice programmer this should scream DRY. For a single case this is easily solved by extracting into a method. When the pattern spans multiple cases and files, though, we are left with either inheritance or decomposition. Since we want to avoid inheriting from a single base class, we would like to have this code extracted like so:
from router_test_utils import assert_router_operational class RouterResetTest(TestCase): def test_can_reset_router(self): self.router.soft_reset() self.router.wait_until_reset_over() assert_router_operational(self.router) def test_can_upgrade_router(self): self.router.download_and_install_upgrade() assert_router_operational(self.router)
When we try to implement
assert_router_operational, though -
def assert_router_operational(router) #???
Oops, we no longer have
self. There are some workarounds (pass
self around to non-method functions, skip self.assert* and just raise AssertionError(...) on your own), but they are just that, workarounds. self.assert* assertions are actually actually quite useful (providing verbose diffs, formatting and more). The problem worsens when we want to use other
TestCase features like cleanups:
def shut_down_router(router): """Shuts down the router, and causes it to be powered on again before the next test""" router.power_off() addCleanup(router.power_on) # can't use this. addCleanup is a method of TestCase
These examples underscore what I think is the underlying problem here. unittest has poor handling of global state (this makes sense; why do you need global state when unit testing?), and thus black box testing is unnatural and expensive with it. A few more examples:
- You may want to initialize your product wrapper once for the entire suite, because initialization is expensive (VMWare vSphere, for instance, pulls several megabytes of schema XMLs upon initialization)
- You may want to preserve some kind of history or statistics over what your tests actually do
- You may want to register hooks from previous tests to be checked throughout the suite
You may also want to get a hold of the currently running tests from outside utilities, like in cases where they must inquire various attributes relating to the current test (like version of protocol needed etc.). This is also not easily achieved without some hacks [^current_tests].
In the real world tests need to interface with other parts of your workflow. Results need to be reported to some aggregation service, identifying the user initiating the tests requires authentication services, and much, much more. You may also be interested in interfacing with build systems, continuous integration solutions, configuration management, inventory tracking, issue reporting/bug tracking -- the only limit is the complexity of your organization.
unittest itself contains no convenient way to add these things - lacking plugins, hooks, convenience helpers or anything like it. Some features are really difficult to support, like trying to save logs for each executed tests to a separate log file to make investigation easier.
A framework for product testing should make these additions very easy to achieve, with as little coding or hacking as possible.
Fighting Implicitness and Gotchas
Production test suites must not fail silently, hide tricky behavior or contain lots of "you should have known about that"s. Unfortunately,
py.test are susceptible to those pitfalls since they try to achieve too much in too many possible scenarios[^nose_flexibility].
One example of such tricky cases is the case of
nose. Given the following two test files:
##### module1.py import unittest class MyTest(unittest.TestCase): def test(self): self.assertIs(type(self), MyTest) ##### module2.py from module1 import MyTest
Trying to run
module2.py will result in a failure:
F ====================================================================== FAIL: test (module2.MyTest) ---------------------------------------------------------------------- Traceback (most recent call last): .... self.assertIs(type(self), Test1) AssertionError: <class 'module2.Test1'> is not <class 'module1.Test1'>
In this case
nose dynamically replaces the
MyTest class in
module2.py with a subclass of the same name[^transplant_class].
Another case is the questionable default in
nose to silently skip executable files. This means that if a coworker commits this file to your codebase:
class MyTest(unittest.TestCase): def test(self): assert False if __name__ == "__main__": unittest.main()
And forgets that it is executable, it will be silently ignored by all
nosetests executions, leading everyone to believe everything is ok. Even the implicit, heuristical nature of the loading process employed by runners is tricky - we can accidentally have a typo in our code and name a test function
tesst_something instead of
test_something, and it will be silently skipped. This is very easy to miss if you have a large number of tests (which you should).
Not all issues mentioned disturb all developers at the same intensity. It should also be obvious that the scale of infrastructure developed in each shop is different, and the exact constraints differ as well.
For me and the teams in which I participated, these issues kept piling up. At some point I started feeling like we're maintaining a pile of patches rather than a engineering a proper software solution.
Eventually I decided to create a new project to provide a decent foundation for product tests, which will depart from
unittest and related tools. In the next article I will present it and demonstrate some of the project goals.
Many special thanks go to @aknin who proofread this article and provided many great tips and suggestions.
unittest I actually mean
unittest2, which was merged into the standard library
unittest in Python 2.7.
[^for_loops]: Some argue that the solution to this is just a for loop (or nested for loops). The problem is you often want a single method per case - having
tearDowns run properly, etc. You also prefer to see exactly which case fails when it does, seeing the exact parameter combination that triggered the failure...
[^underscores]: The underscore in
_KeyValueTestBase is added so that
nose won't try to load and run the base class. This would fail since the class is incomplete. Unfortunately other runners ignore this rule and attempt to run
_KeyValueTestBase, making it harder for you to switch between runners painlessly.
[^infi.unittest.parameters]: Not too long ago I tried to solve this issue in a small wrapper I wrote. It introduced this syntax:
class SomeTest(TestCase): @parameters.iterate("a", [1, 2, 3]) @parameters.iterate("b", [4, 5, 6]) def test(self, a, b): pass # 9 variations will run here... problem is that the entire foundation of `unittest`, namely the `TestCase` class, is constructed using the test method name alone, so creating multiple cases from a single method requires you to patch or replace the `TestLoader` instance used to load tests. Even if you implement one correctly, you have to tie all loose ends (namely the different loaders used by `nose`, `py.test` and others), and you end up breaking third-party tools (like rerunning failed tests on nose, running specific tests via command-line etc.).
[^hooks_impl]: Actually you can modify the
TestResult object to fire hooks/events whenever its methods are called, but then you're faced with forcing a specific result type regardless of the runner/loader you're using. It's also quite tricky to do correctly.
[^current_tests]: Holding the "current test" in a global variable seems easy, but is actually tricky given the various states the test can be in. For
unittest this is especially tricky given the state in which cleanUps and tearDowns are run.
py.test take pride in the fact that they can run almost any kind of test - test functions, unittest cases, objects that look like tests and more. This introduces a very demanding set of scenarios to be supported. You can read the
nose codebase if you want to get a feel of how many different corner cases it tries to deal with.
[^transplant_class]: I originally encountered this by trying to implement abstract base in infi.unittest, and ended up hacking around it. I reported the issue here. At the time of writing there is still no response about this...