3 February 2018
Over last couple of years I spent a lot of time thinking about different aspects of programming. Even though my main occupation (which is being an iOS engineer) definitely makes my perspective limited, I was still trying to focus on things that are applicable to a broader range of platforms and languages.
I wanted to start the series with something that influenced me the most in recent years and led me to a deeper understanding of programming as a whole — testing. I'll be mostly talking about unit testing (which I have the most experience with) and will briefly touch on other kinds of testing (which I'm not very fond of, perhaps, because of lack of the said experience).
What Are Tests?
If we want to come up with a definition for something, it'd better be constructive and actionable, that is, have real, tangible impact on what you're doing every day. As well as that, it's useful to come up with something more or less objective, something that has a higher chance of being perceived similarly by multiple people and not ultimately boiling down to a matter of personal preference.
The definition of unit tests that I found particularly useful is this:
Unit tests are a functional specification of code that is always synchronised with the code and is checked regularly and in automatic fashion.
In other words, I see tests mostly as a means of communication between developers, and here's what a good test tells me:
- this is the expected behaviour of the system under test (SUT), and
- this behaviour is intentional
The second point may be less obvious than the first one, but for me its extremely important. Another way to phrase this would be that if I see a particular aspect of behaviour that doesn't have a corresponding test case (mind you, this is not the same thing as coverage), it's a strong signal for me that, for some reason, the author just decided to spend their time on writing code that wasn't really necessary. Ideally, every line of code should have some justification for being committed in a project, and unit tests provide you with a way of communicating this justification to others: this is why I had to write this line, this is why I decided to check this condition and so on. The fastest code and the code with least amount of bugs is the code that was never written in the first place, you know.
In other words, I see unit tests as a thought process of a developer, their understanding and expectations of the system, but formalised in the form of (test) code, which can be reproduced by the computer or another developer. And, as any other communication, it can be less or more clear and less or more valuable.
Aside on type systems
There were multiple discussions on the Internets about whether we still need sophisticated type system in a language if we have a extensive test suite, and vice versa. I believe this is a false dichotomy, and in reality both tests and type system serve the same purpose — communication.
The main difference is that tests exercise behaviour, but type system makes proofs. The more advanced type system we have, the more can be expressed with it, and less tests we need to write. The main advantage of a sophisticated type system is that it allows to make invalid states in the system not only harder to reach, but unrepresentable. This means that if a program compiles, it is already free of many kinds of errors, and you don't have to write the code that handles such cases and you don't have to write corresponding tests.
I'd like to specifically single out the following type system properties:
Discriminated unions (or sum types, or enumerations), which represent a mutually exclusive set of cases (each with its own associated value). This can range from using enumeration for modelling HTTP methods instead of raw strings (and bring the number of possible states from practically infinite to 8 or 9), to representing states of an app screen, so it's impossible to show a loading indicator on top of error message, because you don't even have an access to data from one state if you are currently in the other. Finally, there is a special case of sum type that is very important — optional values. I can't overestimate how many issues in applications stem from rogue nil values. This includes usual null pointer dereferences or NPEs in Java or C++, and much more sneaky nil messaging behaviour in Objective C++. Making this notion explicit in the type system makes sure you never miss that crucial null check, or at least point you exactly to the place in code, where the unexpected null value originated, not several stack frames deep (or even several application run loops later), where you actually tried to use it.
"Unforgeable" objects (or inability to create objects "out of thin air" by means of reflection or cast between unrelated types), which means you can't create and pass around objects that haven't been explicitly handed to your code by the caller or improperly / incompletely initialised. In other words, the only way to construct an instance is to call a type constructor, which gives it the ability to establish all necessary invariants, so all other code doesn't have to check that they hold across the lifetime of an application.
So, what's the catch with type systems? I'd say that usually it's the complexity of the compiler and longer compilation times as a result.
When to Write Tests?
There are two main approaches to writing tests: writing them either before or after code. I believe that writing tests after code is severely diminishing their value because of two reasons:
If you don't start with a failing unit test, you can't be sure that this test is really testing anything. I have seen many cases, when a test will not actually fail, even if you change the corresponding behaviour of SUT. Sometimes this can happen if a necessary aspect of the behaviour is already present in the system (which is good — you don't need to write any more code). But, unfortunately, most of the time this indicates a mistake in the test itself, for example, due to a particular set of test data, which fails to trigger the necessary behaviour.
Writing tests after code usually leads to writing more code than strictly necessary, which is a bad thing, since it makes it harder to support the code base, especially when no one actually remembers, why this code was needed in the first place.
What (Not) to Test?
Theoretically, all code that you control can be tested, however, it may require some transformations first (more on that later) in order to do this more easily and reliably. Things get worse, when the code you're trying to test touches some other code that you don't control directly. This is usually fine for "inert" data structures like arrays, maps and strings, but most of the time this other code is not ready to be tested. Here are some common ways in which system or third-party frameworks make testing harder:
- shared global mutable state
- asynchronous APIs
- requiring system objects that cannot be easily or cheaply created
All this adds up to tests that are harder to write, slower to run and, perhaps, most importantly — harder to maintain. And the worst thing is that you can't really fix it, since it's a code you don't control. Therefore, I usually try to stay away from platform or third-party frameworks in the code I'm planning to test (and correspondingly, don't test the code that is heavily dependent on these frameworks). 100% coverage is a non-goal, especially when it comes at a cost of maintainability.
In an ideal world we would just write tests and they would do their job of being a communication mechanism and catching regressions. However, in reality it is often the case that, as your test suite grows, you need to spend non-trivial amount of time supporting it. This obvously includes changing tests when requirements change, but the biggest problem, I believe, is dealing with test failures. Here I'm not talking about failing test as the first step of TDD process, but more about tests that are failing after you made your changes. This includes:
- figuring out why a test has failed and how is this failure connected to these changes
- actually fixing the test
The difficulty of doing this is often exacerbated by running a test suite asynchronously, because of it being too slow to run synchronously and more often, and developers switching to other tasks, loosing the relevant context.
Imagine doing all that just to find out that this was "just a flaky test" and the usual solution being just to re-run again? How many man-hours were wasted doing that? What is even the point of having a test that is failing for random reasons, even when the code hasn't changed? I believe this is a very important problem, which significantly reduces the value of a test suite, and needs to be accounted for from the very beginning, if you plan to employ testing in your project.
Aside on asynchronous tests
I always found asynchronous code being a major source of reliability problems with tests. This stems from the very nature of asynchronicity:
- race conditions can make these tests even more flaky than usual
- your timeout for waiting on expectations is always not enough (it may be enough on simulator on your local machine, but not enough for simulator on CI, or for a device and so on)
What I find the most troublesome is that all these problems can be avoided quite easily, at the same time giving you more opportunities for testing all aspects of SUT behaviour. I will be talking more about this in next notes.
Higher-level tests, such as integration or UI tests, are very promising when you want to get maximum coverage with the least effort. However, even though I like the idea in general, I have yet to see a working process for these kinds of tests. I see the following set of (all too familiar) problems:
- even higher flakiness due to things that cannot be formalised easily (like "something appeared on the screen" or "web page has loaded" etc.) and sheer amount of code that is being executed during a test
- since these tests are usually much slower to run, there is a longer feedback cycle between making a breaking change, tests catching the regression and developer getting round to fixing the breakage
- still depend on implementation details (accessibility identifiers, "page objects" etc.) that need to be kept in sync
- can only cover a limited number of cases since more exhaustive testing is prohibitively expensive due to exponential explosion in the number of test cases
To summarise, I believe that higher-level tests can be useful in certain cases, but they provide too little additional value over a solid suite of unit tests to justify the expensive maintenance process.
I also wanted to share several practical tips on writing good unit tests. Some of them are more important than others, but in no particular order:
Naming test methods is important. Ideally, each test should clearly define 1) its preconditions and 2) its expected outcomes. It's a bit easier in RSpec-based frameworks where you don't have to invent your own naming scheme, since the framework already provides this separation in form of
itblocks. In XUnit-based frameworks, however, you are usually forced to lump everything in a name of your test method. Usually, to make things a little easier to read I use the following convention
test_<Precondition>_<ExpectedOutcome>, separating two parts with underscore. A couple of examples:
Another common problem with test names — missing precondition and / or expected outcome, as in
testResponseParsingor (equally useless)
testResponseParsingWorksCorrectly. Apart from failing to communicate, which scenario exactly is being tested, tests with names like this tend to accumulate a lot of unrelated checks and assertions (since almost anything suits the name!). This also makes failures harder to investigate, because you need to look into the body of the test to figure out what went wrong.
Even more alarming is having no assertions in a test at all. Usually, the intention in this case is not to specify the behaviour, but to see if anything goes wrong while running the test, like exceptions or asserts firing. I'm not really keen on this kind of testing, since it's all too easy to forget what was the original purpose of the test and how to interpret its failures, because not even author knows!
I found that it's a good idea to have the following structure in each test: 1) all necessary set-up to reproduce preconditions, 2) triggering the behaviour that is being tested, and finally 3) asserting expected outcomes. These 3 parts are also often called Arrange, Act and Assert, and I usually separate them by a blank line to have a sort of visual clue. Sometimes it may be a good idea to "inline" one part into another, especially if the test is quite short. Also, both RSpec- and XUnit-based test frameworks give you the ability to factor out the common set-up code from several tests, so you don't have to repeat yourself in each individual case.
Finally, test code is still code, so it's also "eligible" for de-duplication, refactoring and other treatments (it's even formalised as the third step in TDD process)
It's easy to test that 2 + 2 = 4, but real projects are not that simple.
Well, you're absolutely right. However, I think that the important question here is why 2 + 2 = 4 is easy to test and can we achieve the same degree of testability for real projects? I believe that we can, and more on that in upcoming notes.
We need 100% coverage.
Test coverage has a nice advantage of being easy to measure, but other than that I don't find high coverage numbers particularly useful. That being said, lines or branches that are not covered, definitely are signs of missing tests or code that is not actually used.
How do I test that a private method of SUT was called or a private field has changed its value?
I'm sorry, but you don't. Everyone seems to agree that you shouldn't test implementation details of your SUT, so you don't have to change your tests when these details change. However, this question keeps popping up, mostly because:
- multiple tests share the same expected outcome, triggered from the same private method
- the private method in question triggers a side effect that is too expensive or irreversible
I don't have a good solution for the first problem (it can be argued that it is not a given that all these preconditions will always share the same expected outcomes). The second one, however, is a signal of an inability to properly isolate SUT during a test.
We need a framework for tests.
Not really. Everything that these frameworks do, can be recreated manually, and this won't be the hardest neither longest part of the whole testing process.
Thank you for reading and stay tuned for Part II!