Automatically Testing Software
Overview
Teaching: 30 min
Exercises: 20 minQuestions
Does the code we develop work the way it should do?
Can we (and others) verify these assertions for themselves?
To what extent are we confident of the accuracy of results that appear in publications?
Objectives
Explain the reasons why testing is important
Describe the three main types of tests and what each are used for
Implement and run unit tests to verify the correct behaviour of program functions
Introduction
Being able to demonstrate that a process generates the right results is important in any field of research, whether it’s software generating those results or not. So when writing software we need to ask ourselves some key questions:
- Does the code we develop work the way it should do?
- Can we (and others) verify these assertions for themselves?
- Perhaps most importantly, to what extent are we confident of the accuracy of results that software produces?
If we are unable to demonstrate that our software fulfills these criteria, why would anyone use it? Having well-defined tests for our software is useful for this, but manually testing software can prove an expensive process.
Automation can help, and automation where possible is a good thing - it enables us to define a potentially complex process in a repeatable way that is far less prone to error than manual approaches. Once defined, automation can also save us a lot of effort, particularly in the long run. In this episode we’ll look into techniques of automated testing to improve the predictability of a software change, make development more productive, and help us produce code that works as expected and produces desired results.
What Is Software Testing?
For the sake of argument, if each line we write has a 99% chance of being right, then a 70-line program will be wrong more than half the time. We need to do better than that, which means we need to test our software to catch these mistakes.
We can and should extensively test our software manually, and manual testing is well-suited to testing aspects such as graphical user interfaces and reconciling visual outputs against inputs. However, even with a good test plan, manual testing is very time consuming and prone to error. Another style of testing is automated testing, where we write code that tests the functions of our software. Since computers are very good and efficient at automating repetitive tasks, we should take advantage of this wherever possible.
There are three main types of automated tests:
- Unit tests are tests for fairly small and specific units of functionality, e.g. determining that a particular function returns output as expected given specific inputs.
- Functional or integration tests work at a higher level, and test functional paths through your code, e.g. given some specific inputs, a set of interconnected functions across a number of modules (or the entire code) produce the expected result. These are particularly useful for exposing faults in how functional units interact.
- Regression tests make sure that your program’s output hasn’t changed, for example after making changes your code to add new functionality or fix a bug.
For the purposes of this course, we’ll focus on unit tests. But the principles and practices we’ll talk about can be built on and applied to the other types of tests too.
Set Up a New Feature Branch for Writing Tests
We’re going to look at how to run some existing tests and also write some new ones,
so let’s ensure we’re initially on our develop
branch we created earlier.
And then, we’ll create a new feature branch called test-suite
off the develop
branch -
a common term we use to refer to sets of tests - that we’ll use for our test writing work:
$ git checkout develop
$ git branch test-suite
$ git checkout test-suite
Good practice is to write our tests around the same time we write our code on a feature branch. But since the code already exists, we’re creating a feature branch for just these extra tests. Git branches are designed to be lightweight, and where necessary, transient, and use of branches for even small bits of work is encouraged.
Later on, once we’ve finished writing these tests and are convinced they work properly,
we’ll merge our test-suite
branch back into develop
.
Don’t forget to activate our venv
environment, launch Jupyter Notebook and let’s see
how we can test our software for light curve analysis.
Lightcurve Data Analysis
Let’s go back to our lightcurve analysis software project.
Recall that it contains a data
directory, where we have observations of presumably variable stars, namely RR Lyrae candidates, coming
from two sources: the Kepler Space Telescope and LSST Data Preview 0.
Don’t forget about the best practices
Following the best practices from the corresponding section, let’s start with creating a new notebook and drafting its structure. Using headers in the markdown cells, determine the sections of your notebook.
Solution
We can start with the following sections:
- Imports
- Params
- Data loading
- Data inspection
- Selecting light curves for a single object
- Trying the model.py functions
- Test development
If we need something else, we can always add it later.
Now let’s open our data and have a look at it. For this we will use pandas
package.
Import it, open the lsst_RRLyr.pkl
catalogue and have a look at the format of this table.
Don’t forget to put your code in the sections where it belongs!
import pandas as pd
lc_datasets = {}
lc_datasets['lsst'] = pd.read_pickle('data/lsst_RRLyr.pkl')
lc_datasets['lsst'].info()
lc_datasets['lsst'].head()
We can see that the dataset contains 11177 rows (‘entries’) and 12 columns.
the lc_datasets['lsst'].info()
function also informs us about the types of the data in the columns,
as well as about the number of non-null values in each column.
Having a look at the top 5 rows (lc_datasets['lsst'].head()
) gives us an impression of what kind
of values we have in each column.
For now there are four columns that we’ll need:
- ‘objectId’ that contains identificators of the observed objects;
- ‘band’ that informs us about the band in which the observation is made;
- ‘expMidptMJD’ that contains the time stamp of the observation;
- ‘psfMag’ that containes measured magnitudes.
Let’s assume that we want to know the maximum measured magnitudes of
the light curves in each band for a single
object. Our dataset contains observations in all bands for a number of sources, so we have
to a) pick only one source, and b) separate the observations in different bands from each other.
There are many ways of how to do this, but for the purposes of this episode we
will store the single-source observational data for each band in a dictionary
and then apply the max_mag
function defined in our models.py
file.
First, pick an id of the object that we will investigate.
### Pick an object
obj_id = lc_datasets['lsst']['objectId'].unique()[4]
And then store its observations in each band as items of a dictionary
lc
.
### Get all the observations for this obj_id for each band
# Create an empty dict
lc = {}
# Define the bands names
bands = 'ugrizy'
# For each band create a bool array that indicates
# that this observation belongs to a certain object and is made in a
# certain band
for b in bands:
filt_band_obj = (lc_datasets['lsst']['objectId'] == obj_id) & (
lc_datasets['lsst']['band'] == b
)
# Select the observations and store in the dict 'lc'
lc[b] = lc_datasets['lsst'][filt_band_obj]
Don’t forget about the best practices
Do you see any variables defined in the code above that you should move in some other section?
Solution
It seems very likely that we will need the variable
bands
many times in the future. Let’s move it to the ‘Parameters’ section of the notebook.
Have a look at the resulting dictionary: you will find that each element has a key corresponding to the band name, and it’s value will contain a Pandas DataFrame with observations in this band.
Now we need to import the functions from the models.py
file.
We should do it in the ‘Imports’ section.
import lcanalyzer.models as models
Pick a function from this module, for example, max_mag
, and apply it to one of the light
curves.
models.max_mag(lc['g'],'psfMag')
19.183367224358136
Don’t forget about the best practices
Do you see anything in the code we just typed that can be put in the ‘Parameters’?
Solution
It is better to put the magnitude column name, ‘psfMag’, in a variable (let’s call it
colname_mag
) and declare it in the ‘Parameters’ section. Why? Because chances are, in the future we will want to apply our code to another dataset with different column names, and if we continue using ‘psfMag’ across the notebook, later on we’ll have to replace it either manually, or using ‘Search’ functionality. In a large notebook both actions are likely to produce errors.
How would you check if our max_mag
function works correctly?
The answer that just came to your head, in all likelyhood, sounds similar to this: “I would pass a simple DataFrame to this function and check manually that the returned maximum value is correct”. It makes perfect sense, and, perhaps, may work with a function as simple as ours:
test_input = pd.DataFrame(data=[[1, 5, 3], [7, 8, 9], [3, 4, 1]], columns=list("abc"))
test_output = 7
models.max_mag(test_input, "a") == test_output
True
But now let’s make the task more realistic and recall our original objective: to get maximum values of the light curves in all bands. We can write a function for this as well:
### Get maximum values for all bands
def calc_stat(lc, bands, mag_col):
# Define an empty dictionary where we will store the results
stat = {}
# For each band get the maximum value and store it in the dictionary
for b in bands:
stat[b + "_max"] = models.max_mag(lc[b], mag_col)
return stat
And then construct the test data:
df1 = pd.DataFrame(data=[[1, 5, 3], [7, 8, 9], [3, 4, 1]], columns=list("abc"))
df2 = pd.DataFrame(data=[[7, 3, 2], [8, 4, 2], [5, 6, 4]], columns=list("abc"))
df3 = pd.DataFrame(data=[[2, 6, 3], [1, 3, 6], [8, 9, 1]], columns=list("abc"))
test_input = {"df1": df1, "df2": df2, "df3": df3}
test_output = {"df1_max": 8, "df12_max": 6, "df3_max": 8}
test_output == calc_stat(test_input, ["df1", "df2", "df3"], "b")
See what kind of output this code produces.
What went wrong?
If you just copied the code above, you got
False
. Try to find out what is wrong with ourcalc_stat
function.Solution
Our
calc_stat
function is fine. Ourtest_output
contains two errors. This example highlights an important point: as well as making sure our code is returning correct answers, we also need to ensure the tests themselves are also correct. Otherwise, we may go on to fix our code only to return an incorrect result that appears to be correct. So a good rule is to make tests simple enough to understand so we can reason about both the correctness of our tests as well as our code. Otherwise, our tests hold little value.
Our crude test failed and didn’t even inform us about the reasons they failed. Surely there must be a better way to do this.
Testing Frameworks
The example above shows that manually constructing even a simple test for a fairly simple function can be tedious, and may produce new errors instead of fixing the old ones. Besides, we would like to test many functions in various scenarios, and for a complex function or a library, a test suite - a set of tests - can include dozens of tests. Obviously, running them one by one in a notebook is not a good idea, so we need a tool to automatize this process and to obtain a comprehensive report on which of the tests were passed and which failed. We’d also prefer to have something that tells us what exactly went wrong.
A solution for these tasks is called unit testing frameworks. In such a framework we define the tests we want to run as functions, and the framework automatically runs each of these functions in turn, summarising the outputs. Since most people don’t enjoy writing tests, the unit testing fraimworks aim to make it simple to:
- Add or change tests,
- Understand the tests that have already been written,
- Run those tests, and
- Understand those tests’ results.
Test results must also be reliable. If a testing tool says that code is working when it’s not, or reports problems when there actually aren’t any, people will lose faith in it and stop using it.
We will use a testing framework called pytest
. It is a Python
package that can be installed, as usual, using pip
:
$ python -m pip install pytest
Why Use pytest over unittest?
We could alternatively use another Python unit test framework, unittest, which has the advantage of being installed by default as part of Python. This is certainly a solid and established option, however pytest has many distinct advantages, particularly for learning, including:
- unittest requires additional knowledge of object-oriented frameworks (covered later in the course) to write unit tests, whereas in pytest these are written in simpler functions so is easier to learn
- Being written using simpler functions, pytest’s scripts are more concise and contain less boilerplate, and thus are easier to read
- pytest output, particularly in regard to test failure output, is generally considered more helpful and readable
- pytest has a vast ecosystem of plugins available if ever you need additional testing functionality
- unittest-style unit tests can be run from pytest out of the box!
A common challenge, particularly at the intermediate level, is the selection of a suitable tool from many alternatives for a given task. Once you’ve become accustomed to object-oriented programming you may find unittest a better fit for a particular project or team, so you may want to revisit it at a later date!
pytest
requires that we put our tests into a separate .py
file.
We already have some tests in tests/test_models.py
:
"""Tests for statistics functions within the Model layer."""
import pandas as pd
def test_max_mag_integers():
# Test that max_mag function works for integers
from lcanalyzer.models import max_mag
test_input_df = pd.DataFrame(data=[[1, 5, 3], [7, 8, 9], [3, 4, 1]], columns=list("abc"))
test_input_colname = "a"
test_output = 7
assert max_mag(test_input_df, test_input_colname) == test_output
...
The first function represent the same test case as the one we tried first in our notebook. However, it has a different format:
- we import the function we test right inside the test function, for clarity of testing environment;
- then we specify our test input and output;
- and then we run our testing using assert keyword.
We haven’t met with assert
keyword before, however, it is essential for developing,
debugging and testing of robust and reliable code. assert
keyword is responsible
for checking if some condition
is true. If it is true, nothing happens and the execution of the code continues. However,
if the condition is not fullfilled, an AssertionError occurs. When you write
your own assert
checks, you can use the following syntax:
assert condition, message
And testing frameworks already have their own implementations of various assertions,
for example those that can check if two dictionaries are the same (and then inform us
where exactly they differ), if two variables are of the same type and so on. Apart from that,
some other packages, including numpy
and pandas
, have testing
modules that allow you
to compare numpy arrays, DataFrames, Series and so on.
Running Tests
Now we can run these tests in the command line using pytest
:
$ python -m pytest tests/test_models.py
Here, we use -m
to invoke the pytest
installed module,
and specify the tests/test_models.py
file to run the tests in that file explicitly.
Why Run Pytest Using
python -m
and Notpytest
?Another way to run
pytest
is via its own command, so we could try to usepytest tests/test_models.py
on the command line instead, but this would lead to aModuleNotFoundError: No module named 'lcanalyzer'
. This is because using thepython -m pytest
method adds the current directory to its list of directories to search for modules, whilst usingpytest
does not - thelcanalyzer
subdirectory’s contents are not ‘seen’, hence theModuleNotFoundError
. There are ways to get around this with various methods, but we’ve usedpython -m
for simplicity.
============================= test session starts ==============================
platform linux -- Python 3.11.5, pytest-8.0.0, pluggy-1.4.0
rootdir: /home/alex/InterPython_Workshop_Example
plugins: anyio-4.2.0
collected 2 items
tests/test_models.py .. [100%]
============================== 2 passed in 0.44s ===============================
Pytest looks for functions whose names also start with the letters ‘test_’ and runs each one.
Notice the ..
after our test script:
- If the function completes without an assertion being triggered,
we count the test as a success (indicated as
.
). - If an assertion fails, or we encounter an error,
we count the test as a failure (indicated as
F
). The error is included in the output so we can see what went wrong.
So if we have many tests, we essentially get a report indicating which tests succeeded or failed.
Exercise: Write Some Unit Tests
We already have a couple of test cases in
tests/test_models.py
that test themax_mag()
function. Looking atlcanalyzer/models.py
, write at least two new test cases that test themean_mag()
andmin_mag()
functions, adding them totests/test_models.py
. Here are some hints:
- You could choose to format your functions very similarly to
max_mag()
, defining test input and expected result arrays followed by the equality assertion.- Try to choose cases that are suitably different, and remember that these functions take a DataFrame and return a float corresponding to a chosen column
- Experiment with the functions in a notebook cell in
test-development.ipynb
to make sure your test result is what you expect the function to return for a given input. Don’t forget to put your new test intests/test_models.py
once you think it’s ready!Once added, run all the tests again with
python -m pytest tests/test_models.py
, and you should also see your new tests pass.Solution
def test_min_mag_negatives(): # Test that min_mag function works for negatives from lcanalyzer.models import min_mag test_input_df = pd.DataFrame(data=[[-7, -7, -3], [-4, -3, -1], [-1, -5, -3]], columns=list("abc")) test_input_colname = "b" test_output = -7 assert min_mag(test_input_df, test_input_colname) == test_output
def test_mean_mag_integers(): # Test that mean_mag function works for negatives from lcanalyzer.models import mean_mag test_input_df = pd.DataFrame(data=[[-7, -7, -3], [-4, -3, -1], [-1, -5, -3]], columns=list("abc")) test_input_colname = "a" test_output = -4. assert mean_mag(test_input_df, test_input_colname) == test_output
Optional Exercise: Write a Unit Test for the
calc_stat
functionIf you have some time left, extract our
calc_stat
function into themodels.py
file and write a test for this function, using the (correct) test input and output from our experiments earlier.
The big advantage is that as our code develops we can update our test cases and commit them back, ensuring that ourselves (and others) always have a set of tests to verify our code at each step of development. This way, when we implement a new feature, we can check a) that the feature works using a test we write for it, and b) that the development of the new feature doesn’t break any existing functionality.
What About Testing for Errors?
There are some cases where seeing an error is actually the correct behaviour,
and Python allows us to test for exceptions.
Add this test in tests/test_models.py
:
import pytest
def test_max_mag_strings():
# Test for TypeError when passing a string
from lcanalyzer.models import max_mag
test_input_colname = "b"
with pytest.raises(TypeError):
error_expected = max_mag('string', test_input_colname)
Note that you need to import the pytest
library at the top of our test_models.py
file
with import pytest
so that we can use pytest
’s raises()
function.
Run all your tests as before.
Since we’ve installed pytest
to our environment,
we should also regenerate our requirements.txt
:
$ pip3 freeze > requirements.txt
Finally, let’s commit our new test_models.py
file,
requirements.txt
file,
and test cases to our test-suite
branch,
and push this new branch and all its commits to GitHub:
$ git add requirements.txt tests/test_models.py
$ git commit -m "Add initial test cases for mean_mag() and min_mag()"
$ git push -u origin test-suite
Why Should We Test Invalid Input Data?
Testing the behaviour of inputs, both valid and invalid, is a really good idea and is known as data validation. Even if you are developing command line software that cannot be exploited by malicious data entry, testing behaviour against invalid inputs prevents generation of erroneous results that could lead to serious misinterpretation (as well as saving time and compute cycles which may be expensive for longer-running applications). It is generally best not to assume your user’s inputs will always be rational.
What About Unit Testing in Other Languages?
Other unit testing frameworks exist for Python, including Nose2 and Unittest, and the approach to unit testing can be translated to other languages as well, e.g. pFUnit for Fortran, JUnit for Java (the original unit testing framework), Catch or gtest for C++, etc.
Key Points
The three main types of automated tests are unit tests, functional tests and regression tests.
We can write unit tests to verify that functions generate expected output given a set of specific inputs.
It should be easy to add or change tests, understand and run them, and understand their results.
We can use a unit testing framework like Pytest to structure and simplify the writing of tests in Python.
We should test for expected errors in our code.
Testing program behaviour against both valid and invalid inputs is important and is known as data validation.