Recently there has been plenty of talk about testing for infections, for what can only be described as very obvious reasons. However, outside specialist sources, the performance characteristics of these tests are not really discussed. If they are, the discussion can be somewhat vague, with some mentions about their accuracy.
Anyone working with data knows that while there is an exact definition for accuracy the colloquial use of this term is much more fluid. In this piece, I want to explore why one number is not really enough and a bit more thorough understanding is called for.
So, let’s assume we are working with a test for an infection. Our test is very good, with positive test outcome meaning that there is a 95% chance of identifying the infection correctly. I’m not referring to any particular Covid-19 test, but did choose a number that is representative of the real tests. Anyways, case closed, test is good, nothing more to see here? Well, not quite. The answer to that actually depends on some other factors. Let’s draw some illustrative squares to explain.
We will consider a population of 10 000 people, neatly represented by this square on the right. To get further, we need to know how large portion of the population is infected. To keep things simple, we ignore the effect of time here and just assume that everyone has it at the same time. While certainly a simplified case when compared to real-life, it actually does not affect the point we’re making here.
Nonetheless, we shall assume that the infection is very effective and reaches 60% of the population. Thus our population is neatly split in two, with the infected and the non-infected. This is the real situation, but it is not something we actually know a priori. That is why we have the test in the first place. To try and understand what the situation is.
So, we test our whole population. (We could also test a sample of the said population in such a way that we would be comfortable extrapolating the results from the test set to the whole population, but this piece is not about sampling, so everyone it is.) Now we need to understand what the expected test results are, and for that we need to understand how good our test actually is.
So, how do we understand our test outcomes? We shall give the test to infected and non-infected alike, so we need to know what is the expected outcome in each case. We assumed a good test that gives the correct answer in 95% of the cases and we take this to hold both ways. This means that 95% of the infected get a positive result, but so do 5% of the non-infected ones. This means that we have the following view of the situation.
From these numbers we can calculate various indicators. For example, there are 5900 positive test results in total, out of which 5700 are correct, meaning that if you get a positive result there is 97% chance that it is correct. Respectively, there is 92% chance of the negative result being correct.
Interestingly, neither of those numbers was 95%, which was the chance that the test identifies the infection correctly. But, they are not that far off either, so should we care? Definitely maybe! Let us make this same exercise using the exact same assumptions on the test, but with a different infection and see what comes out.
Now, what we have considered so far is something that is very common and most of our population was infected. Let’s see how the situation changes if we have something that is very rare. Everything else stays the same, but now only a tiny fraction, 0.4%, of the population is infected. This time around our colored square looks different.
Again, this is the real situation which we do not know, but need to deploy testing to try and figure it out. We already know the performance of our test, because we stipulated that nothing changes. Drawing the same square as we did before gives us quite a different outcome.
As above, we have all the numbers we need to dig a bit deeper into the results:
As you can see, it is the latter group that throws everything off. Now there are 536 positive cases, out of which only 38, or 7%, are correct. However, from the negative cases 99.98% are correct. Again, neither is 95%, but this time we are way off.
In a nutshell, we have two very different results with the exact same test. The only difference between these cases was the prevalence of the infection. No test is perfect and there always will be errors, one way or the other. In the widely spread, Covid-19–like, case the error rate did not have a very large effect on the outcome. In the rare case, however, the real cases are completely overshadowed by the false positives from the non-infected majority.
We used the same rate, 5%, for both errors, false positive or false negative. These of course need not be, and indeed rarely are, the same. It is quite easy to see that in the first case (small) changes in either error rate do not have a major effect in the results. This is certainly not true in the latter case, which is especially sensitive to the false positive rate because the number of actual cases is so small.
This particular error in understanding the outcomes of a test we have described above is common enough that it has a name, the base-rate fallacy.
As a last remark we should note, that aside from some nomenclature, there is nothing that is specific to medical sciences here. Everything applies as is to any binary classification problem and would be good to keep in mind, the next time you’re training your classifier and thinking about the accuracy of your model predictions.
Leave your details below and we'll be in touch soon.