Assessing Accuracy

It is March 2015. Canine cancer-detectors are back in the news. The issue that I would like to address does not concern such studies in particular (i.e., I do not have a dog in the fight with regards to research investigating dogs identifying cancer), but the reporting of these studies does provide a convenient example that I would like to take advantage of to illustrate a common shortcoming in the general media coverage of medical tests (which includes both news stories as well as recommendations by advocacy groups). There are, of course, many different aspects of reporting that one can hone a critical eye for (especially concerning methodology), but I would like to focus on a more general aspect, which should be just as valuable when informing oneself about proven (and routinely administered) screening methods.

The most recent study to make the news involved a dog sniffing out a diagnosis for or against thyroid cancer with a success rate of 88%. I did state that I do not plan to dwell on methodology, but a sample size of 34 subjects really is objectionably small. So to make my point clearer, I thought that it would be fairer to take on a more robust study. And Gianluigi Taverna and colleagues (2014) ran just such a study, published in the Journal of Urology, featuring 902 participants (362 patients with prostate cancer; 540 healthy controls). That is a large sample size, and you would expect any findings to paint a fairly representative picture. The key issue is how the reported findings are interpreted.

Reuters tells me that “Overall, the dogs had 16 false positives and four false negatives.” But Reuters does not tell me – the possibly naive reader – how to interpret this information.

Let us assume that it is reasonable to extrapolate based on the numbers in this study. This would give us:
16 false positives out of 540 cases = a rate of about 3% (of healthy individuals wrongly diagnosed with cancer)
4 false negatives out of 362 cases = a rate of about 1% (of cancer patients wrongly diagnosed as cancer-free)

I googled “dog sniff prostate cancer”, and the first hit was Medical News Today reporting 98% accuracy for this study. 98%!

Do these numbers impress you, too?

Math Time

Let us assume a 5-year prevalence of 100 cases of prostate cancer per 100,000 people (0.1% of the population).
10,000 individuals go to Dr. Dog for their annual screening.
With 0.1% prevalence, 10 of them would genuinely have prostate cancer.
Based on the false-negative rate of only 1%, these 10 individuals could all be diagnosed correctly. So far so good.
But that leaves us with 9,990 individuals with healthy prostates, and at a false-positive rate of 3%, we would have 300 additional hits among them.

That result means that a total of 310 individuals would receive the news that they tested positive for prostate cancer. The test would have been wrong in 97% of all positive results. Despite “98% accuracy”. Again, I am not trying to single out these studies or news articles in particular – I do not doubt that the research is necessary and holds promise for other paths of enquiry (or I could be wrong, and the focus on dogs already counts as old news) – but I am sure that you can now see why the latest reports of “88% accuracy” instantly triggered alarm bells in my head, especially when news sites started celebrating this number in their headlines (“almost 90%”!).

I think that it is important to be armed with a critical eye for such details. Even when dealing with proven tests (as opposed to preliminary research as reported by laypeople, like in the examples above), it is important to know the exact implications of any diagnoses that one may get. This knowledge is just as relevant for doctors to have in the back of their minds when relating test results, as it is for anyone who has just received such a result or may be planning to get some tests done. Assessing accuracy rates is, of course, only one tool in the toolbox of critical analysis. (And there is a host of further considerations too, such as cases where early treatment does not help reduce mortality rates, effectively reducing quality of life earlier than necessary, rendering even correct diagnoses unfavourable.) If you are interested in exploring the analyses of such issues in more depth, I recommend Gerd Gigerenzer’s “Risk Savvy” for further reading.

Recap

Ostensibly high “accuracy” rates can be misleading if considered in isolation. Two additional pieces of information that can be used to get a better picture are: (1) the rate of false positives (What is the percentage of healthy individuals being wrongly diagnosed?), and (2) the prevalence of whatever is being identified, as lower numbers dictate that the false-positive hits would be spread among a far larger share of unaffected individuals, which provides a stark contrast to experiments featuring large proportions of affected individuals.

It is like looking for needles in a haystack – if you mix 362 needles with 540 other hay-like substances, the number of needles that your needle-sniffing detector manages to identify will look really impressive next to the number of wrongly identified objects (e.g., identifying 358 actual needles and mistaking only 16 hay-like objects for needles). But in a more realistic scenario, you could very well be looking for just 10 needles among 9,990 other objects. Yes, you may be able to detect most of your needles, but having 999 times more foreign objects means that the number of wrongly identified objects will grow proportionally, magnifying initially negligible false-positive rates to dwarf the number of genuine hits that the test may yield. And in the real world, this can have negative consequences.

Postscript

Believe it or not, the calculations that I presented were skewed in favour of the test:

(1) If I read these WHO numbers from 2012 correctly, they tell me that the 5-year prevalence of cancer is around 625 individuals per 100,000 people, with prostate cancer accounting for 12% of cases, which amounts to 75 individuals. For the sake of clarity in the calculations, I rounded up fairly generously to a hundred cases (per 100,000).

(2) The Reuters article states that “During the training, 200 urine samples from the prostate cancer group and 230 samples from the control group were analysed.” If the error rates provided are based on the number of samples, my calculations above are more favourable than was actually the case (16/230 and 4/200 instead of 16/540 and 4/362).

(3) Also consider what would happen to false-positive rates when higher numbers of, say, younger individuals (where prevalence is lower) are getting tested.

Konrad Senf

Assessing Accuracy

Math Time

Recap

Postscript

Leave a comment Cancel reply

Math Time

Recap

Postscript

Teilen mit:

Leave a comment Cancel reply