Poll: Accuracy of Test Interpretation

Hmmm, accuracy determined as the percentage of overall tests performed which are correct, irrespective of whether they are positive or negative.

This depends absolutely on the population you choose to test.

If you are testing overall a population which has a low disease incidence, you will come to the conclusion that virtually all your positives are wrong and virtually all your negatives are right. Thus so long as the test has good specificity, that is not spewing out too many false positives (99% is bloody brilliant), it will seem to have great "accuracy" no matter how bad the sensitivity.

If almost all the patients you test are unaffected, almost all your negative results will be right even if the test is actually missing quite a high proportion of affected individuals. You could have a sensitivity of only 50%, missing half of the true positives, but still claim 99% "accuracy" in this way. And it has been done.

However, such a test will be useless to you if you are testing sick individuals you suspect of having the disease. It willl miss half of the real cases.

This is why the term "accuracy" is meaningless. First it is made up of sensitivity and specificity, whch will almost certainly be different, and secondly if you're looking at overall numbers of "correct" results, you can get any answer you want just by choosing how you use the test.

Lousy sensitivity - display the figures of how it performs as a well-animal screen. You can't lose.

Lousy specificity - assume that the user will only be using it where the condition is strongly suspected on clinical grounds. It may still not look wonderful, but you can make it a lot rosier than it really is.

I've seen both ploys used to make things look better than they are. I'm wise to it. That's one of the reasons they ask me to scrutineer papers submitted to a number of professional journals.

Oh, hold still for one of the only two jokes I ever invented all by myself.

________________________________________

<CENTER>New! Cutting-edge technology! Statistically proven!


<FONT SIZE="+3">THE
NEG-TEST™
</FONT>


Over 99.5% of all negative results guaranteed correct!*

NEVER produces a false positive!

Simple and inexpensive!



<FONT SIZE="-1">Method: Simply take the Neg-Test™ ballpoint pen provided, find the cat's clinical record, and write the words "FeLV negative". That's it. No need to take a blood sample, no messy reagents, no fiddly timing, no laboratory skill required.</FONT>


Change to the Neg-Test™
in your practice today!


* <FONT SIZE="-2">Statistics only valid when the prevalence of infection in the population being tested is less than 0.5%.</FONT></CENTER>

____________________________________________

OK, you can quit with the hysterical laughter now. :D

Rolfe.
 
Paul C. Anagnostopoulos said:

You mean there wasn't a hidden agenda here? Wow, fooled me, too.
The "hidden agenda" was to permit me to begin a discussion of why people in general (and sometimes physicians) have problems with the question. Also to demonstrate that certain people aren't nearly as knowledgeable as they think they are.

For the record: I approve of most of modern medicine, and disapprove of most of alternative "medicine". It's the stuff I don't approve of in modern medicine that bothers me - and the unwillingness of some to admit that it could be made better.
 
ceptimus said:
Hmmm. Seeing as the cat is already out of the bag :) let's work this out using a population of 1,000,000 people. As the disease affects one out of every 1,000 people, we know that 1,000 people will be infected.

Of the one thousand people that are infected 990 will be told they have the disease and 10 will test negative.

Of the remaining 999,000 people who don't have the disease, 1% (i.e. 9,990) will be told they tested positive and the remaining 989,010 will be told they tested clear.

So a total of 990 + 9,990 = 10,980 people will be told they tested positive and of those only 990 people will really be ill.

So if you are told you tested positive for the disease, the chances that you actually have it are:

990 / 10,980 = 9.016393443 %

Read what Rolfe has written. This will be the case if the test is administered to everyone. However, if you administer the same test only to those who show other clinical signs, then the incidence of the _tested population_ is far higher than 1/1000.

Let's use a pregnancy test as an example. Suppose the pregnancy test is 99% accurate (in both directions) and that 1/1000 women are pregnant. If every woman is given a preg test, then 90% will be false positive.

OTOH, suppose that the only women who get tested are those who have missed a period. Now, there are lots of reasons to miss a period, but the main one is pregnancy. Let's say that 80% of the time when a woman misses a period, it is because of pregnancy. Thus, if only the women who have missed a period are given an exam, then 800/1000 would be pregnant. For a 99% test, 792 of the pregnant women would test positive, but 2 non-pregnant would test positive. Thus, the probably of being pregnant, given a missed period and a positive preg test, is 99.75%.

This is the same point that Rolfe has been making. Tests are not carried out in a vacuum.

Now throw in a woman who has not only missed a period but is also suffering morning sickness. At that point, the positive test is even more solid.
 
Wrath of the Swarm said:
I said the test is 99% accurate. That sets both values.
Please be more clear, Wrath.

When you say that the test is 99% accurate, do you mean that 99% is the arithmetical mean of the sensitivity and the specificity, or do you mean that 99% of the results you get when you actually do the test are correct?

If the former, then I submit that 99% accurate could describe a test with 98% sensitivity and 100% specificity. In which case the doctor would be right anyway.

If the latter, then it would depend entirely on the percentage of the tested population which is actually being affected, and on the (possibly differing) values of sensitivity and specificity. (In the usual scenario, the majority of the tested population is assumed to be unaffected. This means that a test with great specificity will always look very good, no matter how lousy the sensitivity, while a test with great sensitivity may look diabolical if the specificity is poor.)

Thus one can be misled into thinking that good specificity is what matters and to hell with the sensitivity - especially for screening well patients.

In fact the opposite is true. You need almost perfect sensitivity quite desperately. Because you don't want to have to keep doubting and re-checking all your negative results, which will after all be in the large majority. If you can trust a negative result to be highly unlikely to miss an affected individual, then double-checking all your positives, within reason, isn't too much of a chore.

A test with 99.5% sensitivity and only 95% specificity is much more use to me in a screening situation than one with 99.5% specificity and only 95% sensitivity. That's because I can virtually rely on the negatives and only have to recheck 5% (or a bit more) of my results, the positives, with the latter. With the former, I can't rely on either the positives or the negatives.

But the former has better "accuracy" according to the second definition, in a mostly-unaffected population.

However, I'd settle for knowing which definition of "accuracy" you were using, for a start.

Rolfe.
 
pgwenthold said:
This will be the case if the test is administered to everyone. However, if you administer the same test only to those who show other clinical signs, then the incidence of the _tested population_ is far higher than 1/1000.

Let's use a pregnancy test as an example. Suppose the pregnancy test is 99% accurate (in both directions) and that 1/1000 women are pregnant. If every woman is given a preg test, then 90% will be false positive.

OTOH, suppose that the only women who get tested are those who have missed a period. ....
pgwenthold, I think I'm in love with you.

You know, I have to explain this concept to two groups of people. Those who haven't originally heard it Wrath's way, for whom the light bulb comes on almost at one, and those who have heard the "predictive value" spiel without really thinking about what representative of the population in question actually means. They have a great deal of trouble, usually.

Rolfe.
 
Wrath of the Swarm said:

Anyway, it has been shown that a very large number of medical students have problems with this question - and even doctors interpreting the results of things like mammograms, PSAs, and HIV tests. A lot of research has gone into ways of presenting test data that are less likely to cause people to reach the wrong conclusions. When results are returned in terms of population frequency, people are much less likely to misunderstand what the tests mean.
It seems to me you have been asked for your sources for this more than a couple of times in this thread.

There is, of course, a long line of research on cognitive heuristic use (your problem is one example of the "base-rate fallacy" within this literature). I don't know which sources you are refering to, but I am guessing it is probably Kahneman & Tversky, one of several different publication dates...

Anyway, this paper sums up quite a bit of the research--I don't see your particular claim in it, but I only did a quick once-over of the paper. If you have another source or sources in mind...please cite them.
 
Wrath of the Swarm said:
It's a good thing you can look up the answers on a chart, because you sure as hell can't handle the concepts involved.
Sorry, I just realised what that implied.

Wrath, I wrote the spreadsheet. From scratch. In order to produce that graph I posted earlier, in order to demonstrate the absolute importance of assuming the correct value for the x-axis when deciding whether a result can be relied on or not.

Before anyone gets twitchy, yes, the graph was scanned in from a book. But as I am the author of the book, I think this is allowed, yes?

Rolfe.
 
Wrath of the Swarm said:
I did say that. I said the test is 99% accurate. That sets both values. If I said that the test would correctly identify a person with the condition 99% of the time, then there wouldn't be enough information for anyone to answer the question - you'd know the false negative rate, but not the false positive. But that isn't what I said.

It's a good thing you can look up the answers on a chart, because you sure as hell can't handle the concepts involved. [/B]

The situation isn't as clear-cut as you seem to think, Wrath. It's simply sloppy writing to cite one number and to assume that it applies equally to both the alpha and beta error rates. Another equally legitimate interpretation is that the test has a 99% accuracy rate in practice, but that figures aren't available to support breaking them out into false-positive and false-negative rates.

The standard terminology exists for a reason. Use it.

Your central point, however, is well-taken. This is a classic med-student error. I believe, however, that most experienced physicians have seen enough to know abou this error. Have you relevant evidence on medical error rates? The JREF forum is hardly typical of medical practitioners in either math sophistication or medical training.....
 
Rolfe said:
Hmmm, accuracy determined as the percentage of overall tests performed which are correct, irrespective of whether they are positive or negative.

This depends absolutely on the population you choose to test.
No. Think about what you're saying. If all that matters is whether the result is correct, the distribution of the condition in the population is irrelevant unless there are different error rates for positives and negatives.

In this hypothetical 99% accurate test, it doesn't matter one bit whether everyone tested doesn't have the condition, everyone has the condition, or there's some intermediate state. 99% of the results are accurate, and 1% are not.

The strength of conclusions drawn from the results will depend on the population - but that's not what we're talking about. The power of the test is not the same as its accuracy.

Thank the beneficient powers you don't deal with people.
 
Mercutio said:
It seems to me you have been asked for your sources for this more than a couple of times in this thread.
Yes, I know.

The basic problem is a classic one. I'm trying to find the sources in which I read about the implications for screening tests several years ago.

If I recall correctly, doctors get the right answer more frequently than the general population, but they still tended to reach grossly wrong conclusions about whether a particular patient had a disease. I believe they overestimated the power of the tests significantly.

If I find some good sources on the subject, I'll get back to you.
 
drkitten said:
It's simply sloppy writing to cite one number and to assume that it applies equally to both the alpha and beta error rates. Another equally legitimate interpretation is that the test has a 99% accuracy rate in practice, but that figures aren't available to support breaking them out into false-positive and false-negative rates.
Overall accuracy includes both forms of error. If it's not stated that the probabilities can be further broken down, then there's no reason to presume that they do.

It's not sloppy. I avoided unnecessary complexities (which people are now trying to hide behind, I see.)
 
Well, I found this at PubMed. I've found several other references to the research finding that doctors frequently have problems with Bayesian inferences, but not the research itself.

It's common knowledge within the profession, though. Let me keep looking.
 
Wrath of the Swarm said:
Overall accuracy includes both forms of error. If it's not stated that the probabilities can be further broken down, then there's no reason to presume that they do.

Context is import. In the JREf forums making that kind of asumption with this kind of question is probaly means you are getting the wrong answer (check the puzzels section if you don't belive me)
 
Wrath of the Swarm said:
Yes, I know.

The basic problem is a classic one. I'm trying to find the sources in which I read about the implications for screening tests several years ago.

If I recall correctly, doctors get the right answer more frequently than the general population, but they still tended to reach grossly wrong conclusions about whether a particular patient had a disease. I believe they overestimated the power of the tests significantly.

If I find some good sources on the subject, I'll get back to you.

Daniel Kahneman won the Nobel Prize for this some years ago. The classic work on cognitive errors in the general public is Judgement Under Uncertainty, but I assume you have something more specific for medical professionals?
 
9. For the base-rate neglect question, the important finding from these studies (see also Hogarth and Einhorn, 1992, and Robinson and Hastie, 1985) is that the order in which people get the information makes a difference. Although it shouldn't make any difference what order they get information in, subjects usually put greater weight on the most recently received information (Adelman, Tolcott, and Bresnick, 1993, with military intelligence experts dealing with realistic military intelligence problems; Tubbs, Gaeth, Levin, and Van Osdol, 1993, with college students on everyday problems such as troubleshooting a stereo; Chapman, Bergus, Gjerde, and Elstein, 1993, with medical doctors on a realistic diagnosis problem). In more ambiguous situations the first impression had a lasting effect (Tolcott, Marvin, and Lehner, 1989).
11. Does it matter that people cannot accurately revise numerical probabilities (Christensen-Szalanski, 1986)? The deeper study of what people actually do, as called for by Koehler, can provide perspective. What do doctors do, for example, when ideally they should be forming hypotheses and revising hypothesis probabilities as they gather evidence?

12. It is not that they do a numerical integration more complex than Bayes' Theorem to revise probabilities (Gregson, 1993), as Hamm's (1987) explorations show. Doctors thinking aloud about cases don't even speak explicitly of probabilities (Kuipers, Moskowitz, and Kassirer, 1988), though when they are induced to do so it improves their decisions (Pozen, D'Agostino, Selker, Sytkowski, and Hood, 1984; Carter, Butler, Rogers, and Holloway, 1993).

13. Nor do doctors rely exclusively on learning probabilities from experience, like rats learning the contingencies on a lever (Spellman, 1993). While some of their knowledge is based on this kind of experience (Christensen-Szalanski and Beach, 1982; Christensen- Szalanski and Bushyhead, 1981), doctors have to know what to do with both the common diagnoses (8 out of 10) and the rare ones (1 in 10,000). Though in some situations, where people experience an event repeatedly, they can implicitly learn a base rate, in other situations, where people do not experience an event repeatedly but rather learn about it abstractly, they may also be able to take account of a base rate—but if they cannot, the consequences may be important.

14. How, then, do doctors usually handle diagnostic problems? Experts generally organize their extensive knowledge into mental scripts (Schmidt, Norman, and Boshuizen, 1990), complex rules that function with the speed of recognition to provide responses for familiar and unfamiliar situations. Explicit calculation of Bayesian probabilities is not a strength of this type of rule (cf. Hamm, 1993). Instead, experts' accuracy may be a function of the recognition processes, which can bring ideas to mind optimally (Anderson and Milson, 1989). Or accuracy may be due to well-tuned judgment processes governing response choice (Chapter 8 of Abernathy and Hamm, 1994).

15. If doctors' scripts are used accurately, producing results similar to those that wise use of Bayes' theorem would produce, this is due not only to the feedback of experience but also to reflection and to others' criticism (Chapter 11 of Abernathy and Hamm, 1994). Any form of argument can be applied toward justifying a change in a script, including arguments based on probabilistic analysis.

16. For example, when the screening tests for HIV first came out, Meyer and Pauker (1987) warned against ignoring the base rate, i.e., against assuming that someone with no risk factors has AIDS if their screen is positive for AIDS. Guided by such explicit discussion of the probabilities, and by individual cases of people devastated by false positive HIV screens, doctors' shared scripts were adjusted until now they don't recommend that patients be screened unless there are risk factors. The "1993 script" produces behavior that is, for the most part, consistent with a Bayesian analysis. Individual doctors using the script need neither think about probabilities nor understand the Bayesian principles. They just think of the rules, or of cases in which the script is implicit (Riesbeck and Schank, 1989). Note, of course, that this scenario depends on there being someone who understands the probabilistic principles and can shape the script that everyone else will use.

From this site
 
Rolfe said:
pgwenthold, I think I'm in love with you.

"First you have to move that damn cat."

Oh, sorry.

hey, I have a high affinity for vets (my wife begins fourth year rotations in 2 weeks). However, I'm not a cat person. We wouldn't get along. Besides, the aforementioned wife wouldn't go for it.



You know, I have to explain this concept to two groups of people. Those who haven't originally heard it Wrath's way, for whom the light bulb comes on almost at one, and those who have heard the "predictive value" spiel without really thinking about what representative of the population in question actually means. They have a great deal of trouble, usually.


Actually, I am one of the latter, and am very familiar with the John Allan Paulos take on the matter. However, you made a good point about testing populations. Since I don't know much about the tests for feline leukemia, I figured I'd put it in terms that most people would recognize.
 
This site has a nice discussion of the issue in simple terms. More importantly, it references research studies and asserts that the problem has been replicated many times.

Okay, so it's not a stellar reference... but I think it proves my point. My problem is what medical resources don't discuss the issue much - you'll find a lot more if you do a general Google on "do doctors have problems with Bayesian reasoning?"
 
Rolfe said:
Please be more clear, Wrath.

When you say that the test is 99% accurate, do you mean that 99% is the arithmetical mean of the sensitivity and the specificity, or do you mean that 99% of the results you get when you actually do the test are correct?

If the former, then I submit that 99% accurate could describe a test with 98% sensitivity and 100% specificity. In which case the doctor would be right anyway.

If the latter, then it (gets considerably more complicated....)
This seems to have been missed. Please address. (Unless you did while I was writing this post, sorry, carried away again.)

I seem to have posted once assuming Wrath meant the former, then a second time assuming he meant the latter, then I see from yet another post that maybe he means the former after all.

Not hiding behind anything, Wrath.

Reference to simplistic form of the explanation that Wrath is peddling, at least the one that has caused me the most grief over the years.

JACOBSON, R. H. (1989) How well do serodiagnostic tests predict the disease status of cats? J. Am. Vet. Med. Ass. 199 (10), 1343-1347.

My pet hate quote from this pile of misinformation:
A negative test result .... is reliable in predicting that a cat does not have the infection/disease.

.... negative test results are good prognosticators of non-infected cats even if the sensitivity .... of the test is not good.
The example he used was that a sensitivity of 90% was just peachy, because in his scenario (only 1% of the "population" infected), 99.9% of the negative results would still be correct. He even remarked that even if the specificity was only 20% (!), this was OK because >99% of the negative results would still be correct.

It was at this point I was driven to grasp him methaphorically by the throat and point out that 90% sensitivity was still missing 10% of all infected cats, and I really didn't want that. 20% sensitivity is missing an incredible 80% of the infected cats, and there's no way this can be acceptable except by his crazy logic (which Wrath never extrapolated to, but it's where it goes if you don't rein it in).

Of course, this is where my "NEG-TEST<SUP>TM</SUP>" was born. The reductio ad absurdum of his premise is that if the sensitivity is zero, nevertheless, 99% of the negative results are still correct. The NegTest has a sensitivity of zero. I just tweaked it a little by reducing the hypothetical incidence of infection to 0.5% (not unreasonable in a healthy population, and iin fact if you are talking closed and tested pedigree breeding establishments even 0.5% is a gross libel).

I don't know whay the reductio ad absurdum wasn't spotted when the paper was published, but it wasn't.

This is the reason I have this explanation honed - it seems to imply that sensitivity doesn't matter, and so long as specificity is good (not too many false positives), you're laughing.

Of course, as I said above, the opposite is the case. For a viable screening test, you must be able to trust your negatives, not just trust to luck that the cats you're testing are in fact negative! If you can trust the negatives, you only need to get the (relatively few) positives double-checked. No problem. If you know that the bloody test is missing 10% or more of the infected cats, why do it at all?

In fact Jacobson did say something perfectly sensible in his paper.
When evaluating a serodiagnostic test result, the veterinarian should first consider whether the cat is at high risk (from a high prevelance group) or low risk (from a low prevalence group) for the condition under consideration.
The problem is that he didn't understand that "population" doesn't mean "the village where the cat lives", it means "cats like this one". Including the clinical presentation.

He spent the entire five pages only looking at the left-hand side of the graph, because he couldn't imagine a (geographical) population with more than about 10% incidence of infection. Of course, he isn't a veterinarian. He simply didn't think about the selection-to-test based on clinical presenting signs and the "population" you will be testing if you choose (as many vets do) to test only cats presenting with clinical signs suggestive of infection by the virus.

Once you think about that scenario, you realise you are way up to the right-hand-side of the graph, and positive results become relatively reliable while negative results are untrustworthy as hell.

And of course a "population" in this sense can be one cat, with all its features which put it closer to one side or the other of the graph. Indeed, the bottom line is you don't think about what other cats you tested or didin't choose to test that day, or week, or year, you assess that cat as an individual with probability/risk of infection of x.

I know it's not easy to get your head round, especially if you've got the sloppy version strongly pre-conceived. But it would be nice if Wrath at least read my posts.

Rolfe.
 
Sorry, this is unworthy, but I'm getting a bit narked. (I only just saw the word I suspect led to the reporting of the thread, and yes, I'm not terribly flattered.)

Is it relevant that Wrath had to search PubMed to find something to back him up, after he'd been called on it? Whereas I reached for journals already in my bookcase, and was able to illustrate my point with a graph copied from a book of which I am in fact the author?

Rolfe.
 

Back
Top Bottom