• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Statistical scandals

amhartley

Scholar
Joined
May 27, 2006
Messages
61
Medical research relies heavily on statistical results to substantiate its findings. Yet, stat results are almost always misunderstood and misinterpreted. It's really scandalous.

This research, like any science, seeks to make statements about general states of affairs, associations, effects, etc., given empirical data. Viz., this research seeks to be inductive, to say something about general theories given particular observations, to be inductive.

Yet, the common statistical results (p-values, hypothesis test results & confidence intervals) are not inductive, but deductive; they assume this or that theory or hypothesis, & then make statements about data. They move from the general to the specific.

The most ubiquitous statistical result, for instance, is the p-value, p. p is calculated as
p=Pr(Y>=x given Ho) where
Ho = the tested hypothesis,
x = observed experimental data,
Y = data from a second, hypothetical experiment that is never conducted.
"Y>=x" means "Y constitutes as strong evidence against Ho as does x."
In words: "Having run an experiment and obtained x, p is the probability, in a repetition of the experiment, of obtaining evidence against Ho as strong as is x." So, p is a statement about data x and Y, assuming Ho. It's not a statement about Ho. It's not inductive.

BTW, hypothesis test results & confidence intervals are also statements about data given hypotheses. They are not inductive either. But they are almost universally interpreted as such.

Yet, one can hardly blame researchers for these misinterpretations; the misinterpretations even populate introductory statistical books, where p-values & the like are proposed as "inferential" measures and where "inference" is defined as "extending sample results to general populations."

For decades, Michael Oakes and others have been studying how people interpret common stat results; they conclude that experts as well as students, applied scientists as well as statisticians, all misinterpret these results. Just put "+michael +oakes +p-value +medical" into google or yahoo searches & read the results! None of what I've said so far is news.

It is high time for us to move beyond recognizing and living with these misunderstandings. We need to start asking questions such as
1. Why do people almost universally misunderstand p-values, hypothesis test results & confidence intervals?
2. Why are these tools still used, despite these shortcomings?
We need to explore the root causes of these problems. I have some ideas about these causes, and would be interested in considering any you may propose as well. I expect that would generate some excellent discussion.
Andrew
 
Many lay people, and some physicians, fail to appreciate the difference between "statistical significance" and "clinical relevance". The two are not the same. Ideally, both are achieved in a robust trial. But, pharmaceutical companies often rely heavily on people's ignorance of this distinction to promote their products, both to the public and the prescribing physician.

-Dr. Imago
 
stat vs clinical significance

Dr Imago,
good point; but then why is statistical significance even relevant? The standard answer is that one wants to know whether the results are "compatible with chance." However, this just begs the question, for,
1. does chance exist?
2. even if chance exists, the results are always compatible with it, no matter how small the p-value.
Of what relevance is statistical significance?
-Andrew
 
Dr Imago,
good point; but then why is statistical significance even relevant? The standard answer is that one wants to know whether the results are "compatible with chance." However, this just begs the question, for,
1. does chance exist?
2. even if chance exists, the results are always compatible with it, no matter how small the p-value.
Of what relevance is statistical significance?
-Andrew
1. Yes . At lest in the sense that you can't be sure you picked a representative sample.
2. An example problem with made up numbers: examine if there is a difference between how long people are in New York compared to people in Los Angeles. You do this by measuring the length of lets say 100 people from each city and then compare the results. I can almost guarantee that there will be a difference but the question is is the difference you get significant. If you get a difference of 3mm in mean length is that relevant? This is were the statistical analysis enter the picture.

To analyze your results you make the fairly standard assumption that there is no difference. Called the null hypothesis IIRC. Then you calculate the probability of getting the results you got if the null hypothesis is correct. If the probability is lower than a previously selected number (usually 0.05 this is the p-value) then you conclude that the null hypothesis is wrong and there is a difference. Using a p-value of 0.05 is to accept that you are wrong 1 time of 20. This is why it is important to be able to repeat experiments.

To summarize if you don't do any statistical analysis then you can't be sure your results a relevant.

/Hans
 
Statistical irrelevance

Hans,
Thx for your thoughts. Your post raises several questions:
1. What does picking a representative sample have to do with my question as to whether chance exists?
2. I can guarantee, even before collecting our 200 people, that the 2 cities will have a different mean (or are you referring to median?) height of people. The null hypothesis (in this case, IIRC) of no difference is almost never correct. So what’s the purpose of the experiment? IIRC is false a priori.
3. Where did this value of 0.05 come from? It seems quite arbitrary.
4. We would run an experiment to be able to say something about an important scientific hypothesis. However, the p-value (as well as the alpha you are actually talking about) is a statement about data assuming the hypothesis IIRC. It’s not a statement about any hypothesis itself. So, why are alpha or the p-value relevant?
5. Being new to JREF, I would have thought that the people here take great pride in avoiding “standard assumptions.”

-Andrew
 
Where did this value of 0.05 come from? It seems quite arbitrary.

Can't remember who started the .05 thing, maybe Pearson or Fisher, but yes it is quite arbitrary, IIRC 0.0225 was a much better choice...
 
Jorghnassen,
it was Fisher. But your 0.025 (I think that's what you mean?) just begs the question. In JREF, aren't we interested in removing arbitrariness?
 
Jorghnassen,
it was Fisher. But your 0.025 (I think that's what you mean?) just begs the question. In JREF, aren't we interested in removing arbitrariness?
How do your remove arbitrariness from a selection for standard of proof. As far as I can tell, you dodn't want your test to be so sensistivity that it gives a false positive, but not so insensitivty that you miss real phenomena. The choice of alpha is a practical, but still somewhat arbitrary, choice.
 
JamesM,
sure. No 2 things in the universe are exactly the same in any respect. Even identical twins just after their original cell splits are different.

And even if they were the same at one instant, in the next instant they'd be different.

Of course, with our imperfect measuring equipment, we might get, say, the same (rounded) height from 2 people. But that's just because we can only measure things to a limited exactitude.
 
statistical arbitrariness

How do your remove arbitrariness from a selection for standard of proof. As far as I can tell, you dodn't want your test to be so sensistivity that it gives a false positive, but not so insensitivty that you miss real phenomena. The choice of alpha is a practical, but still somewhat arbitrary, choice.

WW,
excellent question. The problem here is that any choice of alpha, beyond a WAG (wild-a**ed guess) like Fisher's, is going to have to involve the costs of making a type I or type II error (say, c1 & c2, respectively). If c1/c2 is big, then you want a small alpha. If c2/c1 is big, you want a large alpha (for that will shrink beta, the prob of a type II error assuming an alternative). But neither Fisher nor Neyman-Pearson (who at least had the guts to deal with type II errors) specified a way to bring costs into their testing paradigms. To do so would have required making statements about hypotheses (and that, as I said in my first post here, is something frequentism doesn't do).

So, the arbitrariness is indeed inescapable, as long as one is committed to Fisher's or Neyman-Pearson's systems of stat testing.
 
I really like quoting Clemens, so I will: "lies, damned lies and statistics."
 
flippant?

I really like quoting Clemens, so I will: "lies, damned lies and statistics."

Fuelair, since you're a scholar, I assume you can be more specific about what you mean by that. I have heard that quote (sometimes attributed to Mark Twain or Disraeli, too), and have impressions of why it's true, but can you pls elaborate? Thx for contributing.
 
Fuelair, since you're a scholar, I assume you can be more specific about what you mean by that. I have heard that quote (sometimes attributed to Mark Twain or Disraeli, too), and have impressions of why it's true, but can you pls elaborate? Thx for contributing.

Clemens *was* Twain.
 
Medical research relies heavily on statistical results to substantiate its findings. Yet, stat results are almost always misunderstood and misinterpreted. It's really scandalous.

No, not really.



It is high time for us to move beyond recognizing and living with these misunderstandings. We need to start asking questions such as
1. Why do people almost universally misunderstand p-values, hypothesis test results & confidence intervals?
2. Why are these tools still used, despite these shortcomings?
We need to explore the root causes of these problems. I have some ideas about these causes, and would be interested in considering any you may propose as well. I expect that would generate some excellent discussion.
Andrew


1. Because it's an advanced mathematical concept not taught at most levels of education, and when it it, it's taught to specific people in certain fields.

2. Because the shortcomings aren't very serious.


Statistical significance is only a piece of the puzzle for hypothesis testing. To be frank: I think there's too much emphasis on it already. When I critique papers, I'm looking for design flaws, misrepresentation, and so on.

A crappy protocol with suspicious sampling gets tossed into the rubbish bin, even if it get p<.0001, whereas a solid test with good controls and a p<.1 tells me the hypothesis has promise and deserves a closer look.
 
Jorghnassen,
it was Fisher. But your 0.025 (I think that's what you mean?) just begs the question. In JREF, aren't we interested in removing arbitrariness?


p<=.05 is arbitrary, but it isn't pulled out of our collective asses. It's a good step toward increasing confidence, based on the knowledge that no matter how many times you repeat a medical experiment, short of doing it to every human alive, you'll never get 100% certainty. Eventually, we have to put a stake in the ground.

So, no, I don't think the JREF has a policy on 'arbitrariness'.

Science is applied common sense and pragmatism, not philosophical perfection.


In fact, my impression over the years is that what draws people to woo is the allure of certainty, rigidity, &c. Science is for eternal questions. Religion is for eternal answers.
 
No, not really.
1. Because it's an advanced mathematical concept not taught at most levels of education, and when it it, it's taught to specific people in certain fields.

2. Because the shortcomings aren't very serious.
Statistical significance is only a piece of the puzzle for hypothesis testing. To be frank: I think there's too much emphasis on it already. When I critique papers, I'm looking for design flaws, misrepresentation, and so on.

A crappy protocol with suspicious sampling gets tossed into the rubbish bin, even if it get p<.0001, whereas a solid test with good controls and a p<.1 tells me the hypothesis has promise and deserves a closer look.

It’s not scandalous? I encourage you to do the google search I mentioned. In particular, look up the Oakes study where, given a simple written test, only 3 of 70 respondents correctly understood the p-value. Also Oakes’s study has been repeated many times since 1986 in different settings. Basically they get the same results: People think p-values are statements about hypotheses. They’re not.

1. Thx for your suggestion about teaching. However, if education could have solved these problems, it would have done so long ago. Anybody having taught statistics will tell you it’s easy to get students to go thru the mechanics of calculating critical regions, p-values, conf intervals, etc., but difficult (I would say, nearly impossible) to get them to limit their interpretations to what is mathematically justifiable (i.e., to refrain from saying anything about hypotheses based on testing results). Plus, if education could do the job, you’d think that statisticians, at least, would avoid the misinterpretations. They don’t.
2. I’m trying to understand, in your post, the relation between “Statistical significance is only a piece of the puzzle for hypothesis testing. To be frank: I think there's too much emphasis on it already.” and your statement that “the shortcomings aren’t that serious.” Are you saying stat significance isn’t all that important? I would agree entirely with that. But if you’re saying stat signif is an important part of a larger context, I would reply that defining its role necessarily involves an oracular process of speculation. That process depends more on tradition, personal preference & prejudice than on scientific considerations.
 
Statistical analyses, unfortunately, look more impressive to the layman than they really are. They are useful in attributing a level of confidence to a hypothesis, but only in relation to those 'arbitrary' values within the protocol. You cannot seperate them and claim that your results conclude anything valid.

I can do a statistical test and make the claim 'I have evidence that x = y'. It means little until you know how I did this, and how much confidence I have in my results. When you then learn that I have a one in five chance of being wrong, it is up to you to determine whether my results are useful. That is then in relation to what you're going to use my results for; buying a new type of spaghetti in meatballs, you might decide it's ok to trust my study. Choosing a new drug for treating your cancer... you'd want something a little more rigorous.

Statistics is merely a tool. And unfortunately, people have a hard enough time understanding probability, let alone understanding its value in everyday life.

Athon
 
p<=.05 is arbitrary, but it isn't pulled out of our collective asses. It's a good step toward increasing confidence, based on the knowledge that no matter how many times you repeat a medical experiment, short of doing it to every human alive, you'll never get 100% certainty. Eventually, we have to put a stake in the ground.

So, no, I don't think the JREF has a policy on 'arbitrariness'.

Science is applied common sense and pragmatism, not philosophical perfection.

In fact, my impression over the years is that what draws people to woo is the allure of certainty, rigidity, &c. Science is for eternal questions. Religion is for eternal answers.

You haven’t explained how p<=0.05 is any better than a WAG. Food for thot: e.g., in the point-null testing situation, the probability of the tested hypothesis H could be >50% even when p<0.05. Plus, data producing p<0.05 may actually constitute evidence in favor of H. It may “increase our confidence” but it may not, and it may increase our confidence for or against H, depending on such things as power. WAGs often get people into trouble such as this.

Putting a stake in the ground: I think most people in JREF would agree that we should expunge science of as much arbitrariness as possible, and found our findings on solid principles whenever we can. If we don’t do that, can we really claim to be scientific? Otherwise, science becomes a tool for the powerful & influential, i.e., WHOSE “commonsense” will we rely on? WHOSE “pragmatism” is authoritative? Are you saying that the average Randi member, obviously interested in questioning the assumptions of the paranormal etc, would not be willing to examine their own assumptions?

Sorry, what is “woo?”

Eternal questions & answers: Thanks; I never heard it put that way. However, a call for pragmatism over philosophy is often a cop-out; a refusal to entertain the tough (and often important) questions. Surely that's not your situation?
 

Back
Top Bottom