• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Statistical scandals

Statistical analyses, . . .are useful in attributing a level of confidence to a hypothesis. . .

. . .When you then learn that I have a one in five chance of being wrong, it is up to you to determine whether my results are useful. That is then in relation to what you're going to use my results for; buying a new type of spaghetti in meatballs, you might decide it's ok to trust my study. Choosing a new drug for treating your cancer... you'd want something a little more rigorous.

Statistics is merely a tool. And unfortunately, people have a hard enough time understanding probability, let alone understanding its value in everyday life.

Athon

Athon,
Some good ideas here. As I mentioned to Blutoski, however, reading p-values etc. as if they can “attribute a level of confidence to a hypothesis” will at times get the researcher in trouble. E.g., a p-value, however, small, can actually represent evidence FOR the hypothesis H, even though the majority of people think it’s evidence AGAINST H.

“a one in five chance of being wrong:” Can you explain this more? Are you saying that, if I conclude H is false, my chances of being wrong are 20%?

“it is up to you to determine whether my results are useful:” I don’t think you mean to say that any person has complete freedom to ignore statistical results. That’s moving towards the “oracular speculation” I mentioned in a post above. But if you don’t mean that, then what do you mean?

If, as you say, "Statistics is merely a tool," then would you conclude stats is not a science? I.e., it's useful for behavior & decisionmaking, but not increasing knowledge? That's what Neyman & Pearson said in 1933.
 
Putting a stake in the ground: I think most people in JREF would agree that we should expunge science of as much arbitrariness as possible, and found our findings on solid principles whenever we can. If we don’t do that, can we really claim to be scientific? Otherwise, science becomes a tool for the powerful & influential, i.e., WHOSE “commonsense” will we rely on? WHOSE “pragmatism” is authoritative? Are you saying that the average Randi member, obviously interested in questioning the assumptions of the paranormal etc, would not be willing to examine their own assumptions?

What's wrong with arbitrary figures? They are significant when all parties know their value. They are useless when no value is attributed. Easy!

In a situation where the result is 'we will meet together', I can pick an arbitrary place. The place where we meet is significant only if we are to coincide our positions. Same in statistics; arbitrarily picking a number at which we can have a level of confidence is significant only for that confidence. If all parties understand the value, its merit is within that. Giving it additional meaning (which many people erroneously do) is where problems arise.

Athon
 
Athon,
Some good ideas here. As I mentioned to Blutoski, however, reading p-values etc. as if they can “attribute a level of confidence to a hypothesis” will at times get the researcher in trouble. E.g., a p-value, however, small, can actually represent evidence FOR the hypothesis H, even though the majority of people think it’s evidence AGAINST H.

True. Which is a problem in the interpretation of the significance of the results. People have problems understanding probability.

“a one in five chance of being wrong:” Can you explain this more? Are you saying that, if I conclude H is false, my chances of being wrong are 20%?

Yes.

“it is up to you to determine whether my results are useful:” I don’t think you mean to say that any person has complete freedom to ignore statistical results. That’s moving towards the “oracular speculation” I mentioned in a post above. But if you don’t mean that, then what do you mean?

We all construct meaning out of our interpretations. If I do a simple statistical test... for instance, I want to determine what my chances of randomly picking off the street a person with black hair... I might go out and select from the population five people and find one of them has black hair. I will conclude from that we have a one in five chance of getting a black haired individual from a random draw. We could work out significant this number is; in other words, the chances my probability would be wrong in a larger population. If I consider this to be an acceptable risk, the information is useful to me (I will use it). You, on the other hand, might feel you want better information as the risk you are wrong is too great. The study for you has less use.

The end result is how we use the statistics.

If, as you say, "Statistics is merely a tool," then would you conclude stats is not a science? I.e., it's useful for behavior & decisionmaking, but not increasing knowledge? That's what Neyman & Pearson said in 1933.

We can only have confidence in the validity of any knowledge we gather. Hence all tools we have to increase knowledge - statistical analyses being one of them - can only contribute to our level of confidence.

Athon
 
What's wrong with arbitrary figures? They are significant when all parties know their value. They are useless when no value is attributed. Easy!

In a situation where the result is 'we will meet together', I can pick an arbitrary place. The place where we meet is significant only if we are to coincide our positions. Same in statistics; arbitrarily picking a number at which we can have a level of confidence is significant only for that confidence. If all parties understand the value, its merit is within that. Giving it additional meaning (which many people erroneously do) is where problems arise.

Athon

You raise a couple of issues here.
1. As I have said 2 times above, the “number” we pick (and I think you are referring to alpha?) has no correspondence to the “level of confidence” we can place in this or that hypothesis. It’s therefore impossible for “all parties” to “understand the value” if, by “value,” you mean such “level of confidence.”
2. Despite heroic efforts (and I mentioned this in my first post above), stats educators have failed to prevent even experts from (as you say) “giving it additional meaning.” And the reason for this is that, when you take away the “additional meanings,” you’re left with nothing! Stat significance “tastes great and is less filling.”
 
You raise a couple of issues here.
1. As I have said 2 times above, the “number” we pick (and I think you are referring to alpha?) has no correspondence to the “level of confidence” we can place in this or that hypothesis. It’s therefore impossible for “all parties” to “understand the value” if, by “value,” you mean such “level of confidence.”

Either your explaining or my understanding are at fault here (not sure which), but I don't understand what you mean by this. The choice of alpha might be arbitrary, however its value as far as I can tell correlates with how much confidence we should have in the results.

Explain how it doesn't.

2. Despite heroic efforts (and I mentioned this in my first post above), stats educators have failed to prevent even experts from (as you say) “giving it additional meaning.” And the reason for this is that, when you take away the “additional meanings,” you’re left with nothing! Stat significance “tastes great and is less filling.”

Again, I don't understand what you mean by 'left with nothing'. Additional meaning is added purely because statistics is only useful in conjunction with speculation and conjecture, and the human mind is essentially a risk evaluator. Drawing the line between conjecture and statistics is difficult to do when creating a big picture.

Athon
 
True. Which is a problem in the interpretation of the significance of the results. People have problems understanding probability.
. . .
We all construct meaning out of our interpretations. If I do a simple statistical test... for instance, I want to determine what my chances of randomly picking off the street a person with black hair... I might go out and select from the population five people and find one of them has black hair. I will conclude from that we have a one in five chance of getting a black haired individual from a random draw. We could work out significant this number is; in other words, the chances my probability would be wrong in a larger population. If I consider this to be an acceptable risk, the information is useful to me (I will use it). You, on the other hand, might feel you want better information as the risk you are wrong is too great. The study for you has less use.
. . .
We can only have confidence in the validity of any knowledge we gather. Hence all tools we have to increase knowledge - statistical analyses being one of them - can only contribute to our level of confidence.

Athon
Athon, I’m trying to keep up with all the ideas! First, you say alpha is the chance of being wrong, given one has concluded H. Thus, we have alpha = 1-Pr(H given the data). Alpha would then be a bayesian probability. You have demonstrated my point that, even with substantial training (as it seems you have had), people routinely misinterpret the standard results. If experts can’t get it right, what hope is there for the rest of us? And, as I asked in my first post, why do we continue to report alpha? When we call it “inferential,” we are just giving people enough rope to hang themselves with.

On your “black hair” experiment: By equating “how significance this number is” with “the chances my probability would be wrong in a larger population,” you are again erroneously calling alpha a bayesian probability. Alpha is pretty tricky, isn’t it? But my point is that these errors are inescapable. As is your erroneous reference to “risk.” Measurement of risk necessitates bayesian reasoning.

“confidence in the validity of any knowledge we gather:” In statistics, do we truly “gather knowledge?” Rather, we gather data. What does that have to do with “increasing knowledge.” And now that you continue to speak about levels of confidence and increasing our knowledge, I'm unclear about what you meant earlier by saying "statistics is only a tool."
 
Either your explaining or my understanding are at fault here (not sure which), but I don't understand what you mean by this. The choice of alpha might be arbitrary, however its value as far as I can tell correlates with how much confidence we should have in the results.

Explain how it doesn't.
. . .
Again, I don't understand what you mean by 'left with nothing'. Additional meaning is added purely because statistics is only useful in conjunction with speculation and conjecture, and the human mind is essentially a risk evaluator. Drawing the line between conjecture and statistics is difficult to do when creating a big picture.

Athon

Athon, as I mentioned to Blutoski, p<0.05 can constitute evidence FOR the tested hypothesis H, not just evidence against it. Plus, it is possible that Prob(H given data)>50% even though p<0.05. Therefore, p has nothing to do with levels of confidence in hypotheses. Maybe my post referring to bayesian probabilities will clear this up for you.

On your “statistics is only useful in conjunction with speculation and conjecture:” what do you mean by that? Sounds again like oracular speculation; hardly scientific.
 
Athon, I’m trying to keep up with all the ideas! First, you say alpha is the chance of being wrong, given one has concluded H. Thus, we have alpha = 1-Pr(H given the data).

Firstly, my apologies for oversimplifying. I guess sometimes I slip back into 'teaching year 10 probability' mode where the details are occasionally dropped in order to get kids to understand the concept of 'probable' versus 'possible'.

Alpha is simply the chance that your null hypothesis is correct. The lower it is, the less likely it is the situation which explains the phenomena you are observing, and the greater my confidence can be in my hypothesis being a correct justification for the observation.

Alpha would then be a bayesian probability.

I must admit, I do side more with Bayesian philosophy than the frequentists, but that's another discussion. How is alpha = 1 - P strictly Bayesian? Perhaps there's something about that philosophy I'm missing, or I've missed in your argument.

You have demonstrated my point that, even with substantial training (as it seems you have had), people routinely misinterpret the standard results. If experts can’t get it right, what hope is there for the rest of us? And, as I asked in my first post, why do we continue to report alpha? When we call it “inferential,” we are just giving people enough rope to hang themselves with.

I'd hardly qualify as an expert. I've used statistics professionally and have taught it at low high school level, but must admit many of the finer points still lose me. Which is why I'm not arguing this vehemently, but rather because I honestly wonder if I've lost something in your explanations.

You still haven't explained why alpha is arbitrary (or at least, where I can find your explanation).

On your “black hair” experiment: By equating “how significance this number is” with “the chances my probability would be wrong in a larger population,” you are again erroneously calling alpha a bayesian probability. Alpha is pretty tricky, isn’t it? But my point is that these errors are inescapable. As is your erroneous reference to “risk.” Measurement of risk necessitates bayesian reasoning.

Bayesian reasoning takes into account the chance that variables outside of our scope are at work influencing the probability. It's a real world application, even though it is a little vague, as it accepts that we cannot see all of the variables in an experiment. Alpha is still useful, even though it is not applied as strictly as frequency might dictate.

“confidence in the validity of any knowledge we gather:” In statistics, do we truly “gather knowledge?” Rather, we gather data. What does that have to do with “increasing knowledge.” And now that you continue to speak about levels of confidence and increasing our knowledge, I'm unclear about what you meant earlier by saying "statistics is only a tool."

Ok, I wondering when we would get into 'definitions' territory.

Data is raw observable information we interpret from our surroundings. Knowledge is the interpretation of that data in reference to the context it's taken from. Statistics is a way of interpreting the relevance of data in context with its environment. Therefore, we can only construct knowledge personally, although data is something objective that exists outside of our observation of it. Statistics is a tool for classifying and attributing values to information we gather from the environment.

Athon
 
Athon, as I mentioned to Blutoski, p<0.05 can constitute evidence FOR the tested hypothesis H, not just evidence against it. Plus, it is possible that Prob(H given data)>50% even though p<0.05. Therefore, p has nothing to do with levels of confidence in hypotheses. Maybe my post referring to bayesian probabilities will clear this up for you.

I read it before, and it's still not clear. I'm not being deliberately vague, but I need you to explain what you mean by that. An example would be good. I admit I'm lost what you mean by the alpha sometimes contributing as evidence for the hypothesis and the null hypothesis.

On your “statistics is only useful in conjunction with speculation and conjecture:” what do you mean by that? Sounds again like oracular speculation; hardly scientific.

Oracular speculation? I'm not familiar with the term. It sounds like you're meaning 'talking speculatively' about an observation. It certainly has a place in science as far as I can tell, which is why I wonder if you mean something else by it.

Any statistical figure is virtually meaningless outside of the context it's provided. I've agreed that it is practically Bayesian, however I honestly can't see how else statistics can be useful if not in a context.

Athon
 
It’s not scandalous? I encourage you to do the google search I mentioned. In particular, look up the Oakes study where, given a simple written test, only 3 of 70 respondents correctly understood the p-value. Also Oakes’s study has been repeated many times since 1986 in different settings. Basically they get the same results: People think p-values are statements about hypotheses. They’re not.

1. Thx for your suggestion about teaching. However, if education could have solved these problems, it would have done so long ago. Anybody having taught statistics will tell you it’s easy to get students to go thru the mechanics of calculating critical regions, p-values, conf intervals, etc., but difficult (I would say, nearly impossible) to get them to limit their interpretations to what is mathematically justifiable (i.e., to refrain from saying anything about hypotheses based on testing results). Plus, if education could do the job, you’d think that statisticians, at least, would avoid the misinterpretations. They don’t.

The ones who do it for a living understand. They're the ones who matter. That a baseball statistician doesn't understand medical publications is not a crisis.



2. I’m trying to understand, in your post, the relation between “Statistical significance is only a piece of the puzzle for hypothesis testing. To be frank: I think there's too much emphasis on it already.” and your statement that “the shortcomings aren’t that serious.” Are you saying stat significance isn’t all that important? I would agree entirely with that. But if you’re saying stat signif is an important part of a larger context, I would reply that defining its role necessarily involves an oracular process of speculation. That process depends more on tradition, personal preference & prejudice than on scientific considerations.

I'm saying that statistical sigificance is only part of what makes a paper's claim defensible.

Research is a human endeavour. It was not 'discovered'; it was created to suit our purposes. We are allowed to use our judgement when choosing criteria, because there are no scientific facts about what confidence intervals are 'appropriate'.
 
You haven’t explained how p<=0.05 is any better than a WAG. Food for thot: e.g., in the point-null testing situation, the probability of the tested hypothesis H could be >50% even when p<0.05. Plus, data producing p<0.05 may actually constitute evidence in favor of H. It may “increase our confidence” but it may not, and it may increase our confidence for or against H, depending on such things as power. WAGs often get people into trouble such as this.

I found this confusing. Maybe an example or two would help.



Putting a stake in the ground: I think most people in JREF would agree that we should expunge science of as much arbitrariness as possible, and found our findings on solid principles whenever we can. If we don’t do that, can we really claim to be scientific? Otherwise, science becomes a tool for the powerful & influential, i.e., WHOSE “commonsense” will we rely on? WHOSE “pragmatism” is authoritative? Are you saying that the average Randi member, obviously interested in questioning the assumptions of the paranormal etc, would not be willing to examine their own assumptions?

Well, I can't speak for all members, but I think that's my point: p values aren't the obsession that Oakes seems to be making out. It's a piece of the puzzle, but my time and effort in study design is in formulating controls that can't be confounded.



Sorry, what is “woo?”
It's a term skeptics use to describe non-skeptical stuff. Like psychic powers, ghosts. I guess 'woo' is short for 'woo-woo'.



Eternal questions & answers: Thanks; I never heard it put that way. However, a call for pragmatism over philosophy is often a cop-out; a refusal to entertain the tough (and often important) questions. Surely that's not your situation?

In this case, a call for pragmatism over philosophy is based on experience. Two more expressions:

"Best is the worst enemy of better"

"Analysis paralysis"

Medical science isn't so totally committed to p<.05. I do studies with .01 when I can afford it. Science is done with budgets, and p values and ockham's razor are ways to direct resources toward the most promising projects. Ockham's razor is often wrong, too, but scientists are quite fond of it as a guideline.
 
Jorghnassen,
it was Fisher. But your 0.025 (I think that's what you mean?) just begs the question. In JREF, aren't we interested in removing arbitrariness?

The thing is that, and you can test this with simulation (or possibly figure it out analytically, maybe), if you do an experiment and get a p-value below but close to .05, repeating the experiment will likely yield a p>.05, that is, a non-significant result. If you get a p-value below .025 (or something like that), subsequent experiments will much more consistently get p<.025 (hence it is a less arbitrary cut off than .05). There was a whole debate between Berger & Selke and Casella & Berger on p-values and Bayesian statistics back in 1990-1991 if I recall, and the articles they published were really enlightening on the issue.
 
Medical research relies heavily on statistical results to substantiate its findings. Yet, stat results are almost always misunderstood and misinterpreted. It's really scandalous.

The simple cause of all this is that statistics is something that seems essential to the biological sciences, so is used heavily there, yet is mathematically well beyond the vast majority of biologists.

Until everyone is well enough versed in maths to really understand stats, there will be problems of all sorts. As an example, how many biologists appeal to the normal (gaussian) distribution like zombies? Many. How many actually know where it comes from and therefore when it applies? Hardly any.
 
Hans,
Thx for your thoughts. Your post raises several questions:
1. What does picking a representative sample have to do with my question as to whether chance exists?
2. I can guarantee, even before collecting our 200 people, that the 2 cities will have a different mean (or are you referring to median?) height of people. The null hypothesis (in this case, IIRC) of no difference is almost never correct. So what’s the purpose of the experiment? IIRC is false a priori.
3. Where did this value of 0.05 come from? It seems quite arbitrary.
4. We would run an experiment to be able to say something about an important scientific hypothesis. However, the p-value (as well as the alpha you are actually talking about) is a statement about data assuming the hypothesis IIRC. It’s not a statement about any hypothesis itself. So, why are alpha or the p-value relevant?
5. Being new to JREF, I would have thought that the people here take great pride in avoiding “standard assumptions.”

-Andrew
1. What i meant is that if you do a new measurement you would not get the exact same results and therefore chance affects your data. A quick analogy imagine that you throw a dice (6 sided) 1000 times and record all results. If you want to check what the average result is and don't want to add up all the numbers you could randomly pick 30 of them and calculate a mean lets say you get 3.43 if you again pick 30 number at random you might get 3.56 if you do this enough times you will eventually get values like 2.1 or 4.9. You won't get them often but sometimes you will. and thats were chance comes in.

2.I meant mean and your right that it would be different for the two cities but if i do the experiment twice and the first time i find that New Yorkers are 1 cm longer and in the other that they are 1 cm shorter. Then i can't claim that there is a difference in mean length because the error in my measurement is greater than the difference even if i can measure the length with an accuracy of 0.1mm.

3. Yes it is. but it is the one normally used unless you want/need to be more accurate then you can go down to 0.005 or lower but then you increase the risk of doing a type II error. Not seeing a difference when it is there.

4. A lot of it discussed by others above. The alpha value is what you decide beforehand to be the acceptable risk of doing a type 1 error, finding a difference when there is none. The p-value is what you actually get. A medical company want to show that their product is better than the competitions or better than placebo. If you don't have the p-value how are you going to know if the difference between the treated group and the control group is significant or if it was just random noise.

5. There are lot's of standard assumptions in math and as long as you give your alpha and p-values others can check your work. That possibility to check is a lot of what goes on here at JREF if i got it right.

There is however a lot of messing around with statistics in the press. For example if in the control group 2 out of 100 gets a certain disease and in the group using a certain medication/product 4 out of 100 gets the same disease is it a risk increase with 100% or 2%. Guess what sells more newspapers. :)

/Hans
 
Firstly, my apologies for oversimplifying. I guess sometimes I slip back into 'teaching year 10 probability' mode where the details are occasionally dropped in order to get kids to understand the concept of 'probable' versus 'possible'.

Alpha is simply the chance that your null hypothesis is correct. The lower it is, the less likely it is the situation which explains the phenomena you are observing, and the greater my confidence can be in my hypothesis being a correct justification for the observation.

I must admit, I do side more with Bayesian philosophy than the frequentists, but that's another discussion. How is alpha = 1 - P strictly Bayesian? Perhaps there's something about that philosophy I'm missing, or I've missed in your argument.

. . .Athon

Athon,
Yes, we all tend to drop some of the notation; most of the time that’s okay but it can get us into trouble. It even flummoxed RA Fisher, who late in his life had to recant some of his earlier convictions about confidence intervals. Alpha is obtained by assuming the tested hypothesis H, and then calculating the probability of rejecting it: Alpha = Prob(Reject H, given H). It is a probability about the procedure, not about H. Bayesian probabilities, on the other hand, are probabilities about hypotheses given data, e.g., Prob(H given Reject H).
I can’t address everything you’ve said (this is spawning too many separate threads, with you & others), but pls feel free to ask for specific coverage of anything I’m skipping. I think this will clear up a lot of the confusion: You have admitted to believing that alpha = the chance one is wrong, given one has rejected H. Now, once one has rejected H, one is wrong if and only if H is true. So, you are saying (and I’m repeating myself here) that alpha = 1 – Prob(H given Reject H). As I mentioned in the last paragraph, Prob(H given Reject H) is a bayesian probability. Does that help?
 
oracular speculation

Oracular speculation? I'm not familiar with the term. It sounds like you're meaning 'talking speculatively' about an observation. It certainly has a place in science as far as I can tell, which is why I wonder if you mean something else by it.
Athon

An oracle is "The response given through such a medium, often in the form of an enigmatic statement or allegory." Oracular speculation is mysterious pronouncements that seem to come out of thin air, with no discernable justification, but which, because of who is saying them, seem to be authoritative. That is exactly the case with alpha. Mathematically, alpha is just a statement about how often we would reject H if H were true. But, mysteriously, people who should know better (yes, even statisticians, as Oakes has shown) turn alpha around to convey meaning about H itself. Hence the title of my post refers to "scandals."
 
The ones who do it for a living understand. They're the ones who matter. That a baseball statistician doesn't understand medical publications is not a crisis.

I work with professional statisticians all the time who think p-values are statements about the tested hypotheses. I've even repeated Oakes's experiment myself, and found little difference in my samples of people compared with his. Most often, statisticians follow RA Fisher's pattern of interpretation, claiming that p measures the evidence against H, or they think (as does Athon) alpha or p is the post-experimental prob of H.

Plus, if the "ones who do it for a living" understood, they wouldn't call p-values, alpha, etc. "inferential." Statistical inference, as I said in my very first post here, is "extending sample results to general populations." But p-values & alpha don't do that.
 
Of course, with our imperfect measuring equipment, we might get, say, the same (rounded) height from 2 people. But that's just because we can only measure things to a limited exactitude.
And you might, by dumb luck, pick people who were all exactly six feet tall within the tolerance of your measuring device. So you can't guarantee the means will be different.
 
amhartley wrote: <<You haven’t explained how p<=0.05 is any better than a WAG. Food for thot: e.g., in the point-null testing situation, the probability of the tested hypothesis H could be >50% even when p<0.05. Plus, data producing p<0.05 may actually constitute evidence in favor of H. It may “increase our confidence” but it may not, and it may increase our confidence for or against H, depending on such things as power. WAGs often get people into trouble such as this.>>

I found this confusing. Maybe an example or two would help.

Blutoski,
There are many papers & books you could read about this. Berger & Sellke had a paper in 1987 (in The American Statistician) showing the disparity, in the point-null testing situation, between p-values & the post-experimental prob of the tested hypothesis H. Richard Royall’s 1997 book “Statistical Evidence: A Likelihood Paradigm” (I may not have the title exactly right, but it’s published by Chapman & Hall) showed how a p<0.05 should in some cases increase confidence in favor of H. This invalidates the standard guidance, followed by statisticians as well as medical types, to consider p<0.05 as “moderate evidence against H.” Royall also had a paper in 1986 showing how the strength of evidence associated with any p-value is a function of precision. We could, conceivably, combine precision with p, to measure evidence; however, his point was that a much easier, clearer & more direct measure of evidence is the likelihood ratio.

The root problem is that p is a statement about data given H, not an inferential statement. To consider it inferential, as do most stat textbooks, statisticians and consumers of statistics, is a mistake.

Imagine if I tried to measure barometric pressure with a thermometer. I might come up with statements like “if the temp is above 40 degrees C, we can consider pressure to be high. Otherwise, it’s low.” That would be a WAG. But a WAG is all one can develop for measuring pressure using a thermometer, because a thermometer is a tool made for measuring temperature, not pressure. Similarly, WAGs are all one can come up with for measuring evidence (as do statisticians) using p-values.
 

Back
Top Bottom