• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Statistical significance

Gr8wight

red-shirted crewman
Joined
Jul 5, 2004
Messages
1,661
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).
 
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).

Will you settle for me just trying to explain? It's easier than finding a well-written web site.

Every experiment is really a comparison between two hypotheses -- an "experimental" hypothesis and a "null" hypothesis. For example, my experimental hypothesis might be that I can predict coin flips with better than 50% accuracy, while my "null" hypothesis would then be that I can't.

So if I flip a coin eight times, the null hypothesis would say that the most likely number of correct predictions would be four. (which has about a 22% chance of happening). [ETA: make that 27% chance, 'cause I can't read statistical tables. "A mind is a terrible thing to waste." Illiteracy bites. Don't be a fool, stay in school. There once was a lass from Nantucket....)

However, the actual number of correct predictions that I get might be anywhere in the range. I might get all eight right by chance (with a probability of about 0.4%), I might get all eight wrong (with the same chance), or I might get seven right (with a probability of about 3%) or six right (probability about 10%, if I did the math right). So if you graph the number right on the x-axis and the probability of getting exactly that many right on the y-axis, you see a standard bell curve.

Now, remember the two hypotheses? There are correspondingly two types of errors. A type I error is where you believe the experimental hypothesis is true when it isn't (incorrectly reject the null hyptothesis) A type II error is where you incorrectly accept the null hypothesis. We can't say anything about the probability that the experimental hypothesis is correct -- but we can say something about the probabiliy that you would have gotten at least as good results as you did if the null hypothesis were true.

If the null hypothesis were true, the chance of my getting eight of eight is less than 1%, which is pretty unlikely -- therefore, my chance of a type I error is less than 1%. If the null hypothesis were true, my chance of getting at least six out of eight is about 15%.

"Statistical significance" represents the arbitrary cutoff where we consider the chance of a type I error to be acceptably small. Usually that's set at a 5% level, but it can sometimes be set higher or lower depending upon the circumstances.
 
Last edited:

A basic and utterly misleading intro.

Significance levels show you how likely a result is due to chance.

No, they don't.

The most common level, used to mean something is good enough to be believed, is .95. This means that the finding has a 95% chance of being true.

No, it doesn't.

No statistical package will show you "95%" or ".95" to indicate this level. Instead it will show you ".05," meaning that the finding has a five percent (.05) chance of not being true,

No, it doesn't. It means that the finding has a five percent chance of being observed, if the experimental hypothesis happens to be false.

How the hell can a "finding" be true or false in the first place? It's the number that came out of my observation equipment. If I flipped eight coins and got five right, are you suggesting that I'm lying and I didn't actually get five right?
 
So if I flip a coin eight times, the null hypothesis would say that the most likely number of correct predictions would be four. (which has about a 22% chance of happening).
Isn't this a 27% chance in happening?
getting either 5 or 3 right would be 22%, right?
 
A basic and utterly misleading intro.



No, they don't.



No, it doesn't.



No, it doesn't. It means that the finding has a five percent chance of being observed, if the experimental hypothesis happens to be false.

How the hell can a "finding" be true or false in the first place? It's the number that came out of my observation equipment. If I flipped eight coins and got five right, are you suggesting that I'm lying and I didn't actually get five right?

oh you've replied to it - i just edited it out :)
 
Very nice, drk.

Would you care to say a few words about the logic behind the assumption of the null? I would (and will, if you don't), but dang, you got a way with words.

(...and such an explanation would be very nice to be able to point to when people say things like "well, why can't we just assume X until proven otherwise?")
 
Would you care to say a few words about the logic behind the assumption of the null? I would (and will, if you don't), but dang, you got a way with words.

(...and such an explanation would be very nice to be able to point to when people say things like "well, why can't we just assume X until proven otherwise?")

Well, thank you. I don't know if my "few words" will address what you consider the important issues, but you're welcome to explain the areas I miss here.

The basic problem is that of proving a negative. I can't prove that leprechauns don't exist until I've looked everywhere in the universe, including in orbit around distance stars, for them. Similarly, I can't even say that they don't exist in my garden -- since they might just be invisible, or too small for me to see, or something like that. Leprechauns might exist but not be detectable by the instrument I'm using.

So say we're testing a new drug. There are two possbilities -- either it does something, or it doesn't. I can't prove that it doesn't do anything, since it might do something that I'm not equipped to detect . But I can I have a pretty good idea of what "not doing something" would look like.

So the idea is that I set up two hypotheses. My "experimental" hypothesis is that it does something, and my "null" hypothesis is that it doesn't. (it more or less has to be set up this way, because of this detection issue.) From a standpoint of drug performance -- and my future career as a research pharmacologist -- the best I can hope for is clear and convincing evidence that people treated with this drug differ from people left untreated. That will be strong evidence against the null hypothesis, and therefore for my experimental hypothesis. I can "reject the null hypothesis."

But let's say I'm unlucky and I don't notice any difference? Does that mean that my drug doesn't do anything? Of course not. it just means that I didn't notice any difference. So I don't have any evidence for my experimental hypothesis, and I have "failed to reject the null hypothesis." But I haven't proven it. Someone else might come along with more sensitive detectors or more people and find that my drug actually works.....

You can see this in the coin flips example. If all I claim is to be "psychic," that just means I can predict with higher than 50% accuracy. But what counts as "higher"? 100%? Obviously. Eight out of eight is definitely significant. 75%? Six out of eight isn't really "signficant," but it's noticeable. 55% ? That's probably not going to show up as significant on a test as small as eight flips. 50.0001%? That will get lost in the noise on any reasonable-sized experiment. So I might "fail" the test of "psychic," not because I'm not psychic, but because I'm not strongly enough psychic for anyone to care.
 
Last edited:
Very nice, drk.

Would you care to say a few words about the logic behind the assumption of the null? I would (and will, if you don't), but dang, you got a way with words.

(...and such an explanation would be very nice to be able to point to when people say things like "well, why can't we just assume X until proven otherwise?")

Well I'll say some things, but they won't be what you're expecting. ;)

The problem is that people always want to solve an impossible problem. So statisticians have come up with a problem they can solve which sounds enough like the impossible problem that we can't solve that people are happy. And then statisticians get annoyed when people mistake the one for the other.

Allow me to expand.

The problem that people want to answer is, "What is the probability that this is true?" Well anyone who is familiar with Bayes' Theorem can tell you that this doesn't have a well-defined answer - it depends on how likely you thought it was before you did your experiment. But people don't like that answer much.

So we say, "Well let's figure out the probability of getting a result this weird under the null hypothesis. If that is low enough, then we'll reject the null hypothesis." Now this question is something we can answer, it is well-defined, and people are happy to use the answer.

The problem is that people insist on mistaking the second question for the first. And that is where statisticians get annoyed. Because they aren't the same thing at all. Or worse yet, people want a concrete answer of the form, "I see that this is better than that, with 95% probability, how much better is it?" Which again is the impossible question. No matter how much you explain it to them, people want the simple answer, and want to state it in the simple way.

So to keep the statisticians happy, all we need to do is just get it straight and keep it straight? Well, that depends on the statistician. You see, there is a debate among statisticians about whether or not the standard procedure makes much sense at all. Bayesians like to bring up cases like the following one.

Suppose we know that a couple planned to have children until they had both a son and a daughter. They have 7 sons in a row, then a daughter. At a 95% confidence level, should we reject the hypothesis that they are equally likely to have sons or daughters? (*) Well the null hypothesis is equal probabilities, under which a result this strange or stranger requires a string of 7 boys or 7 girls in the first 7 kids, which will happen 1 time in 2^6, or 1.5625% of the time. So at the 95% confidence level (even at a 98% confidence level) we'd reject the null hypothesis.

Now let's change the problem. Suppose they just were going to have 8 kids. What then? Well the odds of a result that odd are the odds of 7 boys and 1 girl (happens 8 ways), or 7 girls and 1 boy (happens 8 ways) or 8 boys (1 way) or 8 girls (1 way). So there are 18/2^8 ways in which we could get a result this odd, which has probability 7.03125%. So at the 95% confidence level we should not reject the hypothesis.

But according to Bayes' theorem, no matter what prior probabilities you assign, your posterior probabilities will not depend on the knowledge that they were going for 8 kids or both a boy and a girl. In any valid system of inference that piece of data is a red herring that should not make sense. Therefore standard statistical methods lead to nonsensical results.

In the real world the Bayesians lose for two reasons. First, everyone is used to the standard solution. And second, Bayesian alternatives to the standard methods are far more complex to understand and explain.

Cheers,
Ben

* In fact this hypothesis is generally wrong. Population statistics demonstrate that there is a small but significant bias towards having sons rather than daughters.
 
Let me try too!

Suppose you claim you know the difference between pepsi and coke, just by taste.

I'm skeptical, so we set up an experiment to see if you can do it.

But what does "know the difference between" really mean?

For example, would you have to have 100% accuracy for me to believe that you know the difference?

Likely not.

What about 75% accuracy?

That probably still would support your claim (you can tell the difference with better than chance accuracy).

What about 55% accuracy-- less impressive, but still might be better than chance guessing.

So, what hypothesis should I test: That you know the difference with 100% or 75% or 55% accuracy?

We dunno, so we don't actually test any of these "experimental hypotheses" precisely because we likely don't know what their true value is (and if we did know the true value, well, then we wouldn't need to do the experiment).

Instead, we always test the null hypothesis, which is typically the exact opposite of what we're interested in testing.

The null here would be: You can't tell pepsi from coke.

This is the default; we're going to assume it's true unless we have evidence suggesting it's not.

Why bother testing the null? Because we know exactly what your performance should be if the null were true. If you DON'T know the difference between pepsi and coke, then you should perform at 50% accuracy in any taste test.

Even if your taste buds were dead, if I gave you an unlabeled glass and had you pick pepsi or coke, you would be right 50% of the time by chance alone.

What were looking for is statistical evidence that lets us reject the null (thereby accepting the alternate, that you do know the difference between pepsi and coke).

So, we expect to get 50% accuracy, given the null. We do the taste test and record your actual accuracy rate (suppose it is 67%).

The question then becomes: Is 67% actual performance different enough from 50% expected performance for us to reject the null (i.e., and conclude that you can indeed tell the difference).

To answer the question, we have to draw a line in the sand. This is called our alpha value, and the convention in science is to set it at .05.

Assuming the null is true, the probability between expecting 50% accuracy and getting 67% accuracy has to be less than .05 for us to reject the null (your performance has to be so improbably better than 50% accuracy that the only reasonable conclusion is the null is false here and you indeed can tell the difference between pepsi and coke).

In this simple scenario, whether we reject the null or not depends on how many trials we gave you.

Assume it was just 3 taste trials-- you got 2 right and one wrong.

The probability of getting 2 right just by guessing (as would be the case if the null were indeed true) is 3/8 or about .24

Since .24 (observed) is greater than .05 (our alpha value) we cannot reject the null. We don't have enough statistical evidence to rule out chance guessing and so we haven't proved that you can tell the difference.

If you achieved 67% accuracy over 50 trials, however, the actual p value would be some number much lower than .05. So, 67% accuracy here would lead us to reject the null because it was based on N=50 here (whereas the exact same accuracy with N=3 lead us to not reject the null).

So, rejecting the null is reaching the conclusion that: the difference between actual and expected values is too big to be due to chance (i.e., the null being true) and is therefore hopefully due to the experiemental manipulation (to the extent the experiment possesses internal validity).

Note that in the N=3 trials case, we didn't really offer a fair test. We need more trials than 3 to fairly test you. Not rejecting the null here was likely due to our lack of statistical power.

If you achieved 50% accuracy over 50 trials, then our not rejecting the null here would be more convincing (the test was fair and reasonably powerful, yet you failed).
 
Let me try too!
[snip]
So, rejecting the null is reaching the conclusion that: the difference between actual and expected values is too big to be due to chance (i.e., the null being true) and is therefore hopefully due to the experiemental manipulation (to the extent the experiment possesses internal validity).

Lemme add a bit more, and put my own neck on the chopping block for the other statisticians to take a whack at.

With the null and alternative hypotheses, we have the option "***** happens" (which is the null hypothesis--nothing but random) and the option "something happened in addition to *****" (the alternative hypothesis). Sadly, when we reject the null (which is either a hit, or type I error, and we can never know which), all we are left with is "something in addition..." We don't know, to borrow Pest's example, whether our subject has 55%, 75%, 62.377639103% psychic abilities, or what. And Pest's parenthetical warning is key--"something in addition..." can mean cheating instead of psychic ability.

ETA--how strange that we get 5 asterisks for a 4-letter word...
 
With all of that said, it simply means you can't substitute logic for statistics. If you don't understand at least the foundations of what you are studing, no amount of statistical treatments will help you.

If I had one suggestion though. Anyone whose going to plan a larget set of studies. talk to a statistician. They really can save you some time and effort. experimental design is a very important step, but you'd be suprised how often it's not performed.
 
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:
On a 0-to-10 scale, children in the guided-imagery group had an average post-pain intervention score of 4.3, a point lower than children in the control group. While the difference was not statistically significant, Schmidt believes it is "clinically" significant.

"If it works for you, and it reduced your pain by one point or two points, isn't it worth it?" she asked.
I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.
 
I think this stuff finally clicked for me visually.

You've got a number from your experiment, but you don't know if it is from the distribution curve described by the null hypothesis or not. Generally, if it falls way out in the skinny little little tail (one tail or both, depending on the experimental model), we say it probably was not from that distribution.

We suppose then that the measurement is from another distribution curve that is described by the effect we "wanted" (predicted by the hypothesis).

If it falls in the fat part (of the distribution predicted by the null hypothesis)--well, then we can't say that the measurement probably doesn't belong to the null hypothesis distribution curve.

Really--it's MUCH clearer with pictures!
 
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.


Can anyone refreash my memory?
 
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:
On a 0-to-10 scale, children in the guided-imagery group had an average post-pain intervention score of 4.3, a point lower than children in the control group. While the difference was not statistically significant, Schmidt believes it is "clinically" significant.

"If it works for you, and it reduced your pain by one point or two points, isn't it worth it?" she asked.

I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.
Yes, of course, if the guided imagery will reduce my pain by one or two points, that could certainly be worth it, depending on how much pain one or two points is. (Calling the reduction in pain "clinically significant" means that it's quite a noticeable amount.)

But if the difference in the experiment wasn't statistically significant, that means that the results of the experiment don't give me very much reason to believe that guided imagery will in fact reduce my pain. Even if it were totally ineffective in reducing pain, some children would presumably end up with somewhat less pain than others, due to unknown factors unrelated to the treatment: in this experiment, it turned out to be the ones in the guided-imagery group; in the next, it might turn out to be the ones in the control group.

While it's true that statistical significance isn't very important without clinical significance, clinical significance is meaningless without statistical significance, because without statistical significance, there's not much reason to believe that the clinical significance will continue to be present in the future.
 
Thank you all for your informative replies. I think I know where I am going now. What I am doing is writing about an article I found about "guided imagery" being used as a pain relief method. The article states:

I understand why that is bull(four asterisks), but I was having trouble putting into words I thought others would understand. You guys have definitely helped me put my thoughts in order. Any further commentary would still be greatly appreciated.

It would help if you could provide more details or a link to the article. Were the subjects randomly assigned to the treatment vs. no treatment groups? Was the pain rating on the typical 0 to 10 scale? How many subjects?
But, in any case, no competent scientist would ever state that statistically insignificant results, even at the less rigorous level of .05, were clinically significant.
Quite often, the opposite is true. With a large number of subjects, statistical significance can be obtained with trivial effects.
 
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.


Can anyone refreash my memory?

Just because there are two possibilities doesn't make them equally likely.
"If I am walking out to my car, the chances of me being struck by lightning are 50% because either I get struck by lightning or I don't."
 
I don't want to start another thread for this question so i'll ask it here...

Can someone give me the mathematical reasons of why this argument is flawed....?


If I get into my car and drive to the store the chances of me having a wreck are 50% because either I have a wreck or I don't so it's 50/50. The chances of me having a wreck on the way home is also 50% either I have a wreck or I don't.

I know there are two main problems with this logic. Firstly probability isn't calculated that way and secondly it isn't added up that way (on the way to and from the store). But I don't remember the exact mathematics behind how it's actually done and why this is fallacious.



Can anyone refreash my memory?

Just because there are only two possibilities doesn't imply that the two possibilities are equally likely i.e. 50/50.

Probabilities are only additive when the events are mutually exclusive. If they are not, then you have to subtract the probability of the intersection of the two events.

eta: I see Jeff beat me to the response.
 

Back
Top Bottom