• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Statistical significance

Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).
On public television they run some classroom-type shows. There was a great one (made in the 1970s, like the best of them) that had a guy that did a series and explained statistics more clearly than anything I have ever seen. Unfortunately, I can't find the series or any website that even comes close.

You asked about the “+/- range of accuracy”. This is a margin of error. You often hear about this in polls. Our poll for the U.S. presidential election shows Aaron has 60% of the vote and Barry has 40%, with a margin of error of +- 3%. What does that mean?

This means the poll resulted in 60% of the people saying they would vote for Aaron and 40% would vote for Barry. But how accurate is this poll? An how confident are we that the number are realistic?

We have to look at sample size. If the poll was based on just 10 people, then we don’t have much confidence in the results. If the poll was based on 10 million people, we would be very confident in the result. The larger the sample size, the larger the confidence that our poll numbers are accurate.

Of course even with a very large sample size, our poll isn’t going to be exact. Even if we poll 10 million people and the results are 60% for Aaron, this doesn’t mean that Aaron will get EXACTLY 60% of the vote.

We have to use a combination of margin or error a confidence level. We can say with 100% confidence that Aaron will get 60% of the vote with a margin of error of +- 60%. Or you could say that Aaron will get 60% of the vote with a margin of error of +- 0% at a 0% confidence level. Neither means much. It means you could be equally right or wrong.

You can’t be 100% confident that your projected numbers are correct. You have to allow for some margin of error. So the goal is to calculate a margin of error that has a reasonable level of confidence—like 95%. The lower your margin of error, the lower the confidence level.

If your sample size for a U.S. election was only 10 people, you probably couldn’t get a 95% confidence level without having a margin of error somewhere around +- 100%. Which means it would be meaningless. If your sample size were 10 million, you would have a low margin of error and a 95% confidence level would be no problem.

I wish I could explain this better. :(
 
You asked about the “+/- range of accuracy”. This is a margin of error. You often hear about this in polls. Our poll for the U.S. presidential election shows Aaron has 60% of the vote and Barry has 40%, with a margin of error of +- 3%. What does that mean?

note it is not just the size of the sample, but whether or not those selected to be in the sample reflect the distribution of the population. you could have a huge sample of people who were all "odd in some way" and get a rather poor "forecast". this fact is often NOT included in the "margin of error" reported in news papers. (and such systematic errors are often hard to avoid a priori).
 
Can someone point me at a website that has a good explanation, in layman's terms, of why statistical significance is important in an experiment. I'd like to read up a bit on what makes a scientific result statistically significant, and what expressions of error mean (the term escapes me right now, but I'm talking about what is expressed as a +/- range of accuracy in experimental results).

The other posters have done a good job of describing statistical significance. I'd like to address your example above, and distinguish it from what they were talking about.

Statistical significance is different than errorbars. That's the +/- you were talking about. You get errorbars in two ways: instrument limitations, or sample error.

Instrument limitations make the most sense: your thermometer has lines every 1.0 degrees. So, you can only measure temperature in +/- 0.5 degrees. Instument error has special rules for when you add or manipulate data: it frequently grows larger when you combine data.

Sample error has to do with assumptions about the sampling and the population being sampled. Usually assumes a uniform sample, in a normalized distribution. Not always, though. Typically, it's ballpark root(N) / N. So, for example, (again, this is just ballpark) if you sample 100 people, your error could be about +/-5%.

This is why you hear reports of surveys that have two error disclaimers: "Fifty percent of people surveyed prefer Brand X, plus or minus five percent, nineteen times out of twenty."

The first statement is the sampling error; the second is a claim that this survey acheives statistical significance ("nineteen times out of twenty" = "95% confidence interval" = "p<=.05")
 
Last edited:
A statistical test consists of the following steps:
1. Choose a parameter that you want to test.
2. Choose a null hypothesis regarding the parameter.
3. Choose some random variable that (presumanbly, one that has something to do with the parameter).
4. Choose a rejection region for the associated statistic.
5. Calculate the probability, under the null hypothesis, of the statistic falling in the rejection region.
6. Perform an experiment that creates an instantiation of the statistic.
7. Evaluate whether the resulting statistic falls in the rejection region; if so, declare the null to be rejected.

The statistical significance is the probability calculated in step 5. Notice that it is a statement about the experiment, not the data (and should be caclulated before you even know what the data is). The statistical significance has a quantitative value: a number between zero and one. The data, however, is purely binary: either it is in the rejection region, or it's not. Within the context of a statistical test, there is no such thing as data that is "very significant" or "low significance" or "almost statistically significant".

Now, on to confidence intervals. Sometimes, the value of a parameter is estimated with a statistic. Since statistical tests involve randomness, this isn't the exact value. So statisticians come up with an interval where it might be. They then can calculate the probability, given that the parameter is correct, of getting an interval that includes it. Notice that it is often misinterpreted as the probability, given an interval, of getting a parameter in that interval, when in fact it's the opposite.

Suppose we know that a couple planned to have children until they had both a son and a daughter. They have 7 sons in a row, then a daughter. At a 95% confidence level, should we reject the hypothesis that they are equally likely to have sons or daughters?
"95% confidence level"? What does that mean? You're not asking a valid statistical question.

This is data mining, since the data comes before the calculation of alpha. It seems to me that your example is simply an example of misdirection. What the couple was planning to do has nothing to do with it; what matters is what statistic we use. Basically, what you're doing is deciding what statistic to use after the data has been collected, finding that the conclusion depends on which statistic is used, then declaring the results "nonsensical". To cover up your malfeasance, you bring in the red herring of what the couple was planning on doing, to make the choice of statistic seem nonarbitrary.

Here's your example made a bit more transparent. Suppose there's a class of 30 students, and I've labeled them from 1 to 30. If I tell you that students 1,4,5,8,10,11, and 12 are all boys, you would, according to your above logic, conclude that more than half the class is boys. If I tell you that students 2,3,6,7,9,13,14, and 17 are girls, then you would conclude that more than half is girls. And you are saying that there something nonsensical about this, because two different sets of data resulted in two different conclusions.

But according to Bayes' theorem, no matter what prior probabilities you assign, your posterior probabilities will not depend on the knowledge that they were going for 8 kids or both a boy and a girl.
They depend no less in the Bayesian system then they do in the standard system.

Therefore standard statistical methods lead to nonsensical results.
That is a complete non sequitur. You didn't present an example of standard statistical methods; you presented an example of ignoring basic statistical rules.
 
really? every experiment? how does that work if am looking for a number, say if i am measuring the speed of light (before it was set equal to 1, of course)?

Well, let's call the currently accepted speed of light (in vacuum) c (eg if we find out it is wrong, we don't change c).

So we make an experiment, and come up with a measurement of 1.1 c for the speed of light. However, depending on how the experiment is set up, there might be alternative explanations for the measurement. Let's say our clock is not really accurate enough. In this case, we could set up a null hypothesis - the measurement is caused by an inaccuracy of the clock. At least in theory, we might even know the distribution of the clock error and we could assign a probability for the measurement occuring, given that the speed of light is actually c.
 
Therefore standard statistical methods lead to nonsensical results.

That is a complete non sequitur.
It's not a non-sequitur if one wants the technical notion of a statistical significance test rejecting a null hypothesis to correspond to the intuitive notion of us having reason to believe that the hypothesis is false, and in particular, if one wants the level of significance of the rejecting test to correspond to the amount of evidence it provides against the truth of the hypothesis rejected.

Fisher certainly wanted this, even if Neyman and Pearson didn't. See chapter 4, "Some Misapprehensions about Tests of Significance," of his book Statistical Methods and Scientific Inference, where he rails against them about it.
 
It's a non sequitur because it doesn't follow from the preceding. Ben Tilly didn't present an example of standard statistical methods.

As for what else you say,
"statistical significance test rejecting a null hypothesis to correspond to the intuitive notion of us having reason to believe that the hypothesis is false"
I guess that as long as P(reject Ho|Ho)<P(reject Ho|Ha), there is such a correspondence. Of course, P(reject Ho|Ha) only is definable if we have a particular Ha in mind.

"one wants the level of significance of the rejecting test to correspond to the amount of evidence it provides against the truth of the hypothesis rejected"
Well, there are clearly more factors than just alpha. I don't think that there is anything "nonsensical" about this failing to hold. Would it be "nonsensical" for one car to get worse gas mileage than another, even though it is lighter? Perhaps indicative of inefficiencies, but hardly "nonsensical".
 
Yes, by definition, if the null hypothesis is true, and the significance is 5%, then there is a 5% chance of being wrong.

There was a study once that purported to show that prayer helps people heal, thand they had split it up into a bunch of subexperiments, testing different diseases to see whether prayer helps them. Well, if you test 20 diseases, you should expect one of them to be "helped" just by chance. 5% is often cited as the "standard" number, but it's rather weak. The idea of, for instance, having it as the significance level for the JREF challenge is rather ridiculous; if a hundred people applied, we'd expect 5 of them to walk away with a million dollars. Someone determined enough and rich enough can "prove" pretty much anything at 5%, by simply having a bunch of experiments and a bunch of different statistics. That's why when evaluating an experiment, you should look at whether the procedures, statistics, and rejection region are well documented prior to the beginning of the experiment, and whether the experimenter releases the results of all of the experiments, or just some.
 
It's a non sequitur because it doesn't follow from the preceding. Ben Tilly didn't present an example of standard statistical methods.

Actually I did, but you may not have understood that.

Standard statistical methods say that you do the following:
  1. Set up an experiment, get a result.
  2. Produce a null hypothesis.
  3. Figure out the odds of getting the result you got from the experiment, or anything less likely. That is your confidence level. (This is the step you likely did not recognize because the experiments that I set up did not follow a distribution that you're used to using hypothesis testing on. However take the description to a statistician and they'll confirm that I followed the appropriate method.)
  4. Make some decision based on the confidence level of your experiment.
So let's set up two different experiments. In experiment A, a couple decides to have children until they have both a son and a daughter. They have 7 sons in a row, and then one daughter. In experiment B, a couple decides to have 8 children. The first 7 are sons and the last is a daughter.

The null hypothesis in both cases is that sons and daughters are equally likely. However the different design of the experiments means that you calculate different probabilities. (They are different because in experiment A getting a daughter on the second try ends the experiment, while in experiment B getting a daughter on the second try and having the other 7 be sons is as unlikely as the observed outcome. So there are more combinations that are as unlikely as what was observed in experiment B than experiment A.) Therefore you make different choices under hypothesis testing.

This result is problematic because Bayes' Theorem shows that no reasonable method of drawing inferences would give a different conclusion from experiment A than experiment B. Hypothesis testing does, therefore it is an unreasonable method of drawing inferences.

Cheers,
Ben
 
Standard statistical methods say that you do the following:
  1. Set up an experiment, get a result.
  2. Produce a null hypothesis.
  3. Figure out the odds of getting the result you got from the experiment, or anything less likely. That is your confidence level. (This is the step you likely did not recognize because the experiments that I set up did not follow a distribution that you're used to using hypothesis testing on. However take the description to a statistician and they'll confirm that I followed the appropriate method.)
  4. Make some decision based on the confidence level of your experiment.
That's not the standard statistical method, as I said in my post. I consider myself a statistician, and I say that number three is wrong. And, at the risk of sounding conceited, I would consider anyone who disagrees to not be a statistician. "the odds of getting the result you got from the experiment, or anything less likely" is not a meaningful phrase. In the example that you gave, every result is equally likely. Seven boys, then a girl, is just as likely as three boys, then two girls, then two more boys.

Proper statistical method requires that you decide on a rejection region before any data is collected.

So let's set up two different experiments. In experiment A, a couple decides to have children until they have both a son and a daughter. They have 7 sons in a row, and then one daughter. In experiment B, a couple decides to have 8 children. The first 7 are sons and the last is a daughter.
As I said, what the couple decides is a red herring. All that matters is the statistic used.

This result is problematic because Bayes' Theorem shows that no reasonable method of drawing inferences would give a different conclusion from experiment A than experiment B.
The only way that statement can be defended is by a "no true Scotsman" type argument, as we already have a method that gives different conclusions. What is unreasonable about it? Mathematical theorems make no statements about anything but mathematical concepts, therefore Bayes' Theorem cannot say anything about "reasonable" methods except insofar as you are redefining "reasonable" to be a mathematical concept.

Hypothesis testing does, therefore it is an unreasonable method of drawing inferences.
Ah. The reason that it is unreasonable is it gives different results, and you've decided that everything that gives different results is unreasonable.
 
An implicit assumption has been made in all the answers in this thread that warrants being stated explicitly.

All of this testing and statistical significance relates to the behaviour of the average (mean or median depending on circumstances) value of a parameter for some group of objects. The point of testing is to determine whether there is truly a difference between the average values of two groups.

This is fine, provided the question you are asking can be properly answered by reference to the behaviour of group average values. But that is a very narrow view of the behaviour of data. The fact that it is useful in so many circumstances is because that narrow view often suffices for the situation

Here is an example where the mere asking of a question that is answerable by reference to the behaviour of group averages means that you have analysed the situation wrongly.

I have been looking at the behaviour of our business bank balance to see whether there are identifiable patterns across the month that we could exploit to manage our account better. I pooled data for 36 months and sure enough, there is an obvious cycle through the month. Let's use notional figures for illustration: we start the month with a mean bank balance of £50,000 and there is a mid-month peak at £80,000. The s.e.m. around these values is quite tight, about £5,000, so we can confirm, at high probability, that this monthly cycle is real and not just a fluke. But I want to know when I can safely write big cheques to pay big bills. The problem is that the standard deviation is about £30,000, i.e. about 95% of the time the actual account balance on any given day is +/- £60,000 of that day's mean value. That's great if the day I write a £20,000 cheque the account is at £80,000 + £60,000, but if it is actually at £50,000 - £60,000 I am likely to receive an embarrassing phone call. So, in this instance, I have shown a statistically real behaviour but if I relied on it for my intended action I would find I had answered the wrong question.

A related problem in medicine is similar to the above. There is an important distinction between statistical significance and biological/clinical significance. If I pool data from 1,000,000 patients, I might find that a certain drug really, genuinely, honestly does lower blood pressure, by an average of 0.1mmHg.This is not very likely to be clinically useful. Tis has been alluded to on the previous page when the idea of statistical power was introduced. It is important to decide what size of an effect would matter clinically or biologically then design the test to look for an effect of that size. Veterinary medicine is plagued by low-powered studies because of the difficulty of recruiting enough subjects to look at real medical conditions for useful lengths of time.
 
Proper statistical method requires that you decide on a rejection region before any data is collected.
And what is the reason for this requirement?

There's no way to look at the results of an experiment directly, and see what they tell us about a hypothesis?

What an experiment tells us about a hypothesis depends not only on the actual results of the experiment, but also on some arbitrary decision we made beforehand about rejection regions?

The only way that statement can be defended is by a "no true Scotsman" type argument, as we already have a method that gives different conclusions. What is unreasonable about it?
Before we can decide whether a method is reasonable or not, we need to decide what goal we want it to accomplish. Then we can say that it's reasonable if it accomplishes that goal, and unreasonable if it doesn't.

So what's the goal of a statistical significance test?

I think it's to help us decide whether a hypothesis is true or not. The decision to "reject the null hypothesis" should depend on, and only on, how much evidence there is that it is false.

So if two different experiments give us the same amount of evidence against the truth of a hypothesis, it makes no sense to reject the hypothesis in one case but not in the other.

Do you think that the results of Ben Tilly's experiments A and B give different amounts of evidence against the hypothesis of equal boy/girl probabilities? How could they? They're the same results!

The problem with significance tests based on p-values is that they take into account all sorts of experimental results that didn't happen (namely, all those in the predetermined rejection region). Where's the sense in that?

As Sir Harold Jeffreys wrote in Theory of Probability (third edition, pp. 384--385, emphasis in original):
[some discussion of the χ2 statistic and p-values based on it, then...]

If P was less than some standard value, say 0.05 or 0.01, the law was considered rejected. Now it is with regard to this use of P that I differ from all the present statistical schools, and detailed attention to what it means is needed. The fundamental idea, and one that I should naturally accept, is that a law should not be accepted on data that themselves show large departures from its predictions. But this requires a quantitative criterion of what is to be considered a large departure. The probability of getting the whole of an actual set of observations, given the law, is ridiculously small. Thus for frequencies 2.74 (6) shows that the probability of getting the observed numbers, in any order, decreases with the number of observations like [latex]$(2\pi N)^{-\frac{1}{2}(p-1)}$[/latex] for χ2 = 0 and like [latex]$(2\pi N e)^{-\frac{1}{2}(p-1)}$[/latex] for χ2 = p - 1, the latter being near the expected value of χ2. The probability of getting them in their actual order requires division by N!. If mere improbability of the observations, given the hypothesis, was the criterion, any hypothesis whatever would be rejected. Everybody rejects the conclusion, but that can only mean that improbability of the observations, given the hypothesis, is not the criterion, and some other must be provided. The principle of inverse probability does this at once, because it contains an adjustable factor common to all hypotheses, and the small factors in the likelihood simply combine with this and cancel when hypotheses are compared. But without it some other criterion is still necessary, or any alternative hypothesis would be immediately rejected also. Now the P integral does provide one. The constant small factor is rejected, for no apparent reason when inverse probabiltiy is not used, and the probability of the observations is replaced by that of χ2 alone, one particular function of them. Then the probability of getting the same or a larger value of χ2 by accident, given the hypothesis, is computed by integration to give P. If χ2 is equal to its expectation supposing the hypothesis true, P is about 0.5. If χ2 exceeds its expectation substantially, we can say that the value would have been unlikely to occur had the law been true, and shall naturally suspect that the law is false. So much is clear enough. If P is small, that means that there have been unexpectedly large departures from prediction. But why should these be stated in terms of P? The latter gives the probability of departures, measured in a particular way, equal to or greater than the observed set, and the contribution from the actual value is nearly always negligible. What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure. On the face of it the fact that such results have not occurred might more reasonably be taken as evidence for the law, not against it. The same applies to all the current significance tests based on P integrals. [footnote: On the other hand, Yates (J.R. Stat. Soc., Suppl. 1, 1934, 217--35) recommends, in testing whether a small frequency nr is consistent with expectation, that χ2 should be calculated as if this frequency was nr + 1/2 instead of nr, and thereby makes the actual value contribute largely to P. This is also recommended by Fisher (Statistical Methods, p. 98). It only remains for them to agree that nothing but the actual value is relevant.]
 
As for what else you say,
"statistical significance test rejecting a null hypothesis to correspond to the intuitive notion of us having reason to believe that the hypothesis is false"
I guess that as long as P(reject Ho|Ho)<P(reject Ho|Ha), there is such a correspondence. Of course, P(reject Ho|Ha) only is definable if we have a particular Ha in mind.
An experiment has three possible outcomes: A, B, and C. On hypothesis H0, their probabilities are 0.02, 0.02, 0.96. On hypothesis Ha, their probabilities are 0.01, 0.04, 0.95.

I choose a rejection region of {A, B}, whose probability on H0 is 0.04, which is less than 0.05, its probability on Ha.

I run the experiment and the outcome is A, which is in the rejection region. Does this result therefore constitute evidence against H0 and in favor of Ha? Or the opposite?

The opposite, obviously.

Why should I care about the probability of possible outcomes that happen to be in the rejection region, if they didn't actually occur? And if I don't care about them, why bother picking a rejection region to begin with?
 
An experiment has three possible outcomes: A, B, and C. On hypothesis H0, their probabilities are 0.02, 0.02, 0.96. On hypothesis Ha, their probabilities are 0.01, 0.04, 0.95.

I choose a rejection region of {A, B}, whose probability on H0 is 0.04, which is less than 0.05, its probability on Ha.

I run the experiment and the outcome is A, which is in the rejection region. Does this result therefore constitute evidence against H0 and in favor of Ha? Or the opposite?

The opposite, obviously.

I'm sorry, I'm perhaps not following this properly. But it seems that your experiment as proposed offers next door to no information at all -- and to the extent that it offers information, offers information in favor of H0[/sub[/i]. So, basically, you ran the wrong experiment. Why should your poor choice of experiments be an argument for or against a statistical theory?
 
That's not the standard statistical method, as I said in my post. I consider myself a statistician, and I say that number three is wrong. And, at the risk of sounding conceited, I would consider anyone who disagrees to not be a statistician. "the odds of getting the result you got from the experiment, or anything less likely" is not a meaningful phrase. In the example that you gave, every result is equally likely. Seven boys, then a girl, is just as likely as three boys, then two girls, then two more boys.

Let's get the ad hominems out of the way first, shall we?

This specific problem is one I first heard about from Dr. Laurie Snell. http://www.dartmouth.edu/~chance/jlsnell.html I have discussed it since with a number of people, including several statisticians who were tenured professors at different universities. I have no idea what your bona fides are to back up your self-identification as a statistician, but if you claim that anyone who disagrees is not a statistician, then you've made a claim that is very much on the outrageous side.

Now let's turn to actual matters of substance.

Proper statistical method requires that you decide on a rejection region before any data is collected.

The rejection region will depend on the set of results one might possibly observe, which in turn depends on the experimental design.

In experiment A the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons then a girl, 7 girls then a son, 8 sons then a girl, 8 girls then a son, 9 sons then a girl, and so on. The cumulative probability of being in this set is readily calculated to be 1/64.

In experiment B the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons and a girl (in any order), 7 girls and a son (in any order), 8 sons, or 8 girls. The cumulative probability of being in this set is readily calculated to be 9/128.

At a 95% confidence level the observed outcome is in the rejection set for experiment A but not for experiment B.

As I said, what the couple decides is a red herring. All that matters is the statistic used.

What the couple decides affects what the set of possible outcomes are, and therefore affects the odds that one might have gotten an outcome that would be taken for evidence that is as strong or stronger evidence against the null hypothesis than what was observed.

Which therefore affects the results of hypothesis testing.

The only way that statement can be defended is by a "no true Scotsman" type argument, as we already have a method that gives different conclusions. What is unreasonable about it? Mathematical theorems make no statements about anything but mathematical concepts, therefore Bayes' Theorem cannot say anything about "reasonable" methods except insofar as you are redefining "reasonable" to be a mathematical concept.

Ah. The reason that it is unreasonable is it gives different results, and you've decided that everything that gives different results is unreasonable.

This is true. Now let me defend the view that anything that gives different results is unreasonable.

According to Bayes' Theorem, under no prior set of beliefs should the difference in design of the experiments make any difference in your conclusions. If one takes the view that reasonable people start with a set of prior beliefs which they then continuously modify in the light of experience, then no reasonable person can ever draw the distinction between these two cases that hypothesis testing does.

Of course if you do not believe that reasonable people should have beliefs and modify those beliefs in the face of experience in a logical fashion, then you may not think that the results of hypothesis testing are unreasonable.

Cheers,
Ben
 
The rejection region will depend on the set of results one might possibly observe, which in turn depends on the experimental design.

In experiment A the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons then a girl, 7 girls then a son, 8 sons then a girl, 8 girls then a son, 9 sons then a girl, and so on. The cumulative probability of being in this set is readily calculated to be 1/64.

In experiment B the evidence that is as strong or stronger against the null hypothesis than the observed outcome will occur if the couple has 7 sons and a girl (in any order), 7 girls and a son (in any order), 8 sons, or 8 girls. The cumulative probability of being in this set is readily calculated to be 9/128.

In particular, if I understand the example right, the outcome probability space is different for different experiments.


For example, the probabiliiy of the couple having a single boy and a single girl is 0.25 for experiment A, the case where the parents just want one of each. The corresponding probability is zero for experment B, where they just want eight kids, irrespective of sex. Similarly, the probability of three boys and five girls is zero for experiment A, non-zero (I'm too lazy to figure it exactly) for experiment B.

Given that the underlying probability mass is different, the fact that the probability mass in the rejection region defined by the same words differs can hardly be considered to be a fault of the statistics.
 
[...]

So what's the goal of a statistical significance test?

I think it's to help us decide whether a hypothesis is true or not. The decision to "reject the null hypothesis" should depend on, and only on, how much evidence there is that it is false.

So if two different experiments give us the same amount of evidence against the truth of a hypothesis, it makes no sense to reject the hypothesis in one case but not in the other.

Do you think that the results of Ben Tilly's experiments A and B give different amounts of evidence against the hypothesis of equal boy/girl probabilities? How could they? They're the same results!

The problem with significance tests based on p-values is that they take into account all sorts of experimental results that didn't happen (namely, all those in the predetermined rejection region). Where's the sense in that?

This is the key point. Experiments A and B differ only in what didn't happen but could have. Hypothesis testing takes those possibilities into account so you come to different conclusions. However it seems absurd that what your conclusion about what is true is based on what didn't happen. Bayes' Theorem allows us to quantify the reason why our intuition says that this is absurd. Therefore hypothesis testing leads to absurd distinctions being made.

Allow me to add more variations.

Experiment C is like experiment A except that the couple agreed to have children until they had a girl. Now the p-value drops to 1/128.

Experiment D is like experiment A except that the couple decided to flip a coin after each child to decide whether to stop the experiment. Now the p-value drops to 1/8192! (They would have been at the old p-value of 1/64 after 3 sons and a daughter!) This is a drastic change in the strength of our conclusion, yet the extra coin flips gave us absolutely no information about the likelyhood of sons versus daughters!

And so it goes. Things that should be irrelevant matter greatly in hypothesis testing. That they do is integral to the procedure.

Cheers,
Ben
 
All of this testing and statistical significance relates to the behaviour of the average (mean or median depending on circumstances) value of a parameter for some group of objects.
You're alluding to an important point, but you don't have it quite correct. The average of the population is a parameter. It makes no sense to speak of the "average value" of a parameter; a parameter has only one value. Statitiscal tests compare one parameter to another. Usually, that parameter is the average of the population, but sometimes other parameters, such as the standard deviation, is considered. And, of course, the parameter is a simplified measure. In the example you gave, you talked about a difference of 0.1mmHg, and said that it might not be clinically significant. Well, that's not quite the point. More to the point, the average blood pressure and the average utility may not be the same. For instance, suppose that a bp of 200 means a 50% chance of dying in the next year, while a bp of 180 means a 30% chance of dying in the next year, and a bp of 150 means a 20% chance of dying in the next year. And suppose, magically, everyone's bp is exactly equal to one of those three values. Now suppose that for drug A, the distribution is as follows: 50% 200, 10% 180, 40% 150. Average = 178. For drug B, its 10% 200, 80% 180, 10% 150. Average = 179. Since drug A reduce the average bp, it's slightly better, right? But if you calculate the death rates, drug A has death rate of 36%, while drug B has a death rate of 31%. Death rates are more important that bp, but it's a lot easier to test bp. And even if we did try to measure death rates, then there are more factors to consider, such as whether a 10% of death and 90% chance of perfect health is better than 1% death and 99% chance of very poor health.

Here is an example where the mere asking of a question that is answerable by reference to the behaviour of group averages means that you have analysed the situation wrongly.
More precisely, it's an example where the mean value isn't as important as some other measure, such as the percentage of balances above £20,000.
 
However it seems absurd that what your conclusion about what is true is based on what didn't happen.

Huh? That makes no sense to me whatsoever.

"I lit the fuse, but the firecracker didn't explode. Therefore it must have been a dud."

"That's absurd!"

"What do you mean, that's absurd?"

"Well, how do you know that a leprechaun didn't come out and pee on the fuse while your back was turned?"

".... um,.... what?"
 

Back
Top Bottom