In particular, if I understand the example right, the outcome probability space is different for different experiments.
It sounds like you understand the example right.
For example, the probabiliiy of the couple having a single boy and a single girl is 0.25 for experiment A, the case where the parents just want one of each. The corresponding probability is zero for experment B, where they just want eight kids, irrespective of sex. Similarly, the probability of three boys and five girls is zero for experiment A, non-zero (I'm too lazy to figure it exactly) for experiment B.
The odds you were too lazy to figure out are 7/32.
Given that the underlying probability mass is different, the fact that the probability mass in the rejection region defined by the same words differs can hardly be considered to be a fault of the statistics.
The statistics calculates exactly what it said it would calculate. I did not mean to imply fault in the calculation of the statistics.
The problem lies in how people interpret and act on those statistics. We decide whether or not to reject a null hypothesis. We will make decisions and carry out actions differently after these two experiments. Is it reasonable to do so?
Well let's take the most reasonable of all possible procedures for drawing an inference. And that is to use Bayes' Theorem. Suppose, for instance, that the experimenter starts with the following prior expectations:
- 50% chance of the couple having boys vs girls be 50-50.
- 20% chance of having the odds be 55-45.
- 20% chance of 45-55.
- 5% chance of 100-0.
- 5% chance of 0-100.
So let's crank it into Bayes' formula. According to the experimenter's expectations, the probability of the outcome was 0.003734353109375 so our revised expectations are 52.3% for option 1, 36.7% for option 2, 11% for option 3 and 0% for options 4 and 5.
This change in expectations is true whether the experiment that was run is version A or B.
In short, by the most reasonable method we can find for adjusting our expectations in the light of further evidence, the differences in experimental design are absolutely and completely irrelevant. In fact it isn't hard to prove that, no matter what set of prior expectations the experiment had, the design difference will be irrelevant.
So when we take the step of using the results of hypothesis testing to draw an inference and make a decision, we are making our decisions in a way that is not consistent with any set of possible prior expectations. And we are doing so because (as 69dodge pointed out) we are explicitly taking into account in our decision the likelyhood of things that didn't happen. (Note that Bayes' formula completely ignores the might have beens that didn't happen - they can't matter to it.)
Cheers,
Ben
PS Note that I am not arguing for throwing out hypothesis testing. As I said before, it gets simple to interpret results when alternatives either produce nothing or give very complex answers. While acting according to what hypothesis testing tells you can lead to some absurd choices, most of the time it leads to fairly reasonable decisions.