• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

(Another) random / coin tossing thread

At least I can understand simple the null hypothesis, and p-values.

If you have actual knowledge of the domain, pick the prior that best reflects your knowledge.

If you have no such knowledge, pick the "least informative" prior as it will be the least likely to bias your findings in an unwanted direction.

If you can't determine such a least-informative prior, you are out of your depth and should not be doing statistics in the first place.
 
Sure. In fact, a frequentist could do more or less exactly the same thing, since coin flips are something that can be sampled, and therefore you could simply get a coin that comes down heads with probability p for all the values of p you were interested in, take large enough samples, and run the appropriate numbers.

Well, I'm only going to allow you one coin, with one (unknown to you) value of p. You get to flip it N times and record H-T (heads-tails).

There are basically two ways to interpret "confidence intervals" in this sense -- and frequentists would have no problem with either. One is the confidence that we have that a coin with known probabilty p will have flipped between x1 and x2 heads in 1000 trials. The other is the confidence we have that a coin that flipped x heads has a "true" probability between p1 and p2. Both of those are legitimate frequentist calculations.

So in this case the second is appropriate. Can you explain to me how you calculate the confidence the coin is between p1 and p2?

It seems to me that all you're allowed to do by your rules is choose a cutoff probability, which given the datum (heads-tails) will then define a single contour for you (i.e. it will determine two specific values p1 and p2, which will depend on the value of H-T and the cutoff, such that any p1<p<p2 would not be rejected, and any p outside that would be). But I don't understand how it allows you to input p1 and p2 and get back a confidence.

And if your rules do allow you to do that, I'm really having trouble seeing how this is any different from Bayes with a flat prior on p.
 
Last edited:
Jimbob,

here's my noddy-but-concrete example of Bayesian statistics at work:

First Bayes’ rule:

P(A|B) = P(B|A).P(A) / P(B)

the denominator of which can be rewritten using the total probability rule:

P(B) = P(B|A).P(A) + P(B|A').P(A')

where A' is the negation of A. The way I like to understand the above expression for P(B) is to draw out a tree diagram with the probabilities on the rhs marked on each branch.

Now, as for the concrete example: Imagine you have a test for a rare disease. The test has 99% sensitivity (=P(B|A)) and 98% specificity (=1-P(B|A')). The result comes back positive. Using Bayes’ rule:

P(Have the disease given a positive test result) = P(A|B) = 0.99 * P(A) / [0.99 * P(A) + 0.02 * P(A')]

The prior information is P(A), which in this example would be the prevalence of the disease. This could be guessed, or sample information could be used to estimate the prevalence. Let's say it's 3%:

P(A|B) = 0.99 * 0.03 / [0.99 * 0.03 + 0.02 * 0.97] = 60.5%

There's a different perspective on the same thing here:

http://www.bmj.com/cgi/content/full/327/7417/716

What I see problematical with the frequentist approach is it thresholds the answer too soon, making the incorporation of additional information which could improve the estimate difficult.
 
Last edited:
Actually, on second thought I'm starting to get the point... following your procedure is going to allow you to draw one of those contours. But what will you say about the points inside it? They are all acceptable null hypotheses, I suppose, and you can't speak about their relative probabilities? So that means you can only ever draw one contour? Or are you allowed to change your cut-off probability and plot several contours associated with different values for it? If so, that seems dangerously close to Bayes...
One of the key differences between hypothesis testing(there are other tricks in the frequentist bag to work round this) and Bayesian methods is there is no attempt to navigate the hypothesis space.
It's a very Popperian view of science, we take a single hypothesis, throw data at it, and if it survives, keep it.

There's no idea implicit to hypothesis testing of exploring the space for optimal theories. Eventually, if we have a wrong theory so much bad data will build up that (when we do meta studies) we'll be forced to discard any false hypotheses.

So, suppose we have 3 well defined models, A, B, and C and some data D to test it with.

A Bayesian will calculate P(A|D),P(B|D),P(C|D) and pick the highest one.

A frequentist will use a different trick entirely, and calculate the maximum likelihoods P(D|A),P(D|B),P(D|C) and pick the highest one.

If they couldn't do this, and had to use hypothesis testing they would test P(D|A) and keep it if it was above an arbitrary threshold. If it didn't work they would move on to hypothesis B and test P(D|B). If that doesn't work they move on to hypothesis C.

It's obvious that hypothesis testing doesn't work well in such a situation. It's a greedy algorithm that gets stuck at local optimas, and isn't robust with respect to the ordering of hypothesis.
This is why frequentists must use different tools to discriminate between multiple unordered hypotheses, while the Bayesian can keep chugging along, directly computing probabilities.
 
The short answer is: in statistics, you usually want to answer a question that cannot be answered. The Bayesian approach is to make an answer up; the frequentist approach is to answer a different question.

For example, suppose I toss a coin 10 times and get HHHHHHHHHH. What is the probability that it is biased? If it's a coin I just found in my pocket, I'd be pretty confident it is unbiased, while if it just came out of a christmas cracker, I'd assume it was double headed.

Assuming for simplicity that the tosses are definitely independent, so the only question is what is the probability p that each one is heads, a Bayesian would of course assign different priors in the two situations, but this prior would still be made up - what is the chance that a coin from my pocket is double headed? Worse, what is the chance that p is between .700 and .701?

In simple cases, hypothesis testing basically amounts to simply asking `what is the probability of an unbiased coin giving a result this extreme?' That's not what you want to know, but the advantage is that you can answer it. Of course, the problem now (apart from getting an answer to the wrong question) is that you need to choose sensible hypotheses to test: why did I decide in advance to test p=0.5, rather than p=0.7? But in some cases, such as this one, there's a single obvious choice, so at least you know what to do, unlike the Bayesian. (Should we assume p is uniform on [0,1]?)

Of course, there are situations where the Bayesian approach is sensible, and in which any statistician would use it. Many medical examples are like this, because the question you want to know the answer to is close to one that can be answered: `if a person A is chosen at random from a population in which the incidence of a certain disease is x%, and A is tested with a test with known rates of false positives/negatives, given a positive result, what is the chance that A has the disease?' Of course, in the real question, A is not a random person, but me, but (especially if you modify things by taking additional data into account), the random person approximation is reasonable. For the coin question, there's no corresponding reasonable approximation by an answerable question.

Confidence intervals are confusing at first: again they don't answer the question you want, but something different. Formally, we have a family of probability distributions: here, for each p in [0,1], there's a distribution Pr_p, corresponding to H having probability p. (Assuming indepedence still.) If X denotes a sequence of coin tosses, then a 99% confidence interval for p is a pair of functions a(X), b(X) such that, for any p, if X is in fact chosen randomly according to the distribution Pr_p, then the probability that a(X)<p<b(X) is at least 99%. The right way to think of this is that p is fixed and unknown, and the probability that the random interval produced surrounds p is at least 99%. This is not what you want, but it's what you get. Of course, the point is that a(X) and b(X) can be calculated from X without knowing p.

Part of the point is that there is often a single best way to calculate them: in this case (for a symmetric interval) it boils down to choosing b(X) as small as possible so that max_p Pr_p(b(X)<p) <= 0.005, and similarly for a(X). So you get a definite answer, just to the wrong question. The answer is still useful, because one can say in advance, that 99 of 100 times you do this, the true value will lie in the interval. But think of it as: 99 of 100 times the interval will surround the true value.
 
Thanks for the response.

A Bayesian will calculate P(A|D),P(B|D),P(C|D) and pick the highest one.

A frequentist will use a different trick entirely, and calculate the maximum likelihoods P(D|A),P(D|B),P(D|C) and pick the highest one.

I don't really understand the difference. According to Bayes, P(A|D) = P(D|A)P(A)/P(D). But P(D) is just a normalization which doesn't depend on A,B, or C and so is irrelevant here (since we care about relative probabilities). If we take a flat prior then P(A)=P(B)=P(C), so in the end P(A|D)=P(D|A) up to an overall ABC-independent factor.

So the two quantities are the same (if we take a flat prior) - which is what I've been confused about all along! I still don't see how frequentists are doing anything other than take Bayes with flat priors (plus mouth a bunch of slogans).
 
The short answer is: in statistics, you usually want to answer a question that cannot be answered. The Bayesian approach is to make an answer up; the frequentist approach is to answer a different question.

I like that. But if that's all there is to it, why do statisticians argue about it so much?

Worse, what is the chance that p is between .700 and .701?

About the same it's between .701 and .702 - and that's generally all you need, isn't it? It doesn't matter much what the prior is so long as it's reasonably smooth - you just collect enough data to render it irrelevant.

Sometimes that can be problematic, like when the amount of data you can ever collect is finite... but then you're in trouble no matter what.

The other side to this is that at least to me, it's obvious that everybody is a closet Bayesian. People always assign prior probabilities to things - if our ancestors had been frequentists they would have been snacked on by a saber-toothed tiger long ago. Personally I suspect frequentists are just ashamed to admit they use priors because they think it's somehow non-scientific. But my opinion doesn't mean much, since I still don't understand the difference fully....
 
Last edited:
I like that. But if that's all there is to it, why do statisticians argue about it so much?

I don't know, I'm a mathematician, not a statistician! Some of the arguing is from people who don't understand that you can't answer the question you want to. The rest is about different approaches to dealing with the situation: something has to give, so it makes sense to argue about what. What to do isn't a mathematical question, which is exactly why you can argue about it.

About the same it's between .701 and .702 - and that's generally all you need, isn't it? It doesn't matter much what the prior is so long as it's reasonably smooth - you just collect enough data to render it irrelevant.

Sometimes that can be problematic, like when the amount of data you can ever collect is finite... but then you're in trouble no matter what.

But in this case a smooth prior is not sensible: there should be a strong peak near p=0.5. In general (in nice cases where priors make sense!), of course with enough data it doesn't really matter: the answers converge. But then you can also calculate a 99.999999999% confidence interval (which will be very narrow) and it will contain (better, surround!) the true value.
 
The other side to this is that at least to me, it's obvious that everybody is a closet Bayesian. People always assign prior probabilities to things - if our ancestors had been frequentists they would have been snacked on by a saber-toothed tiger long ago. Personally I suspect frequentists are just ashamed to admit they use priors because they think it's somehow non-scientific. But my opinion doesn't mean much, since I still don't understand the difference fully....

No - only people who haven't fully come to terms with the unanswerability of the relevant question. (I admit this might be almost everyone!) Of course, in many cases if you knew enough you could come up with a reasonable prior (for example, if you knew the actual fraction of double headed coins in circulation), but sometimes you simply can't. Suppose, for, example that two people come up with different Theories of Everything that have a totally different theoretical structure, both fit all observations pretty well, but differ slightly in the predictions for something statistical (clustering of galaxies, say). How do you assign prior probabilities to them? What additional information could possibly let you do this? If they are similar in complexity, then maybe 50/50 is reasonable, but if one is quite a bit more complex, should it be 90/10 or 99/1 or 99.9/.01?

The frequentist says `I can't answer that question'. If future observations rule out one theory at (say, since it's very important) the 99.99999% confidence level, we'll abandon that theory. There really isn't anything more sensible to do. Of course, you may be unlucky and rule out the true theory, but of course nothing prevents you from starting again when you have more data. Again, in the long run it doesn't matter. With enough data, you'll finally settle on the right theory (if one is right).

Actually, here's an example where the Bayesian get's it wrong: suppose one theory, A, has a free parameter while the other, B, doesn't. We should choose some smooth prior on the free parameter. Now maybe the only theories that come at all close to the data are A with parameter 1.3461893234(1) and B. Then, if A and B had comparable prior weight in total, the Bayesian answer would be that B is almost certain, even if observations rule out B at 99.99999% confidence (due to the extremally small prior weight of A with parameter in the required window). In this situation it's obviously more sensible just to accept that A is correct, and that there's no (as far as we know) reason the parameter is what it is.

ETA: when I say sensible, that's a vague opinion, not a mathematical statement: trying to justify this conclusion properly is again trying to answer the unanswerable question.

ETA: maybe you aren't as Bayesian as you think you are. This post forums.randi.org/showthread.php?postid=3495175#post3495175 (sorry - cant' link even within the forum, it seems), is saying more or less what I was trying to say above!
 
Last edited:
Suppose, for, example that two people come up with different Theories of Everything that have a totally different theoretical structure, both fit all observations pretty well, but differ slightly in the predictions for something statistical (clustering of galaxies, say). How do you assign prior probabilities to them? What additional information could possibly let you do this? If they are similar in complexity, then maybe 50/50 is reasonable, but if one is quite a bit more complex, should it be 90/10 or 99/1 or 99.9/.01?

I don't think it really matters in practice. It's not like anyone actually decides which theory to believe by calculating something using Bayes' theorem... in such a case I don't think anyone would be able to choose between them - not until some observation made a sharp distinction, or some internal inconsistency was discovered.

So maybe I'm being a bit frequentist... but since I still deny there is any difference, it doesn't bother me very much!

Actually, here's an example where the Bayesian get's it wrong: suppose one theory, A, has a free parameter while the other, B, doesn't. We should choose some smooth prior on the free parameter. Now maybe the only theories that come at all close to the data are A with parameter 1.3461893234(1) and B. Then, if A and B had comparable prior weight in total, the Bayesian answer would be that B is almost certain, even if observations rule out B at 99.99999% confidence (due to the extremally small prior weight of A with parameter in the required window). In this situation it's obviously more sensible just to accept that A is correct, and that there's no (as far as we know) reason the parameter is what it is.

There must be a good way to assign priors to deal with that problem gracefully. I know there are ad hoc formulas people sometimes apply which penalize you for every parameter you add (typically exponentially) but favor you for good fits to data (power law in chi squared or something). I've seen one talk in which a cosmologist actually used that to argue for one theory over another.

It wasn't very convincing.

ETA: maybe you aren't as Bayesian as you think you are. This post forums.randi.org/showthread.php?postid=3495175#post3495175 (sorry - cant' link even within the forum, it seems), is saying more or less what I was trying to say above!

"A foolish consistency is the hobgoblin of little minds." :)
 
Last edited:
Actually, here's an example where the Bayesian get's it wrong: suppose one theory, A, has a free parameter while the other, B, doesn't. We should choose some smooth prior on the free parameter. Now maybe the only theories that come at all close to the data are A with parameter 1.3461893234(1) and B. Then, if A and B had comparable prior weight in total, the Bayesian answer would be that B is almost certain, even if observations rule out B at 99.99999% confidence (due to the extremally small prior weight of A with parameter in the required window). In this situation it's obviously more sensible just to accept that A is correct, and that there's no (as far as we know) reason the parameter is what it is.

What would a frequentist do that allows him to get the correct answer, that we can't translate to a Bayesian method? Remember, if the Bayesian thinks that it's at all likely A only fits the data with a fairly narrow range for its free parameter, that's part of the prior too, and suddenly he gets it right.
 
What would a frequentist do that allows him to get the correct answer, that we can't translate to a Bayesian method? Remember, if the Bayesian thinks that it's at all likely A only fits the data with a fairly narrow range for its free parameter, that's part of the prior too, and suddenly he gets it right.

Sorry - let me tweak the example slightly. Suppose that A with any parameter and B are all consistent with current observations, but all give different predictions for the galaxy clustering distribution. We are about to perform some very precise observations of this, so precise that if A is true we'll afterwards know the parameter within .00000000000001, but at the moment we have no idea what the parameter might be. We can't work with a prior that just gives a probability to A, as A without a parameter doesn't determine the distribution, so at this point we are supposed to produce a prior on all candidate theories, namely B, A(1), A(1.0000000000001) etc. I don't believe there is a sensible way to do this.

A frequentist would answer a different question: you want to know afterwards how likely it is that B is correct, but there is simply no way to answer this. Instead, you ask `are the observations (statistically) consistent with B?' And you can also ask `is there an x such that the observations are consistent with A(x)?' You need to be careful actually designing the tests, of course, but if you get back a result in which something measured is 6 standard deviations off from it's value under B (in a way you decided in advance to test for), it's reasonable to conlucde that B is wrong.
 
<snip>

A frequentist would answer a different question: you want to know afterwards how likely it is that B is correct, but there is simply no way to answer this. Instead, you ask `are the observations (statistically) consistent with B?' And you can also ask `is there an x such that the observations are consistent with A(x)?' You need to be careful actually designing the tests, of course, but if you get back a result in which something measured is 6 standard deviations off from it's value under B (in a way you decided in advance to test for), it's reasonable to conlucde that B is wrong.

If you get something 6 standard deviations from the value predicted by B, it is highly likely B was a strawman in the first place!

ETA: Amazing how science progressed up to the middle of the 20th century without performing statistical tests at every opportunity.
 
Last edited:
If you get something 6 standard deviations from the value predicted by B, it is highly likely B was a strawman in the first place!

Not really - it's quite plausible that two theories make slightly different predictions for something that new technology lets you measure very precisely. For example, maybe there was some point in history (I imagine much longer ago than I think) where the observational evidence of Mercury's orbit was consistent with Newtonian Mechanics, because there wasn't much evidence. Newtonian Mechanics is not a straw man theory, yet its predictions are now known to be absolutely inconsistent with observations. It's still `very nearly correct' on suitable scales, but nevertheless has been absolutely ruled out as the `ultimate truth'. A physicist could doubtless think of better examples, where the statistical nature of the observations is clearer.

ETA: your ETA suggests you won't like my example, which is fair enough, as it's not very good! Of course there are lots of things in science that aren't statistical in nature, and no-one is saying you should apply any kind of statistics in all situations. However, there are also lots of things that are statistical, especially near the limits of what we can measure. The most obvious reason is measument errors not much smaller than the effect you are looking for, but there are others. For example, a theory might predict some physical quantity to be the value of a certain integral that cannot be computed exactly. Perhaps with Monte Carlo methods you can compute the integral to a certain accuracy, i.e., in this case with a known standard deviation.
 
Last edited:
Meridian said:
Not really - it's quite plausible that two theories make slightly different predictions for something that new technology lets you measure very precisely.

True. I'm an engineer though, so in my field we tend to be pragmatic about these things. There's always some error in a measurement, so use negative feedback to compensate for it.

Welcome to the forum, BTW.:)
 
Sorry - let me tweak the example slightly. Suppose that A with any parameter and B are all consistent with current observations, but all give different predictions for the galaxy clustering distribution. We are about to perform some very precise observations of this, so precise that if A is true we'll afterwards know the parameter within .00000000000001, but at the moment we have no idea what the parameter might be. We can't work with a prior that just gives a probability to A, as A without a parameter doesn't determine the distribution, so at this point we are supposed to produce a prior on all candidate theories, namely B, A(1), A(1.0000000000001) etc. I don't believe there is a sensible way to do this.

I don't think this is a problem at all - it seems like you're missing something rather important here:

Your prior that A is in some range scales as a power of the size of the range (the power is simply 1 if we have a single parameter). That is, if some confidence interval has width w in that parameter, the prior that A is in that interval will be proportional to w (assuming only that the prior on the parameter was smooth and w is small compared to the variations in it).

On the other hand the confidence with which you can exclude B grows exponentially with 1/w (more precisely, the probability of B given the data scales as exp(-d^2/w^2), where d is the deviation of the measurement from the prediction of B). So for any fixed central value not equal to the one that B predicts, increasing the precision of the measurement will very rapidly rule out B and favor A, precisely as you would want.

On the other hand if the data is within a sigma or so of B, B will be pretty strongly favored (by the ratio of the overall width of the parameter in A to sigma), which is just as it should be - B is a much better theory at that point.
 
Last edited:
I like that. But if that's all there is to it, why do statisticians argue about it so much?

In my experience, statisticians don't, any more than carpenters argue over the relative merits of hammers and screwdrivers. Stats students do.
 
I wandered in here cause I saw the title, and was imagining some silly discussion of probability prediciton and I'm Gonna Win Vegas and all that sort of rot. Instead, I find an argument between two sides that I didn't even know existed until now.

Very interesting!
 
I don't think this is a problem at all - it seems like you're missing something rather important here:
Not missing: just not reiterating! Of course (at least in this kind of example), whatever you do you'll reach the right conclusion eventually given enough data. But it's still true that the Bayesian answer for any given amount of data cannot be justified.

To make it clearer, suppose that there is no more data coming. Perhaps the statistics is in the predictions of the theory, not the measurements, and we've already measured the clustering of all galaxies in the observable universe well enough. If the data is consistent with A only in some window of width w=10-10 and is 5\sigma off for B, what do you conclude?

I feel pretty strongly that you consider B ruled out. Change the numbers down to 1 or 2\sigma, and it's clear A should be favoured. Where is the changeover? That's a question that simply can't be answered. But `can B be ruled out with 99.9999% confidence' (in the technical sense), while not what you want to know, can be answered.
 

Back
Top Bottom