• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

(Another) random / coin tossing thread

The short answer is: in statistics, you usually want to answer a question that cannot be answered. The Bayesian approach is to make an answer up; the frequentist approach is to answer a different question.
snip
In simple cases, hypothesis testing basically amounts to simply asking `what is the probability of an unbiased coin giving a result this extreme?' That's not what you want to know, but the advantage is that you can answer it. Of course, the problem now (apart from getting an answer to the wrong question) is that you need to choose sensible hypotheses to test: why did I decide in advance to test p=0.5, rather than p=0.7? But in some cases, such as this one, there's a single obvious choice, so at least you know what to do, unlike the Bayesian. (Should we assume p is uniform on [0,1]?)

Of course, there are situations where the Bayesian approach is sensible, and in which any statistician would use it. Many medical examples are like this, because the question you want to know the answer to is close to one that can be answered: `if a person A is chosen at random from a population in which the incidence of a certain disease is x%, and A is tested with a test with known rates of false positives/negatives, given a positive result, what is the chance that A has the disease?' Of course, in the real question, A is not a random person, but me, but (especially if you modify things by taking additional data into account), the random person approximation is reasonable. For the coin question, there's no corresponding reasonable approximation by an answerable question.


Thank you, especially the opening paragraph, which helped crystalise what I was thinking.

The medical example has led me to similar situations where I would use a similar approach in my work.

My knowledge of statistics (and most other fields) seems to be eccentricly clustered, quite broad, but with huge areas missing. I might have trained as an applied physicist, but I am an engineer by inclination and profession, and as Ivor has said, that includes a lot of pragmatism, such as using tools that are useful... I have usually developed my analytical toolset specifically to answer particular questions.
 
To make it clearer, suppose that there is no more data coming. Perhaps the statistics is in the predictions of the theory, not the measurements, and we've already measured the clustering of all galaxies in the observable universe well enough. If the data is consistent with A only in some window of width w=10-10 and is 5\sigma off for B, what do you conclude?

I feel pretty strongly that you consider B ruled out. Change the numbers down to 1 or 2\sigma, and it's clear A should be favoured. Where is the changeover? That's a question that simply can't be answered. But `can B be ruled out with 99.9999% confidence' (in the technical sense), while not what you want to know, can be answered.

I think in that second paragraph you meant "B should be favored" where you said "A should be favoured".

But I don't agree with you that in the first case B should be ruled out. You have to think about what this situation would mean. If B is 5 sigma off but w is 10^-10, it means the data are very, very close (in absolute terms) to the prediction given by B. So now you have two theories, one with no parameter that comes within about 10^-10 of getting the answer on the nose, and one with a whole continuous parameter which of course can be made to fit the data. B coming that close but barely missing would be too much of a "coincidence" for anyone to swallow easily.

At that point you have to think hard about what to do, and to rule out B you'd have to be extremely confident that there are no corrections (arising either from systematic error in the data or from theoretical uncertainties in B) which are larger than 10^-10. Realistically, I think a lot of people would work on B in this situation, to try to get it to fit. I very much doubt anyone would say with any confidence that B is ruled out, which fits nicely with the Bayesian result.

To make that concrete, suppose someone finds a solution in string theory that comes within 10^-10 of matching all the parameters of the standard model, but is off by 5 sigma on the fine-structure constant (which is one of the only things in nature that has actually been measured to that accuracy). There is no way people would just say, "oh well, it's off by 5 sigma so it's ruled out". No way - everyone would work on it.
 
Last edited:
The nature of what you are testing, of course. If you are interested in a question about similarly of means, then any two samples that have the same mean are equivalent. That's hardly arbitrary.

Because we don't "happen" to call them equivalent -- the experimental hypothesis we wish to test does. If I wish to know if college students have higher mean self-esteeem than soldiers, then I'm interested by definition in means. If I'm intereste in whether college students have greater variance in self-esteem, then I'm interested in variance. If I'm interested in whether the sets differ at all, then I'm still not interested in differences in subject ordering that are an artefact of my sampling procedure. (I.e. any two data sets that are permutations of each other are equivalent, because sets are unordered).

But it's explictly about what sort of variance from randomness is irrelevant to the question at hand.

That's not what I meant.

There is some null hypothesis. The question at hand is whether to reject it; the question at hand is not, supposedly, what might replace it. A significance test rejects the null hypothesis if there was a sufficiently low probability, on that hypothesis, of having gotten the observed result or any other result that is "more extreme". What exactly does "more extreme" mean? Is there a unique way to define it, based only on the question at hand?

Specific example: I want to check whether some data came from a particular distribution. Should I use a chi-square test or a Kolmogorov-Smirnov test? (I have enough data, the distribution is continuous, everything's great.)

Can you answer that without reference to the prior probabilities of various alternative hypotheses---that is, without considering how you expect the data to differ from the given distribution, should they in fact differ from it?
 
....
ETA: Amazing how science progressed up to the middle of the 20th century without performing statistical tests at every opportunity.

Ah, but don't forget that Student's t-test was created from an employee of Guinness Beer, William Gosset (1876-1937): http://www.columbia.edu/ccnmtl/projects/qmss/t_about.html ... maybe science did not progress, but beer did! (by the way, science probably did progress with statistics, don't forget Kinsey used lots of statistics, and his first report was published in 1948 :D)

I wandered in here cause I saw the title, and was imagining some silly discussion of probability prediciton and I'm Gonna Win Vegas and all that sort of rot. Instead, I find an argument between two sides that I didn't even know existed until now.

Very interesting!

I agree. I have been gainfully unemployed for almost 20 years, and am now getting back into the swing of thinking by taking a statistics course. I took one in college 30 years ago, but there are vasts differences between engineering statistics and the statistics used with human population studies. I am learning lots. Now back to my homework (the downside is that I am now doing 7th grade slope/intercept linear equation solving, I've had two days of class with the instructor explaining how to get an equation of a straight line with two points... sigh... Though today during class I played around with matrix algebra from the first chapter of a book I'm reading on Euler's Identity).
 
I think in that second paragraph you meant "B should be favored" where you said "A should be favoured".

Sorry - yes of course.

But I don't agree with you that in the first case B should be ruled out. You have to think about what this situation would mean. If B is 5 sigma off but w is 10^-10, it means the data are very, very close (in absolute terms) to the prediction given by B. So now you have two theories, one with no parameter that comes within about 10^-10 of getting the answer on the nose, and one with a whole continuous parameter which of course can be made to fit the data. B coming that close but barely missing would be too much of a "coincidence" for anyone to swallow easily.

By ruled out, I simply meant ruled out as it is. It's asking a bit much of statistics to tell you whether a modified version of the theory might fit the data! I agree that in practice you'd look for modifications. To refine my example further (I meant to say this, but forgot), suppose that this clustering we are measuring is not one number, but 5 different numbers, so the predictions of A are a curve in this 5-dimensional space. If this curve goes right through the observations, and B is 5\sigma off, I think you'd accept A. (And then in practice start thinking about possible underlying theories that explain the value of the parameter.)

I'm a mathematician: my example isn't going to be realistic! It's just meant to show that there are certainly (conceivable) situations where assigning prior probabilities isn't sensible. Ideally I'd like you to suppose that you somehow know that either A or B is in fact the ultimate truth as to how the universe operates, which I agree is rather unrealistic. But it's being unrealistic is even more of an argument against trying to assign prior probabilities.

There are many more down-to-earth examples where (in my mind) hypothesis testing is the only sensible approach, at least at first - suppose that I notice that light bulbs in my house seem to fail in clusters, rather than according to a Poisson process as I would expect. This might be due to variations in the voltage supply, with more failing when it's too high. Am I really going to start thinking about the probability that there's something wrong with the power supply, and if so, the distributions on how wrong and in what pattern it's wrong, along with how much this affects light bulbs? No - I'll first test the null hypothesis of Poisson failure, without needing to worry about the alternatives. If it's rejected, then I'll investigate further (probably by digging out a voltmeter, rather than additional tests). If it's accepted, I'll put it down to selective memory (with the possibility of testing again if it seems to keep happening). But if I'd started with this example, you'd claim that I'm secretly Bayesian about it, and have some idea of the probability that there is a fault in the supply. Maybe that's true, but I certainly have no idea of the distribution on types of faults etc, so I just accept that I don't have the data to answer the question I'd like, and answer a different one.
 
By the way, to those who are surprised to find an argument, that's because (as far as I know) there isn't one. Having looked up what a frequentist is, there are three possible arguments that one can confuse:

(a) given some real example, do we assign prior probabilities to various models and use a Bayesian approach, or do we pick a null hypothesis and test it/construct confidence intervals etc. This is not a real argument - any sensible person will use both approaches in appropriate situations (e.g., the medical and lightbulb examples).

(b) in an ideal world, would we always use the Bayesian approach? I think this is what sol is suggesting, and I disagree. But it's totally irrelevant in practice. My disagreement boils down to saying: if you are allowed to make the absurd hypothetical assumption that we have enough data to in fact know the prior distribution in all cases, then I can make some absurd assumption that we know one of two theories of everything is absolutely correct, and it's clear we cannot assign priors to these (unless the Simulator tells us what kind of coin he tossed to choose between them...)

(c) what does Probability mean? This is a philosophical question and therefore (to me) meaningless. I only bring it up because of the word Frequentist: according to www statisticalengineering.com/frequentists_and_bayesians.htm, for example, Frequentists believe that probabilities refer to long term averages, and Bayesians that they refer to degrees of knowledge. To a mathematician, of course, probability theory is a certain formal system with rules as to how probabilities can be manipulated. However, we all have a good idea of when it can be sensibly applied to the real world, and no-one would restrict this to long run averages. For example, I'm sure any mathematician/statistician/physicist would be happy with the idea that the universe evolves according to probabilistic rules, and that one can sensibly talk of the probability of some event that could happen at most once.

I can imagine that you might see mathematicians and statisticians arguing about (c), but only for fun, in the same way that mathematicians might argue about Platonism.
 
Last edited:
By ruled out, I simply meant ruled out as it is. It's asking a bit much of statistics to tell you whether a modified version of the theory might fit the data! I agree that in practice you'd look for modifications. To refine my example further (I meant to say this, but forgot), suppose that this clustering we are measuring is not one number, but 5 different numbers, so the predictions of A are a curve in this 5-dimensional space. If this curve goes right through the observations, and B is 5\sigma off, I think you'd accept A. (And then in practice start thinking about possible underlying theories that explain the value of the parameter.)

I don't see how that makes any difference, either to the answer from Bayes or to my feeling about how we should weight the theories. In your example A successfully predicts 4 parameters without any input, and then has a parameter for the 5th (we can always rotate coordinates in the space so that's true). But we can do the same for B - we can rotate so that B successfully predicts 4 numbers and is off on a 5th. Then we're back to the original problem.

I'm a mathematician: my example isn't going to be realistic! It's just meant to show that there are certainly (conceivable) situations where assigning prior probabilities isn't sensible.

I don't think you've shown that - in fact I think it's a good example of a case where hypothesis testing gives the wrong answer, and Bayes comes close to the right one. Going back to the first case, a frequentist would simply abandon B - and that would be a very bad idea (in my professional opinion as a scientist).

Science progresses because sometimes scientists have good intuition. There are an infinite number of possibilities, and we have to choose which to spend our time investigating. So what really separates good scientists from bad is intuition - and intuition has no place in frequentist thinking. But for a Bayesian it's simply the assignment of priors.

But if I'd started with this example, you'd claim that I'm secretly Bayesian about it, and have some idea of the probability that there is a fault in the supply.

Suppose you'd just read in the newspaper that morning that recent spikes in power had been causing failures. A true frequentist would completely ignore that information, no? But of course no human - and no scientist - would. That's why I think we're all Bayesians, and that's a good thing.
 
Last edited:
<snip>

Suppose you'd just read in the newspaper that morning that recent spikes in power had been causing failures. A true frequentist would completely ignore that information, no? But of course no human - and no scientist - would. That's why I think we're all Bayesians, and that's a good thing.

Isn't a human brain a biological Bayesian network?

For example, we don't see reality directly, we have a prior models (beliefs) about what we will see and check (update) them with new data from our retinas.
 
I'll first test the null hypothesis of Poisson failure, without needing to worry about the alternatives.

Here's a significance test that doesn't worry about alternatives: Pick a random number in [0, 20). If it's less than 1, reject the null hypothesis.

Obviously, this is a silly test. But what's wrong with it, exactly? If the null hypothesis is true, the test rejects it with probability 1/20, just as it should.

The problem, of course, is that if the null hypothesis is false, it also rejects it with probability 1/20. For a test to be useful, it needs to be more likely to reject the null hypothesis if it's false than if it's true. So, if we want to devise a useful test, there's no way to avoid thinking about alternative hypotheses.

But if I'd started with this example, you'd claim that I'm secretly Bayesian about it, and have some idea of the probability that there is a fault in the supply. Maybe that's true, but I certainly have no idea of the distribution on types of faults etc, so I just accept that I don't have the data to answer the question I'd like, and answer a different one.

From a Bayesian point of view, the distribution on types of faults is not an objective property of the world, of which you might have a good or bad idea. What's objective is simply whether or not a fault happened. If you don't have a clear idea about that, you use a probability distribution to characterize the degree of vagueness of your idea.

A prior probability distribution is not a kind of data that you might have or not have; it is simply a description of the extent to which you have data of other kinds. You might not have any data about faults, but presumably you at least know whether you do or don't.
 
Isn't a human brain a biological Bayesian network?

For example, we don't see reality directly, we have a prior models (beliefs) about what we will see and check (update) them with new data from our retinas.

I agree, more or less. But I guess the difference is a rather philosophical one in any case - does it make sense to discuss the probability that a theory is correct, given that there is only one theory which describes the world? A frequentist would say no.

Here's a significance test that doesn't worry about alternatives: Pick a random number in [0, 20). If it's less than 1, reject the null hypothesis.

Sorry - I don't understand your example. What's the null hypothesis here?
 
Here's a question for anyone that understands frequentist reasoning:

Frequentists (as far as I understand) won't allow themselves to ask what the probability is of a theory given some data. No, they say, there is only one true theory, so a given theory under test is either correct or not correct - it's meaningless to ask about its probability.

But suppose we are considering two different theories of physics: A and B. I'm going to make a new theory, C, in which the laws of physics vary as you move around in such a a way that some parts of the universe are described by A and some by B. Now A and B are both correct (somewhere) - so it's simply not true that one or the other is correct. And it seems perfectly reasonable - even to a frequentist - do define a probability distribution or prior on A and B (you might use the fraction of the C-universe that's in A versus B).

Doesn't the fact that theory C might be correct nullify the frequentist argument entirely? That is, some version of C can never be ruled out (assuming one of A or B holds where we are), so it's always a valid hypothesis according to a frequentist. But if C is a valid hypothesis, why can't I use it to compute a probability distribution on A and B?
 
Last edited:
I don't see how that makes any difference, either to the answer from Bayes or to my feeling about how we should weight the theories. In your example A successfully predicts 4 parameters without any input, and then has a parameter for the 5th (we can always rotate coordinates in the space so that's true). But we can do the same for B - we can rotate so that B successfully predicts 4 numbers and is off on a 5th. Then we're back to the original problem.
Hmm. Well, rotating doesn't change anything for B - it's simply off in total either way. But I agree that (if rotating is reasonable, which I don't think it always will be, but never mind) you could view A as simply predicting 4 parameters and saying nothing about the last.

I don't think you've shown that - in fact I think it's a good example of a case where hypothesis testing gives the wrong answer, and Bayes comes close to the right one. Going back to the first case, a frequentist would simply abandon B - and that would be a very bad idea (in my professional opinion as a scientist).
Firstly, you still seem to be expecting way too much of statistics: no stastician would tell you that B is wrong so therefore you should abandon developing it further. In fact, it's more the other way around. Without statistics, you already know that the observations are close to the predictions. The main thing you want to know, is whether there is any need to look for refinements - if in fact they are within 1 \sigma, then it would be silly to try to explain the `error' by some new effect. If off by 5 \sigma, you should - hypothesis testing would simply tell you that B is wrong as it stands.

Secondly, I think with your setup, Bayes gives the answer you don't want: viewing A as a single theory that only predicts 4 parameters, what prior probabilities are you assigning to A and B? If they are roughly equal, Bayes will tell you that given the observations, A is 99.9999% certain.

Science progresses because sometimes scientists have good intuition. There are an infinite number of possibilities, and we have to choose which to spend our time investigating. So what really separates good scientists from bad is intuition - and intuition has no place in frequentist thinking. But for a Bayesian it's simply the assignment of priors.
Of course intuition has a place in any application of statistics: you have to decide what hypothesis to test, and also what's a sensible significance level to look for.


Suppose you'd just read in the newspaper that morning that recent spikes in power had been causing failures. A true frequentist would completely ignore that information, no? But of course no human - and no scientist - would. That's why I think we're all Bayesians, and that's a good thing.

That would certainly affect the significance level that I'd use in the test. But I don't at all see how to use it to construct a prior distribution on the degree of clustering of failures.

Doesn't the fact that theory C might be correct nullify the frequentist argument entirely? That is, some version of C can never be ruled out (assuming one of A or B holds where we are), so it's always a valid hypothesis according to a frequentist. But if C is a valid hypothesis, why can't I use it to compute a probability distribution on A and B?
No - in fact I'd say your example supports it! If you allow that C might be correct, you presumably also allow variations of C in which the fractions of A and B vary. So this argument says that any prior distribution is ok. To obtain a conlcusion by Bayesian reasoning you need to decide on one, and it still seems to me totally clear that there is no reasonable way of doing this for two Theories of Everything.
 
The problem, of course, is that if the null hypothesis is false, it also rejects it with probability 1/20. For a test to be useful, it needs to be more likely to reject the null hypothesis if it's false than if it's true. So, if we want to devise a useful test, there's no way to avoid thinking about alternative hypotheses.

In my example, I've already done that, and any statistican always would. If I've noticed that the bulbs seem to fail in clusters, I'll use a test based on a measure of clustering, and reject the null hypothesis if there is more clustering that plausibly seen in the Poisson distribution.



From a Bayesian point of view, the distribution on types of faults is not an objective property of the world, of which you might have a good or bad idea. What's objective is simply whether or not a fault happened. If you don't have a clear idea about that, you use a probability distribution to characterize the degree of vagueness of your idea.

A prior probability distribution is not a kind of data that you might have or not have; it is simply a description of the extent to which you have data of other kinds. You might not have any data about faults, but presumably you at least know whether you do or don't.

I don't really understand this: I'm taking the Bayesian point of view to mean choosing a prior somehow and applying Bayes Theorem. I'm not interested in what probabilities mean. To apply Bayes Theorem, you need a distribution on the degree of faultiness of the power supply. I agree it's reasonable to guess a probability that the supply if faulty, but to apply Bayes Theorem, your prior needs to be specific enough to calculate the probability of seeing a certain degree of clustering in the failures given that there is a fault in the supply. I can't imagine how you would come up with such a prior. The whole point of hypothesis testing is that you don't need to - it's good to have an idea how likely a fault is, because this affects what significance level you should choose, but there's no need to think about modelling the degree of faultiness until you've established that you have any evidence at all that anything is wrong.
 
Only one theory correctly describes the Necker cube. That theory being : it is a two-dimensional drawing.

There are of course two additional common theories that the human visual system is primed to see. They are both wrong.

I would say it depends on prior belief/information. The Necker cube is a valid 2-D projection of a feasible 3-D object with two different orientations in space. If you could estimate the likely orientation of the original 3-D object there would be a 'correct' way to interpret it. However, without this information (or an estimate of it) the only sensible thing to do is assign a prior of probability 0.5 to each and say both hypotheses are equally likely to be correct.

Having said that, I think it’s a nice example of Bayesian estimation which most people can appreciate.
 
I would say it depends on prior belief/information. The Necker cube is a valid 2-D projection of a feasible 3-D object with two different orientations in space. If you could estimate the likely orientation of the original 3-D object there would be a 'correct' way to interpret it. However, without this information (or an estimate of it) the only sensible thing to do is assign a prior of probability 0.5 to each and say both hypotheses are equally likely to be correct.

Well, that is a good summary of the Bayesian position.

A frequentist would agree with everything except the italicized sentence. He would then point out that if, by hypothesis, you have assumed that you cannot estimate the likely orientation, why are you trying to do what you have already stipulated is impossible-by-hypothesis?

The italicized sentence is simply meaningless from a frequentist point of view.

A frequentist parody of a Bayesian would read something like : "What would happen if an irresistable force met an immovable object? Now, we know that by definition, this event cannot happen. In a universe where one of those can exist, the other -- by definition -- cannot. We know that no one with the sense God gave an onion would consider that to be a meaningful question or would attempt to quantify the inherently impossible. Any sensible person with the education of a retarded kumquat would dismiss the entire question as a waste of money, brains, and time. Any moderately sophisticated philosopher would think substantitially less of any person who even attempted to give a numeric answer, and the entire audience who heard that answer would leave the room more stupid than they had entered.

Therefore, I say that the immovable object has a 50/50 chance of moving."
 
Last edited:
A more 'answerable' question:

There is a bag containing 1,025 pennies, one of which is a 'fake' penny with both sides being 'heads'. You pick a coin at random from the bag without looking at it, and flip it 10 times. Each time you look at the face up side of the coin. It is 'heads' all 10 times, if you pick up the coin and turn it over after the 10th flip, what are the probability that the other side will be 'heads' ?
 
I wandered in here cause I saw the title, and was imagining some silly discussion of probability prediciton and I'm Gonna Win Vegas and all that sort of rot. Instead, I find an argument between two sides that I didn't even know existed until now.

Very interesting!

Hmmm, I read the OP on this thread and was expecting a discussion of the Wald-Wolfowitz runs test. That said, I've been around stats long enough to be aware of the frequentist / Bayesian bunfight. Just don't mention Dempster-Shafer and I think we might get out alive :boxedin:
 

Back
Top Bottom