• You may find search is unavailable for a little while. Trying to fix a problem.

What's the required p-value to beat?

FluffyPersian

Critical Thinker
Joined
Aug 12, 2011
Messages
290
From an earlier thread:

3. Some folks have argued that the Challenge is unfair because the p-values and effect sizes called for are too extreme. Have any applicants claimed to have paranormal abilities with only a small effect size?

What is the requires p-value? I tried searching around, but no number came up.

More importantly, are there any negotiations in which there were debates over this (even if they don't include the word "p-value")? Another poster mentioned that there were some negotiations over this with Ziborov, but I found little to that effect.

I ask because I teach a subject that involves applied statistics. I'd love to use an attempted demonstration of the supernatural as an example, because the meaning of "happened by chance" really stands out in this context.
 
I think the answer is 0.001 for the first test. Though sometimes this cannot be calculated. For example if I claimed I could defy gravity by rising from the ground then that would be impossible. So if I could demonstrate it then I win.
 
Thanks, xterra and rjh. RJ, the example you gave inspired me to come up with a hybrid example.

.001 is indeed high. In most research, 0.025 is the p-value used for a two-tailed test (can't be different from...). My guess is that this is to safeguard the one million in case someone without legitimate supernatural powers (er, everyone) kept going in for repeated challenges.
 
Last edited:
1:1000 seems to be a general rule of thumb for the preliminary test, but because claims (and therefore test protocols) vary so wildly JREF seem to be reluctant to state that officially.

Most people assume that the success criteria would be higher for the final test, though a simple repetition of the preliminary test would produce combined odds of 1:1,000,000 which seems adequate to me. But until and unless someone passes the preliminary test, that question is obviously moot.
 
Thanks, xterra and rjh. RJ, the example you gave inspired me to come up with a hybrid example.

.001 is indeed high. In most research, 0.025 is the p-value used for a two-tailed test (can't be different from...). My guess is that this is to safeguard the one million in case someone without legitimate supernatural powers (er, everyone) kept going in for repeated challenges.

The difference is that it does not really matter if a piece of research is wrong. The research would be repeated and found to be wrong. In fact a lot of it is incorrect. That is why you have metadata in research. In the MDC it does matter if the result is incorrect. JREF could lose $1m.
 
The difference is that it does not really matter if a piece of research is wrong. The research would be repeated and found to be wrong. In fact a lot of it is incorrect. That is why you have metadata in research. In the MDC it does matter if the result is incorrect. JREF could lose $1m.

Yep, precisely what I said above. Given a high enough number of claimants, and enough repeated trials for individual claimants, sheer chance would allow someone to claim the prize if the p-value was high enough. But I think the existing initial obstacles (the need for a recommendation letter from a professor) vastly reduces the number of preliminary trials, and and a limit on the number of attempts (if it doesn't exist already) would take care of the problem altogether.

In a field like medicine, it CAN matter if a piece of research is wrong. In practice, since it often takes years for meta-analyses to appear, treatment decisions are often made on the newest research.
 
As I said in the last thread, this is not intended as a lottery. Ideally, what is being aimed for is certainty. In practice, that's not always possible, but discussing the p-value certainly seems like a red flag. If you believe you have a power, then you should be confident enough not to care how small the p-value is, because you should believe you'll win handily no matter how small it is.
 
As I said in the last thread, this is not intended as a lottery. Ideally, what is being aimed for is certainty. If you believe you have a power, then you should be confident enough not to care how small the p-value is, because you should believe you'll win handily no matter how small it is.



I think the question of what counts as a supernatural power of prophecy is a perfectly fair one. The answer could be anything from beating chance to 100% accuracy in any given trial. Given my lack of supernatural powers, I have little stake in the answer, but I don't see why bringing it up is a red flag.

And as you've implied above, if the trial involves any element of chance (e.g. sensing what integer within a range is on a hidden sheet of paper), then it's always a lottery of sorts.
 
Last edited:
I think the question of what counts as a supernatural power of prophecy is a perfectly fair one. The answer could be anything from beating chance to 100% accuracy in any given trial. Given my lack of supernatural powers, I have little stake in the answer, but I don't see why bringing it up is a red flag.

I think it's a red flag because it starts off by asking what the odds are. In other words, it's treating the million dollars as a lottery instead of a prize for a successful demonstration.

The only reasonable answer is: small enough to be convincing. Otherwise you're inviting people to try to game the system.
 
1:1000 seems to be a general rule of thumb for the preliminary test, but because claims (and therefore test protocols) vary so wildly JREF seem to be reluctant to state that officially.

Most people assume that the success criteria would be higher for the final test, though a simple repetition of the preliminary test would produce combined odds of 1:1,000,000 which seems adequate to me. But until and unless someone passes the preliminary test, that question is obviously moot.

The problem with that argument is it assumes there isn't a fatal flaw in the design.

I submit the real reason for a double-layer test is so that, should someone pass the preliminary by some manner, it will allow experts to double-check where fraud could have crept in undetected and tighten their observation for the second round.

Stats is the error most scientists make in studying the paranormal. It's about magician sleight-of-hand. There are no real odds going on (and if there is something real, a million tests in a row will succeed.)
 
Last edited:
I think the question of what counts as a supernatural power of prophecy is a perfectly fair one. The answer could be anything from beating chance to 100% accuracy in any given trial. Given my lack of supernatural powers, I have little stake in the answer, but I don't see why bringing it up is a red flag.

And as you've implied above, if the trial involves any element of chance (e.g. sensing what integer within a range is on a hidden sheet of paper), then it's always a lottery of sorts.

FluffyPersian, Take a look at my post from the thread entitled "How are MDC protocols designed and carried out?"

http://www.internationalskeptics.com/forums/showpost.php?p=8391238&postcount=58

Post #67 is the answer to what I asked in #58; post #77 is my response to #67.

Does this help explain why people here are not concerned with p-values?

-----------

xtifr, here is the last sentence in the original post in this thread:

"I ask because I teach a subject that involves applied statistics. I'd love to use an attempted demonstration of the supernatural as an example, because the meaning of "happened by chance" really stands out in this context."

I take this to mean that FluffyPersian is not going to become a claimant, and thus he/she* does not think there is a red flag.

As usual, if I have misconstrued, misinterpreted, or misunderstood either of you, I ask for correction so we can continue the discussion.

*FluffyPersian, for clarity, please tell us which pronoun to use.
 
Last edited:
FuffyPersian,

My error. I have no idea what went awry in the link I posted previously. My post in that thread was number 58, but the link showed it incorrectly.

Try this:

http://www.internationalskeptics.com/forums/showthread.php?t=238290

Then go to page 2, and look for my username -- the easiest way is to use the find feature on your browser. From there, follow down as indicated in my previous post.

I think this will work....
 
FuffyPersian,

My error. I have no idea what went awry in the link I posted previously. My post in that thread was number 58, but the link showed it incorrectly.

...

Xterra, your post is still number 58, but you might instead wish to link via the little "link" button at the bottom right of the posts you wish to cite.

58: http://www.internationalskeptics.com/forums/showthread.php?postid=8389976#post8389976

67: http://www.internationalskeptics.com/forums/showthread.php?postid=8391270#post8391270

77: http://www.internationalskeptics.com/forums/showthread.php?postid=8392067#post8392067

Hope that helps.
 
Most people assume that the success criteria would be higher for the final test, though a simple repetition of the preliminary test would produce combined odds of 1:1,000,000 which seems adequate to me. But until and unless someone passes the preliminary test, that question is obviously moot.

The probability of getting a p-value of 0.001 assuming the claim is true depends on the statistical power of the test. The power in the test depends on the sample-size, alpha level, and the effect-size of the claim. Unfortunately, small effect-sizes are generally hard (not impossible) to detect at the 0.001 significance level when the sample-size is small. Otherwise, there's a good chance they will detect the effect. Since I seriously doubt that claimants know how strong or weak their paranormal claims are, chances are they are being tested under inappropriate conditions.

I agree that alpha of 0.001 is a standard in the preliminary test. However, if what Pixel said is true that a single replication of the preliminary creates a p-value of 0.000001, then the claimant better be good on whatever he claims. If they don't combine them, then I guess the claimant must as well beat odds of billion to one in order to pass the formal test.
 
There are no real odds going on (and if there is something real, a million tests in a row will succeed.)

A million tests will succeed in a row? That is so unrealistic in practical terms, even via conventional research. So, if a study found a p-value of 0.001, what is the probability of getting five 0.001 p-values in a row? Simple! 1_ X 10^-14

Not even conventional research has reach those kinds of odds.
 
Last edited:
Yep, precisely what I said above. Given a high enough number of claimants, and enough repeated trials for individual claimants, sheer chance would allow someone to claim the prize if the p-value was high enough. But I think the existing initial obstacles (the need for a recommendation letter from a professor) vastly reduces the number of preliminary trials, and and a limit on the number of attempts (if it doesn't exist already) would take care of the problem altogether.

Obviously, the question is how many trials you expect to run in total. To safeguard the million, you'd want the chance that the million is paid out to be low, even after all of them are done. And by low I mean fraction of a percent.
If we expect a thousand trials then a million to one is the least that will do.

Given the rather large population of professional psychics (IE potential claimants at whom the challenge is actually aimed), expecting thousands of applicants seems reasonable.
 
the claimant better be good on whatever he claims.
If the claimant is any good at all then they will do consistently better than chance, and their ability will become more and more obvious with each test as the probability of their success being due solely to chance steadily decreases.
 
if there is something real, a million tests in a row will succeed

Ah, so that's why every baseball player bats 1.000. And every chess grand master has never lost a game. And every astrophysicist has never made a math error. Oh, wait a minute.....
 
If the claimant is any good at all then they will do consistently better than chance, and their ability will become more and more obvious with each test as the probability of their success being due solely to chance steadily decreases.

True, but the question here is the sample-size and the power of the study. Is the sample-size/power appropriate enough for the test to detect the claimant's claim?
 
Ah, so that's why every baseball player bats 1.000. And every chess grand master has never lost a game. And every astrophysicist has never made a math error. Oh, wait a minute.....
But in all those cases it would become clear very quickly that those individuals were doing considerably better than they would be expected to do if they were just swinging the bat/making moves/writing down figures at random.
 
True, but the question here is the sample-size and the power of the study. Is the sample-size/power appropriate enough for the test to detect the claimant's claim?
That depends on what the claimant's claim is. Most claimants claim a considerably higher success rate than they need to achieve to reach the sort of success criteria JREF usually set. For example dowsers usually expect to be able to tell the difference between a buried barrel of water and a buried barrel of sand every time, so the 70% or 80% success rate that's actually needed should be a doddle.

What needs to be remembered is that the applicants never actually do any better than chance. It's not that they do a little bit better, but not well enough to meet the JREF success criteria - their results are always well within that which would be expected by chance alone.
 
Even if someone only claims a minimal success rate above chance, sufficient repetition could make achieving the required p-level not difficult at all...
 
That depends on what the claimant's claim is. Most claimants claim a considerably higher success rate than they need to achieve to reach the sort of success criteria JREF usually set. For example dowsers usually expect to be able to tell the difference between a buried barrel of water and a buried barrel of sand every time, so the 70% or 80% success rate that's actually needed should be a doddle.

What needs to be remembered is that the applicants never actually do any better than chance. It's not that they do a little bit better, but not well enough to meet the JREF success criteria - their results are always well within that which would be expected by chance alone.

I disapprove of p-values, particularly when applied to hypothesis testing for deeply implausible situations as the JREF tests.

A p-value is usually giving an estimate of the result occurring by chance. This isn't what we're interested in - we want to know the chance the person has paranormal abilities. A p-value of 0.001 is not useful if someone is claiming an ability that you a priori consider much less likely than that.

I'd therefore naturally argue that you want to do a Bayesian model comparison. In practice I'd be prepared to admit that sufficiently strong tests are going to reach the same conclusion whichever approach you take.

However, I think that there's also some educational value in the fact that this approach should encourage applicants to make strong claims about their ability. If a dowser thinks they can perform right 70-80% of the time they should be encouraged to go for that and be tested on that, and if they don't want to then they can broaden their claim at the expense of having to work harder to demonstrate it by needing a larger sample size.

(It's also the sort of approach that is more likely to lead you to a correct conclusion when yet another homeopath claims p < 0.01 results or something, so I think it's considerably more useful when you're at risk of seeing publishing biases)
 
I disapprove of p-values, particularly when applied to hypothesis testing for deeply implausible situations as the JREF tests.

A p-value is usually giving an estimate of the result occurring by chance. This isn't what we're interested in - we want to know the chance the person has paranormal abilities. A p-value of 0.001 is not useful if someone is claiming an ability that you a priori consider much less likely than that.

I'd therefore naturally argue that you want to do a Bayesian model comparison. In practice I'd be prepared to admit that sufficiently strong tests are going to reach the same conclusion whichever approach you take.

snip...

While I agree in general principle, in the case a Bayesian model comparison is problematic precisely because JREF and challengers disagree on the model priors.

More to the point probably, JREF is pretty clear that this is not a scientific investigation to uncover the truth. It's

a) a chance for a challenger to prove JREF wrong (in which case a classical test is probably reasonable).

b) a publicity stunt...so the statistical stuff is just a safeguard against something going wrong accidentally.

In my one experience trying to help an applicant negotiate a protocol with JREF there was indeed an issue of a small effect size requiring a somewhat lengthy test. Basically, JREF was unwilling/unable to deal with it. This makes me suspect that item (b) is what governs. (Which I don't have a problem with.)
 
Last edited:
That depends on what the claimant's claim is. Most claimants claim a considerably higher success rate than they need to achieve to reach the sort of success criteria JREF usually set. For example dowsers usually expect to be able to tell the difference between a buried barrel of water and a buried barrel of sand every time, so the 70% or 80% success rate that's actually needed should be a doddle.

What needs to be remembered is that the applicants never actually do any better than chance. It's not that they do a little bit better, but not well enough to meet the JREF success criteria - their results are always well within that which would be expected by chance alone.

Hmm, I can't argue about that with higher hit rates since it seems reasonable what you're saying. The only problem I have with the challenge are the marginal hit rates since that would require more sample-size than larger hit rates.
 
A p-value is usually giving an estimate of the result occurring by chance. This isn't what we're interested in - we want to know the chance the person has paranormal abilities. A p-value of 0.001 is not useful if someone is claiming an ability that you a priori consider much less likely than that.

P-values are actually quite useful. The p-value basically tells you how likely it of getting an observation extreme or more than extreme if the null-hypothesis is true. The p-value basically measures the evidence for the null-hypothesis. If the p-value is greater than the standard 0.05, then it can't be argued that the null-hypothesis should be rejected. If, on the other hand, is less than 0.05, then it can be said that the null should be rejected. Keep in mind that the p-value tells you the probability of the result occuring by chance, not the alternative hypothesis. If P=0.05, then there is a 0.95 chance that the alternative is correct.

I'd therefore naturally argue that you want to do a Bayesian model comparison. In practice I'd be prepared to admit that sufficiently strong tests are going to reach the same conclusion whichever approach you take.

Bayesian Statistics is generally quite controversial in the statistical community. Stick with point estimates and confidence intervals.

However, I think that there's also some educational value in the fact that this approach should encourage applicants to make strong claims about their ability. If a dowser thinks they can perform right 70-80% of the time they should be encouraged to go for that and be tested on that, and if they don't want to then they can broaden their claim at the expense of having to work harder to demonstrate it by needing a larger sample size.

Agree.

(It's also the sort of approach that is more likely to lead you to a correct conclusion when yet another homeopath claims p < 0.01 results or something, so I think it's considerably more useful when you're at risk of seeing publishing biases)

Publication bias is one thing, multiple analyses is another thing as well.
 
Last edited:
P-values are actually quite useful. The p-value basically tells you how likely it of getting an observation extreme or more than extreme if the null-hypothesis is true.

... If P=0.05, then there is a 0.95 chance that the alternative is correct.

That last quoted sentence is not what a P-value means, which I imagine is why edd was suggesting a Bayesian analysis.
 
That last quoted sentence is not what a P-value means, which I imagine is why edd was suggesting a Bayesian analysis.

Why not? Aren't p-values and confidence intervals connected? P=0.05, hence you can be 95% confident that the observed result is due to the alternative hypothesis whereas there's a 5% chance that the observed result is a Type I Error.

Also, I don't agree with his Bayesian approach. Bayesian Statistics is quite controversial and problematic in the statistical community. That's why I said stick with point estimates and confidence intervals.
 
Last edited:
Why not? Aren't p-values and confidence intervals connected? P=0.05, hence you can be 95% confident that the observed result is due to the alternative hypothesis whereas there's a 5% chance that the observed result is a Type I Error.

Also, I don't agree with his Bayesian approach. Bayesian Statistics is quite controversial and problematic in the statistical community. That's why I said stick with point estimates and confidence intervals.

Sure p-values and confidence intervals are connected. The right statement is that 95% of the time the confidence interval includes the true value of the parameter. The confidence interval is not a posterior distribution for the true value, although it maybe approximately so...if you're a Bayesian.

Loosely speaking, the problem is that Bayes law (nothing to do with being a Bayesian) requires paying attention to Type II error as well as Type I error.

And my reading is that Bayesian statistics is much less controversial than it once was, although there remain skeptics on both sides.

[Note to mods: I assume if this drifts too far you'll move it.]
 
While I agree in general principle, in the case a Bayesian model comparison is problematic precisely because JREF and challengers disagree on the model priors.
Absolutely agree (and with the stuff I've trimmed).
 
But in all those cases it would become clear very quickly that those individuals were doing considerably better than they would be expected to do if they were just swinging the bat/making moves/writing down figures at random.

I don't understand what your comment has to do with mine. I was responding to Beerina's claim that a million tests in a row need to succeed. Why should psychic abilities require 100% accuracy? If they exist, they likely operate the same way other human abilities do, subject to constraints, good days/bad days, and external stressors. The very best batters only hit about 10% of the pitches thrown their way. Why does Beerina think psychics could successfully perform a million tests in a row when no other human endeavor can?
 
I don't understand what your comment has to do with mine. I was responding to Beerina's claim that a million tests in a row need to succeed. Why should psychic abilities require 100% accuracy? If they exist, they likely operate the same way other human abilities do, subject to constraints, good days/bad days, and external stressors. The very best batters only hit about 10% of the pitches thrown their way. Why does Beerina think psychics could successfully perform a million tests in a row when no other human endeavor can?

First, I think Beerina was speaking metaphorically.

Second, without sufficient data I would try to refrain from speculation what psychic abilities - should they exist - can and cannot do, how they are influenced, etc.

Third, picking baseball hitters is a clever ploy because in baseball success for a hitter is (roughly) defined b a .300 batting average. One could as easily have chosen baseball pitchers, even better relievers, and see success rate jump significantly. But that would have weakened one's argument, would it not?

Conclusion: What people like Beerina, Pixel42 and myself are trying to convey is, that e.g. a spoonbender sitting in a comfortable kitchen should have a blow-us-all-away success rate, easily clarifying something "paranormal" or "supernatural" going on.
Under controlled conditions absolutely eliminating manipulation from both sides, this success rate would be one in a million.

Furthermore, that would be a noodle-scratcher for both sides, would it not?
 
I don't understand what your comment has to do with mine.
I was just pointing out that even if we concede your point that we shouldn't expect these abilities to be any more consistent than those of talented batsmen, chess players etc, we would still expect that they would (as with such abilities) produce results that are significantly better than random chance. And they don't.
 
I was just pointing out that even if we concede your point that we shouldn't expect these abilities to be any more consistent than those of talented batsmen, chess players etc, we would still expect that they would (as with such abilities) produce results that are significantly better than random chance. And they don't.

That's why in Statistics we calculate the Type I Error probability before doing a one/two-tailed t-test. Since the Type I error rate for the preliminary is 0.001, hence we would expect by average one in a thousand applicants to pass by dumb luck. If the significant results were significantly better than the thousand to one rate, we can conceive these results as evidence for the paranormal. This can be determined by calculating the p-value of significant studies out of non-significant ones.

Unless the JREF decided to combine the p-value, the overall Type I Error probability of the claimant passing both tests is a billion to one.

Expecting an exact 100% or near 100% replication is very ridiculous and extremely conservative. Telling a psychic to pass 100 tests in a row is like telling famous baskeball player, Brian, to never miss a basket.
 
Last edited:
Second, without sufficient data I would try to refrain from speculation what psychic abilities - should they exist - can and cannot do, how they are influenced, etc.

Why? Any human ability should fall within normal parameters compared to other human abilities. Anyone can play piano after a few lessons, but only some people will reach virtuoso level after many years of study and practice.

Third, picking baseball hitters is a clever ploy because in baseball success for a hitter is (roughly) defined b a .300 batting average. One could as easily have chosen baseball pitchers, even better relievers, and see success rate jump significantly. But that would have weakened one's argument, would it not?

An exceptional baseball pitcher may be defined as one who pitches a no-hitter game. There have only been 236 no-hitters in the past 111 years, so the success rate does not exactly jump significantly.

And no, despite your bizarre claim about my presumed motive, choosing pitchers or any other skilled human would not weaken my argument. Education and practice are the keys to acquiring skill in any field. If psychic skills exist, why should they be any different? Because you say so?
 
Back
Top Bottom