• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

JREF Challenge Statistics

Indeed. But if you request something that will cost JREF money, the very least you could do is to support JREF with money.

If not, you insist that others pay for what you want.

I'm merely raising the idea.

Speculating about possible costs is rather moot. You have no idea how much it would cost, neither do I.
 
Last edited:

I have about 20 stat books that say that one doesn't calculate alpha, one sets it before the experiment. One calculates a p-value, but not alpha.

Case in point, the experiment with Mike cited earlier. If Mike were required to get 20/20 correct to pass the preliminary test, the alpha cutoff would not be 0.001, but 0.0000000001, and we have no knowledge about whether Mike is "typical."

If we had all the numerical details of this preliminary test that is statistical in nature, we wouldn't have to speculate.
 
I think it's an interesting idea and I'd support the notion of publishing the actual results from every test, but it certainly would require a lot of work to go back through years of tests - work and money.
I've noticed a few times where Randi has referred to past tests and how the records have not been kept, so I suspect it would not be easy to compile a representative sample. You might find that only the really strange results (obviously not strange enough to pass...) were kept and not the ordinary, chance-level results.
I'd suggest that it would be less work to just start recording the results now for all future tests. If someone is keen and wants to go through the past 2-3 years worth of tests and take that as a starting set of data, it might also be worthwhile. That, of course, also takes money or a suitably-qualified volunteer.

How difficult would it be to start recording this type of information from now on?
Analysis will either show something interesting or it will confirm that there is nothing strange about the test results. Either way, it seems useful.
 
Combining many short runs is not the same as taking one long run--Rhine's lab made that mistake, and (at least in his casual talks) so does Sheldrake.

Dum question but: Why is it not the same?
 
I have about 20 stat books that say that one doesn't calculate alpha, one sets it before the experiment. One calculates a p-value, but not alpha.
If you are designing an experiment specifically for statistical analysis, you should base it on a pre-determined alpha.

If you have already performed an experiment without bothering to think about this ahead of time, you can calculate what exactly the alpha was for that experiment. It doesn't depend on results, but only on the structure of the experiment - probability of individual success, number of repetitions.
 
I have about 20 stat books that say that one doesn't calculate alpha, one sets it before the experiment. One calculates a p-value, but not alpha.

Then you didn't read the textbooks properly. I just calculated the alpha cutoff for Mike's experiment above, to the degree that it was possible with the information given.
 
Dum question but: Why is it not the same?
One reason is: you could think of it sort of like adding up all the cumulative errors of each test.
It's sort of like measuring something really big with a small ruler. If your small ruler is a little bit off (say one mm), it will result in a larger error than if you use a large measuring tape, even if it has a slightly larger error. Especially when you include the fact that you will screw up a little bit everytime you take a measurement.

There are some other reasons why it is not the same, but they are sort of similar in concept.
 
I have about 20 stat books that say that one doesn't calculate alpha, one sets it before the experiment. One calculates a p-value, but not alpha.



If we had all the numerical details of this preliminary test that is statistical in nature, we wouldn't have to speculate.

In practice with these oddball tests I think it's a combination of both. With the CSICOP test of Natasha Demikina for example, they started by finding six volunteers that had missing and/or extra body parts, then determined the alpha of 4 of 7 and 5 of 7, and set the bar at 5 of 7.
 
How difficult would it be to start recording this type of information from now on?

Already being done -- check the "Challenge Applications" thread.

Analysis will either show something interesting or it will confirm that there is nothing strange about the test results. Either way, it seems useful.[...]

To our great-great-great-grandchildren, perhaps.

The current rate of testing is fewer than five preliminary tests per year. Assuming first that all tests actually achieve the maximally permissive 0.001 nominal alpha cutoff, second, that this rate continues, and third that the data storage lasts long enough, we would expect "by chance" to see one success at the preliminary test sometime in the late 22nd or early 23rd century, although it wouldn't be surprising if we didn't see a success "by chance" for several centuries after that. If the JREF lasts that long, it will a) be extremely surprising, and b) provide excellent material for a senior thesis for a statistics major at Starfleet Academy.
 
In practice with these oddball tests I think it's a combination of both. With the CSICOP test of Natasha Demikina for example, they started by finding six volunteers that had missing and/or extra body parts, then determined the alpha of 4 of 7 and 5 of 7, and set the bar at 5 of 7.
Actually, it was based on a decision that to get acceptable levels of alpha-risk (probability of a false positive), you needed 5 correct out of 7. Whatever the probability of that is. I can't be bothered to calculate, but I assume the probability of having a false positive is over 1 in 1000 for this.
 
Already being done -- check the "Challenge Applications" thread.



To our great-great-great-grandchildren, perhaps.

Actually, I don't care so much about someone winning by chance. I believe it will never happen - unless they also cheat.

However, useful data could be obtained with something like 20 tests. This would tell us if the results overall are statistically significant - which would indicate either cheating on the part of the applicants, or the Evil Randi-rays if the results are abnormally low. Either one would be nice to know.
 
Actually, it was based on a decision that to get acceptable levels of alpha-risk (probability of a false positive), you needed 5 correct out of 7. Whatever the probability of that is. I can't be bothered to calculate, but I assume the probability of having a false positive is over 1 in 1000 for this.

Actually (my tables aren't that fine-grained, and I can't be bothered to do the calculations), it looks like getting 5/7 correct with an a priori probability of getting an individual trial right of 1/7 is about right on the money for an alpha cutoff of 0.001.
 
The fact remains that claiments have an opportunity to learn from other claiment's test, or even their own previous tests. The tests are independent only if the the only factor in the outcomes of the tests is chance. Clearly chance is a big part of most tests, but the fact is that most claiments are NOT using random choice as their strategy.

How do you know this?

One obvious example would be a person that looks at a test by a previous claiment and sees a way to cheat not anticipated by the JREF. This might inspire the new claiment to practice and then apply with the same protocol as the previous claiment.

Can you give an example?




I'm merely raising the idea.

No, you are not merely raising the idea. You have put so much effort into this that you have put up a webpage about it.

Speculating about possible costs is rather moot. You have no idea how much it would cost, neither do I.

It would definitely be outside the current budget. You want something, but refuse to pay for it. When you don't get what you want, you can continue to criticize JREF.

Which, I believe, is the sole reason for this thread and your page.
 
However, useful data could be obtained with something like 20 tests.

Huh? The expected number of successes in 20 tests with alpha cutoff of 0.001 is 0.020 -- one fiftieth of a success.

In practical terms, this means no successes are expected in the next twenty tests.

How would you detect the "Evil Randi-rays" lowering the number of expected successes below zero?
 
Can you give an example?
.

Be serious. Nadia what's her name got caught peering around a blindfold. Now I know that when I apply to take the test, I can't use the standard peering around a blindfold test, so I'll have to use something else like the concealed earbug trick.

Uri got caught using sleight of hand on cameras. Therefore I need to make sure that whatever I do will not show up on film.
 
Huh? The expected number of successes in 20 tests with alpha cutoff of 0.001 is 0.020 -- one fiftieth of a success.

In practical terms, this means no successes are expected in the next twenty tests.

How would you detect the "Evil Randi-rays" lowering the number of expected successes below zero?

You need to actually use the results of the tests. If you know that one applicant got 6 correct out of 10, it tells you a lot more than just knowing that they failed to get 8 out of 10 (or whatever the agreed performance should have been).

Finding out that, in total, applicants are significantly more successful than chance would be its own form of test. Of course it could just indicate that there is a lot of cheating in these tests...

ETA: this boils down to the difference between using discrete data vs. continuous data for analysis. Just counting pass/fail means you need a huge amount of trials to get useful results.
 
How do I know what? As for learning, a great deal of information about the tests is made public. I think we can assume that many claiments have privately practiced tests that have been done in the past. I also assume that this has discouraged many claims. Seems like there used to be more.

As for not using random chance as a strategy, there are basically two sorts of claiments: those who are trying to cheat, and those who think they have paranormal power. Clearly cheaters are not using guessing as a strategy.

Those who think they have powers are using a variety of strategies, some wholly internal (I got a feeling...) and some external. The JREF strives to eliminate any correlation between external clues and the outcomes, but it's clear that the guessing itself is NOT random in most cases. It's based on psychological factors that are hard to pin down.

If I could give an example of an existing protocol that I knew how to cheat, I might decide to earn a million dollars BEFORE I made the flaw public. That was a hypothetical case to show that claiments do in fact vary their strategies based on the experiences of previous claiments. Apparently some of the recent dowsing tests involved someone trying to cheat, but they were caught. Over and over again we hear claiments say "I know now why I failed, I'm going to practice so I can get it right next time."
 
I personally would like to know how much cheating actually goes on in these tests. If we find that there is not statistically significant variance from chance in the overall results of all applicants, then we can probably say we're doing a good job on stopping the cheaters.

Results like that of Natasha's show us that cheating is still slipping through (I suspect), so I'd like to know how bad the problem is.

This kind of analysis is a good way to tell.
 
Dum question but: Why is it not the same?
Very good question, actually, given that it has led to problems in the parapsychology research field.

Short answer for now, because I have a class to get to:

Flipping 10 coins, you expect 5 H, 5 T. But, with only 10 coins, 6, 7, or 8 H is not that rare at all, and 9 or 10, although rare, certainly something you might find if you spent just one day flipping coins.

Flipping 100 coins, you expect 50 H...but it is much more difficult to get the same percentage of H as in the smaller sample. 60H you might find, but 70 is already very rare, 80 you probably won't find in several days' attempts at flipping 100 coins in a row. 90 or 100 could take you weeks. (Basically, with just 10 flips, you only need to be off by 4 from a priori probability in order to get 90%; if you flip 100, you need to be off by 40 flips to get the same percentage. A much more difficult task.)

The same discrepancy exists between any 2 sample sizes. If we are trying to demonstrate our ability based on flipping 100 coins, we cannot compare our result simply to .5, or even to the distribution arrived at by flipping 1000 coins a sufficient number of times to generate an empirical sampling distribution (or by mathematically deriving the same sampling distribution, using s/rootN). Rhine's lab originally allowed subjects to end trials when they chose; by always ending after a run of successes, the accumulated data could (if compared to the large-N null probability of the accumulated scores) achieve statistical significance.

Bottom line--compare small runs to small runs, large runs to large runs. A bunch of small runs put together must be compared to the more varied distribution that is appropriate.
 
How do I know what?

That most claimants are NOT using random choice as their strategy.

As for learning, a great deal of information about the tests is made public. I think we can assume that many claiments have privately practiced tests that have been done in the past. I also assume that this has discouraged many claims. Seems like there used to be more.

As for not using random chance as a strategy, there are basically two sorts of claiments: those who are trying to cheat, and those who think they have paranormal power. Clearly cheaters are not using guessing as a strategy.

Those who think they have powers are using a variety of strategies, some wholly internal (I got a feeling...) and some external. The JREF strives to eliminate any correlation between external clues and the outcomes, but it's clear that the guessing itself is NOT random in most cases. It's based on psychological factors that are hard to pin down.

Neither group is using guessing as a strategy.

If I could give an example of an existing protocol that I knew how to cheat, I might decide to earn a million dollars BEFORE I made the flaw public. That was a hypothetical case to show that claiments do in fact vary their strategies based on the experiences of previous claiments. Apparently some of the recent dowsing tests involved someone trying to cheat, but they were caught. Over and over again we hear claiments say "I know now why I failed, I'm going to practice so I can get it right next time."

No examples, then.
 

Back
Top Bottom