JREF Challenge Statistics

T'ai Chi · Feb 19, 2006

So are people for seeing actual data in a summarized format from skeptical organizations that do such tests?

If not, why not?

drkitten · Feb 20, 2006

T'ai Chi said:
So are people for seeing actual data in a summarized format from skeptical organizations that do such tests?

No, because the inevitable (over)simplifications necessary to put the test data into summarized format will result in the summaries being useless and actively misleading.

T'ai Chi · Feb 20, 2006

drkitten said:
No, because the inevitable (over)simplifications necessary to put the test data into summarized format will result in the summaries being useless and actively misleading.

So how would you analyze the data then without summarizing it?
(that is, if anybody had any)

Mercutio · Feb 20, 2006

Um....individually. Which they already were. DrKitten is quite right, the attempt to combine such disparate studies is likely to be worse than useless.

One could argue that the preliminary nature of the tests, with their purpose as much "demonstrate that you can do what you say you can" as strict test, in and of itself precludes these data from being appropriate for a meta-analysis. The very same excuses that are used to dismiss the results post-hoc (if they are to be attended to) would invalidate the results from inclusion in a meta-analysis.

T'ai Chi · Feb 20, 2006

Mercutio said:
DrKitten is quite right, the attempt to combine such disparate studies is likely to be worse than useless.

Why though?

If you have 20 dowsing experiments, done similarly (choosing gold under a cup, etc.) seems very reasonable to combine the results.

Seems to work in every other field, why not with skeptical organizations?

Gr8wight · Feb 20, 2006

T'ai Chi said:
Why though?

If you have 20 dowsing experiments, done similarly (choosing gold under a cup, etc.) seems very reasonable to combine the results.

Seems to work in every other field, why not with skeptical organizations?

Because we don't have 20 dowsing experiments done simlarly. We have one dowsing experiment finding gold under a cup, one dowsing experiment finding addresses with a pendulum, one telepathy experiment sending thoughts to another person, one martial arts experiment attempting to stop an attacker without touching him... How do you combine those results?

By the way, can you tell me what element all of those tests had in common?

Mercutio · Feb 20, 2006

Gr8wight said:
Because we don't have 20 dowsing experiments done simlarly. We have one dowsing experiment finding gold under a cup, one dowsing experiment finding addresses with a pendulum, one telepathy experiment sending thoughts to another person, one martial arts experiment attempting to stop an attacker without touching him... How do you combine those results?

By the way, can you tell me what element all of those tests had in common?

In addition, we may have one tested against a claim of 100% accuracy, another tested against a claim of 90%, another against a claim of 60%... Each of these may (depending on the deal agreed to by both parties) result in a different cutoff level, which cannot be combined in a meaningful manner.

T'ai Chi · Feb 20, 2006

Gr8wight said:
Because we don't have 20 dowsing experiments done simlarly. We have one dowsing experiment finding gold under a cup, one dowsing experiment finding addresses with a pendulum, one telepathy experiment sending thoughts to another person, one martial arts experiment attempting to stop an attacker without touching him... How do you combine those results?

Tests of the 'how many did you get out of n trials, where each trial had a p probability of success' variety are done in a similar fashion; a binomial experiment.

Mercutio wrote

In addition, we may have one tested against a claim of 100% accuracy, another tested against a claim of 90%, another against a claim of 60%...

What a claimant believes about their performance doesn't interest me, but how they actually perform.

The issue of combining experiments aside, wouldn't it still be nice to see a list of such statistical results from the perliminary experiments, all in one place, from various skeptical organizations, without having to fly to the organizations to read through papers, made available to all interested parties, say, over the internet?

Mercutio · Feb 20, 2006

T'ai Chi said:
What a claimant believes about their performance doesn't interest me, but how they actually perform.

What the claimant claims has a direct bearing on the test; it may determine that a test end as a failure in one case with results that were a small fraction of the required attempts of another case. Thus what a claimant believes about their perfomance has a direct bearing on how they will be allowed to perform in the test (that is, a person claiming 90% accuracy, who agrees to a 20-trial test, will fail a preliminary even if they score slightly above chance. Let us suppose that they score at a percentage rate that would make their performance statistically significant if they maintain it for only 100 trials; the problem is, their trial ended after 20 trials. It is impossible to know whether they would continue, or whether they would regress to the mean.)

More, since the tests are against a claimed level of performance, it is inappropriate to use a measure of effect size compared with chance performance (for similar reasons as above); without this, a traditional meta-analysis is impossible.

T'ai Chi · Feb 21, 2006

Mercutio said:
(that is, a person claiming 90% accuracy, who agrees to a 20-trial test, will fail a preliminary even if they score slightly above chance.

I'm not interested in what a person believes about their performance, they could be mistaken, but how they actually perform, uch like I'm not interested in what a doctor thinks about a drug, but how the drug actually performs.

Mercutio · Feb 21, 2006

T'ai Chi said:
I'm not interested in what a person believes about their performance, they could be mistaken, but how they actually perform, uch like I'm not interested in what a doctor thinks about a drug, but how the drug actually performs.

You missed my point.

Their performance is halted earlier when they are compared to a higher standard. This can throw a bias into the combined results. To throw them all onto the pile is completely inappropriate, statistically. If you wanted to test these folks against chance to begin with, with final test protocol conditions, on sample sizes sufficient to have the needed power, and then threw them into a meta-analysis, that would be just fine. But that is not what the initial challenge tests are, and you cannot pretend that they can be combined as if they were.

They served their purpose. They were not designed so serve yours.

T'ai Chi · Feb 21, 2006

Mercutio said:
They served their purpose. They were not designed so serve yours.

Glad it was never claimed they were designed to serve my purposes...

In any case, in these tests, one compares what one expects to what the claimant actually does. You then measure the difference numerically to see if it is significantly far away.

Again, the issue of combining experiments aside, wouldn't it still be nice to see a list of such statistical results from the perliminary experiments, all in one place, from various skeptical organizations, without having to fly to the organizations to read through papers, made available to all interested parties, say, over the internet? Even something absurdly simple, like how many of the preliminary tests were on dowsers? Out of those, how many tested higher than what one would expect? Etc. Basic info interested parties would hope to find.

Mercutio · Feb 21, 2006

T'ai Chi said:
Glad it was never claimed they were designed to serve my purposes...

Oh, heavens, let's not ever make claims...glad it was never claimed that they were claimed that they were designed to serve your purposes. Rather, you asked about combining data, and I did my best to try to help you understand why. That's all. No "claims" were made, so you can be safe.

In any case, in these tests, one compares what one expects to what the claimant actually does. You then measure the difference numerically to see if it is significantly far away.

No. We compare what the actual claim is to what the claimant actually does. We do not compare what we expect to see. There is a world of difference. We could, very easily, do the latter. The former has turned out to be considerably easier and quicker to do.

Again, the issue of combining experiments aside, wouldn't it still be nice to see a list of such statistical results from the perliminary experiments, all in one place, from various skeptical organizations, without having to fly to the organizations to read through papers, made available to all interested parties, say, over the internet? Even something absurdly simple, like how many of the preliminary tests were on dowsers? Out of those, how many tested higher than what one would expect? Etc. Basic info interested parties would hope to find.

Perhaps. It would certainly be helpful to classes like mine. Even more helpful would be access of this sort to the raw data from the parapsychologists' labs. Doesn't Schwartz have some? (Maybe my memory is playing tricks). I know the most recent Bem precognition data would be a great set to do a time-series analysis on, to see if inadequate randomization predicts performance through a classical conditioning mechanism. If I am not mistaken, this database would be significantly larger and better controlled, since (in theory) the tests are not against claims but against chance. Of course, I would like to even see such experiments videorecorded (no, I am not holding them to a higher standard than, say, psych experiments; I would like to see a video archive for psych experiments as well) so that experimental methodology can be examined in a bit more detail than an article's methods section can manage.

Compared to such a database, the skeptic's data is small change.

CFLarsen · Feb 21, 2006

Mercutio said:
No. We compare what the actual claim is to what the claimant actually does. We do not compare what we expect to see. There is a world of difference. We could, very easily, do the latter.

Precisely.

We cannot enforce our own views on what the claimant's abilities are. It is not just rude and condescending, it is also unscientific.

Mercutio · Feb 21, 2006

A brief example:

Suppose we have a number of people claiming to be able to influence the outcome of a coin toss (or to predict it, if you prefer). If we were testing this claim against chance, we would have a simple binomial problem, testing against P = .50, and we would set up a suitable number of trials. If, though, we listen to the claimants, and adjust our tests accordingly, we can save time. Let us take the extreme example in which claimants say they have complete control, and will always be able to determine the coin's face. We can test this very easily--just start flipping. There is a .5 probability that (by chance alone) any given person will fail after one toss. But that person can stop then. The trial is over. If, on the other hand, the person got the first one right, then there is a .5 probability on the next trial (again, by chance alone). With enough claimants, we may have some people who are getting 5, or 10, or more coins called correctly before making a mistake (this all by chance alone--of course, if they *can* influence the outcome perfectly, they will never make the mistake. And yes, I recall that I am taking the extreme 100% position here, but it extrapolates to lesser claims). Now...if we take these data and combine them, they will very likely be significant. Why? Because we quit the trials earlier when they failed earlier, artificially boosting the number of successes.

In order to properly combine the data, they have to have been collected in a manner that is not subject to such a bias. These preliminary tests are subject to that bias, and thus will show, to the statistically naive observer, the illusion of an effect where there is none.

T'ai Chi · Feb 21, 2006

Mercutio said:
We compare what the actual claim is to what the claimant actually does. We do not compare what we expect to see.

At the end of the day, you get numbers out of it that would be interesting to see and analyze.

T'ai Chi · Feb 21, 2006

Mercutio said:
Let us take the extreme example in which claimants say they have complete control, and will always be able to determine the coin's face. We can test this very easily--just start flipping. There is a .5 probability that (by chance alone) any given person will fail after one toss. But that person can stop then. The trial is over. If, on the other hand, the person got the first one right, then there is a .5 probability on the next trial (again, by chance alone). With enough claimants, we may have some people who are getting 5, or 10, or more coins called correctly before making a mistake (this all by chance alone--of course, if they *can* influence the outcome perfectly, they will never make the mistake. And yes, I recall that I am taking the extreme 100% position here, but it extrapolates to lesser claims).

In this scenario, in order to do the test one assumes that the person making the claim is truthful about their claimed abilities. Unfortunately, that is the very thing one is trying to ascertain by the test in the first place. Perhaps they really only perform at the 90% level, or at the 70% level, or some other level.

It doesn't make much sense to say 'OK, since you're saying you perform at the K% level, we'll test you at that level, and if you don't perform at it, you're wrong.' It makes sense to say 'We know with regular coins we'd expect you to perform at the 50% level, and if you don't perform siginificantly away from this, you're wrong.'

CFLarsen · Feb 21, 2006

T'ai Chi said:
In this scenario, in order to do the test one assumes that the person making the claim is truthful about their claimed abilities. Unfortunately, that is the very thing one is trying to ascertain by the test in the first place. Perhaps they really only perform at the 90% level, or at the 70% level, or some other level.

It doesn't make much sense to say 'OK, since you're saying you perform at the K% level, we'll test you at that level, and if you don't perform at it, you're wrong.' It makes sense to say 'We know with regular coins we'd expect you to perform at the 50% level, and if you don't perform siginificantly away from this, you're wrong.'

How do you expect a psychic to perform?

Jekyll · Feb 21, 2006

T'ai Chi said:
It doesn't make much sense to say 'OK, since you're saying you perform at the K% level, we'll test you at that level, and if you don't perform at it, you're wrong.' It makes sense to say 'We know with regular coins we'd expect you to perform at the 50% level, and if you don't perform siginificantly away from this, you're wrong.'

Doing so makes the test longer, harder, and more expensive by requiring more trials.

It wastes the claiments time and the examiners.

It doesn't make allowence for results that we expect to be better than chance, see the girl with x-ray eyes, etc. .

What possible reason is there for doing it?

drkitten · Feb 21, 2006

T'ai Chi said:
In this scenario, in order to do the test one assumes that the person making the claim is truthful about their claimed abilities. Unfortunately, that is the very thing one is trying to ascertain by the test in the first place.

And that's exactly what the test does. It's a simple binary test -- either the claimaint succeeeds or (more likely), they fail.

They either were truthful, or they weren't..

Perhaps they really only perform at the 90% level, or at the 70% level, or some other level.

Perhaps. But that's a different research question, one that would require substantially more time and resources to collect and analyze.

And one that cannot be answered in retrospect given the information we have available.

It doesn't make much sense to say 'OK, since you're saying you perform at the K% level, we'll test you at that level, and if you don't perform at it, you're wrong.'

Why not? It makes perfect sense to allow each claimant to define exactly what they feel they can do, and then test to see if they can perform as claimed.

JREF Challenge Statistics

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

red-shirted crewman

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Penultimate Amazing

Graduate Poster

Penultimate Amazing