• I've created a thread for feedback on the reaction/likes feature Feedback thread

How to manipulate data, get results as good as PEAR, and be $1,000,000 richer

Baron Samedi

Critical Thinker
Joined
Dec 13, 2006
Messages
476
Since the news of PEAR shutting down came out, I've been reading over some of the news reports and quotes given by the key players. One thing that struck me was their pure dogmatic belief that they have proof beyond a shadow of a doubt.

For example, in their test with a 50/50 outcome, they claim that they have statistically significant results to show that ESP exists. The overall effect is only 0.5003 vs. 0.5000, but they have the statistics to show that this difference is not due to random chance alone. My first question that came to mind was exactly how many experiments needed to be run to even get "statistically significant results"? Time to break out my second year stats course notes:

To test P_observed vs. 0.5, use the formula
(P_observed - 0.5)/sqrt(0.5*0.5/n) > Z
where Z is your critical value for alpha (Type I error) on a one sided test. Knowing this, we can solve for the minimum value of n such that the inequality holds true, assuming that P_observed = 0.5003 as they stated. Since we don't know what alpha level they used, I have to pick some common ones:

alpha = 0.1; z=1.28; n=45,622
alpha = 0.05; z=1.645; n=75,154
alpha = 0.01; z=2.326; n=150,330
alpha = 0.001; z=3.09; n=265,268

So if I ever want to replicate and test their results myself, I'm going to have to try roughly 50,000 tests before I start to see some kind of statistical evidence above chance? Imagine trying for the million dollar prize with this number? It's impossible. This is why the people at PEAR will not touch Randi's prize. If their results are indeed true, and Randi runs a tight test, then the preliminary test will have about 250,000 trials of guessing "Good" or "Bad"? I don't think we could find anyone to volunteer a year's worth of their life full time sitting in on this exam.

Someone raised an interesting point. What if the claimant says that their psychic ability lets them go from 50% chance of being right to 60% chance of being right. How many tests do you need to run to prove to Randi that the ability exists? 6 out of 10 right if laughable. 30 out of 50? Still not enough. 60 out of 100? We're getting there. 150 out of 250? You may start to make believers out of us. But as you can see, people complain that Randi keeps raising the bar, and that if you pass 50, he'll ask for 100. If you pass 100, he'll ask for 250. They think and accuse Randi of cheating. This is wrong.

The answer to this is in the Power test. This is a statistical calculation that asks: If the person has ability of being 60% right, how many trials must the person do so they pass our test 80% of the time. People have good days and bad days, so we want to make sure that even on a bad day and the person's running 58% that they will pass.

So for the power calculation, we have:
Random: 50%
Claimant: 60%
Randi's alpha level: 0.001
Now we solve for n. After struggling for a bit and running some simulations, I have a calculation of 385 trials needed. Because of this test, the scenario is set. If the claimant says their ability is 60%, then 385 trials will be done, no more, no less. This may sound like overkill to a person ("Ah, but I tried it 50 times at home! Why do I need to do it 385 times?"), but that's the way it needs to be done.

Here's the key thing to notice, and where jury rigging the numbers comes in. I said that 385 tests, no more, no less, have to be done. In practice, how often is this really done? Let's say that on one test case, the person is successful 30 out of 40 times. The p-value in this case is indeed less than 0.001. Should you stop or should you go on? According to my rules, you have to continue. Most people, though, will say that the sitter has proven themselves, so we have to quit. There's no need to go on and waste everybody's time if we've "proven the ability" after only 40 tests? This, while looking naive and simplifying the process, is the fatal flaw in these studies.

An alpha level means the chances of Type I Error, or the probability of saying that some effect exists when truly there is nothing and this was due to randomness. Therefore, if I run 10,000 trials using an alpha level of 0.001, I should see roughly 10 of these having positive results. Running off of randomly simulated data, this matches. Sometimes I get 8 successes, sometimes 16, but usually its around 10.

Now I tried running simulations to show what usually happens due to human intervention. For each person sitting to test for psychic abilities, I started off with 30 guesses. If the current p-value < 0.001, I stop and call the person potentially gifted. Otherwise, I'll try one more coin flip. Is the person's hit rate high enough now? If not, try one more flip. If I reach 385 flips and the person still hasn't shown powers beyond a shadow of a doubt, I finally call it a day.

In each individual case, I have a p-value < 0.001 for all potentially psychic people. If this effect is due to random noise, again we should see only about 10 potential hits. In my simulated data, I am now getting, on average, 116 potential psychics. Remember, this data is indeed pure random garbage. Now is time to start spinning the results and throwing in statistical jargon:

10,000 trials were tried at an alpha level of 0.001. Assuming a typical binomial distribution for the trials, one would expect a mean of 10 trial successes. Of all trials, 116 showed a success. A typical hypothesis test for these levels shows a z-score of 47.46, which relates to a probability infinitesimally small (p << 0.0001). This is clear proof than an overall effect is harboured in human potential. Overall, the hit rate on the entire population is 0.5014 which, albeit slightly above chance, can be explained by performance fatigue, negative ESP skills (abnormally incorrect reading ability), and general skepticism in the total population.

Using this same kind of faulty reasoning, I'm fairly confident that I can produce just as significant results with only 200 sittings. With 200 people, and this lousy procedure, I should be able to get 3 "psychic" people and show that 3/200 is absolute proof of ESP.

So now that I've blown my chance for the $1,000,000, do any of you know of a good "Woo" journal in case I want to pull off another Sokal scam? :D
 
PEAR got good results? ;)


Actually, this is very good data! Thanks.
 
I don't think we could find anyone to volunteer a year's worth of their life full time sitting in on this exam.

Yes, we could.

Think about it. These people have spent years - decades - of their lives doing these tests. They claim that they do have positive evidence.

So, yes, we could actually find someone to volunteer a year's worth of their life full time sitting in on this exam.

Provided they really believed that they would get a positive result.

But they don't volunteer. Because they know that they don't have this evidence.

These tests cannot have taken all that much time. Maybe a couple of hours each day? Even if 8 hours was spent on nothing, the rest of the time would be spent on...whatever you want.

It's the paranormal gravy train.
 
Yes, we could.

Think about it. These people have spent years - decades - of their lives doing these tests. They claim that they do have positive evidence.

So, yes, we could actually find someone to volunteer a year's worth of their life full time sitting in on this exam.

Provided they really believed that they would get a positive result.

But they don't volunteer. Because they know that they don't have this evidence.

These tests cannot have taken all that much time. Maybe a couple of hours each day? Even if 8 hours was spent on nothing, the rest of the time would be spent on...whatever you want.

It's the paranormal gravy train.

Ah, I should have stated more clearly... When I asked if we could find someone to sit in for a year, I should have said, "I don't think the JREF could find someone to waste a year of their life volunteering to oversee some woo perform this kind of a test." If I'm the one who's going to win (or try to scam) the $1,000,000, sure I'll do it. But the JREF people have lives to lead. A year for a preliminary test? And no one pays the referees for any of their time? And to sit through 250,000 dice throws and be totally aware and looking out for any kind of cheating? The JREF volunteer would snap for certain. :D
 
AWESOME. You have made my day.

Kage

Glad you liked it. I still would love to be able to try to pass this off to a woo journal and see if they buy it as real and concrete proof. I'm in a spiteful and vengeful mood today. :D

PEAR got good results? ;)

Actually, this is very good data! Thanks.


Well, sure! You look hard enough, you'll always get good results. Or as I was taught in school, "If you torture the data enough, it will confess." And, of course, all good statisticians hang out at S&M bars. It's true!
 
I don't know enough about the RNG experiments to say whether this optimal stopping actually occured.

Another way figures could be inflated is if a subject is expected to do a set number of trials. If they start badly and get disheartened, they are much more likely to drop out compared to those who just so happen to start well. Since they don't complete their pre-set number of trials, their data maybe removed from the results, skewing the results in a positive direction.
 
very nice write up. this stuff should be obvious, yet it's ridiculous how many qualified individuals and professionals overlook this stuff, whether it's by choice, ignorance or stupidity.
 
I don't know enough about the RNG experiments to say whether this optimal stopping actually occured.

Another way figures could be inflated is if a subject is expected to do a set number of trials. If they start badly and get disheartened, they are much more likely to drop out compared to those who just so happen to start well. Since they don't complete their pre-set number of trials, their data maybe removed from the results, skewing the results in a positive direction.

That was the whole reason I wanted to try the simulations. In reading up on how these studies are flawed, people always mention the floating starting point issue. "Oh no, 5 misses right off the bat? Is the equipment working? Do you need more time to warm up? Let's start again." Then there are people who lie, who claim that the person being tested started to tire, and their last number of results were tossed out. But this one is easier to catch, since all data must be reported, and you can simply ask why these observations weren't used. I wanted to know exactly what happens if we're free to stop whenever we chose to, and how that may skew results. It's bigger than I thought. And in my case, I am using each and every observation, I didn't delete a single one.

I did work out a nice catch to see if this jury-rigging is going on. All of my positive tests were between 30 to 385 observations. All of my negative tests had exactly 385. We can look to see if the successes have, on average, a much lower number of trials than the misses in the study. If so, we've caught them. This should work with your example of inflation as well. :D
 
Here's the key thing to notice, and where jury rigging the numbers comes in. I said that 385 tests, no more, no less, have to be done. In practice, how often is this really done? Let's say that on one test case, the person is successful 30 out of 40 times. The p-value in this case is indeed less than 0.001. Should you stop or should you go on? According to my rules, you have to continue. Most people, though, will say that the sitter has proven themselves, so we have to quit. There's no need to go on and waste everybody's time if we've "proven the ability" after only 40 tests? This, while looking naive and simplifying the process, is the fatal flaw in these studies.

That was a nice demonstration of how a systematic bias can create pattern from randomness. If you have a collection of random data and use randomness to look for a pattern, no pattern emerges. If you use a pattern to look for a pattern, then a pattern emerges.

Do you know whether or not this particular flaw was present in the PEAR data - i.e. did individual operators stop early? I thought they were blinded as to the results.

ETA: Oops, already asked and answered, I see.
Linda
 
We can look to see if the successes have, on average, a much lower number of trials than the misses in the study. If so, we've caught them.

That's interesting. With regards to the ganzfeld experiments and the RNG work the most successful experiments were the shortest.
 
Ah, I should have stated more clearly... When I asked if we could find someone to sit in for a year, I should have said, "I don't think the JREF could find someone to waste a year of their life volunteering to oversee some woo perform this kind of a test." If I'm the one who's going to win (or try to scam) the $1,000,000, sure I'll do it. But the JREF people have lives to lead. A year for a preliminary test? And no one pays the referees for any of their time? And to sit through 250,000 dice throws and be totally aware and looking out for any kind of cheating? The JREF volunteer would snap for certain. :D

Absolutely agree.
 
Do you know whether or not this particular flaw was present in the PEAR data - i.e. did individual operators stop early? I thought they were blinded as to the results.

ETA: Oops, already asked and answered, I see.
Linda

Linda,

I have absolutely no clue. I just thought that people looking into PEAR's methodology were blinded to PEAR's results. ;) But seriously, I thought that PEAR was running the trials where a computer would generate a number, create the Good/Bad image, and the person in the other room had to guess the image. Both teams of people could be blind of each other, but they are both being recorded by the computer and results tied in. If there is a third person who sees these live results, they can stop the test early if they deem it sucessful and yet still claim to have performed a double-blind experiment.

But this is just me guessing and trying to find ways of cheating and lying and yet coming up with proof beyond a shadow of a doubt. I have no proof of PEAR doing just this, just like Randi has no proof that Yuri bends spoons with his hands. I do have a sneaky method which creates very similar results to PEAR.

You've seen more medical/psychological stats tests than I have. Do you know if this type of test on the alpha levels is ever done to test H0?

That's interesting. With regards to the ganzfeld experiments and the RNG work the most successful experiments were the shortest.

?! Really?! How short is shortest???
 
?! Really?! How short is shortest???

Well, the best ganzfeld results tend to be in series of twenty trials or less.

As for the RNG stuff, I don't know. I read it a few times, most recently in an abstract from a recent meta-analysis into the subject:

Examining Psychokinesis: The Interaction of Human Intention With Random Number Generators--A Meta-Analysis
Bosch, Steinkamp, Boller
Psychological Bulletin, Vol. 132, No. 4. (July 2006), pp. 497-523
"Seance-room and other large-scale psychokinetic phenomena have fascinated humankind for decades. Experimental research has reduced these phenomena to attempts to influence (a) the fall of dice and, later, (b) the output of random number generators (RNGs). The meta-analysis combined 380 studies that assessed whether RNG output correlated with human intention and found a significant but very small overall effect size. The study effect sizes were strongly and inversely related to sample size and were extremely heterogeneous. A Monte Carlo simulation revealed that the small effect size, the relation between sample size and effect size, and the extreme effect size heterogeneity found could in principle be a result of publication bias."


(my emphasis)
 
Well, the best ganzfeld results tend to be in series of twenty trials or less.

As for the RNG stuff, I don't know. I read it a few times, most recently in an abstract from a recent meta-analysis into the subject:

Examining Psychokinesis: The Interaction of Human Intention With Random Number Generators--A Meta-Analysis
Bosch, Steinkamp, Boller
Psychological Bulletin, Vol. 132, No. 4. (July 2006), pp. 497-523
"Seance-room and other large-scale psychokinetic phenomena have fascinated humankind for decades. Experimental research has reduced these phenomena to attempts to influence (a) the fall of dice and, later, (b) the output of random number generators (RNGs). The meta-analysis combined 380 studies that assessed whether RNG output correlated with human intention and found a significant but very small overall effect size. The study effect sizes were strongly and inversely related to sample size and were extremely heterogeneous. A Monte Carlo simulation revealed that the small effect size, the relation between sample size and effect size, and the extreme effect size heterogeneity found could in principle be a result of publication bias."


(my emphasis)

So people should be aware of this, even though this covers a meta-analysis. If a person well versed in psychology experiments falls for my false arguement, then they haven't done their due dilligence?
 
You've seen more medical/psychological stats tests than I have. Do you know if this type of test on the alpha levels is ever done to test H0?

You know, that's an interesting question considering the thread on HIV/circumcision.

It does happen sometimes in medical trials. If there is a good reason to think the intervention will be effective, in order to mitigate the questionable ethics of withholding an effective treatment in a placebo-controlled trial, often a separate review board is set up to review the data at fixed intervals while the trial is still ongoing. If the strength of the data exceeds a certain threshold, they will recommend the trial be stopped so that the intervention can be offered to all.

Hmmm....

The difference is that these situations are generally pre-selected, rather than post-hoc (e.g. it would be like pre-selecting only the prime-numbered trials out of your 10,000 to end early if the alpha is less than 0.001).

Linda
 
You know, that's an interesting question considering the thread on HIV/circumcision.

It does happen sometimes in medical trials. If there is a good reason to think the intervention will be effective, in order to mitigate the questionable ethics of withholding an effective treatment in a placebo-controlled trial, often a separate review board is set up to review the data at fixed intervals while the trial is still ongoing. If the strength of the data exceeds a certain threshold, they will recommend the trial be stopped so that the intervention can be offered to all.

Hmmm....

The difference is that these situations are generally pre-selected, rather than post-hoc (e.g. it would be like pre-selecting only the prime-numbered trials out of your 10,000 to end early if the alpha is less than 0.001).

Linda

Here we go. I tried your idea and ran a new simulation. I tested at 30, 386, and all of the primes inbetween, and only at these points if my observed Z > 3.1 do I stop and say it's a success with no need to continue. I also bumped it up to 1,000,000 simulated reps, just to make sure I get a good estimate.

Unrestricted (re-evaluate after each observation)
Success: 11,587
Failed: 988,413

Restricted (re-evaluate at primes)
Success: 9,216
Failed: 990,784

Doing that prime number restriction just dropped the result down to 0.92%. So with 400 subjects, each one testing between 30 to 386 guesses, I have a great chance of getting 3 "psychics", maybe more.
 
Here we go. I tried your idea and ran a new simulation. I tested at 30, 386, and all of the primes inbetween, and only at these points if my observed Z > 3.1 do I stop and say it's a success with no need to continue. I also bumped it up to 1,000,000 simulated reps, just to make sure I get a good estimate.

Unrestricted (re-evaluate after each observation)
Success: 11,587
Failed: 988,413

Restricted (re-evaluate at primes)
Success: 9,216
Failed: 990,784

Doing that prime number restriction just dropped the result down to 0.92%. So with 400 subjects, each one testing between 30 to 386 guesses, I have a great chance of getting 3 "psychics", maybe more.

I'm sorry. Now I feel bad that you did all that work. I didn't give an adequate description of what I meant. That result is not at all surprising (except that I'm surprised it dropped as much as it did).

I meant that out of your 10,000 (or 1,000,000) trials, the only trials that you start testing for signficance after 30 guesses would be trials number 2, 3, 5, 7, 11, 13, 17, etc. And I chose prime numbers for an example, but I don't know whether or not the proportion of numbers that are prime numbers is equal to the proportion of trials subject to early review (I suspect using prime numbers is way too high - something like squares of whole numbers may be closer).

Linda
 
I'm sorry. Now I feel bad that you did all that work. I didn't give an adequate description of what I meant. That result is not at all surprising (except that I'm surprised it dropped as much as it did).

I meant that out of your 10,000 (or 1,000,000) trials, the only trials that you start testing for signficance after 30 guesses would be trials number 2, 3, 5, 7, 11, 13, 17, etc. And I chose prime numbers for an example, but I don't know whether or not the proportion of numbers that are prime numbers is equal to the proportion of trials subject to early review (I suspect using prime numbers is way too high - something like squares of whole numbers may be closer).

Linda

Sadly, it's actually fun for me to do and try these sims. I need a life.

Perhaps I'm missing something here. I think I follow you that we only have the option to stop on the trial level, and not the tester level. In real life, then, I don't see how this can work. If we look at the AIDS in Africa case, we may only have one trial. One doctor sets up the study, patients come in, are given either drug or placebo, and are tested at time=t. The patient is my coin flip. The doctor, if p << 0.001, will stop the trial early and publish results.

In the coin flip/psychic case, I'm going to have 10,000 people come in to be tested, each with 385 guesses to make. Your suggestion states that most of these people have to have the full 385. I'm suggesting that even though the rule may be in place, it may not be observed. In fact, looking at my data, 99% of the people did go the full 385 tests. So someone may innocently believe that stopping early should have no major impact. Time is money, money is time, people have lives, and why keep trying more and more tests when a person is clearly showing better than random results. How many testers would be honest enough to continue?
 
Perhaps I'm missing something here. I think I follow you that we only have the option to stop on the trial level, and not the tester level. In real life, then, I don't see how this can work. If we look at the AIDS in Africa case, we may only have one trial. One doctor sets up the study, patients come in, are given either drug or placebo, and are tested at time=t. The patient is my coin flip. The doctor, if p << 0.001, will stop the trial early and publish results.

I was thinking of the body of medical research as representing a large number of trials, some of which (like the HIV/circumcision trial) have the p value calculated before the trial is finished and will stop early if p is less than their cut-off.

So one could attempt to argue that the benefit of conventional medical treatment (taken as a whole) has been exaggerated by the presence of a systematic bias, just like you argued that the actual number of psychics is exaggerated by the presence of a systematic bias (i.e. 3 psychics out of 200 when you'd expect none).

In the coin flip/psychic case, I'm going to have 10,000 people come in to be tested, each with 385 guesses to make. Your suggestion states that most of these people have to have the full 385. I'm suggesting that even though the rule may be in place, it may not be observed. In fact, looking at my data, 99% of the people did go the full 385 tests. So someone may innocently believe that stopping early should have no major impact. Time is money, money is time, people have lives, and why keep trying more and more tests when a person is clearly showing better than random results. How many testers would be honest enough to continue?

I don't think it's a matter of honesty, but rather a matter of whether the effect of the bias is recognized - something that's easy to miss when you tend to focus only on your own trial. It's a more complicated concept than a single test.

Linda
 
As an interesting thought that just popped into my head, there are often reports of medical trials that have been stopped early because of obvious negtive effects. It is possible that at least some of these could be due to this effect and not actually due to the treatment? Of course, if this does happen it would be false negatives rather than false positives, and with experimental drugs it is always better safe than sorry, but it would be interesting to know if we could be losing potential treatments due to this effect.
 
As an interesting thought that just popped into my head, there are often reports of medical trials that have been stopped early because of obvious negtive effects. It is possible that at least some of these could be due to this effect and not actually due to the treatment? Of course, if this does happen it would be false negatives rather than false positives, and with experimental drugs it is always better safe than sorry, but it would be interesting to know if we could be losing potential treatments due to this effect.

That's an interesting idea. I'll have to think about this some more, but I don't it would be as big of an issue. In the first case, you're looking for a high side effect, and you stop on a high side effect. In this case, you're looking for a high side effect, but stop on a low side effect. I think it's a wash.

For example, the tests are usually for no difference between drug A on the market and drug B which you just developed. You're thinking that we stop too early and say B is worse when in fact B is better. So if observed - expected is too small, we stop. This is assuming that expected is 0. If B is in fact better, then expected should be +5 or +10. So:

Z obs: (observed - 0)/se < -3.1 we stop, and p inflates from 0.001 to 0.02
We stop in error when: (observed - +i)/se < -3.1. However, the chances of making this mistake are much smaller than Z obs.

Or let's think of 100 coin tosses. I'll stop if the person get's 40% correct or worse. Therefore, the chances of me stopping the test assuming no ability are: Z = (.4 - .5)/(0.5/10) = -2, therefore p=2.5%
However, if the person really and truly had ESP and was up to 60% success rate, the probability that I would stop and say no evidence is now
Z = (.4 - .6)/sqrt(.6*.4/100) = -4, therefore p << 0.1%

I made my own head now. Ouch.
 
Back
Top Bottom