Not as easy as I thought, but I've certainly made a dent. None of the choices I made in which to include/exclude are particularly controversial, and I'm sure a more skeptical person than myself could find other reasons to filter the data.
First, with regards to the pre-1983 data, I reintroduced the 50% MCE experiments, as per Radin's criteria of using only experiments with a hit/miss method of scoring. Then, following Utts' lead, I excluded Sargent's data due to issues with the protocol. I also took out Terry's 1976 work, after criticisms by Parker and Kennedy (more on which later), and also York's, since the paper did not report the results according to the primary method, but instead by a secondary scoring method.
I then reintroduced data from post-1983 which were not carried out by the five laboratories covered by Radin's work, although I removed Bierman's 1987 experiment since the experimenter in the room during the judging knew what the target was, so there was a problem with possible subliminal cueing.
Also as any paper which did not give numerical results, nor gave any description from which a reasonable estimate could be made.
To remove any problems with optimal stopping, I cut out those experiments that did not have a pre-set number of trials or did not complete the pre-set number of trials
To address the problem of inflormal experiments being published if they should happen to get good results, I also removed any experiments that were explicitly labelled pilot, or were media or classroom demonstrations, or student work, or experiments of twenty trials or fewer.
Then I took out experiments that did not use white or pink noise (ie, static) or silence as an auditory stimuli, or didn't have a random selection of targets. Finally I took out experiments that used audio targets.
79 experiments, 3960 trials, 1073 hits, 27% hit rate average, stouffer z of 2.71, or odds of 1 in 297
I could've done more, but I don't have much time to check about the effect of outliers, nor to go through each experiment and read the paper again to make sure I wasn't missing any experiments that should've been excluded. Nevertheless, I'm confident I haven't made any major gaffs.
The point of the exercise is to show how easy it is to make a perfectly sensible looking meta-analysis that agrees with your pre-existing hypothesis. This took me about an hour to do, and was largely a question of sorting the database according to z-scores, and looking at the positive experiments to see what flaws they shared. However, it wouldn't take a genius to write this up as if these criteria where decided upon before any analysis was carried out, and these were the results we got.
For the record, what really made a difference was when I noticed how many of the highest z-scores came from experiments with very few trials. Removing the shortest experiments really took a chunk off the results.
ETA: mucked about for a bit more: by excluding all experiments of thirty trials or less and then removing Dalton's 1997 work as an outlier, the Stuoffer z falls to 1.97 or about 1 in 41.
First, with regards to the pre-1983 data, I reintroduced the 50% MCE experiments, as per Radin's criteria of using only experiments with a hit/miss method of scoring. Then, following Utts' lead, I excluded Sargent's data due to issues with the protocol. I also took out Terry's 1976 work, after criticisms by Parker and Kennedy (more on which later), and also York's, since the paper did not report the results according to the primary method, but instead by a secondary scoring method.
I then reintroduced data from post-1983 which were not carried out by the five laboratories covered by Radin's work, although I removed Bierman's 1987 experiment since the experimenter in the room during the judging knew what the target was, so there was a problem with possible subliminal cueing.
Also as any paper which did not give numerical results, nor gave any description from which a reasonable estimate could be made.
To remove any problems with optimal stopping, I cut out those experiments that did not have a pre-set number of trials or did not complete the pre-set number of trials
To address the problem of inflormal experiments being published if they should happen to get good results, I also removed any experiments that were explicitly labelled pilot, or were media or classroom demonstrations, or student work, or experiments of twenty trials or fewer.
Then I took out experiments that did not use white or pink noise (ie, static) or silence as an auditory stimuli, or didn't have a random selection of targets. Finally I took out experiments that used audio targets.
79 experiments, 3960 trials, 1073 hits, 27% hit rate average, stouffer z of 2.71, or odds of 1 in 297
I could've done more, but I don't have much time to check about the effect of outliers, nor to go through each experiment and read the paper again to make sure I wasn't missing any experiments that should've been excluded. Nevertheless, I'm confident I haven't made any major gaffs.
The point of the exercise is to show how easy it is to make a perfectly sensible looking meta-analysis that agrees with your pre-existing hypothesis. This took me about an hour to do, and was largely a question of sorting the database according to z-scores, and looking at the positive experiments to see what flaws they shared. However, it wouldn't take a genius to write this up as if these criteria where decided upon before any analysis was carried out, and these were the results we got.
For the record, what really made a difference was when I noticed how many of the highest z-scores came from experiments with very few trials. Removing the shortest experiments really took a chunk off the results.
ETA: mucked about for a bit more: by excluding all experiments of thirty trials or less and then removing Dalton's 1997 work as an outlier, the Stuoffer z falls to 1.97 or about 1 in 41.
Last edited: