Yes, that's true, but it misses the mark of the criticism. While the goal may have been to investigate aggregated behavior, that doesn't preclude inter-subject tests to ensure that the data are homogeneous and therefore that the aggregration has meaning. You seem to be trying to say that Dr. Palmer found an anomaly in something that PEAR wasn't trying to study, so it doesn't matter. That's not what happened. Dr. Palmer found an anomaly while testing the data for integrity and internal consistency. That kind of test is always appropriate, and very important when conclusions are to be drawn from broad aggregations.
Let's say we have a random sample of twenty sixth-graders (12 years old) who are being sent to basketball camp. We test their ability to score a basketball free throw by giving each of them 10 trials. Each player's score is the number of free throws they hit, from zero to 10. Certainly we can create descriptive statistics about the score -- the mean score and the standard deviation. I have no idea what that distribution would look like, but let's say the mean score is 3.1 hits out of ten. Make up a standard deviation; it's not important.
Since it's a random sample of kids, we would expect them to vary in their ability. Some 12-year-olds have more practice and skill than others. If, for each score, we look at the histogram of kids who got that score, it should also look something like a normal distribution. You'd have nerds like me who would score very few hits, and athletes like my 12-year-old nephew who would score a lot. Most kids, we figure, would score somewhere around that mean of 2-4 hits. Few if any would score higher than two standard deviations better than the mean.
That's our empirically determined baseline, although as I continue you will probably see there's a slight problem with method. Ignore it for the purposes of the example; I know it's there.
Now send all the kids to basketball camp for two weeks and draw another test sample. A reasonable test of the effectiveness of the basketball camp would be to see if the mean scores rose. Now let's say the post-camp mean score was 4.2. Eureka! The camp works! Except there's a hitch; the second sample included LeBron James, and the analysts don't know that. They don't know the identities of the subjects, only their scores.
A quick surf over to the NBA stats page says LeBron's free-throw percentage is around 75%, far better than anyone else in the group. When we look at the histogram -- which, for a properly distributed sample, should still look like a normal distribution -- we see that suspicious-looking spike out there in the 7-8 score range. It doesn't fit what we expect the inter-subject data to look like, whether the camp works or not. He's dragging the average artificially upward, so the aggregation is not as meaningful as it otherwise would be. If the mean score without him is only 3.15, we might conclude the camp is not effective.
This is a mess. You didn't give us enough detail in your story to determine what exactly you think the mistake was that you made or why exactly you think Palmer made the same mistake. But let's first discuss what's obviously wrong about your comparison. Then we'll work through the rest as best we can and hope to cover all the bases.
The last statement is simply wrong. "Operator 010 is psychokinetic" is not the "inevitable" conclusion of conscientious scientists looking at outliers. Outliers are not presumed to vary according to the variable proffered in the hypothesis. In fact, if there is any presumption at all it's most often that the anomalous variance comes from a sporadically confounding variable, in my example the considerable outside training and expertise of a professional athlete. It sometimes becomes a subsidiary exercise to discover -- and later control for -- that variable. Dr. Palmer didn't go any further, nor likely would he have been able to.
In your anecdote you proposed to disregard a subject because of questions regarding what may have caused the very low score and its possible ability to confound the intent of the test to measure drug effectiveness. If I'm reading your reasoning correctly, you propose that Palmer wants to disregard Operator 010's score similarly because of what may possibly have caused it. You dance around the concept that it's because of assumed PK ability, but it's not clear that's what you mean. And in my example above, that would be like disregarding the anomalously high score because you suspected it was a professional basketball player.
This is simply inapt. Palmer gives no reason for disregarding Operator 010 beyond the inability of the data to fit reasonably within the expected inter-subject distribution. It would have been appropriate, for example, to disregard Operator 009 if his score had been two orders of magnitude below everyone else's. That could indicate, for example, some pervasive difficulty that subject had operating the machinery, and that effect would mask any intended measurement. We don't have to speculate why scores are anomalous in either direction, although it is often attractive to do so. It is sufficient to reject the score based on its incongruence in context, not on what it may conceivably represent. It would have also been appropriate to remove Operator 010 if the remaining distribution was rendered coherent and fit the pro-PK distribution. Yes, the means would have been slightly lower, but they may still have been significant compared to baseline. And that significance would have statistical validity because the inter-subject distribution would have been as expected.
It's your standard straw-man argument. You're ascribing to Palmer motives you may have once naively had, when there is no evidence for any such motive on Palmer's part and considerable evidence for an entirely different -- and completely necessary -- motive altogether. One that you seem blithely unaware of. Just because Dr. Palmer's actions superficially resemble ones that, in a different context, would be wrong doesn't make them wrong in this context. From the very start you accused him of trying to make the data fit his wishes, which you assumed wrongly to be anti-PK. In fact it's quite obvious he's trying to make the data fit any of the expected distributions so that the descriptive statistics and correlations have the intended meaning. This is common practice, and you know it is. And your protests that Dr. Palmer somehow doesn't know how to do that properly in this field, and that you somehow do, is comically self-serving.
By way of background, just so everyone is up to speed :—
Drug trials typically follow the double-blind, placebo-based model. A sample is drawn, and various categorical variables are consulted for the sample that determine how well it represents the relevant population. That population is then divided (typically randomly) into two groups: one that will receive the drug and the other that will receive a placebo. The subjects don't know which one they're getting. The experimenters who interact with the patients, in turn, don't know which one they're administering. The placebo group serves as the baseline control against which the variable group is measured. In order for that to be valid, all those pre-trial categorical variables have to match up fairly evenly between the groups. They're usually demographic in nature, like age, sex, ethnicity, prior medical conditions, etc. But they're really proxies for effects known, suspected, or speculated to introduce confounding influences in the outcome. Dr. Philip Zimbardo, of Stanford prison experiment infamy, includes a layman-accessible description in his book The Lucifer Effect of how he homogenized his sample between the Guards and Prisoners groups. If all those potential confounders are balanced in both groups, you can confidently attribute variance in outcome (in this case, the disposition or occurrence of cancer) to the only category you varied -- what pill the patient actually got. If your placebo group were mostly men and the drug seemed to have worked well on the mostly-women group who took the actual drug, you may not be able to separate the desired effect of the drug from the baseline fact that cancer rates are higher among men.
In your anecdote, the observation you wished to remove was one in which no cancer was observed. From context, I glean that this was anomalous, such as the patient previously having had cancer, and that the expected effect of the drug was simply to put cancer into remission. The causal category (spontaneous disappearance) you proposed to eliminate from whichever side of the placebo line that patient was on still has to be represented on both sides, and -- within both groups -- along its entire categorical spectrum in order for the correlations to the placebo/drug category to remain valid. This has nothing whatsoever to do with disregarding data that is self-evidently out of place, irrespective of known or speculated cause.
Let's talk more about categorical variables. Imposed controls are often categorical variables, in which case strong correlations to them suggest the presence of the confounding condition that motivated the control. The example I gave above relevant to your anecdote was in controlling the sample for sex, because sex is known to affect cancer rates. For a PK example, if a subject demonstrates the ability to move a paper cup around on the table "using only his mind," and that ability completely disappears when the cup is covered under a bell jar, then any of several confounding phenomena is indicated. The control is applied to prevent the subject from physically manipulating the cup in a way the experimenters wouldn't otherwise detect. In past trials this has been accomplished by invisible loops of of monofilament held between the subjects hands, or tricks as prosaic as blowing on the cup. It isn't necessary for the experimenters to think of every possible way of surreptitiously moving the cup by ordinary physical means. Isolating it physically from the subject eliminates most if not all such methods.
If we have some number of such subjects and some of them can move the cup to varying degrees and others can't, and some can move it with the bell jar in place, and others can't, then we have the basis to perform tests for significant relationships among categorical variables, where in this case the category might be "jar" vs. "no jar." This is how categories would more properly be considered in an experiment, and we would switch to something more akin to the chi-square test for independence.
In the PEAR study the volitional variable was imposed as a control to preclude anything that would have the effect of knowing before the trial how the outcomes would appear. It doesn't matter how such knowledge could be acquired or such preparation could be accomplished. It doesn't matter that you, I, Palmer, or anyone else fails to imagine how it could be done. That's not how empirical control typically works. What matters is that the data were collected in a way that recorded whether the subject was able, at the time of testing, to select the method of trial that would occur. That's a category in the study.
In a double-blind placebo study, we would want the measured cancer rate at the end of the study to be significantly independent of all variables except the placebo-vs-drug variable. Toward that end we compare measured cancer rates to those variables regardless of whether they got the drug or the placebo. If the rate correlates in your sample more strongly to, say, whether a patient exercises regularly than to whether he got the drug, you can't say the placebo-vs-drug category is sufficiently independent to be significant.
If the PK effect hypothesized by PEAR is real, the measurement of it in terms of variance from the baseline was expected to be independent of the volition category. It wasn't. It was quite strongly dependent on it. Now you can read all sorts of nefarious intent into the notion that all the variance in the one out of three studies that showed any variance at all depended on whether one subject knew in advance how the day's experiment was going to be done. But what's more important is that PEAR had no answer for this. They didn't propose any sort of PK-compatible explanation (e.g., "For PK to work, the subject had to be in a certain mindset that was defeated by the volition variable.") They did no further testing to isolate and characterize this clearly predictive variable that wasn't supposed to be predictve.
You can't just leave a failed test for independence alone and claim victory nonetheless. Palmer didn't explicitly perform the independence test, but he didn't have to. The errant correlation is trivially apparent. This lengthy exposition is meant to reach this one point: You accuse Palmer of eliminating or disregarding Operator 010, and you claim -- based on irrelevant and wrong comparisons to control-system design -- that this is a no-no. A better way of looking at Palmer's review is that he considered the problem categorically. The categories of volition-vs-not and significance-vs-significance are not independent in the way they would need to be for Jahn's hypothesis to have been properly supported by his first study. It's not that Operator 010 wasn't taken into account. Palmer took Operator 010 into account according to the way the categories in which she fell could be reasoned about intelligently and correctly.