tl;dr -- Buddha doesn't understand the statistics PEAR used to analyze their findings. He has conflated the incidental means of validating the baseline with the actual tests for significance against the baseline. Based on that misunderstanding, he accuses Palmer of erring when he recomputed the significance. I explain how the t-test for significance works and provide an example to illustrate what Dr. Jeffers found suspicious about PEAR's use of it.
This is not a baseline, contrary of what Palmer thinks...
Yes it is, contrary to what you think.
[H]e doesn’t have a clear idea of what the baseline is.
Yes he does, in the context of PEAR's research which intended -- correctly so -- to use the t-test for significance. Dr. Palmer explicitly notes that PEAR's decision to use the t-test instead of the Z-test improves over Jahn's predecessor Schmidt in studying the PK effect. I've mentioned this several times, but you never commented on it. I'm going to explain why it's an improvement, why PEAR was right to use it, why Palmer was right to endorse it's use, and why you don't know what you're talking about.
A baseline is determined before the start of a project, not during it.
No.
If your intent is to use the t-test for significance, then baseline data must be collected empirically. It can't be inferred from theory. It can only be done after the project design, apparatus, and protocols are in place. Now where the baseline calibration factors are all static given the above, all the empirical baseline data may be collected prior to any experimental trials -- or even afterwards, as long as the baseline collection is independent of experiment trials. But if instead the calibration factors include environmental factors that cannot be controlled for except at the moment of trial, then it would be a mistake to compare experimental data collected in one environment at the beginning of the project to baseline data collected in a different environment as the project proceeds.
It is up to the judgment of the experimenter to know which factors apply. In this case Dr. Jahn, an experienced engineer, properly understood that the REG apparatus was sensitive to several environmental factors, only some of which he could control for explicitly. Hence the protocol properly required calibration runs at the time of trial. This is so that the collected data sets would be reasonably assured to be independent in only one variable.
When it is not possible, a baseline is based on the data available before beginning of a project.
No.
There is no magical rule that says that all baseline data must be collected prior to any experimental data, and absolutely no rule that says calibration runs may not interleave with experimental runs. You're imagining rules for the experimental sciences that simply aren't true. We know from your prior threads that you have no expertise or experience in the experimental sciences, so you are not a very good authority on how experiments are actually carried out. We further know that you will pretend to have expertise you don't have, and that your prior arguments depart from "rules" you invent from that pretended expertise and then try to hold the real world to.
Fluctuations of electrons from the surface of a metal form a Poisson distribution, as the theory shows.
Yes, in theory. The REG design is based on a physical phenomenon known to be governed principally by a Poisson distribution. That does not mean the underlying expectation transfers unaffected through the apparatus from theoretical basis to observable outcome. In the ideal configuration of the apparatus, and under ideal conditions, the outcome is intended to conform to a Poisson distribution to an acceptable amount of error.
Except for the electron emission part of the equipment, it is possible that other equipment parts introduce the bias...
Not just possible,
known to confound. We'll come back to this.
To rule out this possibility, the researchers run the device without the subjects being tested, collect the results and use certain statistical methods to determine if results form a Poisson distribution.
The results will never "form" a Poisson distribution. The results will only ever approximate a Poisson distribution to within a certain error.
Let’s say the results do not form a Poisson distribution.
Specifically, if the machine is operating properly a Z-test for significance applied to the calibration run will produce a p-value less than 0.05. If the p-value is higher, it means some confound in the REG is producing a statistically significant effect.
But you misunderstand why this is of concern. You wrongly think it's because it is the goal of the experimenters to compare the experimental results to the Poisson distribution. Instead, an errant result in an apparatus carefully designed and adjusted to approximate as close as possible a Poisson distribution indicates an apparatus that is clearly out of order. This in turn indicates an unplanned condition within the experiment that cannot for that reason be assumed later not to have confounded in an unknown qualitative way with the experimental data. The Z-test for conformance to the Poisson distribution merely confirms that the machine is working as intended, not that the machine is working so well that the Poisson distribution can be substituted as a suitable baseline.
In this case the team follow well-known guidelines...
Well-known, but known not to be exhaustive. This is the part you're missing. Yes, the operator of an REG (or any other apparatus) will have a predetermined checklist to regulate the known confounds with the hope of reducing measured calibration error to below significance. That doesn't guarantee he will succeed in removing all error such that he can set aside measurement in favor of theory.
If a theory is correct, these measures guarantee that the scientists are dealing with a Poisson process.
No.
Certainly a conscientious team will look for sources of error. But they know they cannot exhaustively do so, and that they will never reduce the Z-test p-value to zero. Nor is it possible to. They will only reduce the results of the calibration Z-test to a p-value that is acceptably small for their purposes. "Acceptably small" does not mean zero. It merely means they have confidence that the machine is working as expected. They know from the start that they are dealing with a Poisson process. The calibration merely ensures that the Poisson effect dominates the machine's operation.
If they were to use the Z-test to compare the experimental data to the idealized Poisson distribution, the error remaining in the calibration Z-test would still be a factor. And it would have been set aside in that method.
Let's say the calibration runs produce a p-value in the Z-test of p < 0.045. That's certainly enough to ensure that the machine is operating within tolerance. But the error still exists as a non-zero quantity. Using the Z-test combines that error with any variance in the experimental results such that they cannot be separated. When the expected variance in your experimental results is very small, this becomes a concern.
Hence the t-test for significance, which relaxes the constraint that the expected data conform to any theoretical formulation of central tendency. The data
may incidentally conform, but that's not a factor in the significance test.
There are non-Poisson processes as well with their own rules of choosing a baseline.
Yes, which is why the other tests for significance besides the Z-test exist. The rules for choosing a baseline in the t-test are that the baseline is determined to an acceptable degrees-of-freedom extent by some number of empirical runs where the test variable is not varied. The protocol for running the calibration is determined by known factors of the test, and looks at how the expected or known confounds are thought to vary. Jahn
et al. expected the confounds to vary mostly by factors that would exhibit themselves only at the time of trial, and could only be partially controlled for by machine adjustment. Hence calibration runs interleaved with trial runs.
"Better than chance" in these contexts doesn't mean varying from the Poisson distribution. It means varying from the behavior that would be expected were not some influence applied. Whatever that behavior would have been is completely up for grabs. You don't have to be able to fit it to a classic distribution. You only have to be able to measure it reliably.
Without knowing which one of them is present in a particular case, you won’t be able to choose a baseline.
Which is why a test was developed that determines a baseline empirically, without the presumption that it would conform to some theoretical distribution. If you aren't sure which theoretical distribution is supposed to fit, or you know that no theoretical distribution will fit because of the nature of the process, then descriptive statistics provides a method for determining whether some variable in the process has a significant effect by comparing it against an empirically-determined baseline. The limitations of baselines determined empirically translates to degrees-of-freedom in the comparison, but does not invalidate it entirely. This is Descriptive Stats 101, Buddha. The fact that you can't grasp this simple, well-known fact in the field says volumes about your pretense to expertise.
At another extreme, your empirical data might fit more than one distribution, which would make the choice impossible.
The t-test requires no choice -- it always uses the t-distribution. You fundamentally don't understand what it is, why it's used, or how it achieves its results.
It would require an infinite number of runs to determine the nature of a process.
No.
This is comically naive, Buddha. You're basically arguing that the t-test itself is invalid, when it is actually one of the best-known standard measurements of significance.
No, the t-test does not require "inifinite number of runs" to establish a usable baseline. The confidence in the baseline is determined by the distribution of means in the calibration runs. The standard deviation of that metric determines the degrees of freedom, which is the major parameter to the t-distribution. The degrees-of-freedom flexibility in the t-distribution is meant to compensate for uncertainty in the standard deviation in the distribution of means in the calibration runs.
You don't compare the calibration runs to some idealized distribution. You compare them to
each other. The central tendency of that distribution of means measures the consistency of the calibration runs from trial to trial. If the calibration runs are very consistent, only a few of them are needed. If they are not consistent -- i.e., the standard deviation of the distribution of means is large -- then many more runs will be required to establish a true central tendency.
But once you know the degrees of freedom that govern how much the t-distribution can morph to accommodate a different distribution, you know whether you have a suitably tight baseline.
Let's say you do twenty calibration runs, and for all of them the Z-test against the Poisson distribution produces a p-value in the range (0.044-0.046). That's approaching significance, but the p<0.05 threshold may be sacrosanct in your field. So you're good to go. Your confounds are just below the level of significance when compared to Poisson.
But instead we might find that the distribution of means in the calibration runs is extremely narrow. That is, the machine might be on the hairy edge of accurately approximating the Poisson distribution, but it could be very solidly within the realm of repeating its performance accurately every time. This is why the t-test is suitable for small data sets (in Jahn's case, N=23) where such behavior might be revealed in only a small number of calibration runs.
A small standard deviation in the baseline means translates to fewer degrees of freedom in the ability of the baseline to "stretch" to accommodate values in the comparison distribution of means. That means any data that stands too far outside the properly-parameterized t-distribution will be seen as significantly variant. That is, it is the
consistency among the baseline runs, not their conformance to one of the other classic distributions, that makes the comparison work.
But what's more important is that any concern in the p-values of the Z-test on the calibration is irrelevant. Whatever was causing the machine to only-just-barely produce suitably random numbers was shown in the t-test baseline computation not to vary a whole lot from run to run. Whatever the confounds are, they're well-behaved and can be confidently counted on not to suddenly become a spurious independent variable. If the subject then comes in and produces a trial that varies at p < 0.05 in the t-test from the t-distribution parameterized from those very-consistent prior runs, that's statistically significant. If that subject's performance had been measured instead according to the Poisson distribution, then the effect hoped to be statistically significant would still be confounded with whatever lingering effect was causing the p < 0.046 etc. values in the calibration.
In your rush to play teacher, you've really shot yourself in the foot today.
First, I covered all this previously. It was in one of those lengthy posts you quoted and added to it a single line of dismissive rebuttal. You constantly attempt to handwave away my posts as "irrelevant" or somehow misinformed, but here you are again trying to say what I've already said as if you're now the one teaching the class. The way the Poisoning-the-Well argument technique works is that you're not supposed to drink from the same well. I explained how the t-test and its parameters worked, but now it's suddenly relevant when
you decide to do it...
...and get it wrong. That's our second point. You fundamentally don't understand how tests for significance work. It's clear you've ever only worked with the basic, classic distributions and -- in your particular mode -- think that's all there could ever be. As I wrote yesterday, you're trying to make the problem fit your limited understanding instead of expanding your understanding to fit the problem. And in your typically arrogant way, you have assumed that your little knowledge of the problem, gleaned from wherever, "must" be correct, and that someone with a demonstrably better understanding of the subject than you -- the eminent psi researcher John Palmer -- "must" have conceived the problem wrong.
These are questioned intended entirely seriously: Do you ever consider that there are things about some subject you do not know? Do you ever consider that others may have a better grasp of the subject than you? Have you ever admitted a consequential error?
Third, now it's abundantly clear why you're so terrified to address Dr. Steven Jeffers. Your ignorance of how the t-test for significance works and achieves its results reveals that you don't have the faintest clue what Jeffers actually did. You're ignoring him because you don't have any idea how to even begin. It's so far over your head.
So the t-test for significance compares two data sets that are categorically independent according to some variable of interest (in PEAR's case, whether PK influence was consciously applied). All the potential confounds are expected to be homogeneous across the two sets. One data set is the calibration runs, represented by its mean and standard deviation. The other set is the experimental runs, similarly represented. The N-value (23, for PEAR) and the standard deviation in one distribution determine the degrees of freedom that the corresponding t-distribution can use to "stretch" or "bend" to accommodate the other distribution.
What Jeffers discovered was that PEAR's t-distribution for the calibration runs was too tightly constrained. Working backwards, this translates into not enough degrees of freedom, then into not a lot of variance in the calibration means for a sample size of 23. In fact, an absurdly small amount of variance. Too small to be possible from PEAR's protocol. Why? Because while the process underlying the the REG operation is theoretically Poisson, the process variable gets discretized along the way. Discretizing a variable changes the amount by which it can vary, and consequently the ways in which statistical descriptions of such variance can appear.
Let's say you ask 10 people to name a number between 1 and 10. We take the mean. Can that mean have a value of 3.14? No. Why not? Because our divisor is 10, and can never produce more than one digit past the decimal. It could be 3.1 or 3.2, but not 3.14. Do that 20 times, for a total of 20 means computed from groups of ten. If we aggregate the means, they can't vary from group to group by anything finer than 0.1. Data points will be either coincident or some multiple of 0.1 apart. If we look at the distribution of those means, there is a limit to how closely they can approximate a classic distribution because they are constrained by where they can fall in the histogram. They can fall only on 0.1-unit boundaries, regardless of how close or far away from the idealized distribution that is. All our descriptive statistics are hobbled in this case by the coarse discretization of the data.
All that occurs because the customary response to "pick a number between 1 and 10" is an integer. If we re-run the test and let people pick decimal numbers to arbitrary precision, then the group means can take on any real value, the aggregate of means can take on any value, and the distribution of those means across all groups has more flexibility to get close to a classical distribution. More importantly, the standard devision of that distribution has more places to go.
What Jeffers found was that the purported distribution of means in the calibration runs is not likely to have actually been produced by the REGs because it offered a standard deviation not achievable through the discrete outputs the REG offered, just like there exists no set of integers such that their sum divided by 10 can be 3.14.
I would like you to address Jeffers, the critic of PEAR you've been avoiding for weeks. I would like to see you demonstrate enough correct knowledge of the t-test for significance to be able to discuss his results intelligently, and at the same time realize that John Palmer is not misinformed as you claim. At this point you seriously don't know what you're talking about.