• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Telephone telepathy data: Statisticians needed

Here are the ANOVA results:

Analysis of Variance
MAIN EFFECTS
A:LOCAL SIDERIAL TIME p-value = 0.1021
B:Subject p-value = 0.7999
INTERACTION p-value = 0.3377

This analysis includes all 216 cases because I can't easily determine which two they dropped from the analysis. Since those were coded as misses, the p-values would be slightly lower. At any rate, its fairly consistent that the only finding of possible significant is that of the peak versus non-peak times. The outcomes during the peak times do show an effect.

The problem with this ANOVA analysis is it only checks for differences between the session times and the subjects. It doesn't take into account what the expected values are. The binomial computations do that though. I don't know why they didn't use that approach. It's more accurate, very straightforward and easy to compute, and shows a significant difference overall at the 90% confidence level and a signficant difference for the peak time periods at the 98% confidence level.

Beth; thanks!

You're welcome. It's always best to have more than one set of eyes look over an analysis. Some details are easy to miss or get wrong. In addition, it's helpful to hash out the pros and cons of some analysis decisions that only statisticians can appreciate - such as one-tailed versus two-tailed :) .

I wonder about the one tail being appropriate. In my world, sometimes psychic powers work against you. For example, I imagine a raw/untrained psychic who hasn't yet harnessed his/her true powers might get things backwards sorta like a young harry potter. Misinterpreting the signal that it's not joe calling as it is joe calling seems about as likely as being able to tell who's calling anyway. Given that the one tailed test makes it easier to find something, I think they should justify why they use it.

Doing significantly worse than chance and replicating that would be just as supernatural, I think, as doing better than chance.

Your point is reasonable but since the authors of the study stated that they were looking to confirm Sheldrakes results, I think the use of the one-tailed test is appropriate here.
 
This is great stuff, folks, thanks! Not being a statistician, I often wonder about the statistics but can't do the analysis.

Carry on ...

~~ Paul
 
Pesta said:
It paired the actual hits of the 6 psychics with their expected values:

12,9
9,9
8,9
10,8.75
13,8.75
11,9

This produced a t (11) of 2.012 with a p of .10 (one tailed) or .05 two tailed.
Did you mean to say .10 (two tailed) or .05 (one tailed)?

aggregating data without mentioning it.
Not sure what you're referring to here.

~~ Paul
 
Did you mean to say .10 (two tailed) or .05 (one tailed)?
I'm sure he did. I get the same values for the two-sample t-test - i.e. 0.10 for the two-tailed and 0.05 for the one-tailed test. But as I said before, this test is inappropriate and the results are meaningless regardless of whether it's a one-tailed or two-tailed test because the sample doesn't meet the requirements needed for this statistical test.
 
Sorry, I did invert the 1 versus 2 tail thingy.

Interesting stuff!

Paul: Aggregating data makes it seem like an effect is stronger than it might be (although it's a levels of analysis issue).

For example, the correlation between IQ and grades for individuals is .50. 25% (r squared) of the variance in grades is explained just by knowing your IQ (75% though is not explained!).

If we wanted to make that seem more impressive, we can aggregrate the data. Lets look at 100 people with IQ's of 80; 100 with IQs of 90, 100 with IQs of 100, etc.

Looking at mean GPAs for each group of 100, you would now get a perfect correlation between IQ and grades (100% of the variance in Mean GPA is explained just by IQ).

So, it depends on whether we're trying to predict a single person's gpa (far less accurate) or the mean gpa for a group of 100 people with identical IQ's (much more accurate).

The study here did this, but far less dramatically than with my example above.

B
 
Pesta said:
Paul: Aggregating data makes it seem like an effect is stronger than it might be (although it's a levels of analysis issue).
I'm not sure what specific aggregation you're referring to in the paper: The one that was done without mentioning it.

~~ Paul
 
I'm not sure what specific aggregation you're referring to in the paper: The one that was done without mentioning it.

~~ Paul

Oh, sorry, it was doing the t-test on the "total" column, versus using the peak and non peak data.

doing it with all 12 values resulted in non significance; doing it on the total column made it sig.
 
Pesta said:
Oh, sorry, it was doing the t-test on the "total" column, versus using the peak and non peak data.
This was mentioned, though, wasn't it? On page 94 it says "Pooling peak and non-peak conditions together, ...".

Edited to add: Oh, well, not really. That sentence is talking about the overall scoring rate.

doing it with all 12 values resulted in non significance; doing it on the total column made it sig.
What is the calculation when using all 12 values?

~~ Paul
 
Last edited:
I mentioned the issue with the paired t-test to one of the authors. She said they used it because the expected number of hits is affected by the number of invalid trials; it is not always 25%. Is this reasonable?

~~ Paul
 
I mentioned the issue with the paired t-test to one of the authors. She said they used it because the expected number of hits is affected by the number of invalid trials; it is not always 25%. Is this reasonable?

~~ Paul


Yes, the expected number of hits is affected by that, but the best way to deal with that would be to use the binomial distribution to compute the p-values, which can be adjusted to compensate for that problem. The paired sample t-test should not be used to compare sample data to expected values based on the hypothesized distribution. I think it might be reasonable to go with a single sample t-test, but I'd have to check into it. A chi-squared goodness-of-fit test would be a better choice than the t-test, but as I said previously, I don't know why they didn't just use the binomial distribution and make an exact computation of the p-value.
 
Last edited:
Wait... They used (observation, "expected") as pairs for a paired t-test?????

Anyway, suppose we include the two excluded trials (one for subject 4 in the peak hours, and one for subject 5 in non-peaks), can't we use the Wilcoxon signed rank test to detect a difference between peak and non-peak? In which case, the most significant possible scenario according to my quick computation, is p=.1, if that excluded peak hour observation is a hit.

So there is actually no discernable difference between peak and non-peak hours, and the total hits (peak+non peak) aren't significantly different from chance either.

I'd still rather have tables of dice roll vs guesses. And a better experimental protocol in the first place...
 
Wait... They used (observation, "expected") as pairs for a paired t-test?????
Yes. That's what they did. At least, that's the test that gives the results they're claiming for that dataset. I felt the same way when I finally figured out what they'd apparently done.
Anyway, suppose we include the two excluded trials (one for subject 4 in the peak hours, and one for subject 5 in non-peaks), can't we use the Wilcoxon signed rank test to detect a difference between peak and non-peak? In which case, the most significant possible scenario according to my quick computation, is p=.1, if that excluded peak hour observation is a hit.
Yes, you could use the Wilcoxon signed rank test, but you don't need to. A paired t-test works fine for that situation. It has a p-value of 0.09.
So there is actually no discernable difference between peak and non-peak hours, and the total hits (peak+non peak) aren't significantly different from chance either.

I'd still rather have tables of dice roll vs guesses. And a better experimental protocol in the first place...

I'm okay with dice rolls. It's a fine way to randomize. You have to realize you have people running the tests who are not only not statisticians, they aren't even engineers. It's a lot easier to instruct people to roll a dice and do x y z for values 1 2 3 than just about any other randomization scheme. Especially one you want to generate on the spot, not create in advance. I think the protocol was fine too. There is some possibility of cheating, but they acknowledge that and the data doesn't show any sign of it. If there had been cheating, they would have had better outcomes than the results they reported.
 
I'm okay with dice rolls. It's a fine way to randomize. You have to realize you have people running the tests who are not only not statisticians, they aren't even engineers. It's a lot easier to instruct people to roll a dice and do x y z for values 1 2 3 than just about any other randomization scheme. Especially one you want to generate on the spot, not create in advance. I think the protocol was fine too. There is some possibility of cheating, but they acknowledge that and the data doesn't show any sign of it. If there had been cheating, they would have had better outcomes than the results they reported.

You misunderstood me. I have no quarrel with randomizing through dice rolls, though I do feel the observer should be the one picking up the phone and verifying who the caller is.

My other issue was in part with the analysis, using hits vs misses (rather, vs number of trials). The data can be seen as pairs of (die roll=caller, guess). So one could make a 4 x 4 table with the pair counts. Column and row totals are not fixed, you do a chi-squared test for independence. That's sort of the natural approach I would have taken. No assumption on whether the dice and the guesses are discrete uniforms is required.
 
Last edited:
I mentioned the issue with the paired t-test to one of the authors. She said they used it because the expected number of hits is affected by the number of invalid trials; it is not always 25%. Is this reasonable?

~~ Paul

I'm still trying to figure out the implications of it.

Pure speculation, but it seems odd to me that a person would have the "sophistication" to know that having the expected values be slightly different across subjects would create a problem, but then to solve the problem with a paired t-test.

The paired t-test looks at whether the sum of the difference scores (subtracting each subject's hit rate from his/her expected value) is significantly greater than zero.

At some level, this seems appropriate. If significant, it would indicate that actual responses were better than chance responses. That's indeed what they want to test.

But, the conditions for using the paired t-test are not met here. Each person has to contribute data to both cells. Here, the subject contributes data to one cell (her hit rate) and that's paired off with the expected value (i.e., not real subject-generated data) for that subject.

So, it's inappropriate, but does it lead to the wrong conclusion? (perhaps beth can help here, but all stats are based on the general linear model, and many stats are equivalant-- for example, a regular t-test is identical to a correlation between the groups and the outcome variable). I'm wondering why the paired samples t-test gave a different result than the binomial test.

The proper, binomial test asks the question: Given an expected value of .25 x the number of trials, what's the probability of getting N_actual number of hits, assuming nothing but chance operates?

I think the paired t-test asks the question:
After subtracing out chance from each person's actual performance, is there anything left (i.e., greater than zero).

The questions seem equivalent, but the results were not-- the binomial test was NS, whereas the t-test was. I'm wondering why that's the case (I guess though it proves-- within rounding error?-- that a paired t-test using expected values as one pair is not equivalent to a binomial test on the same data)?

Not sure this post is interesting enough to respond to, but there it is!
 
Last edited:
You misunderstood me. I have no quarrel with randomizing through dice rolls, though I do feel the observer should be the one picking up the phone and verifying who the caller is.

My other issue was in part with the analysis, using hits vs misses (rather, vs number of trials). The data can be seen as pairs of (die roll=caller, guess). So one could make a 4 x 4 table with the pair counts. Column and row totals are not fixed, you do a chi-squared test for independence. That's sort of the natural approach I would have taken. No assumption on whether the dice and the guesses are discrete uniforms is required.

Okay. Sorry I misunderstood you. Actually, they did do an analysis of that sort. It wasn't of much interest to me, so I didn't examine it closely. But if you are interested, why not take a look at the original paper. I be interested to hear your opinion of it.


I'm still trying to figure out the implications of it.

Pure speculation, but it seems odd to me that a person would have the "sophistication" to know that having the expected values be slightly different across subjects would create a problem, but then to solve the problem with a paired t-test.
Not to me. You get into the social sciences, and even the graduate students rarely take more than a semester of statistics. They often seem to end up with a vague idea of what's going on but not enough real expertise to pick the right test. I frequently saw that sort of error in student projects when I was teaching. But I'm surprised that it passed peer review.

So, it's inappropriate, but does it lead to the wrong conclusion? (perhaps beth can help here, but all stats are based on the general linear model, and many stats are equivalant-- for example, a regular t-test is identical to a correlation between the groups and the outcome variable). I'm wondering why the paired samples t-test gave a different result than the binomial test.
It's not really that different. I think it was a p-value of ~ 0.08 compared to 0.05. The problem is that the paired sample t-test is assuming a t-distribution which has a mean of zero and a standard deviation of 1. The test statistic, properly computed, will have that distribution. The test statistic they computed does not.

That's why the chi-squared test is the best choice to check a sample distribution against a theoretical one. That statistic, properly computed, will have a chi-squared distribution. In many cases, the results won't differ that much, but there's no guarantee. That's why the values computed are meaningless.
The proper, binomial test asks the question: Given an expected value of .25 x the number of trials, what's the probability of getting N_actual number of hits, assuming nothing but chance operates?

Basically, that's right. Technically, in this case we compute the probability of getting >= N hits. It's an exact computation, so it's the best one to use.
Not sure this post is interesting enough to respond to, but there it is!

You bet! I love discussing the finer points of test selection. I miss the captive audience I had when I was teaching this stuff. Actually, I just miss teaching. :( Anyway, I enjoyed the chance to explain about the different tests and why you use different tests for different situations.

I clipped a couple of questions I thought were answered earlier, but if you still have any questions I'd be happy to expound on it at length.
 
Beth said:
It's not really that different. I think it was a p-value of ~ 0.08 compared to 0.05. The problem is that the paired sample t-test is assuming a t-distribution which has a mean of zero and a standard deviation of 1. The test statistic, properly computed, will have that distribution. The test statistic they computed does not.
Expound, please! Tell me more about a t-distribution and why this data does not fit it. Assume I'm pretty dumb about it, cuz, like, that would be the correct assumption.

That's why the chi-squared test is the best choice to check a sample distribution against a theoretical one. That statistic, properly computed, will have a chi-squared distribution. In many cases, the results won't differ that much, but there's no guarantee. That's why the values computed are meaningless.
And now tell me about the chi-squared test. Do your calculation agree with Dakota's in post #16?

~~ Paul
 
Expound, please! Tell me more about a t-distribution and why this data does not fit it. Assume I'm pretty dumb about it, cuz, like, that would be the correct assumption.

Thanks for asking. The student's t-distribution is used when making estimates based on the sample statistics. The test statistic is computed by taking a normally distributed variable (such as the sample mean) subtracting the expected mean (recentering it to zero) and dividing by the square root of a chi-squared distributed variable (such as the sample standard deviation). The distribution of this statistic has been exhaustively studied and is used extensively in hypothesis testing. It's similar to a normal distribution with mean of 0 and standard deviation of 1, but the tails are a bit thicker.

The required assumption is that the sample mean has a normal distribution. This is actually a pretty easy assumption to meet as the sample mean is known to follow a normal distribution when the sample size is large enough or if the underlying population the sample is drawn from has a normal distribution.

You can find out a bit more about it here:
http://mathworld.wolfram.com/Studentst-Distribution.html

The problem is that the dataset in the paper we're discussing doesn't have a normal distribution, it has a binomial distribution. There is an approximate of a binomial with a normal distribution, but it requires the application of a correction formula to the dataset. That wasn't done and it's generally advised to only use the approximation when np>5 (or 10 depending on the textbook used :) ) - a criteria that isn't met in this case.

To compound the problem, they used a paired t-test which makes even more assumptions about the dataset - i.e. that there are two measurements on each experimental unit in the sample and that the difference between them will follow a normal distribution. But the difference between the actual result of a sample from binomial distribution and the expected mean of that distribution - let's just say it's not going to be a normal distribution, particularly given the small sample they had (6 subjects).

Anyway, a statistic computed when the required assumptions are not met is not going to give the proper p-value. If I wanted to go to a great deal of work, I could analyze the statistic that was computed and figure out what the expected variance would be, what direction that would push the p-value in, etc., but it's not an easy analysis for me to do and I don't feel it's worth my time to bother with it.

And now tell me about the chi-squared test. Do your calculation agree with Dakota's in post #16?

I set up and ran a slightly different chi-squared test, so they don't agree but that doesn't mean his numbers are wrong. He did a 2x2 contingency table which is fine. I broke it down a bit differently and the result was the same - not statistically significant.

The chi-squared test was developed specifically to compare a sample against a theorized distribution. You take the actual value, subtract the expected value, square that difference and divide by the expected value. I know that sounds complex (it is!) but it's well established theoretically what the distribution of the resulting statistic is: chi-squared!

The detailed proofs that the t and chi-squared test statistic have the distributions I'm claiming take several sessions in a graduate-level statistics course to cover, so I'm not doing to get into it here. But that's what a lot of higher level statistics is all about - figuring out the distribution of a test statistic that allows us to make the inference we want with a specificied level of confidence.

The chi-squared test also makes assumptions and has requirements for the data, somewhat different than the requirements for the t-test. I really feel that computing the exact p-value and/or a confidence interval for the proportion of hits using the binomial distribution is the best choice in this situation, not the chi-squared test.


Well, that's as much of a statistics lecture as I have time to type this morning. I hope you found it interesting.
 
Last edited:
Thanks Beth, I'm significantly enlightened. Just to bother you with another question:
Beth said:
The problem is that the dataset in the paper we're discussing doesn't have a normal distribution, it has a binomial distribution.
How do you look at the data to decide between these two distributions?

~~ Paul
 

Back
Top Bottom