• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Telephone telepathy data: Statisticians needed

Paul C. Anagnostopoulos

Nap, interrupted.
Joined
Aug 3, 2001
Messages
19,141
I have received the raw data for a telephone telepathy experiment conducted in 2003. I'm interested in testing a few ideas I have about the experiment, but I need statisticians to help me out, because I am no statistician.

The first thing I want to do is verify the statistics given by the authors in their paper, which can be found here:

http://www.sheldrake.org/articlesnew/pdf/Lobach.pdf

There is not much data, so it shouldn't be a difficult job. I'm perfectly happy if you point me to the appropriate statistical tests and a calculator to perform them; I'll do the work myself.

It's rare that we get the raw data for a psi experiment, so this should be interesting.

~~ Paul
 
I have received the raw data for a telephone telepathy experiment conducted in 2003. I'm interested in testing a few ideas I have about the experiment, but I need statisticians to help me out, because I am no statistician.
Well, I'm not sure if I qualify as a statistician per se, but I do have some experience with statistical analysis*

http://www.sheldrake.org/articlesnew/pdf/Lobach.pdf
[/QUOTE]
from p.
Testing individual scoring rates of the participants against expected scoring rates (25%), a paired-samples ttest
shows that scoring rates are significantly above chance, t(5)=2.01, p=0.05 (one-tailed)

I'm not sure why they would use a paired samples t-test here, when they use chi-squared in the previous section, when determining if dice were random:
For the guesses, 52 (29%) of the 179 consecutive pairs contained similar numbers, more than expected
according to chance, but again the difference was not statistically significant, ?2 (1) = 1.47, p = .23.

(but, to be honest, I've only glanced over the results, haven't taken the time to really consider the paper)

An ANOVA would probably be appropriate, I think, to determine if the number of hits is different, between peak and off-peak hours.

I'm also wondering if a Bayesian analysis wouldn't be more appropriate.


* Most of my graduate training was in molecular biology - not much statistical analysis there. But I've been working on software to support crop science research and I've had to learn a lot more statistical methods.

There is not much data, so it shouldn't be a difficult job. I'm perfectly happy if you point me to the appropriate statistical tests and a calculator to perform them; I'll do the work myself.

We also intend to support the R software package with our next software release. R is an open-source statistical package, quite powerful, actually. So I've been writing literate programs in R (literate programming is a way to mix code and documentation into something like literature) to document the analysis methods properly - this to make sure that I'm generating correct code and to explain the analysis to other users.

Anyway, I'd be willing to generate a literate document detailing an analysis of the data, if you'd forward it to me (dakotajudo@mac.com).

Uh, on second thought, I might not need the raw data - I may be able to run chi-sqares and ANOVA on the tables in the paper. I'll look in more detail this evening.
 
I have received the raw data for a telephone telepathy experiment conducted in 2003. I'm interested in testing a few ideas I have about the experiment, but I need statisticians to help me out, because I am no statistician.

The first thing I want to do is verify the statistics given by the authors in their paper, which can be found here:

http://www.sheldrake.org/articlesnew/pdf/Lobach.pdf

There is not much data, so it shouldn't be a difficult job. I'm perfectly happy if you point me to the appropriate statistical tests and a calculator to perform them; I'll do the work myself.

It's rare that we get the raw data for a psi experiment, so this should be interesting.

~~ Paul

Paul, I'm a professional statistician. I've sent you a pm with my email address.

Beth
 
I've sent the spreadsheet to Dakota and Beth. Thanks!

The more, the merrier, though. More statisticians need apply. A discussion here about the appropriate statistical methods is certainly interesting. Remember, I was hoping to settle this without avoiding some sort of argument. :D

~~ Paul
 
Oh, by the way. I suppose we should analyze the data in four different ways, as done in the paper: all sessions, peak and nonpeak; and regular (valid) sessions, peak and nonpeak. To be honest, though, I'm not much interested in the peak vs. nonpeak thing.

~~ Paul
 
They should've used a d4! And the "visiting experimenter" should be the one identifying the caller...

/D&D nerd
//technically a statistician too, I'm not sure I want to look at the data

ETA: I might also add this isn't exactly raw data either. I'd like to see the full guess/actual caller pairs. Hits vs number of trials, not so much. And what are those "peak hours" anyway?
 
Last edited:
Jorghnassen said:
ETA: I might also add this isn't exactly raw data either. I'd like to see the full guess/actual caller pairs. Hits vs number of trials, not so much. And what are those "peak hours" anyway?
Indeed, the paper doesn't have the raw data. That's why the authors sent me a spreadsheet. If you decide you'd like to look at it, PM me your email address.

~~ Paul
 
Drkitten said:
I may or may not have some time in the near-term future to work on it. PM me with details and I'll see what I can manage.
I'll post a few more details here, Drkitten.

The first thing I want to do is verify that the authors have chosen reasonable statistical methods to analyze the data. There isn't much of it, only 216 trials in all. The summaries are given in the paper, the full details on a spreadsheet they sent to me.

At the same time, if the authors chose reasonable methods, I'd like to verify their results. If, on the other hand, different methods would be better, what are the results using the better methods?

I have other ideas I want to check out, but there is no reason to do so unless the results are verifiable.

~~ Paul
 
I haven't done a full analysis, just did some exploratory work, playing around with R code (not sure if I made this clear, but part of my motivation for being involved is that I'm a code geek and I'm trying to learn R).

Anyway, I generated a table of callers (not sure if it will format correctly):

1 2 3 4
58 53 44 63


Note that this includes some observations not included in the paper.

The problem is not so much one of analysis but of experimental design - it was a randomized design, it was not a complete design.


My thinking on this is an analogy to multiple choice exams. Consider an exam of 100 questions and four options for each question (A,B,C,D). Ideally, each answer appears the same number of times in the exam, 25 of each. Otherwise, if students knew an instructors tendencies (myself, I think I tend to prefer (c) as a correct answer when writing multiple choice questions, by hand), they could gain an advantage by choosing the instructors preferred answer, instead of randomly guessing when they don't know the answer.

Alternatively, with a balanced test, answering all A, or all B, etc., would result in the same test score.

In the experiment, while for any single call, participants had a 1/4 chance of guessing the correct caller, they could have achieved results significantly different from chance (if I ran the chi square correctly) by simply guessing 3 or 4 every time.

It seems that there is an extra source of error in the data that goes beyond my skills of analysis, for now. The methods I'm most familar with assume balance in the design.

The authors address this on the bottom of p. 93:

To acquire an indication of the ?randomness? of the dice throws and the participants? guesses, we
checked the frequency distributions for all trials and all consecutive pairs of trials.

The frequency distribution of dice throws and of the participants? guesses did not differ from the
expected 25% for each of the four options, ?2 (3) = 3.53, p > .3 and ?2 (3) = 4.51, p > .2 respectively.

I think the authors are misusing the concept of the null hypothesis; but instead of trying to comment myself at this point (I should get back to work), I'd recommend this:
http://ije.oxfordjournals.org/cgi/content/full/32/5/693 (and the linked references)

PS. Also found this reference in my collection; it may be more reader friendly:

http://www.annals.org/cgi/content/full/130/12/995
 
Last edited:
Okay, I've reviewed the data and the statistics computed in this paper (or at least some of them). There are some statistical errors in this paper.

From the Abstract:
Analyses show a significant over-all scoring rate of 29.4% (p = .05). Almost all of this effect originates from the sessions at peak time with a scoring rate of 34.6%. Exploratory analyses show that a stronger emotional bond between participant and caller is associated with a higher hitrate. It is concluded that results provide tentative support for the hypothesis that Local Sidereal Time is related to a phenomenon like telephone telepathy. In addition, the results are in support of the existence of telephone telepathy.

The overall scoring rate is 29.4% (63 hits out of 214 trials). I computed the probability of this result using a binomial distribution with n = 214, p = .25 and s = 64. I get a p-value of .07944, which means it is not significant at p = 0.05. I computed the 95% one-sided confidence interval for the proportion of hits from this data as (.2435, ∞) , which includes 0.25. However, the results are significant at the 90% confidence ( p = 0.10) level.

The data provided did not include information about emotional bond between participant and caller, so I can’t comment on any of the analysis regarding that.

In the paper, they went into a great deal of work regarding checks on the randomness of the dice. I didn’t check those numbers. I was willing to assume they weren’t using loaded dice in determining who the caller would be.

Table 1: The data in table one is correct and matches the raw data I received. They said
Testing individual scoring rates of the participants against expected scoring rates (25%), a paired-samples ttest shows that scoring rates are significantly above chance, t(5)=2.01, p=0.05 (one-tailed) when all sessions are included (Table 1).

The paired sample t-test is not the appropriate test to use in this situation. This test requires two samples from each experimental unit. They don’t have that here. They are comparing actual results with what they expected assuming a binomial distribution with a probability of 0.25 on each trial. The results they quote here are meaningless because the underlying assumptions of the test have not been met.

Table 2: I did not analyze the data in table 2. They stated that in some cases, the protocol was not perfectly followed for some reason. In some cases, not all four callers were available as they were supposed to be. Sessions in which they felt the protocol was not acceptably followed, they dropped that session from the analysis and table two provides those results.

Difference between peak and non-peak time periods: They said
A paired-samples t-test showed that scoring rates were marginally significantly different between peak and non-peak condition, t(5)=1.60, p=.09 (one-tailed).

Here, a paired sample t-test is appropriate. When I ran this test with the data provided, I also got p = 0.09.

I don't think there was any deliberate falsification here. They just didn't do the stats right. The most significant finding from the data is that the p-value for the peak sessions is 0.017.
 
Last edited:
Thanks for your analysis, Beth. Why do you think your overall score p value is different from theirs? Could you give us a more detailed explanation of the problem with the paired-samples t test?

~~ Paul
 
Did you contact the authors? What do they say in response to your findings?
 
Hm, I just forward some results to Paul.

The overall scoring rate is 29.4% (63 hits out of 214 trials). I computed the probability of this result using a binomial distribution with n = 214, p = .25 and s = 64. I get a p-value of .07944, which means it is not significant at p = 0.05. I computed the 95% one-sided confidence interval for the proportion of hits from this data as (.2435, ?) , which includes 0.25. However, the results are significant at the 90% confidence ( p = 0.10) level.
I set up some chi-square tests in R to duplicate the tests reported by the authors - and I got the same values. I wasn't sure why they didn't do a chi-square on this, but I ran one myself, 151:63 against an expected ratio of 3:1, and got a chi-square = 2.2492 and p = 0.1337.

I don't think there was any deliberate falsification here. They just didn't do the stats right. The most significant finding from the data is that the p-value for the peak sessions is 0.017.
For that, I get a chi-square probability of 0.022.

But, as I noted above, the lack of balance in the design concerns me more than the analysis.
 
I figured out what they did. Odd and likely inappropriate at best, and purposely presenting the "evidence" in the best light at worst (i.e., moderate cheating).

The paired sample t-test they report was done on the total column for table 1.

It paired the actual hits of the 6 psychics with their expected values:

12,9
9,9
8,9
10,8.75
13,8.75
11,9

This produced a t (11) of 2.012 with a p of .10 (one tailed) or .05 two tailed.

Were I a skeptic, I'd speculate they wanted to do a paired t-test because it has more statistical power. Ironically, the way they did it backfired. I've been doing this almost 20 years, and this is the only example I've come across where the between group t test would actually be more powerful (moot, as using a t-test here is inappropriate anyway).

By using the expected value as one "pair" the within group variance for the pair was tiny (making it much easier to get a significant difference across the actual hit rate, but only if you use the between groups t!). So much so that the between groups t is significant, t (10) = 2.068!

That is indeed irony only a geeky statistician can appreciate.

Again, being a skeptic, why did they aggregate data (use the total column) instead of collapsing across peak and non peak times? Because, the later was not significant, t(11) = 1.45, p = .174.

So, flaws so far:

using a more powerful t-test when it's not justified.

using a more powerful one tailed test when two tails seem appropriate.

Not reporting DF, making it hard for anyone to replicate their analyses.

aggregating data without mentioning it.

A mistake in the table under non-peak (hits add to 26 not 27).

Dropping the ball completely on what analysis to do-- very strange to use an expected value as a pair for a correlated t; never seen it done before (it's not actual data!).

they have a 2 x 2 mixed factorial design here with peak versus non peak as a within subjects variable and psychic as a between subjects variable. The DV is technically not completely continuous but I'm ok with doing an ANOVA.

ANOVA results in the next post...
 
Last edited:
Oop, would need the raw data to do the anova.

I tried replicating the t-test they report for paired versus not-paired.

I got t=1.534, p = .186. I suspect it's because I used the values in the table (which have the mistake of adding to 27, when there's only 26 non-peak hits) and the people above musta used the raw data, which I'm guessing show 27 hits for the non-peak variable?

The mean for nonpeak is actually somewhat below chance; the mean for peak, though, is above (6.17 actual versus about 4.5 expected).

The way it's written, it wouldn't merit publication in anything but a d journal. Had they done the stats correct, a C journal might take it.

Being open minded, they may have something with peak times, but it looks like only 2 psychics are any good. I'd follow up with them 2, and do some intensive testing. Once that replicates, off to get randi's million (though I suspect the 2 all stars here will show more chance-like performance in a replication).
 
Thanks for your analysis, Beth. Why do you think your overall score p value is different from theirs? Could you give us a more detailed explanation of the problem with the paired-samples t test?

~~ Paul


I think the overall score p-value is different because they didn't compute the statistics properly. I think bpesta is correct about what the did instead.

The problem with the paired samples t-test is that they weren't applying it to a set of paired samples. You can read up about it here: http://mathworld.wolfram.com/Pairedt-Test.html

The gist of the problem is the sample data doesn't match the conditions required for the test - i.e. they don't have two measurements from each of the subjects being compared.

Did you contact the authors? What do they say in response to your findings?

No. I haven't had any contact with them. Paul, who got the data from them, might want to communicate the comments back to the authors. Or you could.

Hm, I just forward some results to Paul.

I set up some chi-square tests in R to duplicate the tests reported by the authors - and I got the same values. I wasn't sure why they didn't do a chi-square on this, but I ran one myself, 151:63 against an expected ratio of 3:1, and got a chi-square = 2.2492 and p = 0.1337.

For that, I get a chi-square probability of 0.022.
There are a number of ways to set up the chi-squared analysis. I did one looking at the distribution of number of hits per session. The results were similiar to the binomial computations. Personally, I think the binomial computation is the better choice. Since the null hypothesis is no effect regardless of session time or subject, the data should fit a binomial distribution with p = 0.25 under the null. The binomial can be precisely computed for the situation and an exact p-value determined.

But, as I noted above, the lack of balance in the design concerns me more than the analysis.

I'm not concerned about that. Anytime you do a random selection without restrictions, you are likely to end up with a slight imbalance. It is unlikely to have a significant impact on the results.

I figured out what they did. Odd and likely inappropriate at best, and purposely presenting the "evidence" in the best light at worst (i.e., moderate cheating).

The paired sample t-test they report was done on the total column for table 1.

It paired the actual hits of the 6 psychics with their expected values:

12,9
9,9
8,9
10,8.75
13,8.75
11,9

This produced a t (11) of 2.012 with a p of .10 (one tailed) or .05 two tailed.
I think you are correct, at least, that's the conclusion I came to about what they did. It's an inappropriate use of the two-sample test.
...[snip the detailed stat analysis]..
That is indeed irony only a geeky statistician can appreciate.
Indeed :)
So, flaws so far:

using a more powerful t-test when it's not justified.
agreed.
using a more powerful one tailed test when two tails seem appropriate.
No, I think the one-tailed test is appropriate. They are really only interested in whether or not the guesses perform better than average, not worse.
Not reporting DF, making it hard for anyone to replicate their analyses.
Agreed
aggregating data without mentioning it.
A mistake in the table under non-peak (hits add to 26 not 27).
You're right. I missed that mistake, sorry. Even though my computations showed 26 there, I didn't notice they had 27 instead.
Dropping the ball completely on what analysis to do-- very strange to use an expected value as a pair for a correlated t; never seen it done before (it's not actual data!).
Agreed. In my opinion, the most serious mistake, but also an easy error for a non-statistician to make since it involved checking that the underlying assumptions of the test are met.
they have a 2 x 2 mixed factorial design here with peak versus non peak as a within subjects variable and psychic as a between subjects variable. The DV is technically not completely continuous but I'm ok with doing an ANOVA.

ANOVA results in the next post...

Since I have the raw data, I'll try to run an ANOVA and post the results later.

Thanks for the technical review. Nice work, you spotted stuff I missed.

Beth
 
Beth; thanks!

I wonder about the one tail being appropriate. In my world, sometimes psychic powers work against you. For example, I imagine a raw/untrained psychic who hasn't yet harnessed his/her true powers might get things backwards sorta like a young harry potter. Misinterpreting the signal that it's not joe calling as it is joe calling seems about as likely as being able to tell who's calling anyway. Given that the one tailed test makes it easier to find something, I think they should justify why they use it.

Doing significantly worse than chance and replicating that would be just as supernatural, I think, as doing better than chance.
 

Back
Top Bottom