• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

psychics

Okay, I'll make you happy by retracting my comment about a "large effect" as technically inaccurate, but why the focus on the strength of the effect if the likelihood of the results occurring by chance is only .48%? (I understand that you disagree with that percentage, but p=.0048 was the figure used by the authors, following the same methodology used by Milton and Wiseman in their previous article.)

I don't disagree with that figure. What I pointed out is that the figure depends upon including the results of a single study with results that are so different from the rest that it would be difficult to reasonably assume it is measuring the same thing as the rest of the studies. If the results of that study aren't included, the p-value for the meta-analysis is no longer statistically significant. The 'success' of the ganzfeld for measuring psi is essentially resting on a single, highly questionable, unpublished study.

Again, because I'm interested in the likelihood of the results occurring by chance. You seem to think that obtaining 440 hits in 1533 trials with a hit probability of 25% is within the range of what would be expected by chance. I don't think that's true, but I'd like to hear Beth's viewpoint.

That didn't answer my question.

And that evidence is . . .

Quit being coy.

I would simply note that some prior ganzfeld experiments have also shown high hit rates. (I know -- you think they weren't sufficiently tightly controlled, but it's unclear to me whether that made a difference.)

Right. You don't know whether it made a difference. And since the 'proof' of psi is really about eliminating those things that may have made a difference, you don't know whether that burden has been met. Which is the whole point of the criticism leveled against claims that psi is proven.

The authors were responding to Milton's and Wiseman's prior article, which was not exactly a study in humility, either. ;)

That's the problem with starting down this path. The ganzfeld data is too heterogeneous to combine. The guidelines for meta-analysis would say ditch the idea of combining the data ("an unwise meta-analysis can lead to highly misleading conclusions"). However, once the Battle of the Meta-analyses began, no one seemed able to stop, and so on and on we go. You can't blame Milton and Wiseman for the authors' studied indifference to just how sensitive their conclusions were to their assumptions, though.

Speak for yourself. (With the exception of my "large effect" comment, which was the first (minor) error I've ever made in my whole entire life.) ;)

I was.

Linda
 
I don't disagree with that figure. What I pointed out is that the figure depends upon including the results of a single study with results that are so different from the rest that it would be difficult to reasonably assume it is measuring the same thing as the rest of the studies. If the results of that study aren't included, the p-value for the meta-analysis is no longer statistically significant. The 'success' of the ganzfeld for measuring psi is essentially resting on a single, highly questionable, unpublished study.
First, we don't know whether the "single, highly questionable, unpublished study" was flawed.

Second, even excluding that study, in the other 39 studies under consideration there were 440 hits in 1533 trials, a hit rate of 28.7%, versus the expected hit rate of 25%. Now, I understand that the p value falls slightly outside the range of statistical significance using a Stouffer Z test, but I question whether that is the best test in this case, if what is being tested is the hypothesis that psi exists. Rather, it seems to me that a binomial model is more appropriate to test that hypothesis and, under that model, the results are easily statistically significant. However, I would like to get the thoughts of Beth or some other professional statistician on this matter. (Beth, by the way, informs me that she is busy right now, but will eventually weigh in on this thread.)

Third, there is the matter of standard versus non-standard ganzfeld replications. If the analysis Bem, Broughton, and Palmer did was proper, the hit rate for standard ganzfeld replications is even more significant, at 31.2% (and 29.5% excluding the "questionable unpublished study"). Again, I would I would like to get the thoughts of Beth or some other professional statistician on the propriety of the authors' "standard versus non-standard" analysis.

That didn't answer my question.
See above.

Quit being coy.
I'm not, I just want to know what your belief that there is a ganzfeld "file drawer problem" is based upon.

Right. You don't know whether it made a difference. And since the 'proof' of psi is really about eliminating those things that may have made a difference, you don't know whether that burden has been met. Which is the whole point of the criticism leveled against claims that psi is proven.
The point is that, if later, more tightly-controlled ganzfeld experiments show similar results to the earlier experiments, the likelihood is that the results of the earlier experiments were not affected by lack of tight controls. In other words, there may have been a potential problem with earlier experiments, but there is no evidence that translated into skewed results.

That's the problem with starting down this path. The ganzfeld data is too heterogeneous to combine.
So no meta-analysis is possible, even if the great majority of ganzfeld experiments produce hit rates higher than 25%? What if, in another scientific field, that same situation existed -- would a meta-analysis be used there?

The guidelines for meta-analysis would say ditch the idea of combining the data ("an unwise meta-analysis can lead to highly misleading conclusions"). However, once the Battle of the Meta-analyses began, no one seemed able to stop, and so on and on we go. You can't blame Milton and Wiseman for the authors' studied indifference to just how sensitive their conclusions were to their assumptions, though.
The point is that, even using the methodology of Milton and Wiseman, the results of the meta-analysis are statistically significant when all recent studies are included.

I was joking. :)
 
First, we don't know whether the "single, highly questionable, unpublished study" was flawed.

Whether it was flawed is irrelevant.

Second, even excluding that study, in the other 39 studies under consideration there were 440 hits in 1533 trials, a hit rate of 28.7%, versus the expected hit rate of 25%. Now, I understand that the p value falls slightly outside the range of statistical significance using a Stouffer Z test, but I question whether that is the best test in this case, if what is being tested is the hypothesis that psi exists. Rather, it seems to me that a binomial model is more appropriate to test that hypothesis and, under that model, the results are easily statistically significant. However, I would like to get the thoughts of Beth or some other professional statistician on this matter. (Beth, by the way, informs me that she is busy right now, but will eventually weigh in on this thread.)

This is an example of what I'm talking about. The Stouffer method is okay when it tells you what you want to hear, but when it doesn't you go looking for a different method of analysis until you find one that does. This violates the assumptions of hypothesis testing, btw.

I understand that it is tempting to simply combine the data from the table, but I don't think you can. I've been checking the numbers and they don't fit with the reported z-scores. There was heterogeneity in how the results were reported (e.g. multiple judges, ranking), so it looks like in many cases the z-score was calculated from the p-value, rather than from hits and misses (this fits with the methods specified by the authors of this paper and of Milton and Wiseman's). The use of the z-score was an attempt to standardize data that was otherwise too disparate to combine. Your choice of analysis further weakens the possibility of drawing any valid conclusions.

Third, there is the matter of standard versus non-standard ganzfeld replications. If the analysis Bem, Broughton, and Palmer did was proper, the hit rate for standard ganzfeld replications is even more significant, at 31.2% (and 29.5% excluding the "questionable unpublished study"). Again, I would I would like to get the thoughts of Beth or some other professional statistician on the propriety of the authors' "standard versus non-standard" analysis.

What do you think she's going to say? That it's okay to ignore precautions in this case because it'd be really, really cool if we could show that psi exists?

See above.

That did answer my question. Basically you think it's okay to fudge when it comes to psi.

I'm not, I just want to know what your belief that there is a ganzfeld "file drawer problem" is based upon.

All fields of research have unpublished studies. Parapsychology recognizes that they have unpublished studies. There have been specific reports on unpublished studies in specific research areas, including on ganzfeld experiments. Individual researchers have stated they have performed ganzfeld research (including negative studies) that they have not published. It would be unreasonable for me to claim that we can safely ignore the possibility just because I feel like it.

The point is that, if later, more tightly-controlled ganzfeld experiments show similar results to the earlier experiments, the likelihood is that the results of the earlier experiments were not affected by lack of tight controls. In other words, there may have been a potential problem with earlier experiments, but there is no evidence that translated into skewed results.

Choosing to spin the results the way you want them to be spun still does not rule out other possibilities.

So no meta-analysis is possible, even if the great majority of ganzfeld experiments produce hit rates higher than 25%? What if, in another scientific field, that same situation existed -- would a meta-analysis be used there?

There are 5 out of 40 studies that produce a hit rate significantly greater than 25%, not a great majority. If the same situation existed in another field, the idea that anything is being measured would have been dropped long ago.

The point is that, even using the methodology of Milton and Wiseman, the results of the meta-analysis are statistically significant when all recent studies are included.

And that significance is very sensitive to your underlying assumptions. So you have an effect that is not robust, consistent or reproducible, demonstrated through an "unwise analysis that can lead to highly misleading conclusions."

And you wonder why we're not jumping on the bandwagon.

Linda
 
Whether it was flawed is irrelevant.

This is an example of what I'm talking about. The Stouffer method is okay when it tells you what you want to hear, but when it doesn't you go looking for a different method of analysis until you find one that does. This violates the assumptions of hypothesis testing, btw.

I understand that it is tempting to simply combine the data from the table, but I don't think you can. I've been checking the numbers and they don't fit with the reported z-scores. There was heterogeneity in how the results were reported (e.g. multiple judges, ranking), so it looks like in many cases the z-score was calculated from the p-value, rather than from hits and misses (this fits with the methods specified by the authors of this paper and of Milton and Wiseman's). The use of the z-score was an attempt to standardize data that was otherwise too disparate to combine. Your choice of analysis further weakens the possibility of drawing any valid conclusions.

What do you think she's going to say? That it's okay to ignore precautions in this case because it'd be really, really cool if we could show that psi exists?

That did answer my question. Basically you think it's okay to fudge when it comes to psi.
You have a vivid imagination. :) It seems to me that the binomial model is more appropriate than the Stouffer model for this type of analysis, and it has nothing to do with trying to augment the significance of the results. It also seems to me that Bem's, Broughton's, and Palmer's "standard versus non-standard" analysis was reasonable. However, I may be missing something, and that's why I would like a statistician to weigh in.

All fields of research have unpublished studies. Parapsychology recognizes that they have unpublished studies. There have been specific reports on unpublished studies in specific research areas, including on ganzfeld experiments. Individual researchers have stated they have performed ganzfeld research (including negative studies) that they have not published. It would be unreasonable for me to claim that we can safely ignore the possibility just because I feel like it.
Do you have evidence that unpublished negative studies are more of a problem with respect to ganzfeld research than other fields of research? If so, what is it?

Choosing to spin the results the way you want them to be spun still does not rule out other possibilities.
I'm not spinning the results, I'm evaluating them.

There are 5 out of 40 studies that produce a hit rate significantly greater than 25%, not a great majority. If the same situation existed in another field, the idea that anything is being measured would have been dropped long ago.
Most of these studies are too small to produce statistical significance unless the hit rate exceeds 45-50%, but most do show hit rates exceeding 25%. Those studies need to be aggregated to determine whether a small, but measurable, above chance hit rate is occurring. Your opinion about other fields is noted, but at least one statistician -- Jessica Utts -- disagrees with you. I'm not sure how Beth feels; perhaps she'll tell us when she weighs in.

And that significance is very sensitive to your underlying assumptions. So you have an effect that is not robust, consistent or reproducible, demonstrated through an "unwise analysis that can lead to highly misleading conclusions."
Ganzfeld tests have generally shown hit rates that average in the 30-35% range, which is well above chance when measured over thousands of trials.

And you wonder why we're not jumping on the bandwagon.
Considering that this is the Randi Forums, no I don't. ;)
 
You have a vivid imagination. :) It seems to me that the binomial model is more appropriate than the Stouffer model for this type of analysis, and it has nothing to do with trying to augment the significance of the results.

Part of the reason the authors used the Stouffer model was because not all studies provided dichotomous data. There was variation in how the results were measured and reported. The binomial model may be more appropriate for some of the data, but couldn't be used for all of the data.

But what I was referring to was that you were content quoting p-values from the Stouffer method until I pointed out that applying a reasonable amount of caution led to a p-value that was not significant.

It also seems to me that Bem's, Broughton's, and Palmer's "standard versus non-standard" analysis was reasonable. However, I may be missing something, and that's why I would like a statistician to weigh in.

It's not a hard question and I am easily competent enough to answer it.

When you make a distinction that is arbitrary (studies above a certain number represent replication studies, studies below a certain number represent exploration studies), good practice dictates that you test the reasonableness of where you have drawn that line through sensitivity testing. You vary where/how you draw the line over a reasonable range of values and see how that would change your conclusions. If changing those values does not lead to dramatic changes in your p-value (for example, they do not lead to p-values that are non-significant) then you can say that your choice of where to draw the line is robust and reasonable.

The authors do not report on the results of their sensitivity test, or even whether they have done one. I reported on the result of my sensitivity test. Their choice was not robust, and their choice of where to draw the line seemed to be the point at which the p-value is minimized. It makes it look like they did do sensitivity testing (in order to identify how to get the 'best' result), but didn't report the results to hide that their choice was arbitrary and that choosing different cut-off points would nullify their conclusion. Or maybe they were just sloppy, but lucky.

Either way, it does not demonstrate good practice. And whether or not parapsychologists should follow good practice when performing research and reporting on the results is not a question you need a statistician to answer. You and Utts and the authors of many of these papers seem to be saying 'no'. But as I've pointed out before, good practice is not a set of rules designed to exclude woo from the club. It is a set of practices that help us avoid making conclusions that are wrong.

Do you have evidence that unpublished negative studies are more of a problem with respect to ganzfeld research than other fields of research? If so, what is it?

I don't think that it's any more of a problem with respect to ganzfeld research than other fields of research. The effect of publication bias is usually taken into consideration when doing meta-analyses in other fields.

Most of these studies are too small to produce statistical significance unless the hit rate exceeds 45-50%, but most do show hit rates exceeding 25%. Those studies need to be aggregated to determine whether a small, but measurable, above chance hit rate is occurring. Your opinion about other fields is noted, but at least one statistician -- Jessica Utts -- disagrees with you. I'm not sure how Beth feels; perhaps she'll tell us when she weighs in.

This is right up my alley because it's very similar to medical research and we do lots of meta-analyses. I don't know what field Beth works in, so she may not be familiar with a comparable field of study. But I don't think it matters, 'cuz it's not even close. If I tried to claim that I discovered a new disease and that it could be measured with a particular blood test, and then presented data like that for the ganzfeld, I would absolutely be laughed off the stage.

Ganzfeld tests have generally shown hit rates that average in the 30-35% range, which is well above chance when measured over thousands of trials.

The point is that particular study does not support your assertion.

Considering that this is the Randi Forums, no I don't. ;)

Wow. You're really willing to go all the way with this one. I point out that most scientists are unwilling to accept conclusions drawn from unwise analyses on inconsistent and unreliable experiments, and you present it as those damned unreasonable skeptics again. Are you really fooling anyone? Are you fooling yourself?

Linda
 
Ameila Kinkade was just so inane I could barely contain myself from shouting out the whole time.

So, why are you questioning why Randi and Shermer seem so negative when it comes to psychics? Outside of carefully controlled/staged/edited situations, pretty much all psychics come off as "inane", to say the least. After seeing dozens or even hundreds of these frauds in action over the span of years, don't you think your feeling of wanting to shout at them would become semi-permanent?
 
I don't think that it's any more of a problem with respect to ganzfeld research than other fields of research. The effect of publication bias is usually taken into consideration when doing meta-analyses in other fields.

Indeed. I just read this excellent article by Ben Goldacre on the problems with homeopathy research (funnily enough, rather similar to the problems faced with psi research, although it seems to me that psi research has evolved a bit more).

Ben Goldacre said:
But back to the important stuff. Why else might there be plenty of positive trials around, spuriously? Because of something called "publication bias". In all fields of science, positive results are more likely to get published, because they are more newsworthy, there's more mileage in publishing them for your career, and they're more fun to write up. This is a problem for all of science. Medicine has addressed this problem, making people register their trial before they start, on a "clinical trials database", so that you cannot hide disappointing data and pretend it never happened.

How big is the problem of publication bias in alternative medicine? Well now, in 1995, only 1% of all articles published in alternative medicine journals gave a negative result. The most recent figure is 5% negative. This is very, very low.

There is only one conclusion you can draw from this observation. Essentially, when a trial gives a negative result, alternative therapists, homeopaths or the homeopathic companies simply do not publish it. There will be desk drawers, box files, computer folders, garages, and back offices filled with untouched paperwork on homeopathy trials that did not give the result the homeopaths wanted. At least one homeopath reading this piece will have a folder just like that, containing disappointing, unpublished data that they are keeping jolly quiet about. Hello there!

Now, you could just pick out the positive trials, as homeopaths do, and quote only those. This is called "cherry picking" the literature - it is not a new trick, and it is dishonest, because it misrepresents the totality of the literature. There is a special mathematical tool called a "meta-analysis", where you take all the results from all the studies on one subject, and put the figures into one giant spreadsheet, to get the most representative overall answer. When you do this, time and time again, and you exclude the unfair tests, and you account for publication bias, you find, in all homeopathy trials overall, that homeopathy does no better than placebos.

Emphasis mine.
 
Part of the reason the authors used the Stouffer model was because not all studies provided dichotomous data. There was variation in how the results were measured and reported. The binomial model may be more appropriate for some of the data, but couldn't be used for all of the data.

But what I was referring to was that you were content quoting p-values from the Stouffer method until I pointed out that applying a reasonable amount of caution led to a p-value that was not significant.
I was simply quoting the article, which relied on the Stouffer model. I don't think Bem, Broughton, and Palmer saw a need to argue with Milton and Wiseman regarding which model should be used because the former's analysis found statistical significance even under the Stouffer model.

It's not a hard question and I am easily competent enough to answer it.

When you make a distinction that is arbitrary (studies above a certain number represent replication studies, studies below a certain number represent exploration studies), good practice dictates that you test the reasonableness of where you have drawn that line through sensitivity testing. You vary where/how you draw the line over a reasonable range of values and see how that would change your conclusions. If changing those values does not lead to dramatic changes in your p-value (for example, they do not lead to p-values that are non-significant) then you can say that your choice of where to draw the line is robust and reasonable.

The authors do not report on the results of their sensitivity test, or even whether they have done one. I reported on the result of my sensitivity test. Their choice was not robust, and their choice of where to draw the line seemed to be the point at which the p-value is minimized. It makes it look like they did do sensitivity testing (in order to identify how to get the 'best' result), but didn't report the results to hide that their choice was arbitrary and that choosing different cut-off points would nullify their conclusion. Or maybe they were just sloppy, but lucky.

Either way, it does not demonstrate good practice. And whether or not parapsychologists should follow good practice when performing research and reporting on the results is not a question you need a statistician to answer. You and Utts and the authors of many of these papers seem to be saying 'no'. But as I've pointed out before, good practice is not a set of rules designed to exclude woo from the club. It is a set of practices that help us avoid making conclusions that are wrong.
If Bem, Broughton, and Palmer had cooked the books, I would agree with you. However, I see no evidence of that. For example, if you break the 40 studies into the 17 that were rated as having the highest level of standardization and the remaining 23, the hit rate on the former was 33.8% (261 hits in 773 trials) and the hit rate on the latter was 26.9% (239 hits in 888 trials). Or, if you look at the eight studies that were rated a perfect 7 in terms of standardization and compare those to the nine studies that were rated the lowest in terms of standardization, the hit rate on the former was 34.1% (150 hits in 440 trials) and the hit rate on the latter was 24.3% (73 hits in 300 trials). Now, there is an issue regarding the five studies that fell between 4.00-4.67 in terms of standardization, as the hit rate on those was a high 37.2% (71 hits in 191 trials). Perhaps the authors could have put all five of those in a neutral category, instead of putting only the two that were rated an even 4 in that category. However, even if you include those five with the nine studies that were rated the lowest in terms of standardization to form a bottom 14, the hit rate on those 14 studies was 29.3% (144 hits in 491 trials), which is below the hit rate of the studies that were rated the highest in terms of standardization.

I don't think that it's any more of a problem with respect to ganzfeld research than other fields of research. The effect of publication bias is usually taken into consideration when doing meta-analyses in other fields.
How so? Can you give an example or two?

This is right up my alley because it's very similar to medical research and we do lots of meta-analyses. I don't know what field Beth works in, so she may not be familiar with a comparable field of study. But I don't think it matters, 'cuz it's not even close. If I tried to claim that I discovered a new disease and that it could be measured with a particular blood test, and then presented data like that for the ganzfeld, I would absolutely be laughed off the stage.
I don't think that's an apt analogy because medical research must necessarily meet a higher standard. For example, a new drug may be shown to have potential benefits, but its potential side effects may outweigh those benefits.

The point is that particular study does not support your assertion.
Your opinion is noted, but not necessarily agreed to. :)

Wow. You're really willing to go all the way with this one. I point out that most scientists are unwilling to accept conclusions drawn from unwise analyses on inconsistent and unreliable experiments, and you present it as those damned unreasonable skeptics again. Are you really fooling anyone? Are you fooling yourself?
I'm simply attempting to evaluate the evidence.
 
I was simply quoting the article, which relied on the Stouffer model. I don't think Bem, Broughton, and Palmer saw a need to argue with Milton and Wiseman regarding which model should be used because the former's analysis found statistical significance even under the Stouffer model.

Let me put it another way. You were content with the analysis that Bem, Broughton, and Palmer performed until I pointed out that applying a reasonable amount of caution led to a p-value that was no longer significant. At that point you decided that the results from a different analysis - an analysis which Bem, Broughton, Palmer, Milton and Wiseman considered inappropriate (else they would have used it) - should take precedence.

If Bem, Broughton, and Palmer had cooked the books, I would agree with you. However, I see no evidence of that.

You didn't understand what I said if you think this is about cooking the books.

For example, if you break the 40 studies into the 17 that were rated as having the highest level of standardization and the remaining 23, the hit rate on the former was 33.8% (261 hits in 773 trials) and the hit rate on the latter was 26.9% (239 hits in 888 trials). Or, if you look at the eight studies that were rated a perfect 7 in terms of standardization and compare those to the nine studies that were rated the lowest in terms of standardization, the hit rate on the former was 34.1% (150 hits in 440 trials) and the hit rate on the latter was 24.3% (73 hits in 300 trials). Now, there is an issue regarding the five studies that fell between 4.00-4.67 in terms of standardization, as the hit rate on those was a high 37.2% (71 hits in 191 trials). Perhaps the authors could have put all five of those in a neutral category, instead of putting only the two that were rated an even 4 in that category. However, even if you include those five with the nine studies that were rated the lowest in terms of standardization to form a bottom 14, the hit rate on those 14 studies was 29.3% (144 hits in 491 trials), which is below the hit rate of the studies that were rated the highest in terms of standardization.

First of all, as you have just demonstrated, changing your underlying assumptions (finding different ways of grouping the results) changes the conclusions that you draw from the data. This is exactly my point. How do you know which conclusion is likely to be valid if any of those conclusions can be supported by the right manipulation of the data?

Second of all, even though the authors estimated hit rates for all the studies, combining hit rates is the same as combining apples and oranges in this situation. You should stop performing what we already know to be an invalid analysis if you want me to take you seriously (and no, that's not rhetorical). The estimated hit rates are not based on the same kind of data as the non-estimated hit rates. You can tell this is the case because the z-scores don't always match the listed hit rates. Use the Stouffer method instead.

How so? Can you give an example or two?

You can specifically search for unpublished studies (the idea of registering all studies before they are performed is meant to address this).

You can calculate a fail-safe N (the number of studies that would need to be in the file-drawer in order to bring the p-value above 0.05).

You can visually inspect a funnel plot.

You can attempt to model the bias and make a 'correction'.

I don't think that's an apt analogy because medical research must necessarily meet a higher standard. For example, a new drug may be shown to have potential benefits, but its potential side effects may outweigh those benefits.

Ah, so I should receive a smattering of polite applause instead of laughter before everyone moves on to the next presentation and promptly forgets mine?

Your opinion is noted, but not necessarily agreed to. :)

I'm simply attempting to evaluate the evidence.

No you're not. You're looking for ways to dismiss what I say. Hence the implications that I'm not sufficiently knowledgeable, I'm biased in my evaluations, I'm close-minded, I'm unfair, etc. In response to my attempts to inform you about methods of analysis, good research practice, how evidence is evaluated and weighed, different kinds of bias and how they affect results, hypothesis testing and assumptions, etc., you treat me as though I may be lying in order to pull the wool over your eyes. You do all this in the absence of any evidence that I have ever said something that was untrue, and without bothering to inform yourself about any of these issues; you actively/deliberately avoid learning anything in order to maintain plausible deniability.

It's all quite fascinating.

Linda
 
You do all this in the absence of any evidence that I have ever said something that was untrue, and without bothering to inform yourself about any of these issues; you actively/deliberately avoid learning anything in order to maintain plausible deniability.

It's all quite fascinating.

Linda

This is a bit OT, but I just want to thank you, Linda, for explaining more about statistics and research studies in this thread (and in that homeopathy one) than I ever learned in college.
 
This is a bit OT, but I just want to thank you, Linda, for explaining more about statistics and research studies in this thread (and in that homeopathy one) than I ever learned in college.

And I appreciate the opportunity to say these things without seeing people's eyes glaze over. :)

Linda
 
Let me put it another way. You were content with the analysis that Bem, Broughton, and Palmer performed until I pointed out that applying a reasonable amount of caution led to a p-value that was no longer significant. At that point you decided that the results from a different analysis - an analysis which Bem, Broughton, Palmer, Milton and Wiseman considered inappropriate (else they would have used it) - should take precedence.
Again, it has nothing to with trying to make the results look better than they actually are -- it has to do with which model is more appropriate for this analysis.

You didn't understand what I said if you think this is about cooking the books.
So why did you makes these two prior comments?

(1) "The authors 'happened' to divide the studies into those which fell above the mid-point of the scale and those that fell below."

(2) "The method that they chose isn't even the most reasonable or obvious, since it results in uneven groups and is skewed . . . Putting that all together suggests that they tried many different ways of analyzing the data and then picked only those that supported their conclusion to include in their report."

First of all, as you have just demonstrated, changing your underlying assumptions (finding different ways of grouping the results) changes the conclusions that you draw from the data. This is exactly my point. How do you know which conclusion is likely to be valid if any of those conclusions can be supported by the right manipulation of the data?

Second of all, even though the authors estimated hit rates for all the studies, combining hit rates is the same as combining apples and oranges in this situation. You should stop performing what we already know to be an invalid analysis if you want me to take you seriously (and no, that's not rhetorical). The estimated hit rates are not based on the same kind of data as the non-estimated hit rates. You can tell this is the case because the z-scores don't always match the listed hit rates. Use the Stouffer method instead.
Your analysis would make more sense to me if at least a sizeable minority of the hit rates had been estimated. However, according to the authors, in only 5 of the 40 studies were the hit rates not reported and those 5 studies featured only 158 trials, which is less than 10% of the total trials in the 40 studies. So I think the estimated hit rates are pretty much a de minimis percentage of the total and don't invalidate the authors' "standard vs. non-standard" analysis. However, I do think that this is a tricky area, and I could be wrong.

You can specifically search for unpublished studies (the idea of registering all studies before they are performed is meant to address this).

You can calculate a fail-safe N (the number of studies that would need to be in the file-drawer in order to bring the p-value above 0.05).

You can visually inspect a funnel plot.

You can attempt to model the bias and make a 'correction'.
Fine, but what has actually been done in other meta-analyses? I would like at least one example, if you have one.

Ah, so I should receive a smattering of polite applause instead of laughter before everyone moves on to the next presentation and promptly forgets mine?
No, I'm saying that you would not do the presentation because the standards for medical research and psi research are different.

No you're not. You're looking for ways to dismiss what I say. Hence the implications that I'm not sufficiently knowledgeable, I'm biased in my evaluations, I'm close-minded, I'm unfair, etc. In response to my attempts to inform you about methods of analysis, good research practice, how evidence is evaluated and weighed, different kinds of bias and how they affect results, hypothesis testing and assumptions, etc., you treat me as though I may be lying in order to pull the wool over your eyes. You do all this in the absence of any evidence that I have ever said something that was untrue, and without bothering to inform yourself about any of these issues; you actively/deliberately avoid learning anything in order to maintain plausible deniability.
I definitely don't think that you're lying, and I have found your analysis informative. At the same time, I'm not convinced by it. So, let's wait for Beth or another statistician to weigh in and we'll go from there.

It's all quite fascinating.
That's not a bad thing, is it? :)
 
This is a bit OT, but I just want to thank you, Linda, for explaining more about statistics and research studies in this thread (and in that homeopathy one) than I ever learned in college.

Ditto. I think I'll nominate one of these posts and add the note that the nomination is actually for her contributions to this entire thread.
 
Again, it has nothing to with trying to make the results look better than they actually are -- it has to do with which model is more appropriate for this analysis.

And you keep using a model which is likely inappropriate (i.e. it depends upon all the results being dichotomous and they're not).

So why did you makes these two prior comments?

(1) "The authors 'happened' to divide the studies into those which fell above the mid-point of the scale and those that fell below."

(2) "The method that they chose isn't even the most reasonable or obvious, since it results in uneven groups and is skewed . . . Putting that all together suggests that they tried many different ways of analyzing the data and then picked only those that supported their conclusion to include in their report."

I think the phrase "cooking the books" means making up data. What did you mean by it? If you meant a less than straightforward analysis, I already outlined the evidence for that. They should have reported on a sensitivity analysis and they didn't. So, they were either less than straightforward or they were sloppy.

Your analysis would make more sense to me if at least a sizeable minority of the hit rates had been estimated. However, according to the authors, in only 5 of the 40 studies were the hit rates not reported and those 5 studies featured only 158 trials, which is less than 10% of the total trials in the 40 studies. So I think the estimated hit rates are pretty much a de minimis percentage of the total and don't invalidate the authors' "standard vs. non-standard" analysis. However, I do think that this is a tricky area, and I could be wrong.

It's not just that some of the studies did not provide the actual hit rates (so they had to be estimated). It's also that some of the hit rate data wasn't reported as a 'hit' or 'miss', but rather measured in some other way. For example, each picture may be ranked as to whether or not it was a target. This means that you couldn't analyze the data using a binomial distribution. But you could use the p-value from your analysis to figure out a z-score. Now, you could also take the rankings and create binomial data from that (for example, every ranking above a certain number would count as a hit), and form a crude hit rate. But that hit rate wouldn't really be comparable to the hit rate from a study that measured the outcome as a 'hit' or 'miss' because they weren't formed the same way. And these differences show up in the table. If you go through the data in the table and use the hit rate from the list to calculate a z-score, the result for many of them is different from the z-score provided in the table. Because the hit rates provided in the table were formed in different ways, they shouldn't be directly compared. The z-scores on the other hand, because they are standardized, are directly comparable. Which is presumably why the Stouffer method was chosen.

Let me make it clear that I don't think the authors' standard vs. non-standard analysis was invalid. I'm not complaining about the idea. One of the recommended ways of attempting to draw information from data that is too heterogeneous for a meta-analysis is to do sub-group analyses and meta-regression. They just didn't do a very good job of it and they way over-stated what conclusions could reasonably be drawn.

Fine, but what has actually been done in other meta-analyses? I would like at least one example, if you have one.

I'm sorry I did not make this clear. My list was of those things that are actually done in meta-analyses. They are all examples.

I searched CMAJ (because full text is available on line) for the word meta-analysis in the title and took the first entry as an example for you. It describes the use of a funnel plot to assess for publication bias. It also was a meta-analysis with heterogeneous data and it describes a sensitivity analysis.

No, I'm saying that you would not do the presentation because the standards for medical research and psi research are different.

It's clear that the standards should be different, but not in the direction that you think. Psi research would actually make far, far, far more progress if it held itself to higher standards. But we've been over this already.

I definitely don't think that you're lying, and I have found your analysis informative. At the same time, I'm not convinced by it. So, let's wait for Beth or another statistician to weigh in and we'll go from there.

I'm already giving you all the information that is needed. So when you think about it, why else would it matter whether someone else weighed in unless you think I am being dishonest?

That's not a bad thing, is it? :)

Not at all.

Linda
 
Last edited:
Ditto. I think I'll nominate one of these posts and add the note that the nomination is actually for her contributions to this entire thread.

I appreciate the sentiment, but these things never make it to the finals, let alone win. :(

I have a tip, though (so all is not lost ;)). If you ever want to include quoted material when you're quoting a post, instead of hitting the "quote" button, use the "quote this post in a pm to" (from the menu that pops up when you click on the poster's name). Those posts automatically include the quoted material (properly formatted), so all you have to do is copy and paste it into a reply to the thread.

Linda
 
I appreciate the sentiment, but these things never make it to the finals, let alone win. :(

I have a tip, though (so all is not lost ;)). If you ever want to include quoted material when you're quoting a post, instead of hitting the "quote" button, use the "quote this post in a pm to" (from the menu that pops up when you click on the poster's name). Those posts automatically include the quoted material (properly formatted), so all you have to do is copy and paste it into a reply to the thread.

Linda

I appreciate the tip. Thanks.

Rodney, I've read through this thread and the article. I haven't researched it in depth and checked the actual computations but I have no desire to quarrel with anything Linda has said in terms of the analysis. I'm a bit more charitable regarding their motivations for doing things they way they have. She's right regarding the point about robustness. If one data point is skewing the results and the rest of the dataset gives a very different result without it, then it needs to be investigated. It shouldn't be discarded without cause, but if it's from a non-verifiable source, that would be cause to exclude it.
 
And you keep using a model which is likely inappropriate (i.e. it depends upon all the results being dichotomous and they're not).
I'm still not sure whether the binomial model is appropriate -- for one thing, you're calling into question my understanding of how at least 35 of the 40 ganzfeld experiments were carried out; see below.

I think the phrase "cooking the books" means making up data. What did you mean by it? If you meant a less than straightforward analysis, I already outlined the evidence for that. They should have reported on a sensitivity analysis and they didn't. So, they were either less than straightforward or they were sloppy.
By cooking the books, I mean being deliberately misleading. For example, if the authors examined all of the possible ways to do their "standardness" analysis and then picked the way that most favored their hypothesis, I would consider that to be cooking the books.

It's not just that some of the studies did not provide the actual hit rates (so they had to be estimated). It's also that some of the hit rate data wasn't reported as a 'hit' or 'miss', but rather measured in some other way. For example, each picture may be ranked as to whether or not it was a target. This means that you couldn't analyze the data using a binomial distribution. But you could use the p-value from your analysis to figure out a z-score. Now, you could also take the rankings and create binomial data from that (for example, every ranking above a certain number would count as a hit), and form a crude hit rate. But that hit rate wouldn't really be comparable to the hit rate from a study that measured the outcome as a 'hit' or 'miss' because they weren't formed the same way. And these differences show up in the table. If you go through the data in the table and use the hit rate from the list to calculate a z-score, the result for many of them is different from the z-score provided in the table. Because the hit rates provided in the table were formed in different ways, they shouldn't be directly compared. The z-scores on the other hand, because they are standardized, are directly comparable. Which is presumably why the Stouffer method was chosen.

Let me make it clear that I don't think the authors' standard vs. non-standard analysis was invalid. I'm not complaining about the idea. One of the recommended ways of attempting to draw information from data that is too heterogeneous for a meta-analysis is to do sub-group analyses and meta-regression. They just didn't do a very good job of it and they way over-stated what conclusions could reasonably be drawn.
If you're correct that the hit rate data wasn't reported as a hit or miss for some of the 35 studies (excluding the 5 in which a footnote in the article says that the hit rate was not reported), but rather was measured in some other way, then my understanding of what was done in those 35 studies is wrong. I know alternative measures were used in the past, but I thought for those 35 a hit was credited only if the recipient correctly selected the target from among four choices. How many of the 35 studies do you think used alternative measures?

I'm sorry I did not make this clear. My list was of those things that are actually done in meta-analyses. They are all examples.

I searched CMAJ (because full text is available on line) for the word meta-analysis in the title and took the first entry as an example for you. It describes the use of a funnel plot to assess for publication bias. It also was a meta-analysis with heterogeneous data and it describes a sensitivity analysis.
Thanks, but the conclusion of that article was: "Assessment of publication bias using a funnel plot was attempted, but too few studies were available to allow any meaningful judgment." I'm guessing that is a typical conclusion. To your knowledge, has there ever been a case where the conclusion was along the lines of: "The meta-analysis found the results to be statistically significant; however, when adjusted for publication bias, the results are no longer significant"?

It's clear that the standards should be different, but not in the direction that you think. Psi research would actually make far, far, far more progress if it held itself to higher standards. But we've been over this already.
What, specifically, would you suggest to improve psi research?

I'm already giving you all the information that is needed. So when you think about it, why else would it matter whether someone else weighed in unless you think I am being dishonest?
Because differing perspectives can lead to different conclusions. I would like the opinion of an expert third-party.

Not at all.
See how much we have in common? ;)
 
Last edited:
I appreciate the tip. Thanks.

Rodney, I've read through this thread and the article. I haven't researched it in depth and checked the actual computations but I have no desire to quarrel with anything Linda has said in terms of the analysis. I'm a bit more charitable regarding their motivations for doing things they way they have. She's right regarding the point about robustness. If one data point is skewing the results and the rest of the dataset gives a very different result without it, then it needs to be investigated. It shouldn't be discarded without cause, but if it's from a non-verifiable source, that would be cause to exclude it.
Okay, thanks. I agree with Linda's point about robustness, and I would like to obtain more information about the 1997 Dalton study, which showed a Z score of 5.20.
 
I'm still not sure whether the binomial model is appropriate -- for one thing, you're calling into question my understanding of how at least 35 of the 40 ganzfeld experiments were carried out; see below.

I'm going to defer to the authors of both papers on this. They stated that not all of the data was presented as dichotomous, and they chose not to use a binomial analysis when it would have been a better choice if appropriate. That's good enough for me to conclude that not all of the data was presented as dichotomous and that a binomial analysis was inappropriate. If you think the authors are untrustworthy on this point, I'll leave it up to you to investigate.

By cooking the books, I mean being deliberately misleading. For example, if the authors examined all of the possible ways to do their "standardness" analysis and then picked the way that most favored their hypothesis, I would consider that to be cooking the books.

Beth is correct that I am not particularly charitable regarding their motivations. I suspect that they are just trying to be helpful - sorta like "I believe deeply that there is something there, so let's figure out how we can maximize our discovery of this effect and make progress in our investigation." My complaint, however, is that this attitude is exactly what prevents them from making progress (I've gone over this in greater detail previously). It is their continued dismissal of good research practice as mere nay-saying on the part of skeptics that makes me ungenerous.

If you're correct that the hit rate data wasn't reported as a hit or miss for some of the 35 studies (excluding the 5 in which a footnote in the article says that the hit rate was not reported), but rather was measured in some other way, then my understanding of what was done in those 35 studies is wrong. I know alternative measures were used in the past, but I thought for those 35 a hit was credited only if the recipient correctly selected the target from among four choices. How many of the 35 studies do you think used alternative measures?

I don't know. Milton and Wiseman say "some studies used different outcome measures involving ranking or rating the target and decoys, and in such cases the probability associated with the test statistic used (t test, etc.) provided the z score. When a study reported more than one main outcome measure, the mean z score represented the study's outcome." I saw this in some of the studies I checked, but it's not worth my while to check all of them. I'll leave that to you if you're interested. As I said earlier, I'm willing to trust the authors on this.

Thanks, but the conclusion of that article was: "Assessment of publication bias using a funnel plot was attempted, but too few studies were available to allow any meaningful judgment." I'm guessing that is a typical conclusion. To your knowledge, has there ever been a case where the conclusion was along the lines of: "The meta-analysis found the results to be statistically significant; however, when adjusted for publication bias, the results are no longer significant"?

Ah, I see. I didn't realize that your question marked your entry into the "I won't believe you until you provide an example that is identical in every way, including the initial letters of the authors' names, and even then I won't believe you" game. I'm not playing. You've already lost all credit with me.

What, specifically, would you suggest to improve psi research?

I've gone over this in detail with you several times. If you've truly forgotten, I'll leave it up to you to review our previous conversations.

Linda
 
I'm going to defer to the authors of both papers on this. They stated that not all of the data was presented as dichotomous, and they chose not to use a binomial analysis when it would have been a better choice if appropriate. That's good enough for me to conclude that not all of the data was presented as dichotomous and that a binomial analysis was inappropriate. If you think the authors are untrustworthy on this point, I'll leave it up to you to investigate.
I intend to. See below.

Beth is correct that I am not particularly charitable regarding their motivations. I suspect that they are just trying to be helpful - sorta like "I believe deeply that there is something there, so let's figure out how we can maximize our discovery of this effect and make progress in our investigation." My complaint, however, is that this attitude is exactly what prevents them from making progress (I've gone over this in greater detail previously). It is their continued dismissal of good research practice as mere nay-saying on the part of skeptics that makes me ungenerous.
Are you defining progress as moving toward resolving the issue of whether psi exists?

I don't know. Milton and Wiseman say "some studies used different outcome measures involving ranking or rating the target and decoys, and in such cases the probability associated with the test statistic used (t test, etc.) provided the z score. When a study reported more than one main outcome measure, the mean z score represented the study's outcome." I saw this in some of the studies I checked, but it's not worth my while to check all of them. I'll leave that to you if you're interested. As I said earlier, I'm willing to trust the authors on this.
In the Bem, Broughton, and Palmer article, five studies have the same footnote, with that footnote reading: "Hit rate not reported. Estimated from z score." In conjunction with the footnote you quoted, I believe the correct interpretation is that those five studies (four or which were included in the original 30 studies analyzed by Milton and Wiseman) were the only ones of the 40 studies to have used "different outcome measures." If I am correct, the authors of both articles could have excluded those studies (which were a small percentage of the total) from their analyses and used binomial models. I will check with one of the authors to see whether my interpretation is correct.

Ah, I see. I didn't realize that your question marked your entry into the "I won't believe you until you provide an example that is identical in every way, including the initial letters of the authors' names, and even then I won't believe you" game. I'm not playing. You've already lost all credit with me.
Sorry if I wasn't clear, but you stated: "The effect of publication bias is usually taken into consideration when doing meta-analyses in other fields." And I replied: "How so? Can you give an example or two?" You provided me with an article that concludes: "Assessment of publication bias using a funnel plot was attempted, but too few studies were available to allow any meaningful judgment." So I still see no evidence that publication bias is accounted for in a meaningful way.

I've gone over this in detail with you several times. If you've truly forgotten, I'll leave it up to you to review our previous conversations.
Okay, I'll review.
 
Last edited:

Back
Top Bottom