• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Interpreting nonsignificant results

Doesn't seem odd at all to me. If the difference is actually very small, then it will require lots of subject to detect (reliably).

Exactly. Especially if one considers proving no difference is impossible and small differences require rapid increases in sample size as the difference to be detected decreases.
 
Pesta, I've got to agree with Dr. Kitten and Marting on this one.

Nonetheless, there may be other reasons why the studies are not demonstrating differences which may exist. Perhaps the age groups are not different enough and more extreme groupings (e.g., <40 versus >70, which would exclude the majority of cancer patients) might have picked up a difference. Perhaps using a different scale with greater sensitivity to small effects might have done it (but again, what about clinical relevance?). Another possibility is the possible impact of a third variable. Specifically, all the patients in these studies had both disease A (cancer) and symptom B. It may well be that the effects of A and B on outcome C are large and therefore obscure any effect of age. A regression analysis might allow some insight into this, but that isn't currently available.
 
Pesta, I've got to agree with Dr. Kitten and Marting on this one.

Nonetheless, there may be other reasons why the studies are not demonstrating differences which may exist. Perhaps the age groups are not different enough and more extreme groupings (e.g., <40 versus >70, which would exclude the majority of cancer patients) might have picked up a difference. Perhaps using a different scale with greater sensitivity to small effects might have done it (but again, what about clinical relevance?). Another possibility is the possible impact of a third variable. Specifically, all the patients in these studies had both disease A (cancer) and symptom B. It may well be that the effects of A and B on outcome C are large and therefore obscure any effect of age. A regression analysis might allow some insight into this, but that isn't currently available.

Is it possible to tell us what outcome you are looking at?

Could a bias be introduced into the selection process? For example, one that excludes some of the older people likely to have the outcome?

Linda
 
Linda,

You are right. I forgot to mention one bias that plagues all gerontological research: recruitment bias. Older people, especially those who are medically ill, are not highly likely to participate in research. So there is always a question of the representativeness of older samples.

I'm happy to share the variables: all the patients have advanced terminal cancer and pain and the outcome of interest is depression. Each of these is an unwieldy beast in its own right, let alone when they co-occur. I have definitely addressed these issues in the paper.
 
So far the discussion has centred on statistical significance. The last post, indicating a required sample size of 10,000, raises the question of clinical significance. In other words, there may be something going on, but is it useful?
Well, N goes as the inverse square of the effect. So 10,000 suggests an effect of about 1%. Over several million people, that could be important. For instance, a drug that reduces heart disease by 1% would save millions of lives worldwide.
 
It seems like there might be two ways to do it.

1) calculate the presumed effect size based on the mean differences (the nonsignificant ones) found in the 4 studies you looked at. From there, figure out what n size would have been needed to have p=.80 to reject the null.

My problem with this approach-- and I could be wrong-- is that if the 4 studies represent a type II error, then by definition they are unfairly understimating the true effect size. Using the observed effect size then seems off.

I'm not sure if you did this, but I guess the other way to go is figure out what the smallest clinically meaningful effect size might be (say .20) and from there figure out the power these studies had to detect that effect.
That sounds right to me.

If the effect is small, then certainly it will be hard to detect. But the question is, how sure are we that it really is small, i.e., do these studies provide strong evidence or only weak evidence that the effect is small enough to be clinically unimportant. It would seem to be begging the question simply to assume that the true effect size is identical with the size of the effect seen in the studies.

The studies aren't statistically significant. This means that they are consistent, more or less, with a true effect size of zero. But they might also be more or less consistent with a true effect size that's big enough to be clinically relevant. So the next thing to look at is, how big of an effect is big enough, and, supposing the true effect were in fact this big, would the results of the studies be terribly unlikely or reasonably likely?

But I imagine, TruthSeeker, that such things as the recruitment bias you mention are in all probability quite important compared with stuff that's easier to model mathematically. So, use your expert knowledge of the field. That's why you're writing this review and not me. :)
 
i think this is an interesting example of another form of bias inherent in null testing - ie a form of conceptual bias. Tests are generally set up with a straw man null, which the tester thinks/expects/suspects will provide significant results - and as such if no such link is found there is a weight towards looking for looking for errors (in)/ways to "improve" the testing proceedure rather than concluding that there was no evidence to reject the null Whilst this is of course a rather natural human approach, it does mean that whatever the currently accepted scientifically expected norms are, they are reinforced through re-testing and tweaking of tests and their interpretations.....


stuff on conceptual bias
http://htpprints.yorku.ca/archive/00...1/03denis.html
 
If the effect is small, then certainly it will be hard to detect. But the question is, how sure are we that it really is small, i.e., do these studies provide strong evidence or only weak evidence that the effect is small enough to be clinically unimportant. It would seem to be begging the question simply to assume that the true effect size is identical with the size of the effect seen in the studies.

The studies aren't statistically significant. This means that they are consistent, more or less, with a true effect size of zero. But they might also be more or less consistent with a true effect size that's big enough to be clinically relevant. So the next thing to look at is, how big of an effect is big enough, and, supposing the true effect were in fact this big, would the results of the studies be terribly unlikely or reasonably likely?

This starts to verge on special pleading. I assume that you didn't mean to veer in that direction, but that's where it's starting to go.

I don't think it's begging the question to assume that the effect size is what we measured it to be. That's the most reasonable conclusion to draw given the evidence. You're right that the effect size might be larger than we've observed, but it's just as likely to be smaller than we observed.

Most statistics -- not all, certainly, but most -- are symmetric. So if we have statistically significant evidence that the observed value of x is different from (greater than) 0, then we also have significantly significant evidence that the observed value of x is different from (less than) 2x. If TruthSeeker were interested, she's already done most of the analysis that would let her put confidence intervals around the effect size, so we could say "we are 95% confident that the true value is between here and there." (And under this analysis, a test is "significant' if and only if the confidence interval excludes 0.00. Isn't theoretical stats fun?) If neither here nor there correspond to a "clinically" significant difference, then that's more or less a final nail in the coffin of this particlar hypothesis.
 
Last edited:
Jacob indeed-- I dunno where I got Stanley.

Well, this interests me. I'm more convinced I am wrong reading the replies, but I'm still not sure why, and it still seems like 10,000 is way too many to get power =.80. If anyone can help me out here, I would appreciate it!

I thought there were two ways to do power analysis. In one, you know the true population effect size. I pulled out an old text with an example.

Assume you knew the population age difference was 3.5 points on the pain survey with a population SD of 7.

The effect size would be .50

In this example, there were 16 younger and 16 older patients (categorized as younger and older, versus using age as a continuous variable).

delta is the effect size times the square root of 16 (n size) / 2. In this case delta is 1.41, and going to the power tables for alpha = .05, two tailed, we get .29 as the power value.

That's crappy power, so a valid question is how many more subs does one need to run to get power to be, say, .50?

It turns out you need 30.7 people per group for .5 power, versus the 16 we actually ran (at .29 power).

It's rare to know the population means and SD-- otherwise you wouldn't need to do the research-- so, how does one estimate power after the fact (i.e., when results turn out to be non-significant, as that's the only time you're gonna worry about power!)?

Cohen's solution was to create conventional effect sizes:

.20 = small
.50 = medium
.80 = large

We don't know the population means and SDs in truthseeker's study, but by picking a conventional effect size, we can still plug the values in to get the needed sample size for whatever level of power we desire.

In the stat book example I am looking at, they assume a large effect size (.8) and want high power (.8).

As it turns out, they need just 24.5 people per group for .80 power (with an assumed .80 effect size)!

In looking at the power table, even for .999 power and a .01 alpha, the number of subjects required is just 100 per group!

what if the effect size were only .10?

The sample size needed for .8 power would be:

2*2.80(squared) / .10 squared

We'd need 1568 subjects for p = .80. Thats a lot of subjects, but not 10,000 subjects!

What if the effect size were .01?

156,800 subjects needed. But, unless we're predicting hurricanes, I can't see where an effect size of .01 would have any practical significance.

Which leads to why I'm confused:

It makes more sense to me to pick the lowest effect size that would have practical significance, given the reseach question you're looking at. That effect size should be plugged into the formulas to see either how powerful the current study was (given it's sample size), or how big a sample size was needed for a given level of power.

To me, it seems wrong to use the actual reported mean differences and standard deviation as the measure of the effect size you are trying to detect.

When the null is not rejected, there's two possible causes: There really is no difference, or power sucked (because of weak manipulations or less than perfectly reliable measures, or too small sample sizes).

I think using the observed / reported effect size value as the benchmark is way too conservative and leads to an odd conclusion: Even though I am going to conclude there's nothing here, had I ran 10,000 subjects, I would claim this effect significant!

Seems odd. Seems to answer the wrong question (how many subjects would be needed for this trivial, observed effect to be significant). Is this the question you want answered?

I think you wanna answer this question: What type of power did this study have to detect a practically-important effect, assuming one exists?

Wouldn't the better sounding conclusion be something like: The present study showed nominal differences in pain perception-- the older group reported more pain than the younger group, but the difference was not significant.

This is perhaps a power issue. We know from previous research that an effect size of .20 on our pain survey is meaningful. Two groups that differ by .20 sds or more on our scale differ importantly in terms of their perceived pain. Given that, the power to detect at least a .20 effect size from the current study was only .4. Hence, we are reluctant to claim that age has no effect on pain perception in these studies, due to their low statistical power. In fact, given that .80 power is strived for, the present studies would need to have run 200 subjects per group, instead of the 100 actually ran...

Sorry for the crappy writing, but I hope my argument is clear.

So, where did I go wrong?!

In thinking about this, consider, too: Why would cohen adopt/invent his conventional effect sizes for situations like these (which is what TS faces here) if the appropriate way to go is to use the raw data as the estimate of the effect size???


Help!
 
Last edited:
This starts to verge on special pleading. I assume that you didn't mean to veer in that direction, but that's where it's starting to go.

I don't think it's begging the question to assume that the effect size is what we measured it to be. That's the most reasonable conclusion to draw given the evidence. You're right that the effect size might be larger than we've observed, but it's just as likely to be smaller than we observed.

Most statistics -- not all, certainly, but most -- are symmetric. So if we have statistically significant evidence that the observed value of x is different from (greater than) 0, then we also have significantly significant evidence that the observed value of x is different from (less than) 2x. If TruthSeeker were interested, she's already done most of the analysis that would let her put confidence intervals around the effect size, so we could say "we are 95% confident that the true value is between here and there." (And under this analysis, a test is "significant' if and only if the confidence interval excludes 0.00. Isn't theoretical stats fun?) If neither here nor there correspond to a "clinically" significant difference, then that's more or less a final nail in the coffin of this particlar hypothesis.
I'm not entirely sure what we agree about and what we disagree about.

I agree that it would be a good idea to calculate a confidence interval for the effect size and see whether the interval includes something that's clinically significant. But I think this is not very different from what I suggested in my last post, namely, to decide what size effect would be clinically significant, and to determine whether the observed results would be very unlikely or not too unlikely if the true effect were of that size.

I don't think that calculating confidence intervals involves supposing that the true effect size is the same as what was observed, and then, e.g., determining on that supposition how large a study would be needed to reject in a significance test the null hypothesis of no effect.
 
I'm not entirely sure what we agree about and what we disagree about.

I agree that it would be a good idea to calculate a confidence interval for the effect size and see whether the interval includes something that's clinically significant. But I think this is not very different from what I suggested in my last post, namely, to decide what size effect would be clinically significant, and to determine whether the observed results would be very unlikely or not too unlikely if the true effect were of that size.

I don't think that calculating confidence intervals involves supposing that the true effect size is the same as what was observed, and then, e.g., determining on that supposition how large a study would be needed to reject in a significance test the null hypothesis of no effect.


Hey, I think I agree with 69dodge, though it took me 20 paragraphs to say that.
 
I'm not sure how literally you meant this to be taken, but I guess, in the presence of a real effect, one wouldn't expect the p-values themselves to be the same across studies of different sizes: larger studies will tend to have smaller (more significant) p-values than smaller studies, in a test of the null hypothesis of no effect. So one should look instead for consistency of some other, more direct, measure of the effect size.

Not too literally. :) It's true that in the absence of a true relationship (i.e. the null is correct) that the p-values will tend to become smaller as the sample size increases, but if the null is correct, that won't hold.

My thinking is this: if the null is correct and there is no relationship, then the p-values could be considered to be independent random statistics with an expected value of 0.5. If the four studies have p-values of say, .21, .15, .05 and .04, then it's reasonable to speculate that a weak effect may exist but the studies lack sufficient power to identify it reliably. On the other hand, if the p-values are .78, .52, .18, .05 it seems unlikely that it is anything other than random variation.
 
Thanks, everyone.

Some great suggestions here. I'm limited somewhat by the amount of information the authors provide in their manuscripts (very frustrating!) but you have given me some possibilities.
Have you tried contacting the authors directly? They might be able to give you access to the raw data.
 

Back
Top Bottom