Jacob indeed-- I dunno where I got Stanley.
Well, this interests me. I'm more convinced I am wrong reading the replies, but I'm still not sure why, and it still seems like 10,000 is way too many to get power =.80. If anyone can help me out here, I would appreciate it!
I thought there were two ways to do power analysis. In one, you know the true population effect size. I pulled out an old text with an example.
Assume you knew the population age difference was 3.5 points on the pain survey with a population SD of 7.
The effect size would be .50
In this example, there were 16 younger and 16 older patients (categorized as younger and older, versus using age as a continuous variable).
delta is the effect size times the square root of 16 (n size) / 2. In this case delta is 1.41, and going to the power tables for alpha = .05, two tailed, we get .29 as the power value.
That's crappy power, so a valid question is how many more subs does one need to run to get power to be, say, .50?
It turns out you need 30.7 people per group for .5 power, versus the 16 we actually ran (at .29 power).
It's rare to know the population means and SD-- otherwise you wouldn't need to do the research-- so, how does one estimate power after the fact (i.e., when results turn out to be non-significant, as that's the only time you're gonna worry about power!)?
Cohen's solution was to create conventional effect sizes:
.20 = small
.50 = medium
.80 = large
We don't know the population means and SDs in truthseeker's study, but by picking a conventional effect size, we can still plug the values in to get the needed sample size for whatever level of power we desire.
In the stat book example I am looking at, they assume a large effect size (.8) and want high power (.8).
As it turns out, they need just 24.5 people per group for .80 power (with an assumed .80 effect size)!
In looking at the power table, even for .999 power and a .01 alpha, the number of subjects required is just 100 per group!
what if the effect size were only .10?
The sample size needed for .8 power would be:
2*2.80(squared) / .10 squared
We'd need 1568 subjects for p = .80. Thats a lot of subjects, but not 10,000 subjects!
What if the effect size were .01?
156,800 subjects needed. But, unless we're predicting hurricanes, I can't see where an effect size of .01 would have any practical significance.
Which leads to why I'm confused:
It makes more sense to me to pick the lowest effect size that would have practical significance, given the reseach question you're looking at. That effect size should be plugged into the formulas to see either how powerful the current study was (given it's sample size), or how big a sample size was needed for a given level of power.
To me, it seems wrong to use the actual reported mean differences and standard deviation as the measure of the effect size you are trying to detect.
When the null is not rejected, there's two possible causes: There really is no difference, or power sucked (because of weak manipulations or less than perfectly reliable measures, or too small sample sizes).
I think using the observed / reported effect size value as the benchmark is way too conservative and leads to an odd conclusion: Even though I am going to conclude there's nothing here, had I ran 10,000 subjects, I would claim this effect significant!
Seems odd. Seems to answer the wrong question (how many subjects would be needed for this trivial, observed effect to be significant). Is this the question you want answered?
I think you wanna answer this question: What type of power did this study have to detect a practically-important effect, assuming one exists?
Wouldn't the better sounding conclusion be something like: The present study showed nominal differences in pain perception-- the older group reported more pain than the younger group, but the difference was not significant.
This is perhaps a power issue. We know from previous research that an effect size of .20 on our pain survey is meaningful. Two groups that differ by .20 sds or more on our scale differ importantly in terms of their perceived pain. Given that, the power to detect at least a .20 effect size from the current study was only .4. Hence, we are reluctant to claim that age has no effect on pain perception in these studies, due to their low statistical power. In fact, given that .80 power is strived for, the present studies would need to have run 200 subjects per group, instead of the 100 actually ran...
Sorry for the crappy writing, but I hope my argument is clear.
So, where did I go wrong?!
In thinking about this, consider, too: Why would cohen adopt/invent his conventional effect sizes for situations like these (which is what TS faces here) if the appropriate way to go is to use the raw data as the estimate of the effect size???
Help!