• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Statistical help, please

Badly Shaved Monkey

Anti-homeopathy illuminati member
Joined
Feb 5, 2004
Messages
5,363
I have performance data that I must judge and I need some help. It's not difficult stats but I need to get it right so could do with some guidance on the optimal analysis.

The data are paired values. As a cohort, the group must achieve a rise in mean value of 13.2 between the two time points. The two groups have a mean of 14.7 and 27.6 respectively. Difference 12.9.

One problem is that the cohort is small and, although the data are naturally paired, 3 individuals were part of the later cohort but not of the first, so it may be an unpaired comparison is better because the n would be larger.

I don't think any meaningful statement can be made about whether the group has exceeded or undershot its target because 2-sem about the mean difference easily encompass the target value, but please put me right if I am wrong.

I have listed the data with x's where individuals are missing from the first time point.

Time point 1
x ,16, 19, 15, 16, 19, 18, 19, 15, x, 13, 13, 5, 17, 8, 10, x, 18, 11, 19, 15, 19, 16

Time point 2
27, 27, 27, 23, 29, 33, 33, 29, 27, 27, 25, 25, 21, 31, 27, 21, 27, 29, 23, 29, 27, 33, 29

I'm very grateful for any help. I need to report these data on Monday of next week.
 
Isn't this a simple paired t-test (thus you should take a mean of the differences, not a difference of means)?

Also, I don't get the same averages as you do (e.g. for time 1, I get 15.05, taking n=20 and not including the missing cases). Is that really the data you have?
 
The beauty of the within subjects t is that the error term is different. The variance within groups is irrelevant, all that matters is the difference scores (time 1 versus time 2) all be the same sign / greater than zero.

You have a pretty large effect here-- assuming you presented the pairs right (i.e., the first value in time 1 is the same guy as the first value in time 2, etc).

Mean difference: -12.35
Std Dev of the diff: 2.601
Std error: .581
t = -21.23 !!!

effect size: 4.75!

A difference that huge can't be due all to increases in performance. I bet it's a demand characteristic where you're using some subjective measure of performance (versus say units produced or sales dollars) and the people doing the ratings know that there should be improvement from time 1 to time 2.

Still, though, you can spin it to be that the objective was definitely met!
 
Also, I don't get the same averages as you do (e.g. for time 1, I get 15.05, taking n=20 and not including the missing cases). Is that really the data you have?

I get the same (15.05).

I assume you would want the statistical test to tell you whether the actual results were significantly different from the target? In which case I would add on the target increase to the time one scores and do a t-test between that and the time 2 scores. I dont think changing to an unrelated t-test would help, as what you gain by a couple of extra data points, probably wouldn't outweigh what you lose by getting rid of the relationship between the data.

(You might also need to check your data fits the assumptions for parametric tests - if not, do the non parametric equivalent)

(PS IANAS! This is purely remembered from my stats classes in psychology degree).
 
btw, imputing the data gave a mean of 15.05 for time 1 and 27.4 for time 2.


Here are the pairs:
999.0 27.0
16.0 27.0
19.0 27.0
15.0 23.0
16.0 29.0
19.0 33.0
18.0 33.0
19.0 29.0
15.0 27.0
999.0 27.0
13.0 25.0
13.0 25.0
5.0 21.0
17.0 31.0
8.0 27.0
10.0 21.0
999.0 27.0
18.0 29.0
11.0 23.0
19.0 29.0
15.0 27.0
19.0 33.0
16.0 29.0
 
the uncorrelated / between subjects t would be improper here since the values are paired, plus you would lose a ton of statistical power.

There are options for estimating missing values, but you don't have the n size to do that, and you don't need to anyway as it's a hugely freakish effect with just n = 20.
 
The beauty of the within subjects t is that the error term is different. The variance within groups is irrelevant, all that matters is the difference scores (time 1 versus time 2) all be the same sign / greater than zero.

You have a pretty large effect here-- assuming you presented the pairs right (i.e., the first value in time 1 is the same guy as the first value in time 2, etc).

Mean difference: -12.35
Std Dev of the diff: 2.601
Std error: .581
t = -21.23 !!!

effect size: 4.75!

A difference that huge can't be due all to increases in performance. I bet it's a demand characteristic where you're using some subjective measure of performance (versus say units produced or sales dollars) and the people doing the ratings know that there should be improvement from time 1 to time 2.

Still, though, you can spin it to be that the objective was definitely met!

The null is mu(t1)-mu(t2)= -13.2. There is an effect, but it's not what BSM is looking for. The results are not statistically significantly different from the null hypothesis though.
 
I'll wait for the OP to clarify but it seems silly to worry about whether 12.9 is different from 13.2 when the group is doing almost twice as good at time 2.

What are the units? Is .3 a unit meaningful?

If 13.2 is what yer testing against, then it wouldn't be different from 12.9, but now you're betting on a null.

the t for adding 13.2 to everyone's time 1 score versus time 2 is 1.461 (p=.16).
 
Badly Shaved Monkey said:
As a cohort, the group must achieve a rise in mean value of 13.2 between the two time points.
What does the value 13.2 represent? Is it an absolute, or relative to t1 (i.e. a percentage)?
And why "as a cohort"?

Badly Shaved Monkey said:
One problem is that the cohort is small and, although the data are naturally paired, 3 individuals were part of the later cohort but not of the first, so it may be an unpaired comparison is better because the n would be larger.
I would tend to agree with Jorghnassen; it should be a paired test of means of differences, and the larger n does you no good.

Is there a specific reason why 3 individuals were added to the second cohort?

Badly Shaved Monkey said:
I don't think any meaningful statement can be made about whether the group has exceeded or undershot its target because 2-sem about the mean difference easily encompass the target value, but please put me right if I am wrong.
The actual result does not appear to be different from the expected result, so you can say there is no reason to state the group undershot the target (12.9 vs 13.2). What was the initial expectation - no difference, over, or under?

I'll wait for the OP to clarify but it seems silly to worry about whether 12.9 is different from 13.2 when the group is doing almost twice as good at time 2.

You're making an interpretation of the data without any basis. How do you know that the group is doing twice as good?

It may be that for the given group, an change of 13.2 is minimal and to be expected; a change of 12.9 is less than normal.
 
Dakotajudo has it right, the 13.2 is the target that has to be achieved. These are standardised educational test scores. The time points are 3 years apart. My problem is that an outside body is asserting that this group has underachieved because 12.9<13.2. My point is that this cannot be competently asserted from the data, which is obvious, but given that ithas been asserted by people with official power over the institution concerned it needs to be challenged.

I'll need to check the data if you're getting different means to me, I may have transcribed a number wrongly, but that does not affect the principles at play here.

Three individuals entered the group because that is what happens in groups of students rather than experimental subjects. Their existence has to be allowed for- either included in the analysis or appropriately excluded with reasons given. The numbnuts assessing the data from outside simply ignore the absence of data for the earlier timepoint for three individuals and make their assertions by simply saying whether the difference of the means is better or worse than 13.2, which is the nationally accepted target.

I hope that is clearer. I'm posting this from my phone which makes proof-reading tricky.

Thanks
 
Because you have paired data (thus the results at time 1 and time 2 are not independent), you have to use a paired t-test, which means excluding the 3 individuals who were not there at t1. There are ways to impute the missing data, but that wouldn't be of any use here (it would increase the variance of the estimated mean difference/reduce the degrees of freedom). However, the details are moot as we cannot reject the average difference of 13.2 from the data either way.
 
I didn't know it was developmental / on kids, with three years in between observations. When I saw performance; I assumed employees.

This is one of the few scenarios I can think of where a near 5.0 sd increase might not be impressive. Think about the magnitude of a 5.0 sd increase in performance in any other context (like say raising a kid's iq from 70 to 145).

That said, showing that 12.35 isn't different from 13.2 might be hard given low power.
 
However, the details are moot as we cannot reject the average difference of 13.2 from the data either way.

I think I am not going beyond the bounds of confidentiality to say that when I made that point to the Ofsted inspector it was met with a frosty silence and a suggestion that I might benefit from more training in understanding school data. The irony of the fact that the subject of concern is Maths and the inspectors revealed a level of understanding that would fail them at GCSE was not lost on me.

I've never seen behind the scenes of one of these processes before but this blindness to proper data analysis is built into the core of how schools are assessed. There are systems in place to collect reams of numerical data but the analysis of that data is useless. I was wrongfooted by not realizing just how incompetent it could be.

A formal complaint is likely, but the fact that thousands of schools are regularly being judged like this beggars belief.
 
Do you know what their expected mean level of performance was three years ago? If they were way below it then, but came very close to the new expected three years later, would that give you some arguing power?
 
I'm not so great at stats, but here's my input. As others have said, a t-test is appropriate here. We can compute t as the following:
t = (xbar - u0) / (s/sqrt(n))

For our samples, I'm using the difference in scores here (pairs with missing samples are ignored).
xbar = 12.35 (average of differences)
u0 = 13.2 (the target score)
s = 2.601 (standard deviation)
sqrt(n) = 4.472

From this, we get t = -1.4615. I then plugged this number into an online calculator:
http://www.stat.tamu.edu/~west/applets/tdemo.html

Degrees of freedom is n-1, or 19. "Area to the right" comes to 0.92, or 92%.

So the hypothesis that the individuals did not meet their target only reaches 92% confidence--not up to the usual standard.

- Dr. Trintignant
 
Dakotajudo has it right, the 13.2 is the target that has to be achieved. These are standardised educational test scores. The time points are 3 years apart. My problem is that an outside body is asserting that this group has underachieved because 12.9<13.2. My point is that this cannot be competently asserted from the data, which is obvious, but given that ithas been asserted by people with official power over the institution concerned it needs to be challenged.

But is this really a statistics problem?

The problem is not whether or not the actual improvement of a hypothetical population (estimated from a sample) is equal to 13.2. In this, 12.9 is not different from 13.2.

The problem is whether or not a group consisting of the entire population achieves a minimum standard of improvement of 13.2. In this case, 12.9 is indeed less than 13.2.


The only statistical issue is whether this group can be used to measure the performance of a larger group - for example, if 13.2 is the expected improvement for test for an entire school, and the cohort given here is a sample of students used to measure that improvement. If that's the case, then given this sample of students, you can't say that the goal has been met, only that the results are not different from the goal.


Whether it's a fair assessment (again, what is the logic for 13.2? Why is that the minimum acceptable value?) is a different debate. Not sure you can solve it statistically.

Perhaps the desired improvement was 15, and 13.2 was selected to allow for a certain expected variance over time and across different schools?

It might be approached like a quality control problem. For example, drugs are packaged with an expect error rate (that is, your pills contain plus/minus 10% of the labeled dose). A batch that is sampled to be out of that range is discarded (in this hypothetical), even if the actual variance is 10.01%.

Not what we want to do to kids, but that's the type of problem. The ethical question becomes - if this group fails to meet the standard, is the school punished? Or will actions be taken to improve teaching (I'm cynical about the latter)?
 
Dakotajudo, I've got a weak phone signal so I need to keep my responses short or the browser crashes on me.

The subjects are the total population not a sample from a larger population.

13.2 is a number defined by the way the scores are assessed. All schools are assessed against this target in "core" subjects. The "average points score" of a cohort (educationalist jargon rather than strict stats usage, meaning the group of children in the same educational year, e.g. 8th grade in US parlance) must rise by at leastt that much.

So, I think the right test is a t-test of the differences in point score over the two time points versus 13.2 with rejection of the 3 cases that do not allow us to calculate their increase in point score. Yes?
 
p.s. Two successive failures to meet this standard using the "one number is bigger than the other number ignoring variance and degrees of freedom" leads to punishment regardless of whether it is statistically unsupportable. One failure requires that the school demonstrates that it has diverted resources to correct the alleged error.
 
p.p.s. I am happy to accept that the sum of decisions made across thousands of schools may have overall statistical power to lead to well-founded strategic assessments (at least in theory). My problem is at the individual school level where poorly founded judgements have a severe effect on that one school.
 
if you used regression to predict the 3 missing time 1 values-- which doesn't seem unreasonable-- the resulting t value (with 13.2 added to all time 1 scores) is 1.703 with a p value of .103 or .0515 (one tailed).

Would the fairest conclusion then be there probably is a small effect here with the sample students doing slightly worse than national averages? It's not significant, but the power to detect the difference is only .37, with 23 pairs.

You don't seem to have the power to argue that the observed change is not smaller than the expected change


eta power is .284 with 20 cases
 
Last edited:

Back
Top Bottom