• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Statistical help, please

if you used regression to predict the 3 missing time 1 values-- which doesn't seem unreasonable-- the resulting t value (with 13.2 added to all time 1 scores) is 1.703 with a p value of .103 or .0515 (one tailed).

Would the fairest conclusion then be there probably is a small effect here with the sample students doing slightly worse than national averages? It's not significant, but the power to detect the difference is only .37, with 23 pairs.

You don't seem to have the power to argue that the observed change is not smaller than the expected change


eta power is .284 with 20 cases


Can I just clarify? Beyond the issue of imputing the missing values, you added 13.2 to all the t1 scores then simply did a t-test versus the t2 scores. Is that arithmetically and conceptually the same as calculating each individual's t1-t2 difference and testing whether it differs from 13.2? Does the choice of whether or not to impute the missing values affect the answer to the question?

I know I risk sounding as dumb as the people I am criticising, but I hope the difference is that I grasp the principles and am smart enough to ask the right questions. It's also 15 years since I last spent time going through the actual calculations of statistical analysis, so I hope I can be forgiven for being a bit rusty.
 
Last edited:
Dakotajudo, I've got a weak phone signal so I need to keep my responses short or the browser crashes on me.

The subjects are the total population not a sample from a larger population.

13.2 is a number defined by the way the scores are assessed. All schools are assessed against this target in "core" subjects. The "average points score" of a cohort (educationalist jargon rather than strict stats usage, meaning the group of children in the same educational year, e.g. 8th grade in US parlance) must rise by at leastt that much.

So, I think the right test is a t-test of the differences in point score over the two time points versus 13.2 with rejection of the 3 cases that do not allow us to calculate their increase in point score. Yes?

You could do a single-sample t test comparing your set of difference scores between t1 and t2, against an improvement of 13.2 (the mean from a hypothetical population that improves by 13.2 points with an unknown standard deviation). This would show that the 12.9 mean of your differences scores between T1 and T2 is not significantly different from 13.2.

I don't see how this result would really be meaningful. It would tell you that if the 'true' improvement in a hypothetical population is 13.2, your school's result has a certain probability of being sampled from a population that improves by 13.2, and this probability is higher than whatever critical value is used. Next time you repeat this without changing anything, the probability of getting an improvement of 13.2 may be high. However, you say that your results are a population, not a sample, and this population must improve by 13.2. They haven't. I don't see understand how a null hypothesis test might change this interpretation. You also can't base an interpretation on failure to find statistical significance where the power is low.
 
Last edited:
I do a little statistical analysis in evaluating the performance of instrumentation in nuclear plants. I'd have to agree with Elaedith and dakotajudo - it seems you have but one sample from which to gauge the true inprovement of this population over the prescribed time period. Hard to work any meaningful numbers from that, other than to say that 12.9 < 13.2.
 
Elaedith

On the subject of sample vs population. My data is the whole population of that school, but in terms of testing I think it is correct to regard it as a sample and the null hypohesis is that it represents as sample of a larger population with this mean improvement of 13.2.

I think bpesta's method strikes me as the most direct way to do the test in a manner that is easy to explain to others. I think the main issue is whether imputing the missing data is good to do or not. Imputing the data makes it more generalisable to classes where the flux of students may be even larger than 3/20.

In that regard can someone explainhow imputing missing data influences the subsequent testing. Does using the data to generate new data points affect the degrees of freedom used in deriving t. How to decide degrees offreedom is something I never fully grasped. I could do the calculations by rote but I never really understood how to work out the degrees of freedom from first principles I k ow that without that understanding my abilities to analyse data do not generalise well beyond textbook straightforward examples.
 
Elaedith

On the subject of sample vs population. My data is the whole population of that school, but in terms of testing I think it is correct to regard it as a sample and the null hypohesis is that it represents as sample of a larger population with this mean improvement of 13.2.

I think bpesta's method strikes me as the most direct way to do the test in a manner that is easy to explain to others. I think the main issue is whether imputing the missing data is good to do or not. Imputing the data makes it more generalisable to classes where the flux of students may be even larger than 3/20.

In that regard can someone explainhow imputing missing data influences the subsequent testing. Does using the data to generate new data points affect the degrees of freedom used in deriving t. How to decide degrees offreedom is something I never fully grasped. I could do the calculations by rote but I never really understood how to work out the degrees of freedom from first principles I k ow that without that understanding my abilities to analyse data do not generalise well beyond textbook straightforward examples.
 
Hi BSM.

I found the post interesting because I have been doing this stuff for about 15 years, though I am neither a statistician or mathematician. Obviously, pick the method here that makes the most sense to you, and has consensus from the others.

I think the t-value I got in my last analysis can be treated conceptually like a z score. What are the odds this sample of 23 students came from a population with a mean change score of 13.2? Well, given 1.703 is less than the critical value (either one or two tailed) the "correct" conclusion is that it likely is a random sample from the population-- your kids are no different from the population, statistically.

But, like many have said above, is the 13.2 cutoff meant to be a statistical cutoff, or an all or nothing hurdle. I dunno.

Further complicating the issue is the very low power you have with only 23 pairs. So, concluding no difference could very likely be a type II error.

The missing values thing is currently a hot topic in psychology. There's old and new methods. Their validity depends on why the data are missing (they can be missing at random; completely at random, or not completely at random).

Here's a review article (they dont recommend using regression to predict missing values, but they dont say why). The paper is technical, but if you need a cite to justify whatever you do, here it is!:

http://www.iapsych.com/articles/graham2009.pdf









Can I just clarify? Beyond the issue of imputing the missing values, you added 13.2 to all the t1 scores then simply did a t-test versus the t2 scores. Is that arithmetically and conceptually the same as calculating each individual's t1-t2 difference and testing whether it differs from 13.2? Does the choice of whether or not to impute the missing values affect the answer to the question?

I know I risk sounding as dumb as the people I am criticising, but I hope the difference is that I grasp the principles and am smart enough to ask the right questions. It's also 15 years since I last spent time going through the actual calculations of statistical analysis, so I hope I can be forgiven for being a bit rusty.
 
Also, I think using the difference score mean and SD approach and treating it as a one sample t-test (does the sample mean come from the population) is equivalent to the t-test here, which focuses on the difference score anyway.

The error in the one sample approach would be the sd of the difference divided by the square root of sample size; the same thing as in the paired samples t-test.
 
Allowing that a Type II error can be made is actually part of the point that these inspectors seem unable to grasp- the sample size is small so you cannot draw a proper conclusion so should not act as if you have done so.
 
Allowing that a Type II error can be made is actually part of the point that these inspectors seem unable to grasp- the sample size is small so you cannot draw a proper conclusion so should not act as if you have done so.
Depends on how you're looking at things. If you consider each pair to be a datum point, then the sample size is huge - it's the entire population. If you're considering the mean of the test score differences, then you have a sample of one. But how is making a decision based on that one sample different then any other exam grading scheme?
 
13.2 is a number defined by the way the scores are assessed. All schools are assessed against this target in "core" subjects. The "average points score" of a cohort (educationalist jargon rather than strict stats usage, meaning the group of children in the same educational year, e.g. 8th grade in US parlance) must rise by at leastt that much.

So, I think the right test is a t-test of the differences in point score over the two time points versus 13.2 with rejection of the 3 cases that do not allow us to calculate their increase in point score. Yes?

No, the t-test is not the right test. You have no expected result, only the minimum acceptable result. Any statistical test would be invalid.

You can argue, though, that the rejection of the three cases is warranted.


I'm afraid you're engaging in an abuse of statistics. You seem to have determined what the outcome should be (that this particular cohort has met it's target) and are trying to find a statistic to support that conclusion.

That's a post-hoc analysis. The a priori standard is that the mean must be > 13.2. There is no statistical test for that; either it's met, or not. You might try to argue that if the group is given a chance to retake the test, they might (just through random differences in performance on different days) achieve the minimum. But that statistical error may have already been taken into consideration. I'm troubled that you seem to think that the group setting the standard doesn't understand the statistics involved. Maybe they're experts?

Consider this analogy. I'm teaching a course; the course syllabus states that 5 exams will be given and to receive an A, you must average 90 over the 5 exams. If, at the end of the course, you earned an average score of 89, you wouldn't be able to argue that, since your 89 is not statistically different from 90, you should be given the A.

You might argue, though, that your roommate had a heart attack the night before the second exam and you had to sit with him in the hospital all night, so you did very poorly on the second exam and that all you other exams were in the 90s.

This should be your argument, because, as you said above
p.p.s. I am happy to accept that the sum of decisions made across thousands of schools may have overall statistical power to lead to well-founded strategic assessments (at least in theory). My problem is at the individual school level where poorly founded judgements have a severe effect on that one school.

You're arguing for the one school, not the population. The question that really needs to be addressed is why this particular group failed to meet the minimum standard. What was the overall mean for improvement, for comparable groups?



There's a reason people scoff at statistics; some say that statistics can be used to prove anything. You're trying to do that with the t-test applied to this case.
 
I think we need more info about the 13.2 before we can know whether dak is right.

Is 13.2 the mean performance across the entire district? If so, I don't think a stat analysis would be inappropriate.

Or, is the mean actually 15, and they did some calculation such that 13.2 is indeed an all or nothing minimum.
 
Consider this analogy. I'm teaching a course; the course syllabus states that 5 exams will be given and to receive an A, you must average 90 over the 5 exams. If, at the end of the course, you earned an average score of 89, you wouldn't be able to argue that, since your 89 is not statistically different from 90, you should be given the A.

Actually, that seems quite reasonable, even if no teacher would accept the excuse. The scores here aren't really the full population, since they don't include all the tests a student might have taken, but didn't. No student will get the same score on a test twice in a row, due to all kinds of factors (including the roommate having a heart attack). It's not like, say, measuring height, where you'll get the same result each time (within rounding errors and all that).

Further, although the students here represent the total population of the school, they don't represent the population as a whole. Imagine that school A has 1000 students (and meets the target), while school B has 1 student (and doesn't meet the target). It doesn't seem unreasonable that school A has enough cross-section that the average score should be fairly accurate, while in school B, it's perfectly possible that they got "unlucky" with a particularly dim student, and that the one data point isn't representative of the school as a whole.

After all, unless I'm reading this incorrectly, the real goal is to measure the performance of the school, not the performance of the students. A subtle distinction, but important, since to properly evaluate the school's performance, one has to consider all the students that might have gone there, but didn't.

bpesta22 could be right, though--perhaps the 13.2 takes this into account, and the real goal is somewhat higher.

- Dr. Trintignant
 
There's one more way to argue this: the three students with no results at time 1, the missing cases, could represent improvements high enough to raise the school to the required 13.2 average improvement had they been there at time 1. It's grasping at straws (one could put an actual probability to it, given the other observations), but it's something that could be tried.
 
I'm afraid you're engaging in an abuse of statistics. You seem to have determined what the outcome should be (that this particular cohort has met it's target) and are trying to find a statistic to support that conclusion.

I hope that is not the impression I had given.

I'm not trying to say that this cohort has met the standard from the data available, but that it cannot be said that they haven't met the standard.

My point is that these assessments are inherently flawed when the groups being assessed are frequently small and especially when the school is then required to take action as if the group had been shown to have underperformed.

You may have noticed I am only describing the data for one subject area. The school satisfies its targets under the other subject areas on the terms by which the inspectors work. I would say that this judgement is equally flawed. The asymmetry arises because a flawed judgement that a school has met its standard has no practical consequences in this round of assessment, just a pat on the back for a job allegedly well done.

I am not being logically inconsistent. If the school achieved 13.4 in maths next year the system will judge it to have succeeded. I would say that this judgement would be just as ill-founded as this year's.

The target of 13.2 falls out from the way grades are calculated. The target that the school is set is that the "average points score" for a year-group must reach or exceed that level. So, I still think t-tests are appropriate if you have enough data or the variance is small enough. The problem is that most small schools do not have enough data, given the variance, for proper tests to be performed, but that action is then required of the school as if proper tests had been performed.

There are secondary targets for numbers of individual children achieving certain amounts of progress, but that's a whole other story. And let's not even begin to consider the problem of correcting for multiple comparisons
 
The problem is whether or not a group consisting of the entire population achieves a minimum standard of improvement of 13.2. In this case, 12.9 is indeed less than 13.2.

I think this is maybe where wires are getting crossed.

It is accepted that not all individuals will achieve this target. Nationally the target is that the mean of the whole population should rise by that amount. It is not, therefore, a minimum hurdle to be vaulted. Ideally with advancements in educational quality the rise over time should be made better. [sarcasm]And we can see from the complete absence of any suggestion of grade-inflation that governments are very good at ensuring that this does, objectively, happen. [/sarcasm]

However, at the school level, every school is judged as if its average must exceed the national level. It could be argued that this is the means of applying selective pressure. And yes, across the country on average the pressure might get applied in the right places, but this average will conceal pressure being applied in the wrong places in many individual instances.

So, the children across the whole country vary about that mean. Groups of children drawn from that larger population will vary about the mean. The question we are debating is how to know whether a group's mean differs sufficiently from the population mean to show that something special marks that group out, either specially good or specially poor teaching and what should be done when the group one is concerned about is too small to make that assessment valid. And my problem is managing a situation where it is asserted that the teaching has been specially poor based on data from a group whose variance overlaps the target mean.

It is very tempting so say that I think it is shocking that about half of schools are below average and why can't we all live in Lake Wobegone.
 
The underlying problem seems to me to be the discontinuous nature of the response: if a school's grade is slightly above the cutoff, everything is supposed to be perfectly okay; if slightly below, more or less drastic measures are felt to be necessary.

Arguments about exactly where the cutoff should be, tend to be met with, "well, of course we can't give a very good reason why it should be precisely x and not x + 0.01 or x - 0.01, but there has to be some cutoff, after all."

No, there really doesn't.

If the response were a continuous function of the grade, with slightly worse grades resulting in only slightly more punishment (or slightly more help, or slightly more of whatever the appropriate response to a poor grade is thought to be), the urgent need, in borderline cases, to argue about whether an arbitrary cutoff has or hasn't been met, would disappear. There would be no borderline cases, because there would be no borderline.
 

Back
Top Bottom