13.2 is a number defined by the way the scores are assessed. All schools are assessed against this target in "core" subjects. The "average points score" of a cohort (educationalist jargon rather than strict stats usage, meaning the group of children in the same educational year, e.g. 8th grade in US parlance) must rise by at leastt that much.
So, I think the right test is a t-test of the differences in point score over the two time points versus 13.2 with rejection of the 3 cases that do not allow us to calculate their increase in point score. Yes?
No, the t-test is not the right test. You have no
expected result, only the
minimum acceptable result. Any statistical test would be invalid.
You can argue, though, that the rejection of the three cases is warranted.
I'm afraid you're engaging in an abuse of statistics. You seem to have determined what the outcome should be (that this particular cohort has met it's target) and are trying to find a statistic to support that conclusion.
That's a post-hoc analysis. The
a priori standard is that the mean must be > 13.2. There is no statistical test for that; either it's met, or not. You might try to argue that if the group is given a chance to retake the test, they might (just through random differences in performance on different days) achieve the minimum. But that statistical error may have already been taken into consideration. I'm troubled that you seem to think that the group setting the standard doesn't understand the statistics involved. Maybe they're experts?
Consider this analogy. I'm teaching a course; the course syllabus states that 5 exams will be given and to receive an A, you must average 90 over the 5 exams. If, at the end of the course, you earned an average score of 89, you wouldn't be able to argue that, since your 89 is not statistically different from 90, you should be given the A.
You might argue, though, that your roommate had a heart attack the night before the second exam and you had to sit with him in the hospital all night, so you did very poorly on the second exam and that all you other exams were in the 90s.
This should be your argument, because, as you said above
p.p.s. I am happy to accept that the sum of decisions made across thousands of schools may have overall statistical power to lead to well-founded strategic assessments (at least in theory). My problem is at the individual school level where poorly founded judgements have a severe effect on that one school.
You're arguing for the one school, not the population. The question that really needs to be addressed is why this particular group failed to meet the minimum standard. What was the overall mean for improvement, for comparable groups?
There's a reason people scoff at statistics; some say that statistics can be used to prove anything. You're trying to do that with the t-test applied to this case.