<snip>
Firstly I'll address the crux of your proposition, that data differences ("
dispersion") between labs suggests that the samples did not come from one source. In
#76 you said:
That doesn't matter. No matter what date, if the samples don't agree, then the samples were not from the same thing.
The control samples do agree on the date, that means the control samples were from the same items.
This evidences a fundamental misconception (not restricted to you!), that heterogeneity is binary - things are either heterogenous or not. In reality nothing is perfectly homogenous, there are always real differences between subsamples - not just measurement errors. Those differences may be too small to matter but, no matter how small, they can be demonstrated given enough data and appropriate data analyses. If data analysis doesn't show a difference that does not mean there is no difference, only that the data and/or analysis are inadequate to show the difference.
This illustrates one of the problems with null hypothesis significance tests (NHST), we don't believe the null hypothesis anyway (see, for example,
link). We know the SoT is not perfectly homogenous, it is at least somewhat heterogenous. If NHST tests do not demonstrate heterogeneity then either the data or the data analysis are inadequate.
To support your proposition you would need to show that the sample dispersion was so great as to be incompatible with "samples ... from the same thing". But we do not know the dispersion expected of "samples ... from the same thing". The control samples give some relevant information, but it would be a great assumption that the heterogeneity of the ToS and the controls were similar.
The non-overlap of standard deviations you used first (
#70)
and the chi^2 analysis used by Damon et al. compare the between lab dispersion to (estimates of) within lab dispersion, not relevant to your proposition.
link is relevant here.
Even worse, in this data set samples and labs are inextricably
confounded, it isn't possible to distinguish sample effects from lab effects. Again, the control samples give some relevant information, but it would be a great assumption that between lab differences in treatment of ToS samples were the same as differences in their treatment of controls.
To add to the difficulties, Damon et al. Table 1 gives only 12 results for the ToS, 4 for each sample/lab. With such a small data set no statistical test will have useful
power, you really have no chance of showing undue heterogeneity unless it is massive.
<snip>