Your book says Probability is the study of uncertainty. Which is it? Statistics or Probability?
If you're proposing to study statistics without probability, then bwahaha!
The study of Data is a pretty complete answer, if you can propose a better definition, I would like to hear it.
"Statistics is considered the science of uncertainty. This arises from the ways to cope with measurement and sampling error as well as dealing with uncertanties in modelling."
en.wikipedia.org
"The study of data" is too vague, too weak. If you want to limit your understanding to merely
describing data, that's fine. But it's incomplete for the purposes of understanding what is being claimed about the shroud. We're doing inferential statistics, which are inexorably governed by probability. We've taken a number of measurements of the physical properties of the shroud, and we want to
infer from the error possibly arising from the
uncertainty of measurement whether our measurements are
probably good enough for what we are using them for. "Good enough" is not per se a statistical concept, or even a probability concept.
Because we expect measurement to be uncertain, we require probability to help us understand what to expect when we measure. We need to know whether our measurements vary in expected ways that we'll never get rid of, or in unforeseen ways that merit further investigation. We can't do that without a model to measure against, and we can't make that model without invoking probability.
You still don't give the impression you know and understand what any of the chi^2 tests or distributions are for. You understanding of statistical models means nothing in this discussion, real measurements, not modeled ones.
You give the impression you don't understand what a statistical model is in the sense I mean. It is a mathematical expression of expectation based on what we believe the uncertain elements of a process to be, arranged according to the quantitive relationships in the process. An expression of confidence in a set of measurements from the process requires determining how well it fits expectation, and we can't set expectation without a model. When we say we've come up with a statistical model for a process we're going to investigate, that doesn't mean we've
substituted a statistical model for the measurements we're going to take. It means we've come up with a mathematical expression for how we expect the system nominally to behave so that we can compare the observed error in measuring the process to what we've determined mathematically is the expected degree of uncertainty. We don't get a yes-or-no answer in the end; we get a number that tells us how close to the model our measurements got—how well expected uncertainty explains error.
The chi-squared distribution is a
probability distribution. It expresses the probability according to which the sum of the squares of a certain number independent normal variables is expected to vary. Most real measurements are expected to vary according to the normal distribution—uncertainty. It may be a wide distribution or a narrow one, but we can describe both our expectations for it and the outcome of actual measurement using the parameters of the normal distribution. Thus if we substitute "normally distributed measurement" for "independent normal random variable," we can use the chi-squared distribution as an expression of the expectation of aggregate behavior according to ordinary, probabilistic nature of uncertainty.
If I give you a stick and ask you to measure it, you could do it once and tell me it's 13.16 mm long. Or I could ask you to do it 100 times, in which case you'd be expected to come up with 100 very similar numbers. Those numbers would be expected to look like a normal distribution. The expectation of a normal distribution here is the statistical model for measuring a stick. And you could report your measurements to me in terms of the parameters of a normal distribution, say, μ = 13.05 mm, σ = 0.17 mm. Those are your descriptive statistics.
Unbeknownst to you, I've given the same stick to other people. They also measured the stick 100 times, using whatever method they deem appropriate. And they can also report their findings in terms of the parameters of the normal distribution that they derive from their measurements. In our model we consider those independent, normally-distributed measurements. Things like knowing that they're independent and that they should be normally distributed is what lets us make the model the way we did.
Now with everything properly standardized, we can expect the sum of the squares in the formulation of measurements from you and others to conform to the chi-squared distribution. How we measure that fit is another model. You referred us to the Pearson test for categorical variables. That's one model you can build around the chi-squared distribution. But since we don't have categorical data, it's not the right model. The Ward & Wilson test uses a different model for testing how well an actual set of measurements described as independent normally-distributed values conforms to the theoretically-expected chi-squared distribution. I qualified this step by saying everything has to be properly standardized to look like the values in the abstract model. The grind math that gets you there is where Pearson, Ward, Wilson, and others earn their pay.
What we get in the end is a number that tells us how closely our normally-distributed measurements conform to the model of how normally-distributed measurements of the same underlying value would be expected to behave—an expectation informed by the chi-squared distribution. We don't get a yes-or-no answer or any attempt to explain deviation in terms of what was measured our how. If we get a high (bad) number, it might mean that one person was using a cheap ruler, or that they got roaring drunk before doing their measurement. Or it might mean we have too few degrees of freedom in the model and we need to call in more people with rulers. When the deviation occurs broadly across independent trials, that would indicate that the stick is fundamentally unmeasurable, for whatever reason. When it's one outlier, it's not the hair-on-fire exercise the shroud enthusiasts seem to advocate. It means one person had a hard time measuring the stick competently.
Tolerance in not an expression of uncertainty, it is an expression of acceptable error. A measurement or statistical result that has to be within certain specifications.
No, this is a very pidgin understanding. The notion of "specification" or "acceptable error" invokes outside authority not found in the statistic itself. Statistics doesn't give you yes-or-no answers, but some human authority might do that for reasons that make sense to him. This is what I tried to get you to understand by discussing it simply in terms of hypothesis-testing
p-values, but you would have none of it. We use 95% confidence intervals and
p ≤ 0.05 not because these are magic numbers that shine inherently on their own from some statistical test. We use them because everyone just decided by fiat that these would be reasonable threshold values generally for all science. There is a threshold value
at all only because we decided that such a simplification would be generally useful. As I said, there is now considerable discussion in science whether one-size-fits-all thresholds are actually helpful, because (among other things) "significance" is more a property of the underlying fields than of statistics writ large.
Yes, when I tell my technicians that a certain statistically-governed test for measuring sticks has to produce a result to within a certain tolerance, that's because I've done the modeling and I know how that result is expected to vary absent any ulterior problem, and I need to express this in terms compatible with a practical production process. That tolerance incorporates errors that I can foresee by modeling what will arise in the measurement from ordinary uncertainty. That model becomes part of documentation that I supply to regulators. But the important point is that my imposition of a tolerance value
at all is not something inherent to statistics, but rather something that arises from a practical necessity in particular circumstances such as making sticks commercially.
When I get an NCMR (non-conforming material report), one of the things I can do is apply different modeling choices (e.g.,
t-distributed versus normally-distributed for small
n) and see if that helps understand where the non-conforming value fits in a larger understanding of the process. Just because I give my technicians a bright-line tolerance for practical purposes doesn't mean that's how the underlying math behaves or the underlying process that the math is describing. It's not as if the math falls off a cliff at that point. I might sign off on a deviation after I've analyzed the measurement according to the inherent fuzziness of the problem and decided that I can still tolerate that. That's
my judgment, not Ward's or Wilson's. That hypothetical different modeling becomes part of the deviation that regulators will see, because even though it's my judgment, it's subject to regulations about how we are allowed to decide things in what I do. I have some sticks that cost me $60,000 to manufacture, so it's sometimes worth the effort.
The concise summary that is that measurements that fall within the tolerance I've imposed
will work, while measurements outside the tolerance
may work. It's not within the technician's authority to decide that, but it's within mine.
The point is that it's not a question that can be answered by statisticians sipping Aperol spritzes over there on Corso Italia. They can be very adept at getting from rulers and sticks to a goodness-of-fit
probability. But only I know whether that value works for me in my process. There's no inherent, all-statistics-wide value that fits every case. You're trying to tell us that statisticians are the ones who determine whether to reject something based on a statistical analysis. And you've already had one statistician tell you that's not how it works. In this case, no, it's the archaeologists who determine that. This is why we ask about the authors' expertise in archaeology. Neither Casabianca nor his coauthors seem to have any qualification or stature in the field. Hence their judgment does not appear to be suitably informed.
For example, the tolerance for the chi^2 method as performed in the radiocarbon dating of the shroud is 6.0, the measured result is 6.4, which is out of specification.
And around in circles we go.