How could we use the polygraph in a real-world setting when we don't know the 'correct' result? .13., the important point to grasp is that the polygraph is no different in this respect from any imperfect test – a medical screening test, for example. We do the basic research, conduct studies to determine the scope of validity and obtain the 'calibration' data. When we have enough confidence in the test we introduce it in the field, including QA to monitor and improve the test's performance.
But there is a difference. You can't verify the polygraph results. If you could you wouldn't need the polygraph in the first place.
Let's consider an example medical test: You test a patient for some viral disease. You get a negative result and send the patient home. Next morning he comes back showing symptoms. Now you know that your test was wrong.
Now consider this somewhat facetious example: You perform a polygraph test on an employee. He passes it. Next morning he comes back looking guilty: "I lied in my polygraph yesterday." And now you know your test was wrong.
.13.
It's not true that if you could verify polygraph results you wouldn't need the polygraph. There are very many real-world examples of inaccurate screening tests being used when accurate diagnostic tests are available – for reasons of cost, convenience etc.
Leaving that aside, I fully understand your point, but I don't see it as an absolute difference between polygraphy and medical screening (or any other established type of testing). It just makes QA more difficult. As with a medical screening test, you don't just introduce it because it sounds as though it ought to work; you have to have a solid body of experimental verification, and plenty of population data to use in setting the comparison parameters, detection cutoffs etc. (Actually, status, influence or a good marketing campaign can be just as important, but that's another story, and not specific to polygraphy.)
Ideally, we monitor all screening programs, estimate false-positive and detection rates, gather data to refine the calculation parameters, and incorporate the results of further experimental studies to improve the test's accuracy and scope. Bear in mind that, generally, medical tests have not been subjected to this kind of QA - evidence-based medicine is fairly new – but most of them probably worked quite well because the initial studies were sound. Also, for some tests a positive screening result indicates an increased risk that a disease will develop later (rather than an increased risk that it is present but undetected), so performance criteria can't be applied in the usual way.
It is not hopeless to attempt to apply QA to polygraph testing. If we assume that deception/guilt is correlated with
some real-world effect then in principle we can measure
something that is not too far removed from sensitivity and specificity (e.g. conviction rates). And of course we must be aware of bias (a positive polygraph test may encourage the police to try harder for a conviction, etc.).
We need independent confirmation (or otherwise) of polygraph results, but the obvious problem is that the two groups (pass and fail) are treated differently after the test, so confirmation may be impossible (for example, applicants deemed to be dishonest may not be employed). It would be possible to perform studies in which polygraph results are not acted on (they would have to be hidden), but in general, QA (in terms of performance testing and of population data analysis) would have to rely more on experimental studies.
To summarise: your objection, though valid, is not a killer argument by itself. It is another hit against routine polygraphy to add to the bag!
As to distinguishing a nervous response from a guilty one, we assume that there are in principle some detectable differences between the two types of response, and try to refine the test to amplify these differences. As digithead suggests, there are theoretical reasons to suggest that the problem will be reduced by using GKT-type questions rather than CQT – I don't know how well this has been tested.
Is that a valid assumption? Which measurment could potentially show this difference?
But in anycase regardless of if it could be done in the future or not: Surely the machine can't do that with current technology?
To some extent, the different polygraph outputs are redundant measures of excitation of the sympathetic nervous system. But even without any data, common observation suggests that the pattern of physiological responses to stress differs, on average, according to the cause of the stress (for example, embarrassment is more likely to cause blushing). So, the pattern of correlations between polygraph outputs should also differ. The average differences may be small (therefore the signal-to-noise ratio low) but by using sophisticated mathematical techniques in computer algorithms it is often possible to get good discrimination from what doesn't appear to be much information (or, of course, the neural network approach might work).
Also, we have additional information from the pattern of questions that led to a response, hesitation at a particular question etc. Analysis of this kind of information should help researchers to improve questioning strategies. digithead's point about the effectiveness of GKT vs CQT is obviously relevant here (I think it makes sense to regard them as different questioning strategies rather than different tests).
I don't know what studies (if any) have looked at better methods of interpreting the available information, and couldn't find any relevant references in the NA report.