Some things that should be pointed out about this type of statistical analysis:
First of all, the calibration issue. What can be done, and what appears to have been done, is to take randomly selected streams of data from the RNG, of the same length as will be used experimentally, and calculate their statistical properties. This then serves as a base-line, or control data. Alternatively if this control data matches the expected statistical predictions, then on can just use those.
The problem with this idea of calibration, with respect to this experiment, is that they have not controlled for outside influences on the RNG's. After all, the assumption of the experiment is that something is influencing their behavior. So how do we know that nothing is influencing their behavior when we collect the control data?
Now one could argue that since the experiment is trying to discover if anything is influencing their behavior, then this doesn't matter. In this case the argument would be that the purpose of the calibration is simply to make sure that the behavior is not always being externally influenced, so that they can look for the cases when it is.
The problem with this is that even if we find statistically significant evidence of external influence, we have no idea what it is. It could be anything. It could be solar or even inter-stellar radiation interfering. It could be a slight bias due to increased radio-wave activity. It could be any number of things we would never even think of.
It also does no good to argue that the RNG's are "shielded" from such electronic or environmental influence, because no shielding is perfect. And since they are looking for things at the very fringe of detectability, any bias, no matter how tiny, could be responsible for the effects.
But all of this is irrelevant, because the statistical methods they have used to claim that they have found such evidence of external influence, are simply flawed. The fact is that you simply cannot evaluate the probability of a single statistical anomaly accompanying a single world event. What they are doing is simply data mining.
What they would have to do is to clearly define what constitutes a statistical anomaly, and also what constitutes a world event. Then they could look at long term data, and determine whether there is a statistically significant correlation between the timing and occurrence of the two. As it is, all they are doing is counting the hits and ignoring the misses. Indeed, it is not possible for them to not do this, since they do not bother to define what constitutes a hit until they find one, and do not define what constitutes a miss at all.
Dr. Stupid