This is a p-value calculation, using the binomial distribution.
Well, no. Since all but one respondent can see what the previous ones answered, each guess is not independent and governed strictly either by chance or by the hypothesis under test. This is the conformity effect. A properly-controlled experiment would satisfy the constraints of the binomial distribution by hiding everyone's answers from each other. And, ideally, from the experimenter. (I.e., you want success or failure to be noted automatically, without human intervention.)
The answer is provided for example by any online binomial calculator, such as this one:
https://stattrek.com/online-calculator/binomial.aspx.
And those are helpful tools, provided one knows the underlying statistical principles so that one inputs the right values, reads the right results, and interprets them appropriately. One problem we'll discuss later is that those kinds of convenience tools often hide elements of the underlying science from the user in the pursuit of a simple, easily-accessible tool. Just in terms of software, there are many more tools that have greater appeal to serious scientists, such as MATLAB. These give you more control and greater insight into your data. Better still -- but generally relegated to professional statisticians -- is a deep understanding of the derivation and meaning of the mathematical models around which these tools are built. This lets you do things the tools can't.
For example, in my first two tests on this forum...
Changing horses.
You were asked to solve either or both of two specific problems -- one of them the example you yourself came up with. The goal of the exercise was for you to demonstrate your competence. That's done typically by posing problems that the candidate has to solve. If the candidate poses and solves his own problems -- especially with a preprogrammed solution -- it's the fox guarding the henhouse. What you seem to have done is to Google up a canned solution to a specific kind of problem, then contrive a new example that fits the can. That's not as convincing a test of your competence as if you had answered the questions posed.
I found that eight answers were credible...
As others have rightly pointed out, this is a show-stopper. You determined that the answers were "credible" after having seen how they answered. In no way can this be considered defensible experimental practice. No, it does not matter in the least how much you assure us your (subjective) criteria for "credibility" are fair and impartial. A properly-controlled experiment employs blind adjudication by a panel of judges where subjective criteria cannot be entirely controlled away.
Further, your assurance was undermined by evidence. When you were able to see the answers, your judgment of credibility resulted in a high rate of success. When you were blinded to the answer during adjudication, your success rate dropped back to statistical insignificance. As Pixel42 points out, you decided to revise your credibility criteria on an already-decided data set, after seeing that it failed to achieve significance. This is fairly indisputable evidence that your imposition of criteria is intentionally biased. Even among the most eminent and trustworthy of researchers, the possibility for even
unconscious bias is deemed enough to institute controls that preclude any possibility of non-blind
post hoc manipulation. Your bias is evident right there in the data, so after you've done your culling it doesn't matter what numbers you plug into a statistical calculator. It's all garbage at that point.
And as your critics have all concluded, your unwillingness to apply even the most trivially-implemented controls against bias convey the impression that the subjective factor is an intentional ingredient of your experiment. Since this is an immediately fatal condition to your experiment design, we could stop right now because any further discussion of statistics has been rendered moot. But it turns out you're not finished making mistakes.
3 in the first test and 5 in the second...
You seem to be confused about what constitutes a trial. First you're trying to aggregate the results of two separate experiments employing two different subject pools as if it were one data set. Then you're considering a trial to be the single answer provided by each of several different subjects. Since you admit that the outcome under test varies from subject to subject, this violates the homogeneity constraint of the binomial distribution. We'll talk more about this later.
You refer to ganzfield trials as the model for comparison to your method. Now that brings in a whole lot of baggage that is alleged to induce a certain mental state. Let's concentrate on the telepathic reading portion of a hypothetical ganzfield experiment using Zener cards.
Zener cards are seen near the beginning of the original
Ghostbusters. There are 25 cards -- 5 each of 5 types, the type given by the geometric figure on the face. The cards are shuffled. Then the sender deals a card from the top of the deck (at no time allowing the receiver to see it), visualizes the figure on its face, and allows the receiver to guess the card type allegedly via telepathy. The card is then discarded. This procedure repeats for all 25 cards. The receiver (ideally) does not know whether any guess is a success or a failure, or what the figure on the card was (even after finalizing his guess). The number of successful guesses over the 25-card run is one trial, one data point.
But we're not done. The cards are then shuffled and the test repeats with the same sender and receiver. The number of successful guesses on this trial is the next data point. Lather, rinse, and repeat. The number of successful guesses for that one pair of subjects in the typical Zener test
over many full-deck runs of the test for that single subject is expected to fit the normal distribution. The actual distribution of responses is compared to the normal distribution and a customary set of conformance statistics is drawn up, one of them being the
p-value describing the probability that an ostensibly normally-distributed set of experimentally-obtained values is actually normally distributed (as opposed to being affected by some other phenomenon).
Now the same subject pair comes in the next month and does the same test again. We didn't specify the number of full-deck trials that were done in the previous experiment. Ideally it would be an N chosen to ensure that the fit to the normal distribution is within the desired confidence interval. N.B., N is not the number of Zener cards or the number of subjects. Now the same subject pair does another set of full-deck Zener trials. If we say that N
June was 29 and N
July was 33, do we get to lump all the trials together and say N
total is 62? Generally no. Each experiment is meant to aggregate over uncontrollable factors in a way that is not suspected to change over the experiment. A new experiment assumes those factors might possibly vary. Lumping everything together is tantamount to assuming you've controlled for everything you might later want to investigate as a possible uncontrolled correlate.
In contrast to a proper run, you allow only one guess per subject and only subject per trial. Since you can't directly compare the performance of one subject to another, you have no actual scientifically valid trials. Even still, if we had to do the Zener experiment above with senders and receivers whose telepathic ability was suspected to vary or to be interdependent, we would have to turn to more sophisticated methods of gathering and analyzing data, such as the chi-square.
Your data simply don't supply the statistical basis you propose.
probability of success on a single trial = 0.25 because of four possible answers.
Except that's not true. As I explained earlier, when people are presented with a discrete number of ordered alternatives and asked to choose from them at random, their choices -- when aggregated -- do not produce a uniform distribution among all the alternatives. If the success choice in each trial was always the 4th alternative, then you can't use
p=0.25 as the probability of a successful outcome by chance.
Experimenters have known of this effect for quite a long time. This is why, in properly-controlled experiments, the success answer varies in position among the decoys. Over sufficient trials (i.e., often up to 100) the variance from the positional bias averages out so that
p is close enough to 0.25 to satisfy the binomial statistic constraints. Your experiment design ignores this effect.
So go to your tool and vary the probability of success within some small interval. See how it jumps back and forth across the significance limit? This is a strong indicator that you need to have a protocol that varies the positions of the success value and the decoys.
I actually like all quality answers, even non-credible answers can be of interest. From the online binomial calculator, I find that, if 8 answers are given, with four possible answers (like 1, 2, 3 or 4), at least 5 of those must be correct in order to have a p-value less than the conventional threshold of 5% (then p=0.02729).
Here's a salient quibble. You correctly read from the tool at the line labeled "P(X ≥ x)" properly corresponding to the probability that
at least 5 correct guesses must ensue. But the value given is 0.2729797363. I'm sure you discovered, as did I, that the tool doesn't let you copy the answer. So you had to type in the actual digits. You gave us five significant digits, using the keyboard. Why did you not round the last significant digit appropriately? And why, according to you, was that the proper number of significant digits? Physics is one of those sciences where the precision of numbers rigorously matters, since our computers always give us the maximum precision of their number representations no matter how appropriate that ends up being. Statistics is another one of those fields. Work in either one long enough, providing the right number of significant digits and rounding the last one correctly becomes second nature. It's harder
not to do it than to do it habitually. All right, fine, it may just be hasty typing.
But it turns out it might not be just a quibble. Play with your tool a little bit. (Yes, I can hear the audience snickering.) Vary N and x by one number either direction. If you vary x from 5 to 4, your cumulative probability for that tail jumps quite a lot, way past the
p=0.05 milestone. Set N to 9 and you still just barely squeak by. Why is this happenng? Because your numbers are all way too small. Your model doesn't have enough degrees of freedom to represent variance in the composition of the data in anything but the coarsest imaginable intervals.
Small changes in the parameters of one's experiment producing vast changes in the significance of the outcome are one of the things that tell experienced experimenters that their experiment is too small. Yes, the confidence interval is
computable for those small numbers. That doesn't mean it's meaningful. Working out the error bounds in all your parameters would have been helpful. This is why you need more than just a simple online tool to teach you the statistics you need in order to craft an experiment that tells you want you need to know instead of what you want to hear.
Keep in mind where those numbers for N and x would come from in your experiment. Your N is improperly derived, but we'll take it
arguendo. It represents the number of people whose answers you thought were serious. If that's off by one -- that is, if you whimsically decided that just one more or less response was credible -- it produces a vast difference in the statistical outcome. That's when a conscientious experimenter realizes his experiment is too statistically unstable -- too sensitive -- to produce confident data. It's where you need to look at more than just cursory cumulative probabilities.
You're not even remotely close to rigorous science here. And contrary to your incessant complaints, the people here are trying to help you by pointing out your various errors. Sadly, many seem to have concluded that you have no interest in that. However, I can probably assure you that your continued bluster is probably not going to succeed. People can see you gaslighting your way through this debate. They can see you abusing and insulting your critics, trying so very hard to poison the well. But the people you're trying to discredit have, in many cases, a long history here of being able to demonstrate the correctness of their knowledge and the justification of their confidence in it. You're not going to win by trying to erode that with ham-fisted social engineering.