The other thing missing is the success criteria. You need someone whose maths is less rusty than mine to calculate how many trials you need to do for the result to be statistically significant, and how many hits are required to reach that result. Maybe JayUtah will oblige.
Sorry, just noticed this.
For a binomial distribution, the rule of thumb is that
np and
nq have to be at least 10.
p and
q are both 0.5 here, so you want
n=20.
Now can it be done with fewer? Yes, but smaller values for
n mean you have to get more of them right. This is because the binomial distribution is a discrete probability distribution. Trying to do hypothesis testing with this distribution by itself is the statistical equivalent of modeling the Sydney Opera House with Legos. Numbers that can change only by proportionally large, discrete steps don't allow for fitting curves very well, and significance testing is all about deciding how closely two curves fit. Hard to do that conclusively with Lego curves. If
n=5, you essentially have to guess right all five times for it to be significant to the 95% confidence interval. You get more leeway as
n increases.
When
n is sufficiently large, the binomial distribution starts to approximate the normal distribution to the point where we can begin to exploit the properties of the normal distribution. Specifically the
z-test for significance becomes an option. At the minimum
n=20, you need to get 15 right in order for your
z-value to exceed the requisite 1.96 (corresponding to the 95% interval).
The standard experiment method in this kind of case introduces an indirection. It requires a lot more work, but the science is far less assailable. You do a run of, say, 10 trials. Chance says you will get the right answer 5 times. The number of right answers you actually get is your score for that run. That's one data point. Over many 10-trial runs, those scores are expected to fit a normal distribution around a mean of 5, if the null hypothesis holds. The binomial distribution predicts the null-hypothesis behavior of each run of
n trials. The mean of the actual experimental scores over several runs (a different
n) is what gets tested for significance against the normal distribution. Obviously this means you have to do many runs, each with enough trials in it to let the score vary suitably. If you did 100 runs of ten trials each, you'd need a mean score of at least around 5.3 in order to call it significant at 95% confidence. The fewer runs you do, the fewer degrees of freedom in your model. The fewer trials in your run, the less standard your variance. (It's still a discrete distribution -- Poisson, actually, and not truly normal.) All these affect how confidently you can fit (or fail to fit) your experiment curve to the normal distribution that represents the null hypothesis.
Now these numbers are rough estimates from some quick calculations, so don't take them too seriously.