jt512 - Thanks for taking the time to read over my analysis and provide me with feedback.
The p-value of .0672 for your first test shows no evidence of bad luck. In fact, if you input the t-value into the appropriate Bayes factor calculator at
http://pcl.missouri.edu/?q=bayesfactor, I think you'll find that the Bayes factor actually favors the null hypothesis over the alternative.
The p-value of .0086 from your second test is less suggestive of bad luck than it appears. Again, I suggest you enter the appropriate data into Rouder's on-line calculator and observe the Bayes factor. My guess is that it will only modestly favor the alternative hypothesis.
I'm not sure why you think a Bayes computation is suitable for this. Could you explain? The point of the data collection we used was to provide a sample that had a known probability distribution due solely to random chance, so I'm not sure why we need to include a subjective prior probability with respect to the result.
However, use of Bayes is not something that comes up for me in my work, so I'm a bit rusty on it. What is the subjective prior being used? What does the 'scale r on effect size' represent? Does the calculator you linked to have a method to include results from multiple experiments to adjust the final result to include all known information?
In your third analysis, dataset 2014-2, the binomial analysis has multiplicity problems which you did not correct for.
It's true I didn't correct for multiplicity with the binomial tests. A trinomial distribution would be most appropriate, but I was using EXCEL and the build in binomial function is much easier than programming in an exact trinomial probability distrubtion. I don't think the multiplicity issue would cause much difference in the results (only three binomial comparisons were made) but feel free to do thise computations yourself and let me know if I'm mistaken on that point.
The chi-squared tests don't have this problem, and is the more appropriate analysis.
I agree.
The p-value for the test on the raw data, .0108, is suggestive, but if we were to convert it to a Bayes factor, I think again we'd see that it is only weak evidence against the null. Unfortunately, I don't think Rouder has a calculator on line for chi-squared tests.
Again, I'm not sure what the Bayes factor is supposed to represent here, but feel free to educate me about it.
As to your chi-squared test on the values of the hands, the p-value is meaningless, because the values do not follow a chi-squared distribution under the null hypothesis.
I agree that this is a questionable approach. The main point of it is the alignment of the frequency results with the values of the different hands are consistent with the hypothesis of bad luck. If you have an analysis suggestion for looking at not just the frequency of the hands, but also their values, I would be open to suggestions. With only three possible outcomes, I don't feel regression is appropriate.
Additionally, your analysis overall suffers from uncorrected multiplicity, as you've conducted three sets of tests on the same "luck" hypothesis, and any one of them resulting in statistical significance would have allowed you to claim "bad luck."
I disagree that this is an issue of uncorrected multiplicity as the datasets can be treated an independent experiments. There is some overlap of the session data with the 2013 data, but it's a small enough number of hands I'm comfortable with the assumption of independence of the datasets.
Further, whereas multiplicity corrections are designed to account for some positive results occurring through random chance (i.e. one out of 20 at 95% confidence) since we have three out of three independent datasets (actually more if you include the 2011 and 2012 datasets documented earlier in this thread), I think the results can be considered robust. I'm not sure why or how it's occurring, but it does seem to be a consistent finding.
However, as with the above, if you want to look at what effect of multiplicity corrections would have, please feel free to do so and share the results.
Furthermore, your analysis raises questions about selection bias.* Why were these specific datasets chosen for analysis and not others that could have been chosen instead?
You can read through the thread for answers to most of these questions. The all-in hands were chosen after discussion and a conclusion that they represented a true picture of random results uncontaminated by issues of skill during play. The {A,K} {Q,8} and {5,2} data was selected by husband for similar reasons albeit without the additional discussion as it's clear the cards dealt at the beginning of the hand are unaffected by skill. Personally, as I mentioned in my write up (footnote 2) I tried to persuade him to write down every hand he dealt, but that slows down the game (I tried it, it does) and he did not want to impose that on his fellow poker players.
Were the starting and stopping criteria decided in advance, and unrelated to the results? Were there other tests that were run whose results you have not shown? Or were there other data that you looked at but which did not seem promising and hence were not formally analyzed?
Starting and stopping criteria were decided in advance for the 2013 dataset. Other tests were run prior to 2013 analysis and discussed earlier in this thread. There is no other data that we looked at that was not included in the analysis. The session data collection (which started in 2013) is ongoing. I want to test some hypotheses about what he can change and whether or not it will have an impact on the results. Theoretically, nothing should impact the random chance of the cards dealt. But theoretically, these results have a low probability of occurring under the null while they are consistent with the alternative hypothesis.
Finally, whether you admit it or not, the hypothesis that people have an attribute called "luck" is a supernatural hypothesis, and no matter how statistically significant your results may be, the probability that those results are due to errors in the experiment are overwhelmingly greater than the probability that they are due to the supernatural hypothesis. If not due to random error, then it is almost certain that significant results are due to systematic error, and the smaller the p-values, the more that systematic error is the favored explanation. Convincingly small p-values of supernatural hypotheses give us an opportunity to learn how a well-intentioned experiment can go wrong.
Yes, I'm well aware of this issue as is my dh. However, all he can do is collect the data to the best of his ability. How would you suggest he improve his data collection to test the hypothesis that his observation is correct, that he actually has results worse than random chance predicts?
The question is, did he include the hands that brought the supposed anomaly to his attention in the dataset? If so, then the dataset is biased.
No, he did not include that session as he didn't start collecting that data until after he noticed he seemed to be getting a lot of {5,2} hands dealt him. Since he didn't know if it was just observational bias, he included a 'good hand {A, K} and an average hand {Q,8} for comparison purposes to determine if that was the case. It appears to be so.