New telepathy test, the sequel.

JayUtah is, in my estimation, the finest example of a ISF member that one could ask for.

Thanks, you've given me some big shoes to fill.

It is true that he is human and sometimes lets a little frustration show through his posts...

For which Loss Leader and the other moderators have had no qualms about levying sanctions. I haven't received any chastisement from a moderator that I didn't deserve.

Finding any flaw in his understanding of experimental science is close to unheard of.

It happened in this thread. The other guy was right, and I cheerfully accepted the correction. This is why we do science in teams.

He gave me that one and I still count it as a triumph.

It was well argued. How could i resist?
 
The hit rate of 50% does not have any particular significance in my tests. I would like (ideally) a hit rate of 100% for credible answers, and a hit rate of 0% for non-credible ones.
Just do the maths and prove you actually know how to do it, as anyone who claimed to be doing serious testing would.

How many people would have to pick the correct answer out of 4 for the result of your testing to be statistically significant? That's all you're actually being asked. You should have done those calculations before you started even doing your tests, any competent scientist would have, so you should have been able to answer the question straight away. Instead you spend your time slandering Yahoo Answers and Loss Leader.
 
How many people would have to pick the correct answer out of 4 for the result of your testing to be statistically significant?
This is a p-value calculation, using the binomial distribution. The answer is provided for example by any online binomial calculator, such as this one: https://stattrek.com/online-calculator/binomial.aspx.

For example, in my first two tests on this forum, I found that eight answers were credible (3 in the first test and 5 in the second), and that seven of those only were correct (with a hit rate of 7/8=87.5%). The p-value is p= 0.00038146973, and is statistically significant (probability of success on a single trial = 0.25 because of four possible answers).

I actually did such a p-value calculation at the end of my second test, using a somewhat different method:
...
It may be interesting to introduce a credibility threshold, equal to CR=5, for exemple. Then, GregInAustin's answer (CR=2) is eliminated, and I obtain 3+4 = 7 ("strongly") credible answers for the two tests (on this forum, so far), all of which are numerically correct. The probability for this is equal to p = (1/4)7 = 6.10 x 10-5 (assuming a 25% probability of answering correctly, for each answer). ...
 
I asked "How many people would have to pick the correct answer out of 4 for the result of your testing to be statistically significant?", not "How many people whose answers you like would have to pick the correct answer out of 4 for the result of your testing to be statistically significant?", and you didn't actually answer either question.
 
I asked "How many people would have to pick the correct answer out of 4 for the result of your testing to be statistically significant?", not "How many people whose answers you like would have to pick the correct answer out of 4 for the result of your testing to be statistically significant?", and you didn't actually answer either question.
I actually like all quality answers, even non-credible answers can be of interest. From the online binomial calculator, I find that, if 8 answers are given, with four possible answers (like 1, 2, 3 or 4), at least 5 of those must be correct in order to have a p-value less than the conventional threshold of 5% (then p=0.02729).
 
Great.

Now all you have to do is achieve such a result in a properly designed and controlled test, i.e. one which (AOT) does not rely on the subjective judgement of the experimenter, let alone the test's subject.

If you agree to such a test there are plenty of people here who can help you design and run it. If it's clearly a fair test I'm confident we could muster eight (non sarcastic) participants for it.
 
You have a pretty serious victimization complex going on there. Dial down the paranoia a few notches, please.

But the paranoia is the whole point; dial it down and there's nothing left. The reason Michel H is assigning a credibility rating to answers and making sure that the ones with the "wrong" answer are downgraded is that he sincerely believes that the people giving wrong answers are lying. These "tests" on Yahoo Answers are not scientific enquiries, they're hostile interrogations, and his aim is to make everyone admit that they can read his mind. The victimisation complex is not a bug, or even a feature; it's a mission statement.

Dave
 
I don't think that we need on this forum people who seem to enjoy constantly attacking and disparaging others. The goal of a forum like this is helping each other in a spirit of honesty. If your goal is to attack others, I think you should quit posting.

I have to second(or third or fourth)what Loss Leader said about Jay. I have learned a great deal from his postings. I believe your problem is that he keeps taking you to task for your misuse of science and flawed tests and results. Unfortunately, I believe he keeps talking to you on a level that is above your understanding, though your professed education level says you should understand it all perfectly.
 
I found that eight answers were credible

This right here is a major problem.

You need to eliminate the ability or need for you to determine that an answer to your test is credible or not and then using that determination to choose which answer are correct or not.

This process of yours makes your test appear biased and unscientific.
 
This is a p-value calculation, using the binomial distribution. The answer is provided for example by any online binomial calculator, such as this one: https://stattrek.com/online-calculator/binomial.aspx.

For example, in my first two tests on this forum, I found that eight answers were credible (3 in the first test and 5 in the second), and that seven of those only were correct (with a hit rate of 7/8=87.5%). The p-value is p= 0.00038146973, and is statistically significant (probability of success on a single trial = 0.25 because of four possible answers).

I actually did such a p-value calculation at the end of my second test, using a somewhat different method:


I'm not at all sure that you've answered the question. The question was: How many trials with a 25% chance of a randomly correct answer would you have to run to note with a 95% degree of confidence that people will get the right answer 50% of the time?

Your numbers don't appear to answer that question.
 
This right here is a major problem.

You need to eliminate the ability or need for you to determine that an answer to your test is credible or not and then using that determination to choose which answer are correct or not.

This process of yours makes your test appear biased and unscientific.
A previous test we helped him design and run here allowed him to continue to assess the credibility of the answers but obliged him to make those assessments without knowing whether the answer was correct or not, to remove any (conscious or unconscious) bias. Needless to say that test produced the result expected by chance. Once he knew which answers were correct he simply changed his credibility ratings to favour them and claimed victory anyway.

I think that was the point when I finally accepted there was nothing we could do to help him, and resolved to stop responding. And yet here I am. :(
 
The thought had crossed my mind. It all depends on if #4 is the answer he claims to have written.

Well, if it had been "1” then I have no doubt that 'your first answer would have been most credible':rolleyes:. So we're up to a 50% chance. Had it been 2 or 3 then the number of words or something else equally irrelevant would have been used for justification.
 
I have to second(or third or fourth)what Loss Leader said about Jay. I have learned a great deal from his postings. I believe your problem is that he keeps taking you to task for your misuse of science and flawed tests and results. Unfortunately, I believe he keeps talking to you on a level that is above your understanding, though your professed education level says you should understand it all perfectly.

+1 from me as well. I'd also draw attention to Pixel42's contributions over many years and several iterations of this thread, she has patiently offered Michel practical suggestions for testing protocols that would have provided meaningful results (as she has many formal and informal claimants over the years) only to have had them rejected as the results would have reflected reality and not the fiction the OP wants to believe.
 
Yahoo just removed, once again, an interesting question from the Alternative category. The question is:
Which of the following is true?

1. The Earth is flat.

2. It's not unusual to be followed by translucent/transparent entities.
...
8. Domestic cats were genetically engineered by an alien race.

9. There's another bad sun behind our sun.

10. There have been 6/7 clones of Barack Obama (maybe more).
The link of the question is: https://answers.yahoo.com/question/index?qid=20200109121050AAhdcmn , the URL of the Alternative category is: https://answers.yahoo.com/dir/index?sid=396547171&link=list.
 
Last edited:
This is a p-value calculation, using the binomial distribution.

Well, no. Since all but one respondent can see what the previous ones answered, each guess is not independent and governed strictly either by chance or by the hypothesis under test. This is the conformity effect. A properly-controlled experiment would satisfy the constraints of the binomial distribution by hiding everyone's answers from each other. And, ideally, from the experimenter. (I.e., you want success or failure to be noted automatically, without human intervention.)

The answer is provided for example by any online binomial calculator, such as this one: https://stattrek.com/online-calculator/binomial.aspx.

And those are helpful tools, provided one knows the underlying statistical principles so that one inputs the right values, reads the right results, and interprets them appropriately. One problem we'll discuss later is that those kinds of convenience tools often hide elements of the underlying science from the user in the pursuit of a simple, easily-accessible tool. Just in terms of software, there are many more tools that have greater appeal to serious scientists, such as MATLAB. These give you more control and greater insight into your data. Better still -- but generally relegated to professional statisticians -- is a deep understanding of the derivation and meaning of the mathematical models around which these tools are built. This lets you do things the tools can't.

For example, in my first two tests on this forum...

Changing horses.

You were asked to solve either or both of two specific problems -- one of them the example you yourself came up with. The goal of the exercise was for you to demonstrate your competence. That's done typically by posing problems that the candidate has to solve. If the candidate poses and solves his own problems -- especially with a preprogrammed solution -- it's the fox guarding the henhouse. What you seem to have done is to Google up a canned solution to a specific kind of problem, then contrive a new example that fits the can. That's not as convincing a test of your competence as if you had answered the questions posed.

I found that eight answers were credible...

As others have rightly pointed out, this is a show-stopper. You determined that the answers were "credible" after having seen how they answered. In no way can this be considered defensible experimental practice. No, it does not matter in the least how much you assure us your (subjective) criteria for "credibility" are fair and impartial. A properly-controlled experiment employs blind adjudication by a panel of judges where subjective criteria cannot be entirely controlled away.

Further, your assurance was undermined by evidence. When you were able to see the answers, your judgment of credibility resulted in a high rate of success. When you were blinded to the answer during adjudication, your success rate dropped back to statistical insignificance. As Pixel42 points out, you decided to revise your credibility criteria on an already-decided data set, after seeing that it failed to achieve significance. This is fairly indisputable evidence that your imposition of criteria is intentionally biased. Even among the most eminent and trustworthy of researchers, the possibility for even unconscious bias is deemed enough to institute controls that preclude any possibility of non-blind post hoc manipulation. Your bias is evident right there in the data, so after you've done your culling it doesn't matter what numbers you plug into a statistical calculator. It's all garbage at that point.

And as your critics have all concluded, your unwillingness to apply even the most trivially-implemented controls against bias convey the impression that the subjective factor is an intentional ingredient of your experiment. Since this is an immediately fatal condition to your experiment design, we could stop right now because any further discussion of statistics has been rendered moot. But it turns out you're not finished making mistakes.

3 in the first test and 5 in the second...

You seem to be confused about what constitutes a trial. First you're trying to aggregate the results of two separate experiments employing two different subject pools as if it were one data set. Then you're considering a trial to be the single answer provided by each of several different subjects. Since you admit that the outcome under test varies from subject to subject, this violates the homogeneity constraint of the binomial distribution. We'll talk more about this later.

You refer to ganzfield trials as the model for comparison to your method. Now that brings in a whole lot of baggage that is alleged to induce a certain mental state. Let's concentrate on the telepathic reading portion of a hypothetical ganzfield experiment using Zener cards.

Zener cards are seen near the beginning of the original Ghostbusters. There are 25 cards -- 5 each of 5 types, the type given by the geometric figure on the face. The cards are shuffled. Then the sender deals a card from the top of the deck (at no time allowing the receiver to see it), visualizes the figure on its face, and allows the receiver to guess the card type allegedly via telepathy. The card is then discarded. This procedure repeats for all 25 cards. The receiver (ideally) does not know whether any guess is a success or a failure, or what the figure on the card was (even after finalizing his guess). The number of successful guesses over the 25-card run is one trial, one data point.

But we're not done. The cards are then shuffled and the test repeats with the same sender and receiver. The number of successful guesses on this trial is the next data point. Lather, rinse, and repeat. The number of successful guesses for that one pair of subjects in the typical Zener test over many full-deck runs of the test for that single subject is expected to fit the normal distribution. The actual distribution of responses is compared to the normal distribution and a customary set of conformance statistics is drawn up, one of them being the p-value describing the probability that an ostensibly normally-distributed set of experimentally-obtained values is actually normally distributed (as opposed to being affected by some other phenomenon).

Now the same subject pair comes in the next month and does the same test again. We didn't specify the number of full-deck trials that were done in the previous experiment. Ideally it would be an N chosen to ensure that the fit to the normal distribution is within the desired confidence interval. N.B., N is not the number of Zener cards or the number of subjects. Now the same subject pair does another set of full-deck Zener trials. If we say that NJune was 29 and NJuly was 33, do we get to lump all the trials together and say Ntotal is 62? Generally no. Each experiment is meant to aggregate over uncontrollable factors in a way that is not suspected to change over the experiment. A new experiment assumes those factors might possibly vary. Lumping everything together is tantamount to assuming you've controlled for everything you might later want to investigate as a possible uncontrolled correlate.

In contrast to a proper run, you allow only one guess per subject and only subject per trial. Since you can't directly compare the performance of one subject to another, you have no actual scientifically valid trials. Even still, if we had to do the Zener experiment above with senders and receivers whose telepathic ability was suspected to vary or to be interdependent, we would have to turn to more sophisticated methods of gathering and analyzing data, such as the chi-square.

Your data simply don't supply the statistical basis you propose.

probability of success on a single trial = 0.25 because of four possible answers.

Except that's not true. As I explained earlier, when people are presented with a discrete number of ordered alternatives and asked to choose from them at random, their choices -- when aggregated -- do not produce a uniform distribution among all the alternatives. If the success choice in each trial was always the 4th alternative, then you can't use p=0.25 as the probability of a successful outcome by chance.

Experimenters have known of this effect for quite a long time. This is why, in properly-controlled experiments, the success answer varies in position among the decoys. Over sufficient trials (i.e., often up to 100) the variance from the positional bias averages out so that p is close enough to 0.25 to satisfy the binomial statistic constraints. Your experiment design ignores this effect.

So go to your tool and vary the probability of success within some small interval. See how it jumps back and forth across the significance limit? This is a strong indicator that you need to have a protocol that varies the positions of the success value and the decoys.

I actually like all quality answers, even non-credible answers can be of interest. From the online binomial calculator, I find that, if 8 answers are given, with four possible answers (like 1, 2, 3 or 4), at least 5 of those must be correct in order to have a p-value less than the conventional threshold of 5% (then p=0.02729).

Here's a salient quibble. You correctly read from the tool at the line labeled "P(X ≥ x)" properly corresponding to the probability that at least 5 correct guesses must ensue. But the value given is 0.2729797363. I'm sure you discovered, as did I, that the tool doesn't let you copy the answer. So you had to type in the actual digits. You gave us five significant digits, using the keyboard. Why did you not round the last significant digit appropriately? And why, according to you, was that the proper number of significant digits? Physics is one of those sciences where the precision of numbers rigorously matters, since our computers always give us the maximum precision of their number representations no matter how appropriate that ends up being. Statistics is another one of those fields. Work in either one long enough, providing the right number of significant digits and rounding the last one correctly becomes second nature. It's harder not to do it than to do it habitually. All right, fine, it may just be hasty typing.

But it turns out it might not be just a quibble. Play with your tool a little bit. (Yes, I can hear the audience snickering.) Vary N and x by one number either direction. If you vary x from 5 to 4, your cumulative probability for that tail jumps quite a lot, way past the p=0.05 milestone. Set N to 9 and you still just barely squeak by. Why is this happenng? Because your numbers are all way too small. Your model doesn't have enough degrees of freedom to represent variance in the composition of the data in anything but the coarsest imaginable intervals.

Small changes in the parameters of one's experiment producing vast changes in the significance of the outcome are one of the things that tell experienced experimenters that their experiment is too small. Yes, the confidence interval is computable for those small numbers. That doesn't mean it's meaningful. Working out the error bounds in all your parameters would have been helpful. This is why you need more than just a simple online tool to teach you the statistics you need in order to craft an experiment that tells you want you need to know instead of what you want to hear.

Keep in mind where those numbers for N and x would come from in your experiment. Your N is improperly derived, but we'll take it arguendo. It represents the number of people whose answers you thought were serious. If that's off by one -- that is, if you whimsically decided that just one more or less response was credible -- it produces a vast difference in the statistical outcome. That's when a conscientious experimenter realizes his experiment is too statistically unstable -- too sensitive -- to produce confident data. It's where you need to look at more than just cursory cumulative probabilities.

You're not even remotely close to rigorous science here. And contrary to your incessant complaints, the people here are trying to help you by pointing out your various errors. Sadly, many seem to have concluded that you have no interest in that. However, I can probably assure you that your continued bluster is probably not going to succeed. People can see you gaslighting your way through this debate. They can see you abusing and insulting your critics, trying so very hard to poison the well. But the people you're trying to discredit have, in many cases, a long history here of being able to demonstrate the correctness of their knowledge and the justification of their confidence in it. You're not going to win by trying to erode that with ham-fisted social engineering.
 

Back
Top Bottom