Except that when you do a live performance in the listening room, you're not in the Concertgebow. So you lost your reference.
Only if the Concertgebow is your reference. I think most would be satisfied with accurate reproduction of live instruments in their listening/living room. But if you really want the Concertgebow, my second method can be used.
"Which sounds more realistic" is testing preference and only preference. No testable, verifiable equivelence there.
Sure there is - just repeat the experiment many times, with many different listeners. Assuming there is a clear preference on average between A and B (where those are two different recording methods, or two different stereo systems, or whatever), that's a win. If there isn't a clear preference, change something else and A/B that instead. The end result of many iterations of that is likely to be pretty decent.
Your third method has some hope of working, but now, how do you record this information? You have two ears, and your head moves around at a concert. You have to capture a lot more than the mere pressure (which is 1/4 of the information at a single point in the atmosphere) at one point in order to do this.
Yes, I'd do it with two microphones on a dummy head, and take care of head movements somehow (either actually move the head, or average over phases, or something I'd have to think more carefully about).
But I actually think this third method is probably the weakest of the three, because certain aspects of the sound field cannot possibly be reproduced accurately in some random listening room (like reflected sound with all its crazily complex phase and delay patterns, which is
most of the sound in a concert hall). So knowing what to focus on and what to let slide requires knowing how humans perceive sound and how to fool them - which is really hard. The first two methods take that out of the equation by using a human to
tell you what they perceive/prefer, and uses averaging over many people to beat down the BS level.
I'm not saying measurements like that are
useless - far from it, they are crucial to set a baseline and just get close. And the more we learn about human audio perception the more useful they will be. But I think in the end real live humans are the ideal "microphone" for that sort of test, because what we care about is what real live humans perceive, not what mics record.