Ah, I didn't get the es/yes connection. I see now.
Let's add that "bai" (pronounced sort of like "bye") means "yes".
That's right. I was actually in a situation with several Basque members of the Spanish Air Force for whom I served as Spanish/English interpreter. At lunch, the waitress asked, "More coffee?", and one guy said, "Ez, ez", which sounded to the waitress like "Yes yes". She poured the coffee.

That's how I learned my only Basque word.
Taking away the analogy to flying, I think my point is still valid and it boils down to:
a) The OP announces a major revolution, comparable to what an iPhone would have been 40 years ago: "There are programs like Dragon that are very good, but this just seems beyond the pale insofar as its uncanny subtlety of recognition".
b) All other posts following that describe standard speech recognition software, much like Dragon, with all the known problems about different accents, quick speaking, need for training, etc. There's no iPhone 40 years in advance.
I'm not sure what your point is here, but the lecture was about a novel approach to speech recognition that seems to have some promise. I have said that the algorithm does not involve much more than contextual clues and phoneme recognition at this point, but relies apparently on a huge lexicon of "probabilities". In other words, it is relatively primitive and yet does a credible job, but represents a great potential for future iterations.
Sorry if I waxed a little too metaphorical, but I had just returned from the talk, and conversation with one of the program's proponents. Though aimed at a general audience, the talk was quite interesting in certain ways.
As someone mentioned, Dragon for him/her is much better than the program in question. He has apparently tuned Dragon to his voice.
I'm not sure you understand just how enormous a problem it is to write a computer program to recognize speech and convert to text as well as a human can do it. Humans can actually tune out background conversations and noises. They don't do it with digital low/high pass filters. It is a hugely complex process.
There are contextual ambiguities that need to be resolved, social rules to take into account, and many levels of alternate pronunciations. One reason it is so difficult, but not impossible, to do speech recognition is that we don't understand very well how humans learn a language and how syntax, phonology and semantics work at a deep level in the human mind. The problem is at once scientific, psychological, social and mathematical.
It is the rare layman who fully appreciates the complexities/intricacies of natural language. This is why it is so difficult to discuss certain arcane features without someone invoking the usual, "Well, I say this because it 'sounds' right," type of dialog. The issue among others is why does it sound right if it is an utterance you have never heard before? How do we know what we know since no one taught us these subtle points. How do we make linguistic or stylistic generalizations?
If you are interested, I suggest having a look at topics such as
Trace behavior,
Generative Semantics, and
Generative Syntax.
ETA: What I have not mentioned, probably because I take it for granted, is that it would be ideal (Occam's razor) to have rules which govern the behavior of the speech recognition engine, (a theory of language) rather than sheer brute force, computer memory and speed. This is why I asked the lecturer, "Did you just throw lots of computer power at the problem?"
The answer "Yes", told me, as a former student of theoretical linguistics, that there was no novel earth shattering theory of syntax or semantics involved.