Computer Speech Recognition

Olowkow · Oct 6, 2013

uvar said:
Regarding that, isn't there a 'thing' about constructing trick sentences that appear to be written one way but use less obvious grammar or word meanings, and read very confusingly?

Also, I tried a few of the sentences in Dragon out of curiosity to compare the competition; it had no problem with 'there/their/they're', occasionally got frieze right after I corrected it the first time, and got the last two sail/sales correct but never the first. (It seemed to be choosing based on whether the word was a noun or verb, but that might have been a coincidence or just one part of the selection.)

(Edit: of course, remembered almost immediately after posting - this is what I was thinking of.)

I was wondering if Dragon was any better or worse. I love those garden path sentences. Sometimes I hear someone saying something like:

I found some antique items that I don't know what Harry wants to do with them.

This is a kind of maze that one's mind thinks he can navigate, but then it begins to feel like a cul de sac somewhere around ..."Harry wants..." It's a violation of the complex NP constraint.

This link from the page you provided is pretty good: http://en.wikipedia.org/wiki/List_of_linguistic_example_sentences

It seems that self embedding is limited universally to two levels in natural languages. Here is an example of three levels:

The rat the cat the dog bit chased escaped.

rjh01 · Oct 6, 2013

Dragon is pretty good. It needs training and good hardware.

Michael C · Oct 9, 2013

Well, I tried "The Impossible Dream" with United Kingdom English. Here's what came out (all errors in bold):

To dream the impossible dream to fight the on beatable phones to bare with on bearable sorrow to run where are the brave they're not go to write the on right double rome to love pure and chased from afar to try when your arms are too weary to reach the unreachable star this is my quest to follow that star no matter how hopeless no matter how far to fight for the right without question or poles for to be willing to mark to hell for a heavenly kors and I know if I'll only be true to this glorious quest that's my heart will live peaceful and karim when I'm lied to my rest and the world will be better for this that 1 man scorned and covered with scars still strove with his last ounce of courage to reach the unreachable star

Not too bad, I think, for a poetic text. Some errors were weird: I don't know why it understood "kors" instead of "cause", or "karim" instead of "calm".

I also had several tries at the word "pause", also putting it into other sentences: I got "poles", "polls", "pores" or "pauls" but never the correct version.

Is there a way to get it to do punctuation?

Olowkow · Oct 9, 2013

Michael C said:
Well, I tried "The Impossible Dream" with United Kingdom English. Here's what came out (all errors in bold):

,,,,

Is there a way to get it to do punctuation?

Just say "period" or "comma". I tried "colon" and "semicolon", but they don't work.

I have seen strange errors also. Without "training" it seems like it might have a long way to go. In my case, I think I have a strange accent on some words that it just can't get.

As I said above, /udderly/ doesn't hack it, and I had to say /utterly/, with a real /t/ sound.

ETA: I just tried "exclamation point" and "question mark", they both worked producing the symbols /!/ and /?/. "Dash" "parentheses" and "hyphen" don't seem to work.

Boone 870 · Oct 10, 2013

rjh01 said:
Dragon is pretty good. It needs training and good hardware.

And a quiet room with no background chatter, televisions, radios or wind noise from a fan.

Olowkow · Oct 10, 2013

Boone 870 said:
And a quiet room with no background chatter, televisions, radios or wind noise from a fan.

Humans are very good at "filtering" out background noise, focusing on the speech, not just the physical sounds, a talent computers will have a difficult time acquiring.

Boone 870 · Oct 10, 2013

Olowkow said:
I am very excited to have found this, and I'm hoping some of our JREF blind or otherwise disabled members might find this as interesting and useful as I have.

Comments would be much appreciated.

As a longtime Dragon NaturallySpeaking user (over 8 years), and after about 2 minutes of trying out the link you posted, I found it to be much, much worse than Dragon. I would guess the accuracy rate to be about 5% and I would estimate my Dragon software to be closer to 95%.

Boone 870 · Oct 10, 2013

Olowkow said:
Humans are very good at "filtering" out background noise, focusing on the speech, not just the physical sounds, a talent computers will have a difficult time acquiring.

Yes, computers certainly do have a difficult time filtering out background noises.

When I use Dragon it has to be virtually silent with no background noise and no air blowing across the microphone. Not to mention if someone walks in and starts talking while you're using the software. There is no telling what the software will interpret the other person is saying and type it out on the screen.

I giggle every time I see a Dragon commercial on television with someone using the software in an office environment. None of the versions I've ever used would come close to functioning properly with that kind of background noise.

Their commercials would be more accurate if they were to show the user smashing their headset and computer out of frustration because of the inaccuracies due to the background noise.

jojonete · Oct 10, 2013

I've just met this thread. I'm not sure if I'm going to get flamed for this post, but I have to say it: this thread mimics perfectly the behavior of deluded applicants to the MDC.

I can fly. In fact, I'm levitating while typing this post:

Olowkow said:
There are programs like Dragon that are very good, but this just seems beyond the pale insofar as its uncanny subtlety of recognition.
[...]
I must say this is very close to the "holy grail" of computerized speech recognition.
[...]
The principle investigator (PI) gave a lecture in China, and a simultaneous machine translation into Mandarin was done, and it was reportedly excellent.

How high? a few inches. How long? a few seconds. How did I manage to levitate while typing an entire post? Well...

Olowkow said:
according to the lecturer it is not yet suitable for closed captioning and is readily defeated by ambient noise.

_{Hint: how did they manage to translate into Mandarin a whole lecture (I suppose, with the usual ambient noise in any lecture) with excellent quality?}

Okay, here I am. Now I'm going to fly in front of your eyes [applicant jumps a few times] Well, this is all I can get now, but other times it was a lot longer.

roger said:
[...] Um, no.
[...] No.
[...] Better, but still not right.
[...] Okay, I am not so impressed. I am trying to speak very clearly and slowly (earlier attempts not pasted here was talking fast, and that was an absolute disaster).
[...] I read the back jacket of a book. This one is extremely inaccurate.

I failed because the ceiling was too low, and I was afraid of hitting it and hurting myself if I flied too high.

rjh01 said:
It could be a number of things. For example if the microphone is not very good or if there are background noises then it will not work properly.

[To the non-flying applicant] You've learned how to jump. Good work.

BobK said:
The program is trying to teach people to use what it considers proper enunciation and cadence. Once you have demonstrated you are sufficiently compliant, it will move on to educating you to think only as big brother wishes.

I mean, come on, this was supposed to be a "holy grail". More than 30 languages, each with its dialects. Excellent-quality instant translation into Mandarin. Basque, where "es" means "no" (I thought it was "ez", but I'm definitely not an authoritative source about Basque)...

...and 28 posts later the thread subject is how to properly pronounce the phonemes and how to eliminate ambient noise. Doesn't it sound somehow weird?

Olowkow · Oct 10, 2013

Hint: how did they manage to translate into Mandarin a whole lecture (I suppose, with the usual ambient noise in any lecture) with excellent quality?

There was a video showing the guy with a little boom mike speaking, and the text was shown simultaneously in English. It was a lot more accurate than my testing of my own voice. The project's director may have trained the program for his voice. I have no idea.

Or perhaps he was a total fraud.

As for Basque, I know one word. It is spelt "ez", correct. Pronounced "es", which sounds like "yes" which can result in problems in English.

jojonete's analogy to flying is not very informative or accurate. The problem of speech to text is more like the apparent impossibility of creating a shirt pocket device capable of feats like the iPhone, 40 years ago.

jojonete · Oct 10, 2013

Olowkow said:
As for Basque, I know one word. It is spelt "ez", correct. Pronounced "es", which sounds like "yes" which can result in problems in English.

Ah, I didn't get the es/yes connection. I see now.
Let's add that "bai" (pronounced sort of like "bye") means "yes".

Olowkow said:
jojonete's analogy to flying is not very informative or accurate. The problem of speech to text is more like the apparent impossibility of creating a shirt pocket device capable of feats like the iPhone, 40 years ago.

Well, I just saw so many similarities with extraordinary claims that I couldn't help pointing them out. However, I do agree that claiming to be able to fly is far less credible than claiming to be able to recognize spoken words through a computer.

Taking away the analogy to flying, I think my point is still valid and it boils down to:
a) The OP announces a major revolution, comparable to what an iPhone would have been 40 years ago: "There are programs like Dragon that are very good, but this just seems beyond the pale insofar as its uncanny subtlety of recognition".
b) All other posts following that describe standard speech recognition software, much like Dragon, with all the known problems about different accents, quick speaking, need for training, etc. There's no iPhone 40 years in advance.

Olowkow · Oct 11, 2013

jojonete said:
Ah, I didn't get the es/yes connection. I see now.
Let's add that "bai" (pronounced sort of like "bye") means "yes".

That's right. I was actually in a situation with several Basque members of the Spanish Air Force for whom I served as Spanish/English interpreter. At lunch, the waitress asked, "More coffee?", and one guy said, "Ez, ez", which sounded to the waitress like "Yes yes". She poured the coffee.

That's how I learned my only Basque word.

Taking away the analogy to flying, I think my point is still valid and it boils down to:
a) The OP announces a major revolution, comparable to what an iPhone would have been 40 years ago: "There are programs like Dragon that are very good, but this just seems beyond the pale insofar as its uncanny subtlety of recognition".
b) All other posts following that describe standard speech recognition software, much like Dragon, with all the known problems about different accents, quick speaking, need for training, etc. There's no iPhone 40 years in advance.

I'm not sure what your point is here, but the lecture was about a novel approach to speech recognition that seems to have some promise. I have said that the algorithm does not involve much more than contextual clues and phoneme recognition at this point, but relies apparently on a huge lexicon of "probabilities". In other words, it is relatively primitive and yet does a credible job, but represents a great potential for future iterations.

Sorry if I waxed a little too metaphorical, but I had just returned from the talk, and conversation with one of the program's proponents. Though aimed at a general audience, the talk was quite interesting in certain ways.

As someone mentioned, Dragon for him/her is much better than the program in question. He has apparently tuned Dragon to his voice.

I'm not sure you understand just how enormous a problem it is to write a computer program to recognize speech and convert to text as well as a human can do it. Humans can actually tune out background conversations and noises. They don't do it with digital low/high pass filters. It is a hugely complex process.

There are contextual ambiguities that need to be resolved, social rules to take into account, and many levels of alternate pronunciations. One reason it is so difficult, but not impossible, to do speech recognition is that we don't understand very well how humans learn a language and how syntax, phonology and semantics work at a deep level in the human mind. The problem is at once scientific, psychological, social and mathematical.

It is the rare layman who fully appreciates the complexities/intricacies of natural language. This is why it is so difficult to discuss certain arcane features without someone invoking the usual, "Well, I say this because it 'sounds' right," type of dialog. The issue among others is why does it sound right if it is an utterance you have never heard before? How do we know what we know since no one taught us these subtle points. How do we make linguistic or stylistic generalizations?

If you are interested, I suggest having a look at topics such as Trace behavior, Generative Semantics, and Generative Syntax.

ETA: What I have not mentioned, probably because I take it for granted, is that it would be ideal (Occam's razor) to have rules which govern the behavior of the speech recognition engine, (a theory of language) rather than sheer brute force, computer memory and speed. This is why I asked the lecturer, "Did you just throw lots of computer power at the problem?"
The answer "Yes", told me, as a former student of theoretical linguistics, that there was no novel earth shattering theory of syntax or semantics involved.

rjh01 · Oct 11, 2013

Top tip for those in countries where your language is not the first language. Use your hands to speak. You get asked "do you want more coffee?" You do not just say no, you cover your mug with your hands.

Olowkow · Oct 11, 2013

rjh01 said:
Top tip for those in countries where your language is not the first language. Use your hands to speak. You get asked "do you want more coffee?" You do not just say no, you cover your mug with your hands.

That gesture would have probably worked, but he leaned back and made the gesture of both palms in the air, as in "Stop", saying "ez, ez".

I was in a Chinese restaurant and asked our favorite waiter how to ask for "A bottle of Tsing Tao" in Mandarin. He obliged.

The pronunciation can be heard by using this link to Google translate, then click on the speaker icon:

一瓶青岛

I was practicing my newly learned phrase to myself, perfecting the tones, when another waiter walked by and heard me. Moments later, I had another (unwanted) bottle of beer on the table.

Belz... · Oct 11, 2013

Olowkow said:
Computer Speech Recognition

I hope this is better than current systems. Yesterday I called the electricity company, and the computer asked for my address. I was surprised that it repeated it exactly, but then it asked me to confirm with yes or no, and even after 6 tries it didn't understand my simple "yes".

jojonete · Oct 11, 2013

Olowkow said:
Sorry if I waxed a little too metaphorical, but I had just returned from the talk, and conversation with one of the program's proponents.
[etc.]

Okay, I'll agree to everything here (including the "etc." part). Let's just say you were too excited from the talk, and I was too excited from the obvious parallel between this thread and others (at least, obvious for me when I found this for the first time); so I guess we both were too metaphorical.

Sure recognizing a voice is no easy task. The most similar thing I've ever attempted is reading a tape recorded from an MSX and a CPC-464 (I tried both). I did succeed at this, but it took several days - and that was a recording designed to be easily machine-readable. Even separating "silence" from "sound" was not trivial at all.

ddt · Oct 26, 2013

Olowkow said:
I made a point of asking if this was used for any closed captioning, and I was told it was not.

Is $200/hour the rate for the closed captioners? I don't feel sorry for them any more if so.

I routinely transcribe interviews from tape, and the industry average is that you need something like 5-7 hours for 1 hour of transcription. A closed captioner may need even more because he has to condense the text. And you need quite a bit of general education and knowledge because you're confronted with wildly differing topics. So I don't think that's an unreasonable rate.

FWIW, I was not impressed by the performance of the Dutch voice recognition.

ExMinister · Oct 26, 2013

Interesting.

I work for one of the major medical transcription companies in the U.S. Since about 2008 they have transitioned to a VR system and most of us have become editors, at least those of us who stayed on after extensive out sourcing and pay slashing. The doctors dictate by phone, VR does its thing, and I fix it, checking VR's result against the audio. This system is now fairly good at tuning out background noise and it does a fair job at recognizing plain English and medical terminology. But only fair.

From what I understand, quite a bit of money goes into perfecting these systems but it is still very rare for me to receive a report that doesn't require at least moderate editing. Some reports need little work, some so much it would be easier to scrap the whole thing and type it. People don't often talk in a straight line, VR does not do punctuation well, and even with the best dictators the reports need careful editing for poorly recognized material, some of it critical. For example, a clear dictator the other day dictated the medication amiodarone and VR put amlodipine. It is capable of a certain amount of logic, for lack of a better word, but is unable to recognize when a physician refers to a female patient as he, and still makes frequent sound similar errors, as in Lasix instead of LASIK. It picks up some context but it's up to the MT/editor to know the difference between affect/effect, weather/whether, etc. Some providers speak well enough to need no help at all; more often we have providers who thank us for making them sound good on paper. There is an art to doing this without changing the meaning of what they say even slightly, and it's something VR cannot do.

VR does not work well with dictators who mumble, talk quickly, talk from cell phones on the freeway with the top down... It also gets hung up on formatting, leaves out words like a, and, and the, doesn't recognize where headings should go. It has made great strides since it was first rolled out but if the ultimate goal is to eliminate the need for human eyes, it's not there yet. The quality of the medical reports still depends on the skill of the transcriptionist, and many of the best have fled. I have no idea if overall quality has suffered. Colleges are still churning out new transcriptionists but this is an industry where experience matters but no longer pays accordingly, at least where VR exists.

As far as ESL, our VR works as well with these providers as any other, as long as they speak at a moderate pace and enunciate.

I personally like VR - less wear and tear on my wrists than typing. But I think there's a bit of misperception out there about what it can do.

IMO it will likely require human eyes for awhile to come, even if it's the provider doing a quick review.

ddt · Oct 26, 2013

ExMinister said:
I work for one of the major medical transcription companies in the U.S. Since about 2008 they have transitioned to a VR system and most of us have become editors, at least those of us who stayed on after extensive out sourcing and pay slashing. The doctors dictate by phone, VR does its thing, and I fix it, checking VR's result against the audio. This system is now fairly good at tuning out background noise and it does a fair job at recognizing plain English and medical terminology. But only fair.

Also interesting. I'm glad my native tongue is not near the top of the priority list of speech recognition companies.

I get the impression that most, or all jobs you get are dictations by one doctor in a fairly controlled environment, who consciously dictates. That's quite a different setting than an animated interview, where people interrupt each other, you have to recognize who is speaking this time, et cetera. VR certainly would show its limitations there.

My own transcription speed, btw, is much higher than I mentioned above - in between 2 or 3 hours per hour tape - because I use a chord keyboard. A couple of times I've made a live transcript - because the client wanted a rough transcript immediately - and later tried to edit it from tape to an exact transcript. That actually cost me more time than start from scratch.

Wudang · Oct 28, 2013

Thanks! This used to be an area of fascination to me and I keep meaning to get back into it.

Computer Speech Recognition

Philosopher

Gentleman of leisure

Graduate Poster

Philosopher

Critical Thinker

Philosopher

Critical Thinker

Critical Thinker

Thinker

Philosopher

Thinker

Philosopher

Gentleman of leisure

Philosopher

Fiend God

Thinker

Mafia Penguin

RSL Acolyte

Mafia Penguin

BOFH