Should I integrate DSN in my application?
Hi,
I was using MS speech engine in my application which takes audio file as input rather than realtime human dictation from Microphone, however the accuracy is disappointing. I've been looking into DNS SDK and found this has a promising accuracy, especially the AudioMining SDK includes features users to search the audio file which is also another important function that I need in my application.
I'm considering using DNS SDK, but I notices from this forum that it costs above 5000$ comparing the MS engine that could be downloaded for free online.Now I'm concerned with things:
1) how accurate will DNS SDK be when taking the audio file as input?
2) Unlike MS engine that you could search for help on forums for free, when encounter problems in coding with DNS you need to purchase technical support which some times turns out to be ineffective and unhelpful, is it worth to purchase the DNS? I'm worried that I might again meet the some dilemmas and have a hard time to seek for help which makes the money spend on purchasing worthless.
Thanks for any suggestion.
Carol

carolfly wrote: Hi, I was
Hi,
I was using MS speech engine in my application which takes audio file as input rather than realtime human dictation from Microphone, however the accuracy is disappointing. I've been looking into DNS SDK and found this has a promising accuracy, especially the AudioMining SDK includes features users to search the audio file which is also another important function that I need in my application.
I'm considering using DNS SDK, but I notices from this forum that it costs above 5000$ comparing the MS engine that could be downloaded for free online.Now I'm concerned with things:
1) how accurate will DNS SDK be when taking the audio file as input?
2) Unlike MS engine that you could search for help on forums for free, when encounter problems in coding with DNS you need to purchase technical support which some times turns out to be ineffective and unhelpful, is it worth to purchase the DNS? I'm worried that I might again meet the some dilemmas and have a hard time to seek for help which makes the money spend on purchasing worthless.
Thanks for any suggestion.
Carol,
The first thing that you have to understand is that speech recognition is not yet speaker independent. That means that accuracy in transcribing audio files is still heavily dependent upon training speech recognition to recognize the unique characteristics of the speakers speech (pronunciation/enunciation). Therefore, and specifically with Windows Speech Recognition (WSR), the ability to accurately recognized what a speaker is saying is still heavily dependent upon training the user profile to that speaker’s speech. Even though DNS has a very good speaker independent Acoustic Model, this is only viable for single speakers. The moment that you start throwing multiple speakers at it, recognition accuracy begins to degrade. Speech recognition does not yet understand either conversational speech or recognize multiple speakers with any degree of accuracy above about 30 to 60%.
Second, I see that you did some homework, but you missed some significant points. Audio mining is only available in the DNS SDK Server version. Instead of talking about $5000 worth of SDK and developer tools, you're now talking about approximately $15,000 worth. $5000 covers the cost of the developer tools, DNS runtime module, SDK support, and a few other odds and ends for the DNS SDK Client version. Audio mining is not included in the Client version.
Third, I think you overestimate the value of AudioMining for what you are wanting to do. AudioMining can parse streaming audio for specific content and extract such, as well as give you some parameters. In addition, it can be used along with the SDK and the DNS runtime module for transcription. However, unless you are experienced at using the SDK for this purpose, you're not going to get much use out of it. Granted that the technical support contract that you purchased along with the DNS SDK Server version will help you to learn certain aspects, I think you will be disappointed with the end result (i.e., accuracy) for the reasons specified above, even though DNS is much better at this than WSR.
Fourth, the basic difference between DNS and WSR is that WSR is embedded in the operating system and all you really need in order to do any kind of development is SAPI 5.3, which Microsoft has always provided at no charge. Nevertheless, and I assume from your post that you have a reasonable understanding of SAPI, SAPI is only part of the picture. The reason that Nuance charges for the developer tools which comprise the SDK is that many of them are proprietary and patented. For example, you're not provided with the DNS user interface. The runtime module only contains the core DNS components sans the user interface that you see when you purchase a full copy of DNS. In addition, the runtime module is an evaluation only version. You cannot use it for distributing any product that you might develop which employs such without either requiring the end-user to purchase a copy of DNS or purchasing runtime licensing, of which and for which the minimum is generally $50,000 worth of run-time licenses. A lot has probably changed since I was the DNS SDK Program Manager, but I would assume that much of what I originally set up in terms of developer tools (i.e., the original Developer Suite) is probably still intact give or take and those modifications that Nuance may have made to the overall package since I left at the end of December of 2001 (L&H/ScanSoft).
Fifth, the support contract for the SDK is not to be compared to the support for end-users with DNS, it is simply not the same. The support for the SDK is given through the developers responsible for. In that sense you have options for both online/phone support and SDK training. How much of the SDK training, which is not on-site but at Nuance, is provided as part of the initial support contract I can't be sure a. However, when I was the SDK Program Manager, the actual training was provided as part of the support contract on-site in Burlington Mass, but the developer had to pay their own transportation, as well as for any materials that were provided during the training. That's pretty much industry-standard. Nevertheless, SDK support and training is, at least was, when I was doing it, very comprehensive and user oriented. The only problem was that training classes were only conducted only when a sufficient number of SDK users were accumulated for training. So, there are some pros and cons, ups and downs, etc., but the quality of training is not one of them. The problem is mainly timing and accessibility relative to the access to the programmers and the fact that they can't be training one or two users here or there.
Lastly, to a certain degree I can tell from your question and the manner in which you speak about the SDK that it probably won't fill your needs or suit your purposes. The way you phrase your questions is kind of like walking into a luxury car showroom, pointing to a Rolls-Royce, and asking "How much?" If you have to ask, you can afford it. In the case of the SDK, if you have to ask, it won't be of much use to you. Unfortunately, one of the weak points that Nuance is that unless you're willing to slap down your visa card) and they are, they won't give you a whole lot of information beyond what's available on the website. That's the general problem with sales. They have basically been told to get the customers money first and answer questions second.
Based on what you have provided in your post, I think it would be a waste of your time and money to even go this route. More than the cost for the difficulty in applying it, the results would definitely be inconsistent and disappointing in terms of accuracy. The only reasonable degree of accuracy that you would achieve by going in this direction would be if you were to create a user for each unique speaker. If you're trying to capture audio for transcription from numerous (multiple) speakers, you're not going to get significantly better results than you are with WSR. Speech recognition simply isn't there yet, so I wouldn't waste my time or my money going the SDK Server route. The best you can hope for is to do what you're currently doing with WSR and editing it if possible. The only other possibility would be to do it with DNS using a standard user profile and then editing it. You would probably find that you would get a little better accuracy using this approach, but still not much above 60 to 70%.
Chuck Runquist
GEMCCON - The Choice of Intelligence
Speech Recognition Consulting and Training
We would often be sorry if our wishes were gratified. - Aesop (620 BC - 700 BC
Chuck, Thanks for your
Chuck,
Thanks for your detailed explanation about my question. From what you said I feel it's impossible to improve the accuracy with any kind of speech engine, since DNS is said to be the best speech engine yet the accuracy is not much above 60 to 70%. However, my boss has showed me an application named Docsoft, by uploading an audio file it could generate decent transcript compare to the transcript I got from using SAPI 5.1, which randomly get some words but most of them are incorrect. It seems the speech engine is just guessing and accuracy can hardly be above 30%. I thought about many ways to improve the accuracy, but none of they seems feasible in this case.
1) It seems impossible to train an audio file to the engine
2) Some people suggest me to create an grammar with less vocabulary myself, but how am I know in advance what words will be uttered in the audio?
The audio format is 16 bit,44100 hz and mono, and the grammar I'm using is dictation, everything seems valid but the results are always dissappointing. What can I do to improve the accuracy ? I really been stuck in this problem for a long time that's why I start to look at the DNS. Will it help if I use SAPI 5.3?