Well, it has been a month since I got accepted into this year's Google Summer of Code. This has been a great time for me, during the community bonding period within the CMU Sphinx organization. Our organization has six GSoC students this year working on different projects. We introduced ourselves to each other over the cmusphinx-gsoc mailing list and had a few conversations over chat. Thanks to Carol Smith, I received my welcome package from Google on May 19th, and a free ACM membership too :)
It has been three days since GSoC 2012 started officially. Prior to that, I became familiarized with a few different things with the help of my mentor. He created a wiki page for our projects at http://cmusphinx.sourceforge. net/wiki/pronunciation_ evaluation. Troy and I are also going to blog at http://cmusphinx.sourceforge.net and update the wiki there during this summer. So please check there for important updates too.
Currently, my goal is to build a web interface which allows users to evaluate their pronunciation. Some of the sub-tasks have already been accomplished, and some of them are still ongoing:
Work accomplished:
It has been three days since GSoC 2012 started officially. Prior to that, I became familiarized with a few different things with the help of my mentor. He created a wiki page for our projects at http://cmusphinx.sourceforge.
Currently, my goal is to build a web interface which allows users to evaluate their pronunciation. Some of the sub-tasks have already been accomplished, and some of them are still ongoing:
Work accomplished:
- Created an initial web interface which allows users to record and playback their speech using the open source wami-recorder which is being designed by the spoken language systems at MIT.
- When the recording is completed, the wave file is uploaded to the server for processing.
- Sphinx3 forced alignment is used to align a phoneme string expected from the utterance with the recorded speech to calculate time endpoints acoustic scores for each phoneme.
- I tried many different output arguments in sphinx3_align from http://cmusphinx.sourceforge.net/wiki/sphinx4:sphinxthreealigner and successfully tested producing the phoneme acoustic scores using two recognition passes.
- In the first pass, I use -phlabdir as an argument to get a .lab file as output, which contains the list of recognized phonemes.
- In the second pass, I use that list to get acoustic scores for each phoneme using -wdsegdir as an input argument.
- Later, I integrated sphinx3 forced alignment with the wami-recorder microphone recording applet so that the user sees the acoustic scores after uploading their recording.
- Please try this link to test it: http://talknicer.net/~ronanki/test
- Wrote a program to convert a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet.
- Wrote a program to take a string of phonemes representing an expected utterance as input and produce a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring.
Ongoing work:
- Reading about Worldbet, OGIbet, ARPAbet, and CMUbet, the different ASCII-based phonetic alphabets and their mappings between each other and the International Phonetic Alphabet.
- Will be enhancing the first pass of recognition described above using the generated alternative neighboring phoneme grammars to find phonemes which match the recorded speech more closely than the expected phonemes without using complex post-processing acoustic score statistics.
- Trying more parameters and options to derive acoustic scores for each phoneme from sphinx3 forced alignment.
- Writing an exemplar score aggregation algorithms to find the means, standard deviations, and their expected error for each phoneme in a phrase from a set of recorded exemplar pronunciations of that phrase.
- Writing an algorithm which can detect mispronunciations by comparing a recording's acoustic scores to the expected mean and standard deviation for each phoneme, and aggregating those scores to biphones, words, and the entire phrase.
No comments:
Post a Comment