The basic scoring routine for the pronunciation evaluation system is now available at http://talknicer.net/~ ronanki/test/. The output is generated for each phoneme in the phrase and displays the total score.
(b) a three-phone decoder (contextual)
http://talknicer.net/~ ronanki/phrase_data/results_ edit_distance/output_3phones. txt
(c) an entire phrase decoder with neighboring phones
http://talknicer.net/~ ronanki/phrase_data/results_ edit_distance/output_compgram. txt
Please follow the README file in each folder for detailed instructions on how to use them.
These are the things I've accomplished in the fifth week of GSoC 2012:
1. Edit-distance neighbor grammar generation:
Earlier, I did this with:
(a) a single-phone decoder
http://talknicer.net/~ ronanki/phrase_data/results_ edit_distance/output_1phone. txt
http://talknicer.net/~
(b) a three-phone decoder (contextual)
http://talknicer.net/~
(c) an entire phrase decoder with neighboring phones
http://talknicer.net/~
This week, I added two more decoders: a word-decoder and a complete phrase decoder using each phoneme at each time
word-decoder: I used sox to split each wav file into words based on forced-alignment output and then presented each word as follows.
Ex: word - "with" is presented as
public <phonelist> = ( (W | L | Y) (IH) (TH) );
public <phonelist> = ( (W) (IH | IY | AX | EH) (TH) );
public <phonelist> = ( (W) (IH) (TH | S | DH | F | HH) );
The accuracy turned out to be better than single-phone/three-phone decoder, same as entire phrase decoder and the output of a sample test phrase is at http://talknicer.net/~ ronanki/phrase_data/results_ edit_distance/output_words.txt
Complete phrase decoder using each phoneme: This is again more similar to entire phrase decoder. This time I supplied neighboring phones for each phoneme at each time and fixed the rest of the phonemes in the phrase. Not a good approach, takes more time to decode. But, the accuracy is better than all the previous methods. The output is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_phrases.txt
The code for above methods are uploaded in cmusphinx sourceforge at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/
The code for above methods are uploaded in cmusphinx sourceforge at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/
Please follow the README file in each folder for detailed instructions on how to use them.
2. Scoring paradigm:
Phrase_wise:
The current basic scoring routine which is deployed at http://talknicer.net/~ ronanki/test/ aligns the test recording with the utterance using forced alignment in sphinx and generates a phone segmentation file. Each phoneme in the file is then compared with mean, std. deviation of the respective phone in phrase_statistics (http://talknicer.net/~ ronanki/phrase_data/phrase1_ stats.txt) and standard scores are calculated from z-scores of acoustic_score and duration.
Random_phrase:
I also derived statistics (mean score, std. deviation score, mean duration) for each phone in CMUphoneset irrespective of context using the exemplar recordings for all the three phrases (http://talknicer.net/~ ronanki/phrase_data/phrases. txt) which I have as of now. So, If a test utterance is given, I can test each phone in the random phrase with respective phone statistics.
Statistics are at : http://talknicer.net/~ ronanki/phrase_data/all_ phrases_stats (column count represents number of times each phone occurred)
Things to do in the upcoming week:
1. Use of an edit-distance grammar to derive standard scores such that the minimal effective training data set is required. [Mentor note: was "no training data," which is excluded.]
2. Use of the same grammar to detect the words that are having two correct different pronunciation (ex: READ/RED)
3. In a random phrase scoring method, another column can be added to store the position of each phone with respect to word (or SILence) such that each phone will have three statistics and can be compared better with the exemplar phonemes based on position.
4. Link all those modules to try to match experts' scores.
5. Provide feedback to the user with underlined mispronunciations, or numerical labels.
Future tasks:
1. Use of CART models in training to do better match of statistics for each phoneme in the test utterance with the training data based on contextual information
2. Use of phonological (power normalized cepstral?) features instead of mel-cepstral features, which are expected to better represent the state of pronunciation.
3. Develop a complete web-based system so that end user can test their pronunciation in an efficient way.
No comments:
Post a Comment