Sunday, August 26, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation: Summary and Conclusions

This article briefly summarizes the implementation of GSoC 2012 Pronunciation Evaluation project.

Primarily, I started with sphinx forced-alignment and obtained the spectral matching acoustic scores, duration at phone, word level using WSJ models. After that I tried concentrating mainly on two things. They are edit-distance neighbor phones decoding and Scoring routines for both Text-dependent and Text-independent systems as a part of GSoC 2012 project.

Edit-distance Neighbor phones decoding:

1. Primarily started with single-phone decoder and then explored three-phones decoder, word decoder and complete phrase decoder by providing neighbor phones as alternate to the expected phone.
2. The decoding results shown that both word level and phrase level decoding using JFGF are almost same.
3. This method helps to detect the mispronunciations at phone level and to detect homographs as well if the percentage of error in decoding can be reduced.

Scoring Routines:

This method is based on exemplars for each phrase. Initially, mean acoustic score, mean duration along with deviations are calculated for each of the phone in the phrase based on exemplar recordings. Now, given the test recording, each phone in the phrase is then compared with exemplar statistics. After that, z-scores are calculated and then normalized scores are calculated based on maximum and minimum of z-scores from exemplar recordings. All phone scores are aggregated to get word score and then all word scores are aggregated with POS weight to get complete phrase score.

This method is based on predetermined statistics built from any corpus. Here, in this project, I used TIMIT corpus to build statistics for each phone based on its position (begin/middle/end) in the word. Given any random test file, each phone acoustic score, duration is compared with corresponding phone statistics based on contextual information. The scoring method is same as to that of Text-dependent system.

Please try our demo @ and help us by giving the feedback.

Documentation and codes:
All codes are uploaded in cmusphinx svn @ and raw documentation of the project can be found here.

The pronunciation evaluation system really helps all users to improve their pronunciation by trying multiple times and it lets you correct your-self by giving necessary feedback at phone, word level. I couldn't complete some of the things I have mentioned earlier during the project. But I hope I can keep my contributions to this project in future also.

This summer has been a great experience to me. Google Summer of code 2012 has finally ended. I would like to thank my mentor James Salsman for his time, continuous efforts and help. The way he motivated really helped me to focus on the project all the time. I would also like to thank my friend Troy Lee, Nickolay, Biksha Raj for their help and comments during the project time. 

Wednesday, August 22, 2012

GSoC 2012: #Troy Pronunciation Evaluation Week 7 Status

Last week, I was still working on the data collection website.

Thank Robert ( so much for trying out the website and listed the issues he encountered on this page:

Issue #1: The under construction of Student Page

The first stage of the website to collect exemplar recordings, thus the student page is not implemented at that time. 

Issue #2: The inconvenient birthdate control

The birthdate control is now replaced with the standard HTML5 <input type="datetime"> control. Due to the datetime input control is a new element in HTML5, currently only Chrome, Safari and Opera support the popup date selection. On other browsers, which have on support yet, the control will simply be displayed as an input box. The user can just type in the date and the background script will check whether the format is correct or not.

Issue #3: The incorrect error message "Invalid date format" on the additional information update page

After digging into the source code to find the problem for several hours, the bug lies in the order of invoking mysql related functions. The processing steps in the additional information update page is as follows:
a) client side post the user input information to the server;
b) server side first using mysql_escape_string function to preprocess the user information to ensure the security of later mysql queries;
c) check the format of each field including the date time format, whether the user inputs a valid date;
d) update the mysql database with the new information.
As only in step d) the mysql sever action is needed, I thus put the database connection code behind step c), without knowing the mysql_escape_string function also requires mysql database connection. In the previous implementation, the mysql_escape_string returns empty string thus leads to invalid date format. 

Secondly, the exemplar recording page is update with following features:
1) Automatically move to the next utterance after the user record and playback the current recording;
2) Adding extra navigation control for recording phrase selection;
3) When the user opens the exemplar recording page, the first un-recorded utterance will be set to the first one shown the user.
4) Connection the enable and disable of recording and playback buttons of the player with the database information, i.e. if the user has recorded the phrase before, both the recording and playback buttons are enabled, otherwise only recording is allowed.

The third major part done in last week is the student page which is previously left empty.
For the student page, users now can also practice their pronunciation by recording the phrases in the database and also listening to the exemplar recordings in the system. The features are:
1) Full recording and playback functionalities as exemplar recording;
2) When navigating to each phrase, randomly maximum 5 exemplar recordings from the system are retrieved from the database and listed on the page to help the students. 
3) Additionally, to put some exemplar recordings in the system, I have to manually transcribe several sentences and put the recordings into the system for use. After there are many people contributing to the exemplar recordings, I don't need to do manually transcription any more.

For this week, two major tasks to be done: integration with Ronanki's evaluation scripts and mid-term report. 

Tuesday, August 21, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Final week Report

Here comes my final report for Pronunciation Evaluation project. The demo system is little bit modified. You can give a try and test the text-independent system @

Last week, I tested the system with both Indian accent and US accent. For US accent, I don't have any mis-pronunciation data. I just tested with SA1, SA2 (TIMIT) sentences. For Indian accent, I prepared a data with both correct pronunciations and mis-pronunciations and can be downloaded at

The results are provided at The scripts for evaluating the database are uploaded in svn project folder. Phonological features are provided in svn, but couldn't built models with it in time.

The project and the required scripts can be downloaded from
Please go through README files provided in each folder.

Finally, I would like to thank my mentor James Salsaman, Nickolay, Biksha Raj and rest of the community for helping me all the time. I hope that I keep contributing to this project over the time. 

Ronanki: GSoC 2012 Pronunciation Evaluation week 12

This week, I am trying to extend the TIMIT statistics to 5 or 6 per each phoneme based on syllable position or I can do CART modelling to predict duration and acoustic score based on training. I did this to some extent using wagon in speech tools.

Regarding mis-pronunciation detection accuracy, I collected data from 8 non-native speakers with 5 words being recorded 10 times in both correct and wrong ways and 5 sentences being recorded 3-5 times in both correct and wrong ways. Here is the link to it @
database/ and the description of the database is here at

I need to split each speaker's data into individual files which is a tedious task and taking some time. Somehow, I completed with one speaker's data and the current text-independent system is doing good. 46 out of 50 correct words are detected good pronunciation and 42 words out of 50 wrong words are detected mis-pronunciation by setting a common threshold for all words. It takes one or two more days to give complete statistics. 

In parallel, I completed phonological features and generated acoustic models for TIMIT database because I faced some difficulties to find complete set of wav files for WSJ database. But, I failed in both decoding and forced-alignment with the new models generated on phonological features. Even I failed in generating appropriate models with sphinx mfc features. Even though they generated properly, I didn't get results with forced-alignment or decode functions by replacing with WSJ models. I will try to overcome these issues by next week.

Ronanki: GSoC 2012 Pronunciation Evaluation Week 11

This week, I managed to do only data collection which is required to evaluate the project.

The database collection is over and is on different servers. I am trying to bring it on-to one place. You can find part amount of the data for one speaker here @

The description of the data is at 

Ronanki: GSoC 2012 Pronunciation Evaluation Week 10

This week, I explored CART models a little bit, but couldn't complete it. The models are trained using wagon in speech tools with the following contextual information:

Current phone, previous phone, next phone, syllable postion, phonological features, phone type etc.,

Complete list of features are listed in the below URL:

Once the training is completed, the output is built in a tree which is given as input along with testing data to wagon_test in speech tools and there by it predicts the duration of the each phone using the contextual information using tree structure.

Regarding replacing of traditional MFCC features either with PNCC or phonological features, I need to compute acoustic models for WSJ database replacing these features instead of MFCC. It's in process, and once acoustic models are built, the rest of the testing process is same. 

Work to do:
By next week, I would be able to complete one of these two and the next one thereafter. In the final week, I upload all codes to svn and integrate these new techniques with the current working pronunciation evaluation model @

Ronanki: GSoC 2012 Pronunciation Evaluation week 9

This week, I finished with my random phrase pronunciation evaluation and is in testing phase @

The system can provide evaluation scoring for any random sentence. It also gives feedback for mispronunciation and rate of duration at word level. Please, test the system and mail me the bugs if any. Please avoid giving proper nouns and punctuation marks while testing the system.

For doing this, I evaluated entire TIMIT dataset  and the statistics for each phone are evaluated at three positions: 
Begin/Middle/End (0/1/2). The count in the last column represents the number of times each phone occurred at each position. The statistics are @

Next week, I am going to implement CART models so that each phone can be compared with respective phone in better context. Regarding features, I studied about Power Normalized Cepstral Coefficients (PNCC) which are more robust towards speech recognition even in noisy environment. PNCC are 13 in dimension, computationally more cost than MFCC but performs better than MFCC in speech recognition. I downloaded the available matlab code @ and trying some experiments on nTIMIT database. I also implemented phonological mapping with current state of spectral features (MFCC) using ANN. Currently, I am in testing phase of speech recognition using all these features. 

Ronanki: GSoC 2012 Pronunciation Evaluation week 8

This week, I mainly concentrated on integrating everything with web @

The following are the ones which are integrated:

1. File upload option with different formats (wav/wma/mp3) is provided.
2. All test cases are evaluated while recording and it allows only those recordings which are near to the perfect case.
3. The calculate score button provides the feedback page @ (still some of the columns in UI are under construction)
4. Phrase entry as per user's choice and then score calculation page @ is also under construction.
5. As of now, statistics for random phrase entry as per user's choice are derived from TIMIT database which covers 630 speakers with 10 recordings from each one.

Next Tasks:

1. Feature extraction (Power-Normalized Cepstral Coefficients and phonological features)
2. CART models (for efficient score calculation in random phrase method based on contextual information)

Regarding under construction pages, the back-end codes were developed and uploaded at sourceforge. Only, the web pages need to be build dynamically. Will be done in parallel with the current next tasks.

Ronanki: GSoC 2012 Pronunciation Evaluation week 7

Last week, I continued to work on spectral features and phonological features and their mapping based on neural network training for first few days. Based on forced-alignment/manual labels if exists, these phonological features for each phone in a phrase are repeated against it's spectral features. I am looking over CSLU Toolkit which uses a neural net for feature-to-diphone decoding and stopped at that point to work out later after mid-evaluation.

Later, I worked on integration of acoustic/duration scores along with edit-distance grammar decoding with the current website for exemplar outlier analysis.

I tried with many test cases such as
1. silence
2. noisy speech
3. Junk speech
4. Random sentence
5. Actual sentence shortened in the end
6. Actual sentence skipped the beginning

In test cases from 1-6, the forced alignment did not reach the final state and failed to create phone segmentation file, label file which contains acoustic scores, phone labels respectively.

7. Actual sentence
8. Actual sentence with more silence both at beginning and end
9. Actual sentence with one small word skip in the middle of phrase
10. Similar sounding sentences such as
Ex. Utterance: 
Approach the teaching of pronunciation with more confidence
Tested Similar sounding: 
a. Approach the teaching opponents the nation with over confidence
b. Approach the preaching opponents the nation with confidence

In test cases from 7-10, the forced alignment worked and generated acoustic scores, phone labels. Then, I moved on to edit-distance grammar decoding testing accuracy on cases 7-10 so that I can set a threshold parameter to distinguish between cases (7,8,9) and (10)

Earlier, I tested for cases 7,8 with phrase decoder in edit-distance and reported it around 73% and the accuracy is < 40% for case 10 so that I can easily set the threshold parameter such as accuracy = x>0.4 ? T : F

I also discussed with my mentor James Salsman on giving weights to words based on parts of speech for phrase output score and here is what he proposed and all the units are in db which represents relative loudness in English.
(%wt, %pos); # scoring weights and names of parts of speech
$wt{'q'} = 1.0; $pos{'q'} = 'quantifier';
$wt{'n'} = 0.9; $pos{'n'} = 'noun';
$wt{'v'} = 0.9; $pos{'v'} = 'verb';
$wt{'-'} = 0.8; $pos{'-'} = 'negative';
$wt{'w'} = 0.8; $pos{'w'} = 'adverb';
$wt{'m'} = 0.8; $pos{'m'} = 'adjective';
$wt{'o'} = 0.7; $pos{'o'} = 'pronoun';
$wt{'s'} = 0.6; $pos{'s'} = 'possessive';
$wt{'p'} = 0.6; $pos{'p'} = 'preposition';
$wt{'c'} = 0.5; $pos{'c'} = 'conjunction';
$wt{'a'} = 0.4; $pos{'a'} = 'article';

Hope, I do it and launch the site after integrating everything before mid-evaluation submission.

Ronanki: GSoC 2012 Pronunciation Evaluation Week 6

I uploaded all my codes (except few ongoing) here at . Please follow README files in each folder for detailed instructions on how to use them. 

This week, I have concentrated on new features for speech recognition. I read a paper on Power-Normalized Cepstral Coefficients [1] which are more robust towards speech recognition and a few papers on phonological features [2],[3]. I hope to investigate mapping the acoustic speech features of each phoneme derived from machine phonetic transcription to phonological features. Using this mapping, mispronunciations at phone level can be identified using phonological features along with acoustic pronunciation scores and edit distances. I got some mapping here at based on those papers.

Ongoing tasks:
1. In random phrase scoring method, another column is added to store the position of each phone with respect to word (begin/middle/end) such that each phone will have three statistics
2. Standard word scores are derived along with phoneme standard (acoustic + duration) scores in the current forced-alignment.
3. Linking edit-distance algorithm with pronunciation evaluation website
4. Complete a full-pledged website at with all test cases (junk speech, silence, misread etc.,) before mid-evaluation and publicize the system so that it can be tested by large number of users. 


[1] Chanwoo Kim and Richard M.Stern, "Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition", ICASSP 2012.

[2] Katrin Kirchhoff a, Gernot A. Fink b, Gerhard Sagerer b, "Combining acoustic and articulatory feature information for robust speech recognition", Speech Communications 37 (2002) 303–319.

[3]  S. King and P. Taylor, “Detection of phonological features in continuous speech using neural networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333–353, 2000.