Pronunciation Evaluation: July 2012

Saturday, July 14, 2012

Daily progress reports in comments here

Quick mentor note: We are converting to daily progress reports which I will combine into draft blog posts that the students will proofread, copy-edit, and approve for publication. This will help keep all three of us on schedule. Sorry I am behind. The good news is that both students made it from "on schedule" to "ahead of schedule" in a sprint for the evaluations.

Congratulations, Troy and Ronanki!

Please post your daily-ish (4 or more per week) progress reports here. Thanks!

Tuesday, July 10, 2012

Troy: GSoC 2012 Pronunciation Evaluation Week 5

Sorry for the late update. The following are the things I did in Week 5; mainly problem solving.

1) Solving the Flash-based recorder update which prevented users from using their microphones.

At the beginning, before the Flash player 11.2 and 11.3 update, the audio recorder I created using Flex worked fine. Users could simply right click the recorder and select the "Settings" to allow microphone access. However, with the new updates, that option is disabled without any error message.

To solve this problem, people suggested adding websites into the online global privacy list. However, after trying many times that was still not working for the audio recorder.

Furthermore, http://englishcentral.com/ which also uses Flash-based recording has a popup window from their recording button (a microphone image) with the Flash Microphone privacy setting dialogue. Checking the accessibility of microphone in code and prompting for the setting dialogue when necessary helps provide the solution:

First, checking whether the microphone is available, if not show the microphone list dialogue of Flash object ask the user to plugin a microphone:

var mic:Microphone = Microphone.getMicrophone();

if(!mic) {

Alert.show("No microphone available");

debug("No microphone available");

Security.showSettings("microphone");

}

Otherwise, check whether the microphone is accessible or not, if it is muted, prompt the privacy dialogue to ask user to allow the microphone access:

if(mic.muted) {

debug("Microphone muted!");

Security.showSettings("privacy");

}

With these testing during the initialization stage of the Flash recorder, it can allow users to enable the microphone access at the early beginning. One interesting thing is that after doing this, the "Setting" option of the Flash object now is clickable.

Now, looking back to the code solving the problem, which is so apparent, however, before you know the answer, it is really hard to predict.

2) Cross-browser Flash recorder compatibility

As the Flash recorder problem was solved as above, I was happy to update the source code in the trunk and our server and hoped to see the site working nicely. But the browser shows that the Flash recorder cannot load, the only information I got is "Error 2046"....

To try to solve this problem, I Googled a bunch of pages and tried several suggestions, the first which suggested I clear the browser cache and then set the Flash player to not save local cache and then re-enable its local cache (some kind of clear Flash player local cache), which gives some progress by changing "Error 2046" to "Error 2032".

For "Error 2032", there are mainly two groups of explanations, one saying there is something wrong with the URLs in Actionscript's HTTPRequests, which seems unlikely because those URLs are definitely correct and are under the same folder as the player. The other is an RSL problem of the mxmlc Flash compiler. To solve the RSL linkage problem, go to the "Flex Build Path" properties page, "Library path" tab and change the framework linkage to "merged into code".

[Mentor note: Requesting compatibility with earlier versions of Flash ActionScript using compiler switches may or may not help here.]

3) Adding a password change page

4) Refining the user extra information update page to reflect the existing user information if available, instead of always showing the default values.

The website for exemplary recordings is now at a usable stage.

In this week, I will try to accomplish these things:

1) Phrase data entry for administrators (with text, exemplar pronunciations, homograph disambiguation, phonemes, parts of speech per word, etc.;

2) Design recording prompts to start our exemplary recording data collection;

3) Bug fixing and system testing;

4) Study the Amazon Mechanical Turk and start thinking how to incorporate our speech data collection on to that platform.

Ronanki: GSoC 2012 Pronunciation Evaluation Week 5

The basic scoring routine for the pronunciation evaluation system is now available at http://talknicer.net/~ronanki/test/. The output is generated for each phoneme in the phrase and displays the total score.

These are the things I've accomplished in the fifth week of GSoC 2012:

1. Edit-distance neighbor grammar generation:

Earlier, I did this with:

(a) a single-phone decoder
http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_1phone.txt

(b) a three-phone decoder (contextual)
http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_3phones.txt

(c) an entire phrase decoder with neighboring phones
http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_compgram.txt

This week, I added two more decoders: a word-decoder and a complete phrase decoder using each phoneme at each time

word-decoder: I used sox to split each wav file into words based on forced-alignment output and then presented each word as follows.

Ex: word - "with" is presented as

public <phonelist> = ( (W | L | Y) (IH) (TH) );

public <phonelist> = ( (W) (IH | IY | AX | EH) (TH) );

public <phonelist> = ( (W) (IH) (TH | S | DH | F | HH) );

The accuracy turned out to be better than single-phone/three-phone decoder, same as entire phrase decoder and the output of a sample test phrase is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_words.txt

Complete phrase decoder using each phoneme: This is again more similar to entire phrase decoder. This time I supplied neighboring phones for each phoneme at each time and fixed the rest of the phonemes in the phrase. Not a good approach, takes more time to decode. But, the accuracy is better than all the previous methods. The output is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_phrases.txt

The code for above methods are uploaded in cmusphinx sourceforge at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/

Please follow the README file in each folder for detailed instructions on how to use them.

2. Scoring paradigm:

Phrase_wise:

The current basic scoring routine which is deployed at http://talknicer.net/~ronanki/test/ aligns the test recording with the utterance using forced alignment in sphinx and generates a phone segmentation file. Each phoneme in the file is then compared with mean, std. deviation of the respective phone in phrase_statistics (http://talknicer.net/~ronanki/phrase_data/phrase1_stats.txt) and standard scores are calculated from z-scores of acoustic_score and duration.

Random_phrase:

I also derived statistics (mean score, std. deviation score, mean duration) for each phone in CMUphoneset irrespective of context using the exemplar recordings for all the three phrases (http://talknicer.net/~ronanki/phrase_data/phrases.txt) which I have as of now. So, If a test utterance is given, I can test each phone in the random phrase with respective phone statistics.

Statistics are at : http://talknicer.net/~ronanki/phrase_data/all_phrases_stats (column count represents number of times each phone occurred)

Things to do in the upcoming week:

1. Use of an edit-distance grammar to derive standard scores such that the minimal effective training data set is required. [Mentor note: was "no training data," which is excluded.]

2. Use of the same grammar to detect the words that are having two correct different pronunciation (ex: READ/RED)

3. In a random phrase scoring method, another column can be added to store the position of each phone with respect to word (or SILence) such that each phone will have three statistics and can be compared better with the exemplar phonemes based on position.

4. Link all those modules to try to match experts' scores.

5. Provide feedback to the user with underlined mispronunciations, or numerical labels.

Future tasks:

1. Use of CART models in training to do better match of statistics for each phoneme in the test utterance with the training data based on contextual information

2. Use of phonological (power normalized cepstral?) features instead of mel-cepstral features, which are expected to better represent the state of pronunciation.

3. Develop a complete web-based system so that end user can test their pronunciation in an efficient way.

Wednesday, July 4, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Week 4

The source code for the functions below have been uploaded to http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/
Here are some brief notes on how to use those programs:

Method 1: (phoneme decode)
Path:
neighborphones_decode/one_phoneme/
Steps To Run:
1. Use split_wav2phoneme.py to split a sample wav file in to individual phoneme wav files
Usage: python split_wav2phoneme.py <input_phoneseg_file> <complete_phone_list> <input_wav_file> <out_split_dir>
2. Create split.ctl file using extracted split_wav directory
3. Run feature_extract.sh program to extract features for individual phoneme wav files
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg.
6. Run decode_1phoneme.py to get the required output in output_decoded_phones.txt
Usage: python decode_1phoneme.py <input_split_ctl_file> <output_phone_file>

Method 2: (Three phones decode)
Path:
neighborphones_decode/three_phones/
Steps To Run:
1. Use split_wav2threephones.py to split a sample wav file in to individual phoneme wav files which consists of three phones the other two being served as contextual information for the middle one.
Usage: python split_wav2threephones.py <input_phoneseg_file> <ngb_key_mapper> <input_wav_file> <out_split_dir>
2. Create split.ctl file using extracted split_wav directory
3. Run feature_extract.sh program to extract features for individual phoneme wav files
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg.
6. Run decode_3phones.py to get the required output in output_decoded_phones.txt
Usage: python decode_3phones.py <input_split_ctl_file> <output_phone_file>

Method 3: (Single/Batch phrase decode)
Path:
neighborphones_decode/phrases/
Steps To Run:
1. Run decode.sh program to get the required output in sample.out
2. Provide the input arguments such as grammar file, feats, acoustic models etc., for the input test phrase
3. Construct grammar file (JSGF) using my earlier scripts from phonemes2ngbphones and then use jsgf2fsg in sphinxbase to convert from JSGF to FSG which serves as input Language Model to sphinx3_decode

Troy: GSoC 2012 Pronunciation Evaluation Week 4

[Project mentor note: I have been holding these more recent blog posts pending some issues with Adobe Flash security updates which periodically break cross-platform audio upload web browser solutions. We have decided to plan for a fail-over scheme using low-latency HTTP POST multipart/form-data binary Speex uploads to provide backup in case Flash/rtmplite fails again in the future. This might also support most of the mobile devices. Please excuse the delay and rest assured that progress continues and will continue to be announced at such time as we are confident that we won't need to contradict ourselves as browser technology for audio upload continues to develop. --James Salsman]

The data collection website now can provide basic capabilities. Anyone interested, please check out http://talknicer.net/~li-bo/datacollection/login.php and give it a try. If you encounter any problems, please let us know.

Here are my accomplishments from last week:

1) Discussed the project schema design with the project mentor and created the database with MySQL. The current schema is shown at http://talknicer.net/w/Database_schema. During the development of the user interface, slight modifications were made to refine the database schema, such as the age field in for the users table: Storing the user's birth date is much better. Other similar changes were made. I learned that good database design comes from practice, not purely imagination.

2) Implemented the two types of user registration pages: one for students and one for exemplar uploaders. To avoid redundant work and allow for fewer constraints on types of users, the registration process involves two steps: one basic registration and one extra information update. For students, only the basic one is mandatory, but the exemplar uploaders have to fill out two separate forms.

3) Added extra supporting functionality for user management, including password reset and mode selection for users with more than one type.

4) Incorporated the audio recorder with the website for recording and uploading to servers.

This week I plan to:

1) Complete the user interface for adding phrase prompts;

2) Test the resulting system;

3) Design the pronunciation learning game for student users.