Pronunciation Evaluation: June 2012

Tuesday, June 19, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Week 3

I finally finished trying different methods for edit-distance grammar decoding. Here is what I have tried so far:

1. I used sox to split each input wave file into individual phonemes based on the forced alignment output. Then, I tried decoding each phoneme against its neighboring phonemes. The decoding output matched the expected phonemes only 12 out of 41 times for the exemplar recordings in the phrase "Approach the teaching of pronunciation with more confidence"

The accuracy for that method of edit distance scoring was 12/41 (29%) -- This naive approach didn't work well.
2. I used sox to split each input wave file into three phonemes based on the forced alignment output and position of the phoneme. If a phoneme is at beginning of its word, I used a grammar like: <current phone> <next> <next2next> and if it is middle phoneme: <previous> <current> <next> and if it is at the end: <previous2previous> <previous> <current> and supplied neighboring phones for the current phone and fixed the other two. For example, the phoneme IH in word "with" is encoded as: ((W) (IH|IY|AX|EH) (TH))

The accuracy was 19/41 (46.2%) -- better because of more contextual information.

3. I used the entire phrase with each phoneme encoded in a sphinx3_decode grammar file for matching a sequence of alternative neighboring phonemes which looks something like this:

#JSGF V1.0;
grammar phonelist;
public <phonelist> = (SIL (AH|AE|ER|AA) (P|T|B|HH) (R|Y|L) (OW|AO|UH|AW) (CH|SH|JH|T) (DH|TH|Z|V)(AH|AE|ER|AA) (T|CH|K|D|P|HH) (IY|IH|IX) (CH|SH|JH|T) (IH|IY|AX|EH) (NG|N) (AH|AE|ER|AA) (V|F|DH) (P|T|B|HH)(R|Y|L) (AH|AE|ER|AA) (N|M|NG) (AH|AE|ER|AA) (N|M|NG) (S|SH|Z|TH) (IY|IH|IX) (EY|EH|IY|AY) (SH|S|ZH|CH) (AH|AE|ER|AA) (N|M|NG) (W|L|Y) (IH|IY|AX|EH) (TH|S|DH|F|HH) (M|N) (AO|AA|ER|AX|UH) (R|Y|L) (K|G|T|HH) (AA|AH|ER|AO) (N|M|NG) (F|HH|TH|V) (AH|AE|ER|AA) (D|T|JH|G|B) (AH|AE|ER|AA) (N|M|NG) (S|SH|Z|TH) SIL);

The accuracy for this method of edit distance scoring was 30/41 (73.2%) -- the more contextual information provided, better the accuracy.

Here is some sample output, written both one below the other to have a comparison of phonemes.

Forced-alignment output: AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M

Decoder output: ER P R UH JH DH AH CH IY CH IY N AH V P R ER N AH NG Z IY EY SH AH N W IH TH M

In this case, both are forced outputs. So, if someone skips or inserts something during phrase recording, it may not work well. We need to think a method to solve this. Will a separate pass decoder grammar to test for whole word or syllable insertions and deletions work?

Things to do for next week:

1. We are trying to combine acoustic standard scores (and duration) from forced alignment with an edit distance scoring grammar, which was reported to have better correspondence with human expert phonologists.

2. Complete a basic demo of the pronunciation evaluation without edit distance scoring from exemplar recordings using conversion of phoneme acoustic scores and durations to normally distributed scores, and then using those to derive their means and standard deviations, so we can produce per-phoneme acoustic and duration standard scores for new uploaded recordings.

3. Finalize the method for mispronunciation detection at phoneme and word level.

Troy: GSoC 2012 Pronunciation Evaluation Week 3

Week 3 accomplishments:

1. Tailored the previous ActionScript/MXML audio recorder to provide only audio recording and playback functionality and began interfaces for interaction with the web site pages using JavaScript.

2. Discussed database design and schema with the project mentor and continued refining and testing the schema and initial database records.

Plans for Week 4:

1. Fix the database schema for prompts to handle word lists with (possibly multiple) pronunciations and parts of speech, along with a separate text string for phrase display which can include arbitrary punctuation and might not have as clear word boundaries because of that punctuation--such as this phrase in dashes--etc.

2. Create separate registration interface for users who will be uploading exemplar pronunciation recordings.

3. Create an interface to add phrase prompts and mark their words' disambiguated pronunciation and parts of speech.

4. Create the interface to upload exemplar recordings for prompts.

5. Think about game play and refine its schema once the basic features are decided.

Sunday, June 10, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Week 2

[It is my fault this update is late, not Ronanki's. --James Salsman]

Following last week's discussion describing how to obtain phoneme acoustic scores from sphinx3_align, here is some additional detail pertaining to two of the necessary output arguments:

1. Following up on the discussion at https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4583225, I was able to produce acoustic scores for each frame, and thereby also for each phoneme in a single recognition pass. Add the following code to the write_stseg function in main_align.c and use the state segmentation parameter -stsegdir as an argument to the program:

char str2[1024];

align_stseg_t *tmp1;

for (i = 0, tmp1 = stseg; tmp1; i++, tmp1 = tmp1->next) {

mdef_phone_str(kbc->mdef, tmp1->pid, str2);

fprintf(fp, "FrameIndex %d Phone %s PhoneID %d SenoneID %d state %d Ascr %11d \n",
i, str2, tmp1->pid, tmp1->sen, tmp1->state, tmp1->score);

}

2. By using the phone segmentation parameter -phsegdir as an argument to the program, the acoustic scores for each phoneme can be calculated. The output sequence for the word "approach" is as follows:

SFrm EFrm SegAScr Phone
0 9 -64725 SIL
10 21 -63864 AH SIL P b
22 33 -126819 P AH R i
34 39 -21470 R P OW i
40 51 -69577 OW R CH i
52 64 -55937 CH OW DH e
Each phoneme in the "Phone" column is represented as <Aligned_phone> <Previous_phone> <Next_phone> <position_in_the_word (b-begin, i-middle, e-end)>. The full command line usage for this output is:

$ sphinx3_align -hmm wsj_all_cd30.mllt_cd_cont_4000 -dict cmu.dic -fdict phone.filler -ctl phone.ctl -insent phone.insent -cepdir feats -phsegdir phonesegdir -phlabdir phonelabdir -stsegdir statesegdir -wdsegdir aligndir -outsent phone.outsent

Work in progress:

1. It's very important to weight word scores by the words' part of speech (articles don't matter very much if they are omitted, but nouns, adjectives, verbs, and adverbs are the most important.) Troy has designed a basic database schema at http://talknicer.net/w/Database_schema in which the part of speech is one of the fields in the "prompts" table along with acoustic and duration standard scores in the "scores" table.

2. I put some exemplar recordings for three phrases the project mentor had collected at http://talknicer.net/~ronanki/Datasets/ in each subdirectory there for each of the three phrases. The description of the phrases is at http://talknicer.net/~ronanki/Datasets/files/phrases.txt.

3. I ran sphinx3_align for that sample data set. I wrote a program to calculate mean and standard deviations of phoneme acoustic scores, and the mean duration of each phoneme. I also generated neighbor phonemes for each of the phrases, and the output is written in this file: http://talknicer.net/~ronanki/Datasets/out_ngb_phonemes.insent

4. I also tried some of the other sphinx3 executables such as sphinx3_decode, sphinx3_livepretend, and sphinx3_continous for mispronunciation detection. For the sentence, "Approach the teaching of pronunciation with more confidence." (phrase 1), I used this command:

$ SPHINX3DECODE -hmm ${WSJ} -fsg phone.fsg -dict basicphone.dic -fdict phone.filler -ctl new_phone.ctl -hyp phone.out -cepdir feats -mode allphone -hypseg phone_hypseg.out -op_mode 2

The decoder, sphinx3_decode, produced this output:

P UH JH DH CH IY CH Y N Z Y EY SH AH W Z AO K AA F AH N Z

The forced alignment system, sphinx3_align, produced this output:

AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M AO R K AA N F AH D AH N S

The sphinx3_livepretend and sphinx3_continous commands produce output in words using language models and acoustic models along with a complete dictionary of expected words:

approach to teaching opponents the nation with more confidence

Plans for the coming week:

1. Write and test audio upload and pronunciation evaluation for per-phoneme standard scores.

2. Since there are many deletions in the edit distance scoring grammars tried so far, we need to modify the grammar file and/or the method we are using to detect whether neighboring phonemes match more closely. Here is my idea of finding neighboring phonemes by dynamic programming:

a. Run the decoder to get the best possible output

b. Align the decoder output to forced-alignment output using a dynamic programming string matching algorithm

c. The aligned output will have the same number of phones as from forced alignment. So, we need to test two things for each phoneme:

If the phone is same as expected phoneme, no need to do anything
If the phone is not as expected phoneme, check that phone in the list of neighboring phonemes of the expected phoneme.

d. Then, we can run sphinx3_align with this outcome against the same wav file to check whether the acoustic scores actually indicate a better match.

3. As an alternative to the above, I used sox to split each input wave file in to individual phoneme wav files using the forced alignment phone labels, and then used a separate recognition pass on each tiny speech segment. Now, I am writing separate grammar files for the neighboring phonemes for each phoneme. Once I complete them, I will check the output using decoder for each phoneme segment. This should provide for more accurate assessment of mispronunciations.

4. I will update the wiki here at http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation with my current tasks and milestones.

Tuesday, June 5, 2012

Troy: GSoC 2012 Pronunciation Evaluation Week 2

These are the things I've accomplished in the second week of GSoC 2012:

1. Set up a cron job for the rtmplite server to automatically check whether the process is still running or not. If it is stopped, restart it. This will allow the server to stay up if the machine gets rebooted, and will allow the server to spawn subprocesses without being stopped by job control as happens when the process is put into the background from a terminal shell. To accomplish this, I first created a .process file in my home directory with the rtmplite server's process id number as its sole contents. You can use 'top' or 'ps' to find out the process id of the server. Then I created this shell script file to check the status of the rtmplite server process:

pidfile=~/.process
if [ -e "$pidfile" ]
then
    # check whether the process is running
    rtmppid=`/usr/bin/head -n 1 ${pidfile} | /usr/bin/awk '{print $1}'`;
    # restart the process if not running
    if [ ! -d /proc/${rtmppid} ]
    then
       /usr/bin/python ${exefile} -r ${dataroot} &
       rtmppid=$!
       echo "${rtmppid}" > ${pidfile}
       echo `/bin/date` "### rtmplite process restarted with pid: ${rtmppid}"
    fi
fi

This script first checks whether the .process file exists or not. If we don't want the cron job to check for this process temporarily (such as when we apply patches to the program), we can simply delete this file and it won't check on or try to restart the server; after out maintenance, recreate the file with the new process id, and the checking will automatically resume.

The last and also the most important step is to schedule this task in cron by creating following item with the command crontab -e
 * * * * * [path_to_the_script]/check_status.sh

This causes the cron system to run this script every minute, thereby checking the rtmplite server process every minute.
2. Implemented web server user login and registration pages using MySQL and HTML. We use a MySQL database for storing user information, so I designed and created this table for user information in the server's mysql database:

Field	Type	Comments
userid	INTEGER	Compulsory, automatically increased, primary key
email	VARCHAR(200)	Compulsory, users are identified by emails
password	VARCHAR(50)	Compulsory, encrypted using SHA1, at least 8 alphanumeric characters
name	VARCHAR(100)	Not compulsory, default 'NULL'
age	INTEGER	Not compulsory, default 'NULL', accepted values [0,150]
sex	CHAR(1)	Not compulsory, default 'NULL', accepted values {'M', 'F'}
native	CHAR(1)	Not compulsory, default 'NULL', accepted values {'Y', 'N'}. Indicating the user is a native English speaker or not.
place	VARCHAR(1000)	Not compulsory, default 'NULL'. Indicating the place when the user lived at the age between 6 and 8.
accent	CHAR(1)	Not compulsory, default 'NULL', accepted values {'Y', 'N'}. Indicating the user has a self-reported accent or not.

This table was created by the following SQL command:

CREATE TABLE users (
   userid INTEGER NOT NULL AUTO_INCREMENT,
   email VARCHAR(200) NOT NULL,
   password VARCHAR(50) NOT NULL,
   name VARCHAR(100),
   age INTEGER,
   sex SET('M', 'F'),
   native SET('Y', 'N') DEFAULT 'N',
   place VARCHAR(1000),
   accent SET('Y', 'N'),
   CONSTRAINT PRIMARY KEY (userid),
   CONSTRAINT chk_age CHECK (age>=0 AND age<=150)
);

I also prototyped the login and simple registration pages are in HTML. Here are their preliminary screenshots:

If you like, you can go to this page to help us test the system: http://talknicer.net/~li-bo/datacollection/login.php. On the server, we use PHP to retrive the form information from the login and registration pages, perform an update or query in mysql database, and then send data back in HTML.

The recording interface, has also been modified to use HTML instead of pure Flex as earlier. The page currently displays well, but there is no event interaction between HTML and Flash yet.

3. Database schema design for the entire project: Several SQL tables have been designed to store the various information used by all aspects of this project. Detailed table information can be found on our wiki page: http://talknicer.net/w/Database_schema. Here is a brief discussion.

First, the user table shown above will be augmented to keep two additional kinds of user information: one for normal student users and one for those who are providing exemplar recordings. Student users, when they can provide correct pronunciation, should also be allowed to contribute to the exemplar recordings. Also if exemplar recorders register through the website, they have to show they are proficient enough to contribute a qualified exemplar recording, so we should be able to use the student evaluation system to qualify them for uploading exemplar contributions.

There are several other tables for additional information such as languages for a list of languages defined by the ISO in case we may extend our project to other languages; a region table to store some idea of the user's accent; prompts table for the list of text resources will be used for pronunciation evaluation. Then are also tables to log the recordings the users do and tables for set of tests stored in the system.

Here are my plans for the coming week:

1. Discuss details of the game specification to finish the last part of schema design.

2. Figure out how to integrate the Flash audio recorder with the HTML interface using bidirectional communication between ActionScript and JavaScript.

3. Implement the student recording interface.

4. Further tasks can be found at: http://talknicer.net/w/To_do_list

Friday, June 1, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Week 1 Status

Last week, I accomplished the following:

Successfully tested producing phoneme acoustic scores from sphinx3_align using two recognition passes. I was able to use the state segmentation parameter -stsegdir as an argument to the program, to obtain acoustic scores for each frame and thereby for each phoneme as well. But, the output of the program is to be decoded to integer format which I will try to do by the end of next week.
Last week I wrote a program which converts a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet. But, yesterday, when I compared both files manually, found some of the phones mismatched. So, I re-checked my code and fixed the bug. The corrected program takes a string of phonemes representing an expected utterance as input and produces a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring.

All the programs I have written so far are checked in at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki using subversion. (Similarly, Troy's code is checked in at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/troy.)

Here is the procedure for using that code to obtain neighboring phonemes of CMUbet from a file which contains a string of phonemes:

To convert Worldbet phonetic alphabet to CMUbet

Usage: python convert_world2cmu.py <input_worldbet_phone> <input_key_map> <output_cmubet_phone>

To convert input list of phonemes to neighboring phones

Usage: python convert2_ngbphones.py <input_phoneme_list> <input_phone_map> <output_neighboring_phone_list>

Ex: "I had faith in them" (arctic_a0030) - a sentence from arctic database:

<input_phoneme_list> AY HH AE D F EY TH IH N DH EH M (arctic_a0030)

<output_neighboring_phone_list> {AY|AA|IY|OY|EY} {HH|TH|F|P|T|K} {AE|EH|ER|AH} {D|T|JH|G|B} {F|HH|TH|V} {EY|EH|IY|AY} {TH|S|DH|F|HH} {IH|IY|AX|EH} {N|M|NG} {DH|TH|Z|V} {EH|IH|AX|ER|AE} {M|N} (arctic_a0030)