tag:blogger.com,1999:blog-80124011485878995012024-03-06T11:37:50.149+05:30Pronunciation Evaluationwith CMU Sphinx3 speech recognition: a Google Summer of Code diaryJames Salsmanhttp://www.blogger.com/profile/11470879525772180030noreply@blogger.comBlogger30125tag:blogger.com,1999:blog-8012401148587899501.post-21612410173304361152013-05-02T11:17:00.000+05:302013-05-02T11:17:19.328+05:30Data collection funded; moving GSoC to MoodleTwo big news items: we have a sponsor for the data collection effort, and in the 2013 Google Summer of Code, we're going to try to integrate with Moodle. More soon.James Salsmanhttp://www.blogger.com/profile/11470879525772180030noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-2201875408456170192012-11-17T23:01:00.000+05:302012-11-18T05:56:58.887+05:30UPDATED: Status updateI wish that this were a progress report instead of a status update, but so far we haven't raised enough to begin data collection with Mechanical Turk. We have had a <a href="https://docs.google.com/open?id=0B73LgocyHQnfS0g5ZEw1aFNKT2s">paper accepted for publication</a> and we are trying to get in to the <a href="https://developers.google.com/compute/">Google Compute Engine</a> to save expenses for the huge Amazon bill for asking people who claim to have good pronunciation and reading skill to record exemplars. The problem is that the number of such exemplars needs to be relatively large. For those of you familiar with the <a href="http://talknicer.com/d/">TalkNicer demo</a>, this is the "exemplar sufficiency index" and it needs to meet a certain threshold for at least 5,000 words of instructional material before I feel comfortable committing to an expensive data collection effort.<br />
<br />
So in summary, <a href="http://talknicer.com/slics/">please donate more, or if you have already donated, please ask multiple people to at least match your donation</a>. It will be worth it.<br />
<br />
Update: How much more do we need? About $4,000 based on the preliminary per-phoneme exemplar sufficiency index including English homographs and Mechanical Turk performance expectation estimates. Also updated: <a class="ot-anchor" href="http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation" style="background-color: white; color: #3366cc; cursor: pointer; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">cmusphinx.sourceforge.net/wiki/pronunciation_evaluation</a><br />
<br />
Further update: I am very sorry about delaying Troy's posts here (it was due to the WebRTC and related questions) but they have been available at e.g. <a href="http://cmusphinx.sourceforge.net/2012/08/gsoc-2012-pronunciation-evaluation-troy-project-conclusions/">cmusphinx.sourceforge.net/2012/08/gsoc-2012-pronunciation-evaluation-troy-project-conclusions</a>James Salsmanhttp://www.blogger.com/profile/11470879525772180030noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-22401906826185654712012-08-26T17:12:00.002+05:302013-08-06T14:15:42.794+05:30Ronanki: GSoC 2012 Pronunciation Evaluation: Summary and Conclusions<div dir="ltr" style="text-align: left;" trbidi="on">
This article briefly summarizes the implementation of GSoC 2012 Pronunciation Evaluation project.<br />
<br />
Primarily, I started with sphinx forced-alignment and obtained the spectral matching acoustic scores, duration at phone, word level using WSJ models. After that I tried concentrating mainly on two things. They are edit-distance neighbor phones decoding and Scoring routines for both Text-dependent and Text-independent systems as a part of GSoC 2012 project.<br />
<br />
<b>Edit-distance Neighbor phones decoding:</b><br />
<br />
1. Primarily started with single-phone decoder and then explored three-phones decoder, word decoder and complete phrase decoder by providing neighbor phones as alternate to the expected phone.<br />
2. The decoding results shown that both word level and phrase level decoding using JFGF are almost same.<br />
3. This method helps to detect the mispronunciations at phone level and to detect homographs as well if the percentage of error in decoding can be reduced.<br />
<br />
<b>Scoring Routines:</b><br />
<br />
<b>Text-dependent: </b><br />
This method is based on exemplars for each phrase. Initially, mean acoustic score, mean duration along with deviations are calculated for each of the phone in the phrase based on exemplar recordings. Now, given the test recording, each phone in the phrase is then compared with exemplar statistics. After that, z-scores are calculated and then normalized scores are calculated based on maximum and minimum of z-scores from exemplar recordings. All phone scores are aggregated to get word score and then all word scores are aggregated with POS weight to get complete phrase score.<br />
<br />
<b>Text-independent:</b><br />
This method is based on predetermined statistics built from any corpus. Here, in this project, I used TIMIT corpus to build statistics for each phone based on its position (begin/middle/end) in the word. Given any random test file, each phone acoustic score, duration is compared with corresponding phone statistics based on contextual information. The scoring method is same as to that of Text-dependent system.<br />
<br />
<b>Demo:</b><br />
Please try our demo @ <a href="http://talknicer.net/~ronanki/test/">http://talknicer.net/~ronanki/test/</a> and help us by giving the feedback.<br />
<br />
<b>Documentation and codes:</b><br />
All codes are uploaded in cmusphinx svn @<br />
<a href="http://sourceforge.net/p/cmusphinx/code/HEAD/tree/branches/speecheval/ronanki/">http://sourceforge.net/p/cmusphinx/code/HEAD/tree/branches/speecheval/ronanki/</a> and raw documentation of the project can be found <a href="http://researchweb.iiit.ac.in/~srikanth.ronanki/Pronunciation_scoring_documentation.docx" target="_blank">here</a>.<br />
<br />
<b>Conclusions:</b><br />
The pronunciation evaluation system really helps all users to improve their pronunciation by trying multiple times and it lets you correct your-self by giving necessary feedback at phone, word level. I couldn't complete some of the things I have mentioned earlier during the project. But I hope I can keep my contributions to this project in future also.<br />
<br />
This summer has been a great experience to me. Google Summer of code 2012 has finally ended. I would like to thank my mentor James Salsman for his time, continuous efforts and help. The way he motivated really helped me to focus on the project all the time. I would also like to thank my friend Troy Lee, Nickolay, Biksha Raj for their help and comments during the project time. </div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-67910472356078532432012-08-22T05:46:00.000+05:302012-11-18T05:49:51.159+05:30GSoC 2012: #Troy Pronunciation Evaluation Week 7 Status<br />
<div style="font-family: arial; font-size: small;">
Last week, I was still working on the data collection website.</div>
<div style="font-family: arial; font-size: small;">
<br /></div>
<div style="font-family: arial; font-size: small;">
Thank Robert (<span style="font-family: arial, sans-serif;">butler1970@gmail.com) so much for trying out the website and listed the issues he encountered on this page: </span><a href="https://www.evernote.com/pub/butler1970/cmusphinx#b=11634bf8-7be9-479f-a20e-6fa1e54b322b&n=398dc728-b3f0-4ceb-8ccf-89295b98a6d7">https://www.evernote.com/pub/butler1970/cmusphinx#b=11634bf8-7be9-479f-a20e-6fa1e54b322b&n=398dc728-b3f0-4ceb-8ccf-89295b98a6d7</a></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">Issue #1: The under construction of Student Page</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">The first stage of the website to collect exemplar recordings, thus the student page is not implemented at that time. </span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">Issue #2: The inconvenient birthdate control</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">The birthdate control is now replaced with the standard HTML5 <input type="datetime"> control. Due to the datetime input control is a new element in HTML5, currently only Chrome, Safari and Opera support the popup date selection. On other browsers, which have on support yet, the control will simply be displayed as an input box. The user can just type in the date and the background script will check whether the format is correct or not.</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">Issue #3: The incorrect error message "Invalid date format" on the additional information update page</span></div>
<div style="font-family: arial; font-size: small;">
<br /></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">After digging into the source code to find the problem for several hours, the bug lies in the order of invoking mysql related functions. The processing steps in the additional information update page is as follows:</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">a) client side post the user input information to the server;</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">b) server side first using mysql_escape_string function to preprocess the user information to ensure the security of later mysql queries;</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">c) check the format of each field including the date time format, whether the user inputs a valid date;</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">d) update the mysql database with the new information.</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">As only in step d) the mysql sever action is needed, I thus put the database connection code behind step c), without knowing the mysql_escape_string function also requires mysql database connection. In the previous implementation, the mysql_escape_string returns empty string thus leads to invalid date format. </span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">Secondly, the exemplar recording page is update with following features:</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">1) Automatically move to the next utterance after the user record and playback the current recording;</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">2) Adding extra navigation control for recording phrase selection;</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">3) When the user opens the exemplar recording page, the first un-recorded utterance will be set to the first one shown the user.</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">4) Connection the enable and disable of recording and playback buttons of the player with the database information, i.e. if the user has recorded the phrase before, both the recording and playback buttons are enabled, otherwise only recording is allowed.</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">The third major part done in last week is the student page which is previously left empty.</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">For the student page, users now can also practice their pronunciation by recording the phrases in the database and also listening to the exemplar recordings in the system. The features are:</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">1) Full recording and playback functionalities as exemplar recording;</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">2) When navigating to each phrase, randomly maximum 5 exemplar recordings from the system are retrieved from the database and listed on the page to help the students. </span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">3) Additionally, to put some exemplar recordings in the system, I have to manually transcribe several sentences and put the recordings into the system for use. After there are many people contributing to the exemplar recordings, I don't need to do manually transcription any more.</span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;"><br /></span></div>
<div style="font-family: arial; font-size: small;">
<span style="font-family: arial, sans-serif;">For this week, two major tasks to be done: integration with Ronanki's evaluation scripts and mid-term report. </span></div>
<div style="font-family: arial; font-size: small;">
<br /></div>
Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com1tag:blogger.com,1999:blog-8012401148587899501.post-2400927883504234692012-08-21T09:45:00.001+05:302013-08-06T14:17:40.341+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Final week Report<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">Here comes my final report for Pronunciation Evaluation project. The demo system is little bit modified. You can give a try and test the text-independent system @ <a href="http://talknicer.net/~ronanki/test">http://talknicer.net/~ronanki/test</a></span><br />
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
Last week, I tested the system with both Indian accent and US accent. For US accent, I don't have any mis-pronunciation data. I just tested with SA1, SA2 (TIMIT) sentences. For Indian accent, I prepared a data with both correct pronunciations and mis-pronunciations and can be downloaded at <a href="http://talknicer.net/~ronanki/Database.tar.tgz" style="color: #1155cc;" target="_blank">http://talknicer.net/~ronanki/Database.tar.tgz</a></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
The results are provided at <a href="http://talknicer.net/~ronanki/results/">http://talknicer.net/~ronanki/results/</a><span style="background-color: transparent;">. The scripts for evaluating the database are uploaded in svn project folder. Phonological features are provided in svn, but couldn't built models with it in time.</span></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
The project and the required scripts can be downloaded from<br />
<a href="http://sourceforge.net/p/cmusphinx/code/HEAD/tree/branches/speecheval/ronanki/">http://sourceforge.net/p/cmusphinx/code/HEAD/tree/branches/speecheval/ronanki/</a><br />
<span style="background-color: transparent;">Please go through README files provided in each folder.</span></div>
<div>
<br /></div>
<div>
Finally, I would like to thank my mentor James Salsaman, Nickolay, Biksha Raj and rest of the community for helping me all the time. I hope that I keep contributing to this project over the time. </div>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com3tag:blogger.com,1999:blog-8012401148587899501.post-8735216802332102862012-08-21T09:42:00.002+05:302012-08-21T11:02:38.499+05:30Ronanki: GSoC 2012 Pronunciation Evaluation week 12<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white;">This week, </span><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">I am trying to extend the TIMIT statistics to 5 or 6 per each phoneme based on syllable position or I can do CART modelling to predict duration and acoustic score based on training. I did this to some extent using wagon in speech tools.</span><br />
<div>
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;"><br /></span></div>
<div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
Regarding mis-pronunciation detection accuracy, I collected data from 8 non-native speakers with 5 words being recorded 10 times in both correct and wrong ways and 5 sentences being recorded 3-5 times in both correct and wrong ways. Here is the link to it @ <a href="http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/" style="color: #1155cc;" target="_blank">http://researchweb.iiit.ac.<wbr></wbr></a></div>
</div>
in/~srikanth.ronanki/GSoC/PE_<wbr></wbr><br />
database/ and the description of the database is here at <a href="http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/description.txt" style="color: #1155cc;" target="_blank">http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/description.txt</a>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
I need to split each speaker's data into individual files which is a tedious task and taking some time. Somehow, I completed with one speaker's data and the current text-independent system is doing good. 46 out of 50 correct words are detected good pronunciation and 42 words out of 50 wrong words are detected mis-pronunciation by setting a common threshold for all words. It takes one or two more days to give complete statistics. </div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
In parallel, I completed phonological features and generated acoustic models for TIMIT database because I faced some difficulties to find complete set of wav files for WSJ database. But, I failed in both decoding and forced-alignment with the new models generated on phonological features. Even I failed in generating appropriate models with sphinx mfc features. Even though they generated properly, I didn't get results with forced-alignment or decode functions by replacing with WSJ models. I will try to overcome these issues by next week.</div>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-83278063986194616692012-08-21T09:36:00.002+05:302012-08-21T09:36:38.241+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 11<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">This week, I managed to do only data collection which is required to evaluate the project.</span><br />
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;"><br /></span>
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">The database collection is over and is on different servers. I am trying to bring it on-to one place. You can find part amount of the data for one speaker here @ </span><a href="http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/Sru/" style="background-color: white; color: #1155cc; font-family: arial, sans-serif; font-size: 13px;" target="_blank">http://researchweb.iiit.ac.in~srikanth.ronanki/GSoC/PE_database/Sru/</a><br />
<br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">The description of the data is at </span><a href="http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/description.txt" style="background-color: white; color: #1155cc; font-family: arial, sans-serif; font-size: 13px;" target="_blank">http://researchweb.iiit.ac.in/<wbr></wbr>~srikanth.ronanki/GSoC/PE_<wbr></wbr>database/description.txt</a> </div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com2tag:blogger.com,1999:blog-8012401148587899501.post-2426015084922012592012-08-21T09:34:00.004+05:302012-08-21T11:03:14.021+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 10<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">This week, I explored CART models a little bit, but couldn't complete it. The models are trained using wagon in speech tools with the following contextual information:</span><br />
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
Current phone, previous phone, next phone, syllable postion, phonological features, phone type etc.,</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
Complete list of features are listed in the below URL:</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<a href="http://talknicer.net/~ronanki/dur.desc"><span style="color: #1155cc;">http://talknicer.net/~ronanki/</span><span style="background-color: transparent;">dur.desc</span></a></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
Once the training is completed, the output is built in a tree which is given as input along with testing data to wagon_test in speech tools and there by it predicts the duration of the each phone using the contextual information using tree structure.</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br clear="all" />
<div>
Regarding replacing of traditional MFCC features either with PNCC or phonological features, I need to compute acoustic models for WSJ database replacing these features instead of MFCC. It's in process, and once acoustic models are built, the rest of the testing process is same. </div>
<div>
<br /></div>
<div>
<b>Work to do:</b></div>
<div>
By next week, I would be able to complete one of these two and the next one thereafter. In the final week, I upload all codes to svn and integrate these new techniques with the current working pronunciation evaluation model @ <a href="http://talknicer.net/~ronanki/" style="color: #1155cc;" target="_blank">http://talknicer.net/~ronanki/<wbr></wbr></a></div>
</div>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-18187140038055284032012-08-21T09:33:00.001+05:302012-08-21T09:33:15.208+05:30Ronanki: GSoC 2012 Pronunciation Evaluation week 9<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
This week, I finished with my random phrase pronunciation evaluation and is in testing phase @ <a href="http://talknicer.net/~ronanki/test/index.html"><span style="color: #1155cc;"><span style="background-color: white;">http://talknicer.net/~ronanki/</span></span>test/index.html</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
The system can provide evaluation scoring for any random sentence. It also gives feedback for mispronunciation and rate of duration at word level. Please, test the system and mail me the bugs if any. Please avoid giving proper nouns and punctuation marks while testing the system.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
For doing this, I evaluated entire TIMIT dataset and the statistics for each phone are evaluated at three positions: </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Begin/Middle/End (0/1/2). The count in the last column represents the number of times each phone occurred at each position. <span style="background-color: white;">The statistics are @ </span><a href="http://talknicer.net/~ronanki/phrase_data/statistics/TIMIT_statistics.txt"><span style="color: #1155cc;"><span style="background-color: white;">http://talknicer.net/~ronanki/</span></span>phrase_data/statistics/TIMIT_statistics.txt</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Next week, I am going to implement CART models so that each phone can be compared with respective phone in better context. Regarding features, I studied about Power Normalized Cepstral Coefficients (PNCC) which are more robust towards speech recognition even in noisy environment. PNCC are 13 in dimension, computationally more cost than MFCC but performs better than MFCC in speech recognition. I downloaded the available matlab code @ <a href="http://www.cs.cmu.edu/~robust/archive/algorithms/PNCC_IEEETran/"><span style="color: #1155cc;">http://www.cs.cmu.edu/~</span>robust/archive/algorithms/PNCC_IEEETran/</a> and trying some experiments on nTIMIT database. I also implemented phonological mapping with current state of spectral features (MFCC) using ANN. Currently, I am in testing phase of speech recognition using all these features. </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
</div>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-34411887695989302432012-08-21T09:26:00.001+05:302012-08-21T09:26:05.838+05:30Ronanki: GSoC 2012 Pronunciation Evaluation week 8<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">This week, I mainly concentrated on integrating everything with web @</span><br />
<a href="http://talknicer.net/~ronanki/test/"><span style="color: #1155cc; font-family: arial, sans-serif; font-size: x-small;"><span style="text-align: -webkit-auto;">http://talknicer.net/~ronanki/</span></span>test/</a></div>
<br style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;" />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;"><b>The following are the ones which are integrated:</b></span><br />
<br style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;" />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">1. File upload option with different formats (wav/wma/mp3) is provided.</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">2. All test cases are evaluated while recording and it allows only </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">those recordings which are near to the perfect case.</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">3. The calculate score button provides the feedback page @ </span><a href="http://talknicer.net/~ronanki/test/scores_page.html"><span style="color: #1155cc; font-family: arial, sans-serif; font-size: x-small;"><span style="text-align: -webkit-auto;">http://talknicer.net/~ronanki/</span></span>test/scores_page.html</a> <span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">(still some of the columns in UI are under construction)</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">4. Phrase entry as per user's choice and then score calculation page @ </span><a href="http://talknicer.net/~ronanki/test/random.html"><span style="color: #1155cc; font-family: arial, sans-serif; font-size: x-small;"><span style="text-align: -webkit-auto;">http://talknicer.net/~ronanki/</span></span>test/random.html</a><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;"> is also under </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">construction.</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">5. As of now, statistics for random phrase entry as per user's choice </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">are derived from TIMIT database which covers 630 speakers with 10 </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">recordings from each one.</span><br />
<br style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;" />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;"><b>Next Tasks:</b></span><br />
<br style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;" />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">1. Feature extraction (Power-Normalized Cepstral Coefficients and </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">phonological features)</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">2. CART models (for efficient score calculation in random phrase </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">method based on contextual information)</span><br />
<br style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;" />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">Regarding under construction pages, the back-end codes were developed </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">and uploaded at sourceforge. Only, the web pages need to be build </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">dynamically. Will be done in parallel with the current next tasks.</span>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-82957614557512686612012-08-21T09:24:00.004+05:302012-08-21T09:24:39.852+05:30Ronanki: GSoC 2012 Pronunciation Evaluation week 7<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Last week, I continued to work on spectral features and phonological features and their mapping based on neural network training for first few days. Based on forced-alignment/manual labels if exists, these phonological features <a href="http://talknicer.net/~ronanki/phonological_features/feature_stream"><span style="color: #1155cc;">http://talknicer.net/</span>~ronanki/phonological_features/feature_stream</a> for each phone in a phrase are repeated against it's spectral features. I am looking over CSLU Toolkit which uses a neural net for feature-to-diphone decoding and stopped at that point to work out later after mid-evaluation.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Later, I worked on integration of acoustic/duration scores along with edit-distance grammar decoding with the current website for exemplar outlier analysis.<br />
<br />
I tried with many test cases such as<br />
1. silence<br />
2. noisy speech<br />
3. Junk speech<br />
4. Random sentence<br />
5. Actual sentence shortened in the end<br />
6. Actual sentence skipped the beginning<br />
<br />
In test cases from 1-6, the forced alignment did not reach the final state and failed to create phone segmentation file, label file which contains acoustic scores, phone labels respectively.<br />
<br />
7. Actual sentence<br />
8. Actual sentence with more silence both at beginning and end<br />
9. Actual sentence with one small word skip in the middle of phrase<br />
10. Similar sounding sentences such as</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Ex. Utterance: </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Approach the teaching of pronunciation with more confidence</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Tested Similar sounding: </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
a. Approach the teaching opponents the nation with over confidence</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
b. Approach the preaching opponents the nation with confidence</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
In test cases from 7-10, the forced alignment worked and generated acoustic scores, phone labels. Then, I moved on to edit-distance grammar decoding testing accuracy on cases 7-10 so that I can set a threshold parameter to distinguish between cases (7,8,9) and (10)<br />
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Earlier, I tested for cases 7,8 with phrase decoder in edit-distance and reported it around 73% and the accuracy is < 40% for case 10 so that I can easily set the threshold parameter such as accuracy = x>0.4 ? T : F</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
I also discussed with my mentor James Salsman on giving weights to words based on parts of speech for phrase output score and here is what he proposed and all the units are in db which represents relative loudness in English.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
(%wt, %pos); # scoring weights and names of parts of speech</div>
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'q'} = 1.0; $pos{'q'} = 'quantifier';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'n'} = 0.9; $pos{'n'} = 'noun';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'v'} = 0.9; $pos{'v'} = 'verb';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'-'} = 0.8; $pos{'-'} = 'negative';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'w'} = 0.8; $pos{'w'} = 'adverb';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'m'} = 0.8; $pos{'m'} = 'adjective';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'o'} = 0.7; $pos{'o'} = 'pronoun';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'s'} = 0.6; $pos{'s'} = 'possessive';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'p'} = 0.6; $pos{'p'} = 'preposition';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'c'} = 0.5; $pos{'c'} = 'conjunction';</span><br />
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">$wt{'a'} = 0.4; $pos{'a'} = 'article';</span><br />
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<span style="color: #222222; font-family: arial, sans-serif;"><br /></span></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<span style="color: #222222; font-family: arial, sans-serif;">Hope, I do it and launch the site after integrating everything before mid-evaluation submission.</span></div>
<div>
<span style="color: #222222; font-family: arial, sans-serif;"><br /></span></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
</div>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-68653753830471675062012-08-21T09:24:00.000+05:302012-08-21T09:24:00.070+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 6<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
I uploaded all my codes (except few ongoing) here at </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<a href="http://www.blogger.com/%C2%A0http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/" target="_blank"> <span style="color: #1155cc;">http://cmusphinx.svn.</span>sourceforge.net/viewvc/cmusphinx/branches/speecheval/<wbr></wbr>ronanki/scripts/</a> . Please follow README files in each folder for detailed instructions on how to use them. </div>
</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<span class="Apple-style-span" style="border-collapse: collapse;"></span><br />
<div>
This week, I have concentrated on new features for speech recognition. I read a paper on Power-Normalized Cepstral Coefficients [1] which are more robust towards speech recognition and a few papers on phonological features [2],[3]. I hope to investigate mapping the acoustic speech features of each phoneme derived from machine phonetic transcription to phonological features. Using this mapping, mispronunciations at phone level can be identified using phonological features along with acoustic pronunciation scores and edit distances. I got some mapping here at <a href="http://talknicer.net/~ronanki/phonological_features/"><span style="color: #1155cc;">http://talknicer.net/~</span>ronanki/phonological_features/</a> based on those papers.</div>
</div>
<div>
<br /></div>
<div>
<b>Ongoing tasks:</b></div>
<div>
1. <span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">In random phrase scoring method, another column is added to store the position of each phone with respect to word (begin/middle/end) such that each phone will have three statistics</span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;"> </span><br />
<a href="http://talknicer.net/~ronanki/phrase_data/all_phrases_stats_position">http://talknicer.net/~ronanki/phrase_data/all_phrases_stats_position</a>
</div>
<div>
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">2. Standard word scores are derived</span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;"> </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">along with phoneme standard (acoustic + duration) scores in the current forced-alignment.</span></div>
<div>
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">3. Linking edit-distance algorithm with pronunciation evaluation website</span></div>
<div>
4. Complete a full-pledged website at <a href="http://talknicer.net/~ronanki/test/"><span style="color: #1155cc;">http://talknicer.net/~</span>ronanki/test/</a> with all test cases (junk speech, silence, misread etc.,) before mid-evaluation and publicize the system so that it can be tested by large number of users. </div>
<div>
<br /></div>
<div>
<b>References:</b></div>
<div>
<br /></div>
<div>
<div>
<div>
[1] Chanwoo Kim and Richard M.Stern, "Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition", ICASSP 2012.</div>
</div>
<div>
<a href="http://www.cs.cmu.edu/~chanwook/MyPapers/ICASSP2012_PNCC_ver09.pdf"><span style="color: #1155cc;">http://www.cs.cmu.edu/~</span>chanwook/MyPapers/ICASSP2012_PNCC_ver09.pdf</a></div>
</div>
<div>
<br /></div>
<div>
[2] Katrin Kirchhoff a, Gernot A. Fink b, Gerhard Sagerer b, "Combining acoustic and articulatory feature information for robust speech recognition", Speech Communications 37 (2002) 303–319.</div>
<div>
<div>
<a href="http://www.sciencedirect.com/science/article/pii/S0167639301000206"><span style="color: #1155cc;">http://www.sciencedirect.com/</span>science/article/pii/S0167639301000206</a></div>
</div>
<div>
<br /></div>
<div>
[3] <span style="color: #888888; font-family: arial, sans-serif; font-size: 13px;"> </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">S. King and P. Taylor, “Detection of phonological features in continuous speech </span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">using neural networks,” Computer Speech and Language, vol. 14, no. 4, pp. 333–</span><span style="color: #222222; font-family: arial, sans-serif; font-size: 13px;">353, 2000.</span></div>
<div>
<a href="http://www.era.lib.ed.ac.uk/bitstream/1842/1001/1/King_Taylor_csl2000.pdf"><span style="color: #1155cc;">http://www.era.lib.ed.ac.uk/</span>bitstream/1842/1001/1/King_Taylor_csl2000.pdf</a></div>
</div>
srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-69266793309977229842012-07-14T01:19:00.003+05:302012-07-14T01:19:33.283+05:30Daily progress reports in comments hereQuick mentor note: We are converting to daily progress reports which I will combine into draft blog posts that the students will proofread, copy-edit, and approve for publication. This will help keep all three of us on schedule. Sorry I am behind. The good news is that both students made it from "on schedule" to "ahead of schedule" in a sprint for the evaluations. <br />
<br />
Congratulations, Troy and Ronanki!<br />
<br />
Please post your daily-ish (4 or more per week) progress reports here. Thanks!James Salsmanhttp://www.blogger.com/profile/11470879525772180030noreply@blogger.com7tag:blogger.com,1999:blog-8012401148587899501.post-1128772740997271132012-07-10T12:25:00.002+05:302012-07-10T12:25:34.197+05:30Troy: GSoC 2012 Pronunciation Evaluation Week 5<div class="gmail_quote">
Sorry for the late update. The following are the things I did in Week 5; mainly problem solving.<br />
<div>
<br /></div>
<div>
1) Solving the Flash-based recorder update which prevented users from using their microphones. </div>
<div>
<br /></div>
<div>
At the beginning, before the Flash player 11.2 and 11.3 update, the audio recorder I created using Flex worked fine. Users could simply right click the recorder and select the "Settings" to allow microphone access. However, with the new updates, that option is disabled without any error message. </div>
<div>
<br /></div>
<div>
To solve this problem, people suggested adding websites into the online global privacy list. However, after trying many times that was still not working for the audio recorder. </div>
<div>
<br /></div>
<div>
Furthermore, <a href="http://englishcentral.com/" target="_blank">http://englishcentral.com/</a> which also uses Flash-based recording has a popup window from their recording button (a microphone image) with the Flash Microphone privacy setting dialogue. Checking the accessibility of microphone in code and prompting for the setting dialogue when necessary helps provide the solution:</div>
<div>
<br /></div>
<div>
First, checking whether the microphone is available, if not show the microphone list dialogue of Flash object ask the user to plugin a microphone:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: 'courier new', monospace;">var mic:Microphone = Microphone.getMicrophone();</span></div>
<div>
<span style="font-family: 'courier new', monospace;"><br /></span></div>
<div>
<span style="font-family: 'courier new', monospace;">if(!mic) {</span></div>
<div>
<span style="font-family: 'courier new', monospace;"><span style="white-space: pre-wrap;"> </span>Alert.show("No microphone available");</span></div>
<div>
<span style="font-family: 'courier new', monospace;"><span style="white-space: pre-wrap;"> </span>debug("No microphone available");</span></div>
<div>
<span style="font-family: 'courier new', monospace;"><span style="white-space: pre-wrap;"> </span>Security.showSettings("microphone");</span></div>
<div>
<span style="font-family: 'courier new', monospace;">}</span></div>
</div>
<div>
<br /></div>
<div>
Otherwise, check whether the microphone is accessible or not, if it is muted, prompt the privacy dialogue to ask user to allow the microphone access:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: 'courier new', monospace;">if(mic.muted) {</span></div>
<div>
<span style="font-family: 'courier new', monospace;"><span style="white-space: pre-wrap;"> </span>debug("Microphone muted!");</span></div>
<div>
<span style="font-family: 'courier new', monospace;"><span style="white-space: pre-wrap;"> </span>Security.showSettings("privacy");</span></div>
<div>
<span style="font-family: 'courier new', monospace;">}</span></div>
</div>
<div>
<br /></div>
<div>
With these testing during the initialization stage of the Flash recorder, it can allow users to enable the microphone access at the early beginning. One interesting thing is that after doing this, the "Setting" option of the Flash object now is clickable. </div>
<div>
<br /></div>
<div>
Now, looking back to the code solving the problem, which is so apparent, however, before you know the answer, it is really hard to predict. </div>
<div>
<br /></div>
<div>
2) Cross-browser Flash recorder compatibility</div>
<div>
<br /></div>
<div>
As the Flash recorder problem was solved as above, I was happy to update the source code in the trunk and our server and hoped to see the site working nicely. But the browser shows that the Flash recorder cannot load, the only information I got is "Error 2046"....</div>
<div>
<br /></div>
<div>
To try to solve this problem, I Googled a bunch of pages and tried several suggestions, the first which suggested I clear the browser cache and then set the Flash player to not save local cache and then re-enable its local cache (some kind of clear Flash player local cache), which gives some progress by changing "Error 2046" to "Error 2032". </div>
<div>
<br /></div>
<div>
For "Error 2032", there are mainly two groups of explanations, one saying there is something wrong with the URLs in Actionscript's HTTPRequests, which seems unlikely because those URLs are definitely correct and are under the same folder as the player. The other is an RSL problem of the mxmlc Flash compiler. To solve the RSL linkage problem, go to the "Flex Build Path" properties page, "Library path" tab and change the framework linkage to "merged into code". <br />
<br />
[Mentor note: Requesting compatibility with earlier versions of Flash ActionScript using compiler switches may or may not help here.]</div>
<div>
<br /></div>
<div>
3) Adding a password change page</div>
<div>
<br /></div>
<div>
4) Refining the user extra information update page to reflect the existing user information if available, instead of always showing the default values.</div>
<div>
<br /></div>
<div>
The website for exemplary recordings is now at a usable stage. </div>
<div>
<br /></div>
<div>
In this week, I will try to accomplish these things:</div>
<div>
<br /></div>
<div>
1) Phrase data entry for administrators (with text, exemplar pronunciations, homograph disambiguation, phonemes, parts of speech per word, etc.;</div>
<div>
<br /></div>
<div>
2) Design recording prompts to start our exemplary recording data collection;</div>
<div>
<br /></div>
<div>
3) Bug fixing and system testing;</div>
<div>
<br /></div>
<div>
4) Study the Amazon Mechanical Turk and start thinking how to incorporate our speech data collection on to that platform.</div>
<div>
<br /></div>
<div>
<br /></div>
</div>
<br />Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-62552138138694561182012-07-10T12:10:00.000+05:302012-07-10T12:10:06.556+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 5<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">The basic scoring routine for the pronunciation evaluation system is now available at </span><a href="http://talknicer.net/~ronanki/test/" style="background-color: white; color: #1155cc; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/test/</a><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">. The output is generated for each phoneme in the phrase and displays the total score.</span><br />
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
These are the things I've accomplished in the fifth week of GSoC 2012:</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>1. Edit-distance neighbor grammar generation:</b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Earlier, I did this with: </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
(a) a single-phone decoder<br />
<a href="http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_1phone.txt" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/results_<wbr></wbr>edit_distance/output_1phone.<wbr></wbr>txt</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br />
(b) a three-phone decoder (contextual)<br />
<a href="http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_3phones.txt" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/results_<wbr></wbr>edit_distance/output_3phones.<wbr></wbr>txt</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br />
(c) an entire phrase decoder with neighboring phones<br />
<a href="http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_compgram.txt" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/results_<wbr></wbr>edit_distance/output_compgram.<wbr></wbr>txt</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
This week, I added two more decoders: a word-decoder and a complete phrase decoder using each phoneme at each time</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>word-decoder:</b> I used sox to split each wav file into words based on forced-alignment output and then presented each word as follows. </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>Ex:</b> word - "with" is presented as</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
public <phonelist> = ( (W | L | Y) (IH) (TH) );</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
public <phonelist> = ( (W) (IH | IY | AX | EH) (TH) );</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
public <phonelist> = ( (W) (IH) (TH | S | DH | F | HH) ); </div>
<br />
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
The accuracy turned out to be better than single-phone/three-phone decoder, same as entire phrase decoder and the output of a sample test phrase is at <a href="http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_words.txt" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/results_<wbr></wbr>edit_distance/output_words.txt</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>Complete phrase decoder using each phoneme: </b>This is again more similar to entire phrase decoder. This time I supplied neighboring phones for each phoneme at each time and fixed the rest of the phonemes in the phrase. Not a good approach, takes more time to decode. But, the accuracy is better than all the previous methods. The output is at <a href="http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_phrases.txt" style="background-color: white; text-align: left;">http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_phrases.txt</a><br />
<br />
The code for above methods are uploaded in cmusphinx sourceforge at <a href="http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/" style="background-color: white;">http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/</a></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br />
Please follow the README file in each folder for detailed instructions on how to use them.<br />
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>2. Scoring paradigm:</b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b><br /></b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>Phrase_wise:</b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
The current basic scoring routine which is deployed at <a href="http://talknicer.net/~ronanki/test/" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/test/</a> aligns the test recording with the utterance using forced alignment in sphinx and generates a phone segmentation file. Each phoneme in the file is then compared with mean, std. deviation of the respective phone in phrase_statistics (<a href="http://talknicer.net/~ronanki/phrase_data/phrase1_stats.txt" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/phrase1_<wbr></wbr>stats.txt</a>) and standard scores are calculated from z-scores of acoustic_score and duration.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b><br /></b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>Random_phrase:</b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
I also derived statistics (mean score, std. deviation score, mean duration) for each phone in CMUphoneset irrespective of context using the exemplar recordings for all the three phrases (<a href="http://talknicer.net/~ronanki/phrase_data/phrases.txt" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/phrases.<wbr></wbr>txt</a>) which I have as of now. So, If a test utterance is given, I can test each phone in the random phrase with respective phone statistics. </div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
Statistics are at : <a href="http://talknicer.net/~ronanki/phrase_data/all_phrases_stats" style="color: #1155cc;" target="_blank">http://talknicer.net/~<wbr></wbr>ronanki/phrase_data/all_<wbr></wbr>phrases_stats</a> (column count represents number of times each phone occurred)</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>Things to do in the upcoming week:</b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b><br /></b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
1. Use of an edit-distance grammar to derive standard scores such that the minimal effective training data set is required. [Mentor note: was "no training data," which is excluded.]</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
2. Use of the same grammar to detect the words that are having two correct different pronunciation (ex: READ/RED)</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
3. In a random phrase scoring method, another column can be added to store the position of each phone with respect to word (or SILence) such that each phone will have three statistics and can be compared better with the exemplar phonemes based on position.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
4. Link all those modules to try to match experts' scores.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
5. Provide feedback to the user with underlined mispronunciations, or numerical labels.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b>Future tasks:</b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
<b><br /></b></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
1. Use of CART models in training to do better match of statistics for each phoneme in the test utterance with the training data based on contextual information</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
2. Use of phonological (power normalized cepstral?) features instead of mel-cepstral features, which are expected to better represent the state of pronunciation.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">
3. Develop a complete web-based system so that end user can test their pronunciation in an efficient way.</div>
</div>srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-86801000658877371792012-07-04T09:44:00.001+05:302012-07-04T11:20:45.779+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 4<div dir="ltr" style="text-align: left;" trbidi="on">
The source code for the functions below have been uploaded to <a href="http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/">http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/</a><br />
<span style="background-color: white;">Here are some brief notes on how to use those programs:</span><br />
<br />
<b>Method 1: (phoneme decode)</b><br />
<b>Path:</b><br />
neighborphones_decode/one_phoneme/<br />
<b style="background-color: white;">Steps To Run:</b><br />
1. Use split_wav2phoneme.py to split a sample wav file in to individual phoneme wav files<br />
Usage: python split_wav2phoneme.py <input_phoneseg_file> <complete_phone_list> <input_wav_file> <out_split_dir><br />
2. Create split.ctl file using extracted split_wav directory<br />
3. Run feature_extract.sh program to extract features for individual phoneme wav files<br />
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme<br />
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg.<br />
6. Run decode_1phoneme.py to get the required output in output_decoded_phones.txt<br />
<span style="background-color: white;">Usage: python decode_1phoneme.py <input_split_ctl_file> <output_phone_file></span><br />
<br />
<b>Method 2: (Three phones decode)</b><br />
<b style="background-color: white;">Path:</b><span style="background-color: white;"> </span><br />
neighborphones_decode/three_phones/<br />
<b style="background-color: white;">Steps To Run:</b><br />
1. Use split_wav2threephones.py to split a sample wav file in to individual phoneme wav files which consists of three phones the other two being served as contextual information for the middle one.<br />
Usage: python split_wav2threephones.py <input_phoneseg_file> <ngb_key_mapper> <input_wav_file> <out_split_dir><br />
2. Create split.ctl file using extracted split_wav directory<br />
3. Run feature_extract.sh program to extract features for individual phoneme wav files<br />
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme<br />
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg.<br />
6. Run decode_3phones.py to get the required output in output_decoded_phones.txt<br />
<span style="background-color: white;">Usage: python decode_3phones.py <input_split_ctl_file> <output_phone_file></span><br />
<br />
<b>Method 3: (Single/Batch phrase decode)</b><br />
<b style="background-color: white;">Path:</b><span style="background-color: white;"> </span><br />
neighborphones_decode/phrases/<br />
<b style="background-color: white;">Steps To Run:</b><br />
<span style="background-color: white;">1. Run decode.sh program to get the required output in sample.out</span><br />
<span style="background-color: white;">2. Provide the input arguments such as grammar file, feats, acoustic models etc., for the input test phrase</span><br />
<span style="background-color: white;">3. Construct grammar file (JSGF) using my earlier scripts from phonemes2ngbphones and then use jsgf2fsg in sphinxbase to convert from JSGF to FSG which serves as input Language Model to sphinx3_decode</span></div>srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-81989796897893698292012-07-04T09:42:00.001+05:302012-07-04T09:42:20.405+05:30Troy: GSoC 2012 Pronunciation Evaluation Week 4[Project mentor note: I have been holding these more recent blog posts pending some issues with Adobe Flash security updates which periodically break cross-platform audio upload web browser solutions. We have decided to plan for a fail-over scheme using low-latency HTTP POST multipart/form-data binary Speex uploads to provide backup in case Flash/rtmplite fails again in the future. This might also support most of the mobile devices. Please excuse the delay and rest assured that progress continues and will continue to be announced at such time as we are confident that we won't need to contradict ourselves as browser technology for audio upload continues to develop. --James Salsman]<br />
<br />
The data collection website now can provide basic capabilities. Anyone interested, please check out <a href="http://talknicer.net/~li-bo/datacollection/login.php">http://talknicer.net/~li-bo/datacollection/login.php</a> and give it a try. If you encounter any problems, please let us know.<br />
<br />
Here are my accomplishments from last week:<br />
<div>
<br /></div>
<div>
1) Discussed the project schema design with <span style="background-color: white;"> the project mentor</span><span style="background-color: white;"> and created the database with MySQL. The current schema is shown at </span><a href="http://talknicer.net/w/Database_schema" style="background-color: white;">http://talknicer.net/w/Database_schema</a><span style="background-color: white;">. During the development of the user interface, slight modifications were made to refine the database schema, such as the age field in for the users table: Storing the user's birth date is much better. Other similar changes were made. I learned that good database design comes from practice, not purely imagination. </span></div>
<div>
<br /></div>
<div>
2) Implemented the two types of user registration pages: one for students and one for exemplar uploaders. To avoid redundant work and allow for fewer constraints on types of users, the registration process involves two steps: one basic registration and one extra information update. For students, only the basic one is mandatory, but the exemplar uploaders have to fill out two separate forms. </div>
<div>
<br /></div>
<div>
3) Added extra supporting functionality for user management, including password reset and mode selection for users with more than one type.</div>
<div>
<br /></div>
<div>
4) Incorporated the audio recorder with the website for recording and uploading to servers. </div>
<div>
<br /></div>
<div>
This week I plan to:</div>
<div>
<br /></div>
<div>
1) Complete the user interface for adding phrase prompts;</div>
<div>
<br /></div>
<div>
2) Test the resulting system; </div>
<div>
<br /></div>
<div>
3) Design the pronunciation learning game for student users.</div>Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-74124238220757048602012-06-19T05:56:00.001+05:302012-06-19T14:47:24.909+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 3<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: -webkit-auto;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">I finally finished trying different methods for edit-distance grammar decoding. Here is what I have tried so far:</span><br />
<br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">1. I used sox to split each input wave file into individual phonemes based on the forced alignment output. Then, I tried decoding each phoneme against its neighboring phonemes. The decoding output matched the expected phonemes only 12 out of 41 times for the exemplar recordings in the phrase "Approach the teaching of pronunciation with more confidence" </span><br />
<br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">The accuracy for that method of edit distance scoring was 12/41 (29%) -- This naive approach didn't work well.</span><br />
<span style="background-color: white; color: #222222; font-family: arial, sans-serif;">2. I used sox to split each input wave file into three phonemes based on the forced alignment output and position of the phoneme. If a phoneme is at beginning of its word, I used a grammar like: <current phone> <next> <next2next> and if it is middle phoneme: <previous> <current> <next> and if it is at the end: <previous2previous> <previous> <current> and supplied neighboring phones for the current phone and fixed the other two. For example, the phoneme IH in word "with" is encoded as: </span><span style="background-color: white; color: #222222; font-family: 'Courier New', Courier, monospace;">((W) (IH|IY|AX|EH) (TH)) </span><br />
<br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">The accuracy was 19/41 (46.2%) -- better because of more contextual information.</span><br />
<br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">3. I used the entire phrase with each phoneme encoded in a </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px;">sphinx3_decode</span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;"> grammar file for matching a sequence of alternative neighboring phonemes which looks something like this:</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-size: 13px;">#JSGF V1.0;</span></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-size: 13px;">grammar phonelist;</span></span><br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px;">public <phonelist> = (SIL (AH|AE|ER|AA) (P|T|B|HH) (R|Y|L) (OW|AO|UH|AW) (CH|SH|JH|T) (DH|TH|Z|V)(AH|AE|ER|AA) (T|CH|K|D|P|HH) (IY|IH|IX) (CH|SH|JH|T) (IH|IY|AX|EH) (NG|N) (AH|AE|ER|AA) (V|F|DH) (P|T|B|HH)(R|Y|L) (AH|AE|ER|AA) (N|M|NG) (AH|AE|ER|AA) (N|M|NG) (S|SH|Z|TH) (IY|IH|IX) (EY|EH|IY|AY) (SH|S|ZH|CH) (AH|AE|ER|AA) (N|M|NG) (W|L|Y) (IH|IY|AX|EH) (TH|S|DH|F|HH) (M|N) (AO|AA|ER|AX|UH) (R|Y|L) (K|G|T|HH) (AA|AH|ER|AO) (N|M|NG) (F|HH|TH|V) (AH|AE|ER|AA) (D|T|JH|G|B) (AH|AE|ER|AA) (N|M|NG) (S|SH|Z|TH) SIL);</span><br />
<br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">The accuracy for this method of edit distance scoring was 30/41 (73.2%) -- the more contextual information provided, better the accuracy.</span></div>
<div style="text-align: -webkit-auto;">
<br /></div>
<div style="text-align: -webkit-auto;">
</div>
<div style="text-align: left;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">Here is some sample output, written both one below the other to have a </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">comparison of phonemes.</span></div>
<div style="text-align: left;">
<div style="text-align: -webkit-auto;">
<span style="color: #222222; font-family: arial, sans-serif; font-size: x-small;"><br /></span></div>
<div style="text-align: -webkit-auto;">
</div>
<div style="text-align: left;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">Forced-alignment output: </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; text-align: -webkit-auto;"><b>AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M</b></span></div>
<div style="text-align: -webkit-auto;">
<span style="color: #222222; font-family: arial, sans-serif; font-size: x-small;"><br /></span></div>
</div>
<div style="text-align: left;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; text-align: -webkit-auto;">Decoder output: </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; text-align: -webkit-auto;"><b>ER P R UH JH DH AH CH IY CH IY N AH V P R ER N AH NG Z IY EY SH AH N W IH TH M</b></span><br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;"><br /></span></div>
<div style="text-align: left;">
<div style="text-align: -webkit-auto;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif;">In this case, both are forced outputs. So, if someone skips or inserts something </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif;">during phrase recording, it may not work well. We need to think a method </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif;">to solve this. Will a separate pass decoder grammar to test for whole word or syllable insertions and deletions work?</span></div>
</div>
<br />
<b style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">Things to do for next week:</b><br />
<br />
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">1. </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">We are trying to combine acoustic standard scores (and duration) from </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">forced alignment with an edit distance scoring grammar, which was </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">reported to have better correspondence with human expert phonologists.</span><br />
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<span style="color: #222222; font-family: arial, sans-serif;"><br /></span></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<span style="color: #222222; font-family: arial, sans-serif;">2. Complete a basic demo of the pronunciation evaluation without edit distance scoring from exemplar recordings using conversion of phoneme acoustic scores and durations to normally distributed scores, and then using those to derive their means and standard deviations, so we can produce per-phoneme acoustic and duration standard scores for new uploaded recordings.</span></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<span style="color: #222222; font-family: arial, sans-serif;"><br /></span></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<span style="color: #222222; font-family: arial, sans-serif;">3. Finalize the method for mispronunciation detection at phoneme and word level.</span></div>
</div>srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-37053481729841048622012-06-19T03:48:00.004+05:302012-06-19T03:48:48.836+05:30Troy: GSoC 2012 Pronunciation Evaluation Week 3Week 3 accomplishments:<br />
<div>
<br /></div>
<div>
1. Tailored the previous ActionScript/MXML audio recorder to provide only audio recording and playback functionality and began interfaces for interaction with the web site pages using JavaScript. </div>
<div>
<br /></div>
<div>
2. Discussed database design and schema with the project mentor and continued refining and testing the schema and initial database records.</div>
<div>
<br /></div>
<div>
Plans for Week 4:</div>
<div>
<br /></div>
<div>
<div>
1. Fix the database schema for prompts to handle word lists with (possibly multiple) pronunciations and parts of speech, along with a separate text string for phrase display which can include arbitrary punctuation and might not have as clear word boundaries because of that punctuation--such as this phrase in dashes--etc.</div>
<div>
<br /></div>
<div>
2. Create separate registration interface for users who will be uploading exemplar pronunciation recordings.</div>
<div>
</div>
<div>
3. Create an interface to add phrase prompts and mark their words' disambiguated pronunciation and parts of speech.</div>
<div>
</div>
<div>
4. Create the interface to upload exemplar recordings for prompts.</div>
<div>
<br /></div>
<div>
5. Think about game play and refine its schema once the basic features are decided.</div>
</div>Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-55509298779413774872012-06-10T10:50:00.003+05:302012-06-10T11:42:57.928+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 2<div dir="ltr" style="text-align: left;" trbidi="on">
<i>[It is my fault this update is late, not Ronanki's. --James Salsman]</i><br />
<br />
Following last week's discussion describing how to obtain phoneme acoustic scores from sphinx3_align, here is some additional detail pertaining to two of the necessary output arguments:<br />
<br />
1. Following up on the discussion at <a href="https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4583225" style="font-family: Arial, Helvetica, sans-serif;">https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4583225</a><span style="font-family: Arial, Helvetica, sans-serif;">, I was able to produce acoustic scores for each frame, and thereby also for each phoneme in a single recognition pass. </span><span style="line-height: 18px;"><span style="font-family: Arial, Helvetica, sans-serif;">Add the following code to the </span><span style="font-family: 'Courier New', Courier, monospace;">write_stseg</span><span style="font-family: Arial, Helvetica, sans-serif;"> function in </span><span style="font-family: 'Courier New', Courier, monospace;">main_align.c</span><span style="font-family: Arial, Helvetica, sans-serif;"> and use the </span></span><span style="background-color: white; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">state segmentation parameter </span><span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">-stsegdir</span><span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;"> </span><span style="background-color: white; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">as an argument to the program:</span>
<br />
<div style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">
<span style="background-color: white; font-size: 13px; line-height: 18px;"><br /></span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> char str2[1024];</span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> align_stseg_t *tmp1;</span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"><br /></span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> for (i = 0, tmp1 = stseg; tmp1; i++, tmp1 = tmp1->next) {</span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> mdef_phone_str(kbc->mdef, tmp1->pid, str2);</span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> fprintf(fp, "FrameIndex %d Phone %s PhoneID %d SenoneID %d state %d Ascr %11d \n",</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> i, str2, tmp1->pid, tmp1->sen, tmp1->state, tmp1->score);</span></div>
<div style="color: #222222;">
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small; line-height: 18px;"> }</span></div>
<div style="color: #222222; line-height: 18px;">
<br /></div>
<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">2. By using the phone segmentation parameter </span><span style="background-color: white; color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">-phsegdir</span><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;"> as an argument to the program, the acoustic scores for each phoneme can be calculated. </span><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">The output sequence for the word "approach" </span><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">is as follows</span><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">:</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"><b><span style="background-color: white; color: #222222; line-height: 18px;"> </span><span style="background-color: white; color: #222222; line-height: 18px;"> SFrm EFrm SegAScr Phone</span>
<span style="background-color: white; color: #222222;"></span></b></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: 'Courier New', Courier, monospace; line-height: 18px;"><b> 0 9 -64725 SIL</b></span></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: 'Courier New', Courier, monospace; line-height: 18px;"><b> 10 21 -63864 AH SIL P b</b></span></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: 'Courier New', Courier, monospace; line-height: 18px;"><b> 22 33 -126819 P AH R i</b></span></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: 'Courier New', Courier, monospace; line-height: 18px;"><b> 34 39 -21470 R P OW i</b></span></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: 'Courier New', Courier, monospace; line-height: 18px;"><b> 40 51 -69577 OW R CH i</b></span></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: 'Courier New', Courier, monospace; line-height: 18px;"><b> 52 64 -55937 CH OW DH e</b></span></span><br />
<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">Each phoneme in the "Phone" column is represented as <Aligned_phone> <Previous_phone> <Next_phone> <position_in_the_word (b-begin, i-middle, e-end)>. The full command line usage for this output is:</span><br />
<div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="line-height: 18px;">
<span style="font-family: 'Courier New', Courier, monospace;"><span style="background-color: white; color: #222222;">$ sphinx3_align -hmm </span><span style="color: #222222;">wsj_all_cd30.mllt_cd_cont_4000</span><span style="background-color: white; color: #222222;"> -dict cmu.dic -fdict phone.filler -ctl phone.ctl -insent phone.insent -cepdir feats -phsegdir phonesegdir -phlabdir phonelabdir -stsegdir statesegdir -wdsegdir aligndir </span><span style="background-color: white; color: #222222;">-outsent phone.outsent</span></span></div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif;"><b>Work in progress:</b></span></div>
<br />
<div style="font-size: 13px;">
<div style="line-height: 18px;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">1. It's very important to weight word scores by the words' </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">part of speech </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">(articles don't matter very much if they are omitted, but nouns, </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">adjectives, verbs, and adverbs are the most important.) Troy has designed a basic database schema at <a href="http://talknicer.net/w/Database_schema">http://talknicer.net/w/Database_schema</a> </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">in which the part of speech is one of the fields in the "prompts" table along with acoustic and duration standard scores in the "scores" table. </span></div>
</div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;"><br /></span></div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">2. I put some exemplar recordings for three phrases the project mentor had collected at <a href="http://talknicer.net/~ronanki/Datasets/">http://talknicer.net/~ronanki/Datasets/</a> </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">in each subdirectory there for each of the three phrases. The description of the phrases is at </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;"><a href="http://talknicer.net/~ronanki/Datasets/files/phrases.txt">http://talknicer.net/~ronanki/Datasets/files/phrases.txt</a>.</span><br />
<br /></div>
<div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;">3. I ran </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; line-height: normal; text-align: -webkit-auto;">sphinx3_align</span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; line-height: normal; text-align: -webkit-auto;"> for that sample data set. I wrote a program to calculate mean and standard deviations of phoneme acoustic scores, and the mean duration of each phoneme. </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;">I also generated neighbor phonemes for each of the phrases, and the output is written in this file: <a href="http://talknicer.net/~ronanki/Datasets/out_ngb_phonemes.insent">http://talknicer.net/~ronanki/Datasets/out_ngb_phonemes.insent</a></span><br />
<br /></div>
<div style="font-size: 13px; line-height: 18px;">
<span style="background-color: white; color: #222222; font-family: arial, sans-serif;">4. </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;">I also tried some of the other sphinx3 executables such as </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">sphinx3_decode</span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;">, </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">sphinx3_livepretend</span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;">, and </span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: 'Courier New', Courier, monospace; text-align: -webkit-auto;">sphinx3_continous</span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;"> for mispronunciation detection. For the sentence, "</span><span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; text-align: -webkit-auto;">Approach the teaching of pronunciation with more confidence." (phrase 1), I used this command:</span></div>
<div style="text-align: -webkit-auto;">
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-size: 13px;">
<div>
<span style="font-family: 'Courier New', Courier, monospace;">$ SPHINX3DECODE -hmm ${WSJ} -fsg phone.fsg -dict basicphone.dic -fdict phone.filler -ctl new_phone.ctl -hyp phone.out -cepdir feats -mode allphone -hypseg phone_hypseg.out -op_mode 2</span></div>
<div>
<div style="font-family: arial, sans-serif;">
<br /></div>
<div>
<div>
<span style="font-family: arial, sans-serif;">The decoder, </span><span style="font-family: 'Courier New', Courier, monospace;">sphinx3_decode</span><span style="font-family: arial, sans-serif;">, produced this output:</span><br />
<div style="font-family: arial, sans-serif;">
<br /></div>
<span style="font-family: 'Courier New', Courier, monospace;"><b>P UH JH DH CH IY CH Y N Z Y EY SH AH W Z AO K AA F AH N Z</b></span></div>
<div>
<br />
<span style="font-family: arial, sans-serif;">The forced alignment system, </span><span style="font-family: 'Courier New', Courier, monospace;">sphinx3_align</span><span style="font-family: arial, sans-serif;">, produced this output: </span><br />
<div style="font-family: arial, sans-serif;">
<br /></div>
<span style="font-family: 'Courier New', Courier, monospace;"><b>AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M AO R K AA N F AH D AH N S</b></span></div>
</div>
</div>
<div>
<br /></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;">The </span><span style="font-family: 'Courier New', Courier, monospace;">sphinx3_livepretend</span><span style="font-family: arial, sans-serif;"> and </span><span style="font-family: 'Courier New', Courier, monospace;">sphinx3_continous</span><span style="font-family: arial, sans-serif;"> commands produce output in words using language models and acoustic models along with a complete dictionary of expected words:</span></div>
<div style="font-family: arial, sans-serif;">
<br /></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;"><b>approach to teaching opponents the nation with more confidence</b></span></div>
</div>
</div>
<div style="font-size: 13px; line-height: 18px;">
<b><br /></b></div>
<div style="font-size: 13px; line-height: 18px;">
<b><span style="font-family: Arial, Helvetica, sans-serif;">Plans for the coming week:</span></b><br />
<b><br /></b></div>
<div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
1. Write and test audio upload and pronunciation evaluation for per-phoneme standard scores.</div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
<br /></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
2. Since there are many deletions in the edit distance scoring grammars tried so far, we need to modify the grammar file and/or the method we are using to detect whether neighboring phonemes match more closely. Here is my idea of finding neighboring phonemes by dynamic programming:</div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
<br /></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
a. Run the decoder to get the best possible output<br />
<br /></div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
b. Align the decoder output to forced-alignment output using a dynamic programming string matching algorithm </div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
<br />
c. The aligned output will have the same number of phones as from forced alignment. So, we need to test two things for each phoneme:</div>
<div style="background-color: rgba(255, 255, 255, 0.917969); text-align: -webkit-auto;">
<ul style="color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: left;">
<li><span style="text-align: left;">If the phone is same as expected phoneme, no need to do anything</span></li>
<li><span style="text-align: left;">If the phone is not as expected phoneme, check that phone in the list of neighboring phonemes of the expected phoneme.</span></li>
</ul>
</div>
<div style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal; text-align: -webkit-auto;">
<br />
d. Then, we can run sphinx3_align with this outcome against the same wav file to check whether the acoustic scores actually indicate a better match. </div>
<div style="background-color: rgba(255, 255, 255, 0.917969); text-align: -webkit-auto;">
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal;">
<br /></div>
<div>
<span style="color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal;">3. As an alternative to the above, I used </span><span style="color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: normal;">sox</span><span style="font-family: arial, sans-serif;"> to split each input wave file in to individual phoneme wav files using the forced alignment phone labels, and then used a separate recognition pass on each tiny speech segment. Now, I am writing separate grammar files for the neighboring phonemes for each phoneme. Once I complete them, I will check the output using decoder for each phoneme segment. This should provide for more accurate assessment of mispronunciations.</span></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal;">
<br /></div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal;">
4. I will update the wiki here at <a href="http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation">http://cmusphinx.<wbr></wbr>sourceforge.net/wiki/<wbr></wbr>pronunciation_evaluation</a> with my current tasks and milestones.</div>
<div style="color: #222222; font-family: arial, sans-serif; font-size: 13px; line-height: normal;">
<br /></div>
</div>
</div>
</div>
</div>
</div>srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-87334804899676479152012-06-05T23:40:00.000+05:302012-06-06T00:38:25.537+05:30Troy: GSoC 2012 Pronunciation Evaluation Week 2<span style="font-family: Arial, Helvetica, sans-serif;">These are the things I've accomplished in the second week of GSoC 2012:</span><br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">1. Set up a cron job for the rtmplite server to automatically check whether the process is still running or not. If it is stopped, restart it. This will allow the server to stay up if the machine gets rebooted, and will allow the server to spawn subprocesses without being stopped by job control as happens when the process is put into the background from a terminal shell. </span><span style="font-family: Arial, Helvetica, sans-serif;">To accomplish this, I first created a </span><span style="font-family: 'Courier New', Courier, monospace;">.process</span><span style="font-family: Arial, Helvetica, sans-serif;"> file in my home directory with the rtmplite server's process id number as its sole contents. You can use 'top' or 'ps' to find out the process id of the server. </span><span style="font-family: Arial, Helvetica, sans-serif;">Then I created this shell script file to check the status of the rtmplite server process:</span><br />
<pre><span style="color: #906030; font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace; font-size: x-small;">pidfile</span><span style="color: #303030; font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace; font-size: x-small;">=~/.process
</span><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace; font-size: x-small;"><span style="color: green; font-weight: bold;">if</span> <span style="color: #303030;">[</span> -e <span style="background-color: #fff0f0;">"$pidfile"</span> <span style="color: #303030;">]</span>
<span style="color: green; font-weight: bold;">then</span>
<span style="color: grey;"> # check whether the process is running</span>
<span style="color: #906030;"> rtmppid</span><span style="color: #303030;">=</span><span style="background-color: #fff0f0;">`</span>/usr/bin/head -n 1 <span style="color: green; font-weight: bold;">${</span><span style="color: #906030;">pidfile</span><span style="color: green; font-weight: bold;">}</span> | /usr/bin/awk <span style="background-color: #fff0f0;">'{print $1}'</span><span style="background-color: #fff0f0;">`</span>;
<span style="color: grey;"> # restart the process if not running</span>
<span style="color: green; font-weight: bold;"> if</span> <span style="color: #303030;">[</span> ! -d /proc/<span style="color: green; font-weight: bold;">${</span><span style="color: #906030;">rtmppid</span><span style="color: green; font-weight: bold;">}</span> <span style="color: #303030;">]</span>
<span style="color: green; font-weight: bold;"> then</span>
/usr/bin/python <span style="color: green; font-weight: bold;">${</span><span style="color: #906030;">exefile</span><span style="color: green; font-weight: bold;">}</span> -r <span style="color: green; font-weight: bold;">${</span><span style="color: #906030;">dataroot</span><span style="color: green; font-weight: bold;">}</span> &
<span style="color: #906030;"> rtmppid</span><span style="color: #303030;">=</span><span style="color: #906030;">$!</span>
<span style="color: #007020;"> echo</span> <span style="background-color: #fff0f0;">"${rtmppid}"</span> > <span style="color: green; font-weight: bold;">${</span><span style="color: #906030;">pidfile</span><span style="color: green; font-weight: bold;">}</span>
<span style="color: #007020;"> echo</span> <span style="background-color: #fff0f0;">`</span>/bin/date<span style="background-color: #fff0f0;">`</span> <span style="background-color: #fff0f0;">"### rtmplite process restarted with pid: ${rtmppid}"</span>
<span style="color: green; font-weight: bold;"> fi</span>
<span style="color: green; font-weight: bold;">fi
</span></span><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;">
</span><span style="font-family: arial, helvetica, sans-serif;">This script first checks whether the .process file exists or not. If we don't want the cron job to check for this process temporarily (such as when we apply patches to the program), we can simply delete this file and it won't check on or try to restart the server; after out maintenance, recreate the file with the new process id, and the checking will automatically resume.
</span><span style="font-family: arial, helvetica, sans-serif;">
The last and also the most important step is to schedule this task in cron by creating following item with the command </span><span style="font-family: 'Courier New', Courier, monospace;">crontab -e</span><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;">
</span><pre><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;"> * * * * * [path_to_the_script]/check_status.sh
</span>
</span><span style="font-family: arial, helvetica, sans-serif;">This causes the cron system to run this script every minute, thereby checking the rtmplite server process every minute.</span></pre>
<pre><span style="font-family: arial, helvetica, sans-serif;">2. Implemented web server user login and registration pages using MySQL and HTML. </span><span style="font-family: arial, helvetica, sans-serif;">We use a MySQL database for storing user information, so I designed and created this table for user information in the server's mysql database:</span></pre>
</pre>
<span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;">
</span><br />
<table border="1" style="border-spacing: 0px; font-family: Arial;"><tbody>
<tr><th>Field</th><th>Type</th><th>Comments</th></tr>
<tr><td>userid</td><td>INTEGER</td><td>Compulsory, automatically increased, primary key</td></tr>
<tr><td>email</td><td>VARCHAR(200)</td><td>Compulsory, users are identified by emails</td></tr>
<tr><td>password</td><td>VARCHAR(50)</td><td>Compulsory, encrypted using SHA1, at least 8 alphanumeric characters</td> </tr>
<tr><td>name</td><td>VARCHAR(100)</td><td>Not compulsory, default 'NULL'</td></tr>
<tr><td>age</td><td>INTEGER</td><td>Not compulsory, default 'NULL', accepted values [0,150]</td></tr>
<tr><td>sex</td><td>CHAR(1)</td><td>Not compulsory, default 'NULL', accepted values {'M', 'F'}</td></tr>
<tr><td>native</td><td>CHAR(1)</td><td>Not compulsory, default 'NULL', accepted values {'Y', 'N'}. Indicating the user is a native English speaker or not.</td></tr>
<tr><td>place</td><td>VARCHAR(1000)</td><td>Not compulsory, default 'NULL'. Indicating the place when the user lived at the age between 6 and 8.</td></tr>
<tr><td>accent</td><td>CHAR(1)</td><td>Not compulsory, default 'NULL', accepted values {'Y', 'N'}. Indicating the user has a self-reported accent or not.</td></tr>
</tbody></table>
<span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;">
</span><br />
<span style="font-family: arial, helvetica, sans-serif;">This table was created by the following SQL command:</span><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;"><br /></span><br />
<pre><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;"><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>CREATE</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>TABLE</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> users (
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> userid </span><span class="s2" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">INTEGER</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>NOT</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>NULL</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> AUTO_INCREMENT,
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> email </span><span class="s2" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">VARCHAR</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s3" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>200</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">) </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>NOT</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>NULL</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">,
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> password </span><span class="s2" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">VARCHAR</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s3" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>50</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">) </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>NOT</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>NULL</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">,
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> name </span><span class="s2" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">VARCHAR</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s3" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>100</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">),
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> age </span><span class="s2" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">INTEGER</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">,
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> sex </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>SET</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'M'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">, </span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'F'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">),
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> native </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>SET</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'Y'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">, </span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'N'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">) </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>DEFAULT</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'N'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">,
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> place </span><span class="s2" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">VARCHAR</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s3" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>1000</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">),
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> accent </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>SET</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">(</span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'Y'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">, </span><span class="s4" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">'N'</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">),
</span><span class="s5" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><b style="font-family: 'Courier New', Courier, monospace; font-size: small;">CONSTRAINT</b><span class="s5" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><b style="font-family: 'Courier New', Courier, monospace; font-size: small;">PRIMARY</b><span class="s5" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><b style="font-family: 'Courier New', Courier, monospace; font-size: small;">KEY</b><span class="s5" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> (userid),
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>CONSTRAINT</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> chk_age </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>CHECK</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> (age</span><span class="s6" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">>=</span><span class="s3" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>0</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> </span><span class="s1" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>AND</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"> age</span><span class="s6" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><=</span><span class="s3" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"><b>150</b></span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">)
</span><span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">);</span></span></pre>
<span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;"><span style="font-family: arial, helvetica, sans-serif;">I also prototyped the login and simple registration pages are in HTML. Here are their preliminary screenshots:</span><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhX3ZAn0G2oobp0tiQYuaJrXjRiEnsgGE7qzWM68esy_-o6ezxEHqb17YkxJyyHEREQEB67RBDG1S-6cCZ4hiam-rVKPCozImv4YlVK7mnlQJrEw5Jrggi25EJ4oNxjSPIrsa0EIdIo_GfE/s1600/login-756956.png" style="font-family: arial, helvetica, sans-serif;"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5749875687284161314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhX3ZAn0G2oobp0tiQYuaJrXjRiEnsgGE7qzWM68esy_-o6ezxEHqb17YkxJyyHEREQEB67RBDG1S-6cCZ4hiam-rVKPCozImv4YlVK7mnlQJrEw5Jrggi25EJ4oNxjSPIrsa0EIdIo_GfE/s320/login-756956.png" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTUB3hhYzmLDPFbrcu7y2ppPUIkkziUU1xNLQDA8_hzpdB4q9h5wQ0WqT9CWbKuTDAnSr9PH68TtQW1LVmy5n_YLI-eHhQXkQfAn503w7qjJBM206BkS5Pdve9iTfhYcogGbadQGBoCGiK/s1600/register-757946.png" style="font-family: arial, helvetica, sans-serif;"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5749875692195794242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTUB3hhYzmLDPFbrcu7y2ppPUIkkziUU1xNLQDA8_hzpdB4q9h5wQ0WqT9CWbKuTDAnSr9PH68TtQW1LVmy5n_YLI-eHhQXkQfAn503w7qjJBM206BkS5Pdve9iTfhYcogGbadQGBoCGiK/s320/register-757946.png" /></a><br />
</span><br />
<pre><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;"><span style="font-family: arial, helvetica, sans-serif;">If you like, you can go to this page to help us test the system: </span><span style="background-color: transparent; font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;"><span style="font-family: arial, helvetica, sans-serif;"><a href="http://talknicer.net/~li-bo/datacollection/login.php" target="_blank">http://talknicer.net/~li-bo/datacollection/login.php</a>. On the server, we use PHP to retrive the form information from the login and registration pages, perform an update or query in mysql database, and then send data back in HTML.
</span></span><span style="font-family: arial, helvetica, sans-serif;">
The recording interface, has also been modified to use HTML instead of pure Flex as earlier. The page currently displays well, but there is no event interaction between HTML and Flash</span><span style="font-family: arial, helvetica, sans-serif;"> yet.
</span><span style="font-family: arial, helvetica, sans-serif;">
3. Database schema design for the entire project: S</span><span style="font-family: arial, helvetica, sans-serif;">everal SQL tables have been designed to store the various information used by all aspects of this project. Detailed table information can be found on our wiki page: </span><a href="http://talknicer.net/w/Database_schema" style="font-family: arial, helvetica, sans-serif;">http://talknicer.net/w/Database_schema</a><span style="font-family: arial, helvetica, sans-serif;">. Here is a brief discussion.
</span><span style="font-family: arial, helvetica, sans-serif;">
First, the </span><a href="http://talknicer.net/w/Database_schema#Users" style="font-family: arial, helvetica, sans-serif;">user table</a><span style="font-family: arial, helvetica, sans-serif;"> shown above will be augmented to keep two additional kinds of user information: one for normal student users and one for those who are providing exemplar recordings. Student users, when they can provide correct pronunciation, should also be allowed to contribute to the exemplar recordings. Also if exemplar recorders register through the website, they have to show they are proficient enough to contribute a qualified exemplar recording, so we should be able to use the student evaluation system to qualify them for uploading exemplar contributions.
</span><span style="font-family: arial, helvetica, sans-serif;">
There are several other tables for additional information such as </span><a href="http://talknicer.net/w/Database_schema#Languages" style="font-family: arial, helvetica, sans-serif;">languages</a><span style="font-family: arial, helvetica, sans-serif;"> for a list of languages defined by the ISO in case we may extend our project to other languages; a </span><a href="http://talknicer.net/w/Database_schema#Regions" style="font-family: arial, helvetica, sans-serif;">region table</a><span style="font-family: arial, helvetica, sans-serif;"> to store some idea of the user's accent; </span><a href="http://talknicer.net/w/Database_schema#Prompts" style="font-family: arial, helvetica, sans-serif;">prompts table</a><span style="font-family: arial, helvetica, sans-serif;"> for the list of text resources will be used for pronunciation evaluation. </span><span style="font-family: arial, helvetica, sans-serif;">Then are also </span><a href="http://talknicer.net/w/Database_schema#Recordings" style="font-family: arial, helvetica, sans-serif;">tables</a><span style="font-family: arial, helvetica, sans-serif;"> to log the recordings the users do and tables for set of </span><a href="http://talknicer.net/w/Database_schema#Tests" style="font-family: arial, helvetica, sans-serif;">tests</a><span style="font-family: arial, helvetica, sans-serif;"> stored in the system.
</span><span style="font-family: arial, helvetica, sans-serif;">
Here are my plans for the coming week:
</span><span style="font-family: arial, helvetica, sans-serif;">1. Discuss details of the game specification to finish the last part of schema design.
</span><span style="font-family: arial, helvetica, sans-serif;">
2. Figure out how to integrate the Flash audio recorder with the HTML interface using bidirectional communication between ActionScript and JavaScript.
3</span><span style="font-family: arial, helvetica, sans-serif;">. Implement the student recording interface.
4.</span><span style="font-family: arial, helvetica, sans-serif;"> Further tasks can be found at: </span><a href="http://talknicer.net/w/To_do_list" style="font-family: arial, helvetica, sans-serif;">http://talknicer.net/w/To_do_list</a><span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;">
</span>
</span></pre>
<span style="font-family: 'UbuntuBeta Mono', 'Ubuntu Mono', monospace;">
</span>Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-75132345751102247182012-06-01T12:32:00.002+05:302012-06-01T12:39:12.442+05:30Ronanki: GSoC 2012 Pronunciation Evaluation Week 1 Status<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">Last week, I accomplished the following:</span><br />
<ol>
<li><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">Successfully tested producing phoneme acoustic scores from sphinx3_align using two recognition passes. I was able to use the state segmentation parameter </span><span style="background-color: white; color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18px;">-stsegdir</span><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;"> as an argument to the program, to obtain acoustic scores for each frame and thereby for each phoneme as well. But, the output of the program is to be decoded to integer format which I will try to do by the end of next week.</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;">Last week I wrote a program which converts a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet. But, yesterday, when I compared both files manually, found some of the phones mismatched. So, I re-checked my code and fixed the bug. The corrected</span><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;"> program takes a string of phonemes representing an expected utterance as input and produces a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring. </span></span></li>
</ol>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: small;"><span style="background-color: white; color: #222222; line-height: 18px;">All the programs I have written so far are checked in at </span><a href="http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/">http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki</a> <span style="background-color: white; color: #222222; line-height: 18px;">using su</span><span style="background-color: white; color: #222222; line-height: 18px;">bversion. (Similarly, Troy's code is checked in at </span><a href="http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/troy/">http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/troy</a>.)</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">
<span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;">Here is the procedure for using that code to obtain neighboring phonemes of CMUbet from a file which contains a string of phonemes:</span></span>
<br />
<ul style="text-align: left;">
<li><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 13px; line-height: 18px;">To convert Worldbet phonetic alphabet to CMUbet </span></li>
<span style="font-family: Arial, Helvetica, sans-serif;"><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;"><b>Usage:</b> python convert_world2cmu.py <input_worldbet_phone> <input_key_map> <output_cmubet_phone></span>
</span>
<li><span style="font-family: Arial, Helvetica, sans-serif;"><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;">To convert input list of phonemes to neighboring phones</span><span style="background-color: white; color: #222222; font-size: small; line-height: 18px;"> </span></span></li>
<span style="font-family: Arial, Helvetica, sans-serif;"><b style="color: #222222; font-size: 13px; line-height: 18px;">Usage:</b><span style="background-color: white; color: #222222; font-size: small; line-height: 18px;"> </span><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;">python convert2_ngbphones.py <input_phoneme_list> <input_phone_map> <output_neighboring_phone_list></span>
</span></ul>
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;"><b style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;">Ex:</b><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;"> </span><span style="background-color: white; color: #222222; font-size: 13px; line-height: 18px;">"I had faith in them" (arctic_a0030) - a sentence from arctic database:</span></span></li>
<span style="font-family: Arial, Helvetica, sans-serif; font-size: small;"><span style="background-color: white; color: #222222; line-height: 18px;"><input_phoneme_list> AY HH AE D F EY TH IH N DH EH M (arctic_a0030)</span>
</span><span style="background-color: white; color: #222222;"><div style="color: black; line-height: normal;">
<span style="font-family: Arial, Helvetica, sans-serif;"><span style="font-size: small;"><output_neighboring_phone_list> <span style="color: #222222; line-height: 18px;">{AY|AA|IY|OY|EY} {HH|TH|F|P|T|K} {AE|EH|ER|AH} {D|T|JH|G|B} {F|HH|TH|V} {EY|EH|IY|AY} {TH|S|DH|F|HH} {IH|IY|AX|EH} {N|M|NG} {DH|TH|Z|V} {EH|IH|AX|ER|AE} {M|N} (arctic_a0030)</span></span>
</span></div>
</span></ul>
</div>
</div>srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-47469956523585231112012-05-30T04:13:00.003+05:302012-05-30T07:57:05.154+05:30Troy: GSoC 2012 Pronunciation Evaluation Week 1The first week of GSoC 2012 has already been a busy summer. Here is what I have accomplished so far:<br />
<br />
<ol>
<li>To measure the Speex recording "quality" parameter (which is set by the client from 0 to 10) I recorded the same Sphinx3 test utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") from a constant source recording with the quality varying from 0 to 10. As shown on the graph, the higher the Speex quality parameter, the larger the .FLV file will be. Judging from my own listening, greater quality parameter values do result in better quality, but it is difficult to hear the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems essentially identical for the different recordings. But to be on the safe side in case of background and line noise, for now we will use a Speex recording quality parameter of 8.<img alt="graph" height="247" src="https://docs.google.com/spreadsheet/oimg?key=0AnmjwEuGJ5xldHR5N25feWxpOGhja3QyVWRxVjFEc2c&oid=2&zx=puoyjnpt1v3q" title="" width="400" /></li>
<li>The rtmplite server is now configured to save its uploaded files to the <b>[path_to_webroot]/data</b> directory on the server. The initial audioRecorder applet will place its recordings in the<b> [path_to_webroot]/data/audioRecorder</b> directory, and for each user there will be a separate folder (e.g. <b>[path_to_webroot]/data/audioRecorder/user1</b>). For each recording utterance, the file name is now in the format of <b>[sentence name]_[quality level].flv</b></li>
<li>The conversion from .FLV Speex uploads to .WAV PCM audio files is done entirely in the rtmplite server using a process spawned by Python's subprocess.Popen() function calling ffmpeg. After the rtmplite closes the FLV file, the conversion is performed immediately and the converted WAV file has exactly the same path and name except the suffix, which is .wav instead of .flv. <a href="https://plus.google.com/102046902944478837894" target="_blank">Guillem</a> suggested the sox command for the conversion, but it doesn't recognize .flv files directly. Other possibilities included speexdec, but that won't open .flv files either.</li>
<li>In the audioRecorder client, the user interface now waits for NetConnection and NetStream events to open and close successfully before proceeding with other events. And a 0.5 second delay has been inserted at the beginning and end of the recording button click event to avoid inadvertently trimming the front or end of the recording. </li>
</ol>
<div>
My plans for the 2nd week are:</div>
<div>
<ol>
<li>Solve a problem encountered in converting FLV files to WAV using ffmpeg with Python's Popen() function. If the main Python script (call it test.py for example) is run from a terminal as "python test.py", then everything works great. However, if I put it in background and log off the server by doing "python test.py &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + test.py &" error message. I will try to figure out a way to work around this issue. Maybe if I start the process from cron (after checking to see whether it already running with a process ID number in a .pid text file) then it will start subprocesses without stopping as occurs when it is detached from a terminal.</li>
<li>Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students, we will display from one to five cue phrases below space for a graphic or animation, assuming the smallest screen possible using HTML which would also look good in a larger window. For the exemplar recordings, we just need to display one phrase but we should also have per-upload form fields (e.g., name, age, sex, native speaker (y/n?), where speaker lived ages 6-8 (which determines their accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (perhaps using HTTP cookies.) I want to integrate those fields with the mysql database running on our server, so I will need to create a SQL schema with some CREATE TABLE statements to hold all those fields, the filenames, maybe recording durations, the date and time, and perhaps other information.</li>
<li>Test the rtmplite upload server to make sure it works correctly and without race conditions during simultaneous uploads from multiple users, and both sequential and simultaneous recording uploads by the same user, just to be on the safe side.</li>
<li>Further milestones are listed at <a href="http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation#milestones1">http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation#milestones1</a></li>
</ol>
</div>Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com4tag:blogger.com,1999:blog-8012401148587899501.post-44209835804087266082012-05-25T00:32:00.002+05:302012-05-25T00:32:26.653+05:30Ronanki: GSoC Work Prior to the Official Start<div dir="ltr" style="text-align: left;" trbidi="on">
Well, it has been a month since I got accepted into this year's Google Summer of Code. This has been a great time for me, during the community bonding period within the CMU Sphinx organization. Our organization has six GSoC students this year working on different projects. We introduced ourselves to each other over the cmusphinx-gsoc mailing list and had a few conversations over chat. Thanks to Carol Smith, I received my welcome package from Google on May 19th, and a free ACM membership too :)<br />
<br />
It has been three days since GSoC 2012 started officially. Prior to that, I became familiarized with a few different things with the help of my mentor. He created a wiki page for our projects at <a href="http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation" target="_blank">http://cmusphinx.sourceforge.<wbr></wbr>net/wiki/pronunciation_<wbr></wbr>evaluation</a>. Troy and I are also going to blog at <a href="http://cmusphinx.sourceforge.net/">http://cmusphinx.sourceforge.net</a> and update the wiki there during this summer. So please check there for important updates too.<br />
<br />
Currently, my goal is to build a web interface which allows users to evaluate their pronunciation. Some of the sub-tasks have already been accomplished, and some of them are still ongoing:<br />
<br />
<b>Work accomplished:</b><br />
<ul style="text-align: left;">
<li>Created an initial web interface which allows users to record and playback their speech using the open source <a href="http://code.google.com/p/wami-recorder/">wami-recorder</a> which is being designed by the spoken language systems at MIT.</li>
<li>When the recording is completed, the wave file is uploaded to the server for processing.</li>
<li>Sphinx3 forced alignment is used to align a phoneme string expected from the utterance with the recorded speech to calculate time endpoints acoustic scores for each phoneme.</li>
<li>I tried many different output arguments in sphinx3_align from
<a href="http://cmusphinx.sourceforge.net/wiki/sphinx4:sphinxthreealigner">http://cmusphinx.sourceforge.net/wiki/sphinx4:sphinxthreealigner</a> and successfully tested producing the phoneme acoustic scores using two recognition passes.</li>
<ul>
<li>In the first pass, I use <span style="font-family: 'Courier New', Courier, monospace;">-phlabdir</span> as an argument to get a .lab file as output, which contains the list of recognized phonemes.</li>
<li>In the second pass, I use that list to get acoustic scores for each phoneme using <span style="font-family: 'Courier New', Courier, monospace;">-wdsegdir</span> as an input argument.</li>
</ul>
<li>Later, I integrated sphinx3 forced alignment with the wami-recorder microphone recording applet so that the user sees the acoustic scores after uploading their recording.</li>
<li>Please try this link to test it: <a href="http://talknicer.net/~ronanki/test/">http://talknicer.net/~ronanki/test</a></li>
<li>Wrote a program to convert a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet.
</li>
<li>Wrote a program to take a string of phonemes representing an expected utterance as input and produce a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring.
</li>
</ul>
<div>
<b>Ongoing work:</b></div>
<ul>
<li>Reading about Worldbet, OGIbet, ARPAbet, and CMUbet, the different ASCII-based phonetic alphabets and their mappings between each other and the International Phonetic Alphabet.</li>
<li>Will be enhancing the first pass of recognition described above using the generated alternative neighboring phoneme grammars to find phonemes which match the recorded speech more closely than the expected phonemes without using complex post-processing acoustic score statistics.</li>
<li>Trying more parameters and options to derive acoustic scores for each phoneme from sphinx3 forced alignment.</li>
<li>Writing an exemplar score aggregation algorithms to find the means, standard deviations, and their expected error for each phoneme in a phrase from a set of recorded exemplar pronunciations of that phrase.</li>
<li>Writing an algorithm which can detect mispronunciations by comparing a recording's acoustic scores to the expected mean and standard deviation for each phoneme, and aggregating those scores to biphones, words, and the entire phrase.</li>
</ul>
</div>srikanth ronankihttp://www.blogger.com/profile/15384976912513321039noreply@blogger.com0tag:blogger.com,1999:blog-8012401148587899501.post-25021634116500557932012-05-25T00:02:00.000+05:302012-05-25T00:02:03.708+05:30Troy: GSoC Before Week OneGoogle Summer of Code 2012 officially started this Monday (21 May). Our expected weekly report should begin next Monday, but here is a brief overview of the preparations we have accomplished during the "community bonding period."<br />
<div>
<br /></div>
<div>
We started with a group chat including our mentor James and the other student Ronanki. The project details are becoming more clear to me, from the chat and subsequent email communications. For my project, the major focuses will be:</div>
<div>
<br /></div>
<div>
1) A web portal for automatic pronunciation evaluation audio collection; and</div>
<div>
2) An Android-based mobile automatic pronunciation evaluation app.</div>
<div>
<br /></div>
<div>
The core of these two applications is edit distance grammar based-automatic pronunciation evaluation using CMU Sphinx3.<br />
<br />
Here are the preparations I have accomplished during the bonding period:</div>
<div>
<ol>
<li>Trying out the basic <a href="http://code.google.com/p/wami-recorder/" target="_blank">wami-recorder</a> demo on my school's server;</li>
<li>Changing <a href="http://code.google.com/p/rtmplite/" target="_blank">rtmplite</a> for audio recording. Rtmplite is a Python implementation of an RTMP server with minimum support needed for real-time streaming and recording using Adobe's AMF0 protocol. On the server side, the RTMP server daemon process listens on TCP port 1935 by default, for connections and media data streaming. On the client side, the Flash user needs to use Adobe ActionScript 3's NetConnection function to set up a session with the server, and the NetStream function for audio and video streaming, and also microphone recording. The demo application has been set up at: <a href="http://talknicer.net/~li-bo/testClient/bin-debug/testClient.html">http://talknicer.net/~li-bo/testClient/bin-debug/testClient.html</a></li>
<li>Based on my understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is a key user interface component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: <a href="http://talknicer.net/~li-bo/audioRecorder/audioRecorder.html">http://talknicer.net/~li-bo/audioRecorder/audioRecorder.html</a> . The current implementation:</li>
<ol>
<li>Distinguishes recordings from different users with user IDs;</li>
<li>Loads pre-defined text sentences to display for recording, which will be useful for pronunciation exemplar data collection;</li>
<li>Performs peal-time audio recording;</li>
<li>Can play back the recordings from the server; and </li>
<li>Has basic event control logic, such as to prevent users from recording and playing at the same time, etc.</li>
</ol>
<li>Also, I have also learned from <a href="http://cmusphinx.sourceforge.net/wiki/sphinx4:sphinxthreealigner">http://cmusphinx.sourceforge.net/wiki/sphinx4:sphinxthreealigner</a> on how to get phoneme acoustic scores from "forced alignment" using sphinx3. To generate the phoneme alignment scores, two steps are needed. The details of how to perform that alignment can be found on my more tech-oriented posts at <a href="http://troylee2008.blogspot.com/2012/05/testing-cmusphinx3-alignment.html">http://troylee2008.blogspot.com/2012/05/testing-cmusphinx3-alignment.html</a> and <a href="http://troylee2008.blogspot.com/2012/05/cmusphinx3-phoneme-alignment.html">http://troylee2008.blogspot.com/2012/05/cmusphinx3-phoneme-alignment.html</a> on my personal blog.</li>
</ol>
<div>
Currently, these tasks are ongoing:</div>
</div>
<div>
<ol>
<li>Set up the server side process to manage user recordings, i.e., distinguishing between users and different utterances.</li>
<li>Figure out how to use ffmpeg, speexdec, and/or sox to automatically convert the recorded server side FLV files to PCM .wav files after the users upload the recordings. </li>
<li>Verify the recording parameters against the recording and speech recognition quality, possibly taking the network bandwidth into consideration.</li>
<li>Incorporating delays between network and microphone events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission, etc.) to successfully finish before processing the next user event, which can often cause the recordings to be clipped.</li>
</ol>
</div>
<div>
<ol> </ol>
</div>Anonymoushttp://www.blogger.com/profile/09375125981129389911noreply@blogger.com0