Wednesday, May 30, 2012

Troy: GSoC 2012 Pronunciation Evaluation Week 1

The first week of GSoC 2012 has already been a busy summer. Here is what I have accomplished so far:

  1. To measure the Speex recording "quality" parameter (which is set by the client from 0 to 10) I recorded the same Sphinx3 test utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") from a constant source recording with the quality varying from 0 to 10. As shown on the graph, the higher the Speex quality parameter, the larger the .FLV file will be. Judging from my own listening, greater quality parameter values do result in better quality, but it is difficult to hear the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems essentially identical for the different recordings. But to be on the safe side in case of background and line noise, for now we will use a Speex recording quality parameter of 8.graph
  2. The rtmplite server is now configured to save its uploaded files to the [path_to_webroot]/data directory on the server. The initial audioRecorder applet will place its recordings in the [path_to_webroot]/data/audioRecorder directory, and for each user there will be a separate folder (e.g. [path_to_webroot]/data/audioRecorder/user1). For each recording utterance, the file name is now in the format of [sentence name]_[quality level].flv
  3. The conversion from .FLV Speex uploads to .WAV PCM audio files is done entirely in the rtmplite server using a process spawned by Python's subprocess.Popen() function calling ffmpeg. After the rtmplite closes the FLV file, the conversion is performed immediately and the converted WAV file has exactly the same path and name except the suffix, which is .wav instead of .flv. Guillem suggested the sox command for the conversion, but it doesn't recognize .flv files directly.  Other possibilities included speexdec, but that won't open .flv files either.
  4. In the audioRecorder client, the user interface now waits for NetConnection and NetStream events to open and close successfully before proceeding with other events. And a 0.5 second delay has been inserted at the beginning and end of the recording button click event to avoid inadvertently trimming the front or end of the recording. 
My plans for the 2nd week are:
  1. Solve a problem encountered in converting FLV files to WAV using ffmpeg with Python's Popen() function. If the main Python script (call it for example) is run from a terminal as "python", then everything works great. However, if I put it in background and log off the server by doing "python &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + &" error message. I will try to figure out a way to work around this issue. Maybe if I start the process from cron (after checking to see whether it already running with a process ID number in a .pid text file) then it will start subprocesses without stopping as occurs when it is detached from a terminal.
  2. Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students, we will display from one to five cue phrases below space for a graphic or animation, assuming the smallest screen possible using HTML which would also look good in a larger window. For the exemplar recordings, we just need to display one phrase but we should also have per-upload form fields (e.g., name, age, sex, native speaker (y/n?), where speaker lived ages 6-8 (which determines their accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (perhaps using HTTP cookies.)  I want to integrate those fields with the mysql database running on our server, so I will need to create a SQL schema with some CREATE TABLE statements to hold all those fields, the filenames, maybe recording durations, the date and time, and perhaps other information.
  3. Test the rtmplite upload server to make sure it works correctly and without race conditions during simultaneous uploads from multiple users, and both sequential and simultaneous recording uploads by the same user, just to be on the safe side.
  4. Further milestones are listed at


  1. Sorry for my ignorance, but why do you use shpinx3 and not the latest version (4)?

  2. I can answer that, as it was my decision as the project mentor. Sphinx3 is written in C, while Sphinx4 is in Java, which takes much more memory and time to run. We hope to create stand-alone versions which run in limited memory on mobile devices and OLPC laptops. The Viterbi beam search algorithm is very memory bus bandwidth intensive, so using the Java Virtual Machine instead of native code, and the memory overhead which Java data structures entail (not to mention managed memory such as unpredictable garbage collection) would be completely impractical for response time, memory utilization, and battery life on mobile and OLPC systems.

    Furthermore, Sphinx4 currently has some problems doing forced alignment of phonemes:

    1. Thanks for the explanation. I didn't know that sphinx3 had been implemented in C. Speaking of which, why the decision to completely change the language from one version to another? And is there any development in version 3 still or has it completely lost support?

      I am very happy to know that you are thinking of creating stand-alone versions for mobile devices. I think the work you are developing is very important! Congratulations =)

    2. The decision to move from C to Java for Sphinx 4 was made a decade ago and the rationale is detailed in Sphinx3 is still maintained, according to but it is very stable. Thank you for your kind words.