Wednesday, May 30, 2012

Troy: GSoC 2012 Pronunciation Evaluation Week 1

The first week of GSoC 2012 has already been a busy summer. Here is what I have accomplished so far:

  1. To measure the Speex recording "quality" parameter (which is set by the client from 0 to 10) I recorded the same Sphinx3 test utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") from a constant source recording with the quality varying from 0 to 10. As shown on the graph, the higher the Speex quality parameter, the larger the .FLV file will be. Judging from my own listening, greater quality parameter values do result in better quality, but it is difficult to hear the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems essentially identical for the different recordings. But to be on the safe side in case of background and line noise, for now we will use a Speex recording quality parameter of 8.graph
  2. The rtmplite server is now configured to save its uploaded files to the [path_to_webroot]/data directory on the server. The initial audioRecorder applet will place its recordings in the [path_to_webroot]/data/audioRecorder directory, and for each user there will be a separate folder (e.g. [path_to_webroot]/data/audioRecorder/user1). For each recording utterance, the file name is now in the format of [sentence name]_[quality level].flv
  3. The conversion from .FLV Speex uploads to .WAV PCM audio files is done entirely in the rtmplite server using a process spawned by Python's subprocess.Popen() function calling ffmpeg. After the rtmplite closes the FLV file, the conversion is performed immediately and the converted WAV file has exactly the same path and name except the suffix, which is .wav instead of .flv. Guillem suggested the sox command for the conversion, but it doesn't recognize .flv files directly.  Other possibilities included speexdec, but that won't open .flv files either.
  4. In the audioRecorder client, the user interface now waits for NetConnection and NetStream events to open and close successfully before proceeding with other events. And a 0.5 second delay has been inserted at the beginning and end of the recording button click event to avoid inadvertently trimming the front or end of the recording. 
My plans for the 2nd week are:
  1. Solve a problem encountered in converting FLV files to WAV using ffmpeg with Python's Popen() function. If the main Python script (call it for example) is run from a terminal as "python", then everything works great. However, if I put it in background and log off the server by doing "python &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + &" error message. I will try to figure out a way to work around this issue. Maybe if I start the process from cron (after checking to see whether it already running with a process ID number in a .pid text file) then it will start subprocesses without stopping as occurs when it is detached from a terminal.
  2. Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students, we will display from one to five cue phrases below space for a graphic or animation, assuming the smallest screen possible using HTML which would also look good in a larger window. For the exemplar recordings, we just need to display one phrase but we should also have per-upload form fields (e.g., name, age, sex, native speaker (y/n?), where speaker lived ages 6-8 (which determines their accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (perhaps using HTTP cookies.)  I want to integrate those fields with the mysql database running on our server, so I will need to create a SQL schema with some CREATE TABLE statements to hold all those fields, the filenames, maybe recording durations, the date and time, and perhaps other information.
  3. Test the rtmplite upload server to make sure it works correctly and without race conditions during simultaneous uploads from multiple users, and both sequential and simultaneous recording uploads by the same user, just to be on the safe side.
  4. Further milestones are listed at

Friday, May 25, 2012

Ronanki: GSoC Work Prior to the Official Start

Well, it has been a month since I got accepted into this year's Google Summer of Code. This has been a great time for me, during the community bonding period within the CMU Sphinx organization. Our organization has six GSoC students this year working on different projects. We introduced ourselves to each other over the cmusphinx-gsoc mailing list and had a few conversations over chat. Thanks to Carol Smith, I received my welcome package from Google on May 19th, and a free ACM membership too :)

It has been three days since GSoC 2012 started officially. Prior to that, I became familiarized with a few different things with the help of my mentor. He created a wiki page for our projects at Troy and I are also going to blog at and update the wiki there during this summer. So please check there for important updates too.

Currently, my goal is to build a web interface which allows users to evaluate their pronunciation. Some of the sub-tasks have already been accomplished, and some of them are still ongoing:

Work accomplished:
  • Created an initial web interface which allows users to record and playback their speech using the open source wami-recorder which is being designed by the spoken language systems at MIT.
  • When the recording is completed, the wave file is uploaded to the server for processing.
  • Sphinx3 forced alignment is used to align a phoneme string expected from the utterance with the recorded speech to calculate time endpoints acoustic scores for each phoneme.
  • I tried many different output arguments in sphinx3_align from and successfully tested producing the phoneme acoustic scores using two recognition passes.
    • In the first pass, I use -phlabdir as an argument to get a .lab file as output, which contains the list of recognized phonemes.
    • In the second pass, I use that list to get acoustic scores for each phoneme using -wdsegdir as an input argument.
  • Later, I integrated sphinx3 forced alignment with the wami-recorder microphone recording applet so that the user sees the acoustic scores after uploading their recording.
  • Please try this link to test it:
  • Wrote a program to convert a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet.
  • Wrote a program to take a string of phonemes representing an expected utterance as input and produce a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring. 
Ongoing work:
  • Reading about Worldbet, OGIbet, ARPAbet, and CMUbet, the different ASCII-based phonetic alphabets and their mappings between each other and the International Phonetic Alphabet.
  • Will be enhancing the first pass of recognition described above using the generated alternative neighboring phoneme grammars to find phonemes which match the recorded speech more closely than the expected phonemes without using complex post-processing acoustic score statistics.
  • Trying more parameters and options to derive acoustic scores for each phoneme from sphinx3 forced alignment.
  • Writing an exemplar score aggregation algorithms to find the means, standard deviations, and their expected error for each phoneme in a phrase from a set of recorded exemplar pronunciations of that phrase.
  • Writing an algorithm which can detect mispronunciations by comparing a recording's acoustic scores to the expected mean and standard deviation for each phoneme, and aggregating those scores to biphones, words, and the entire phrase.

Troy: GSoC Before Week One

Google Summer of Code 2012 officially started this Monday (21 May). Our expected weekly report should begin next Monday, but here is a brief overview of the preparations we have accomplished during the "community bonding period."

We started with a group chat including our mentor James and the other student Ronanki. The project details are becoming more clear to me, from the chat and subsequent email communications. For my project, the major focuses will be:

1) A web portal for automatic pronunciation evaluation audio collection; and
2) An Android-based mobile automatic pronunciation evaluation app.

The core of these two applications is edit distance grammar based-automatic pronunciation evaluation using CMU Sphinx3.

Here are the preparations I have accomplished during the bonding period:
  1. Trying out the basic wami-recorder demo on my school's server;
  2. Changing rtmplite for audio recording. Rtmplite is a Python implementation of an RTMP server with minimum support needed for real-time streaming and recording using Adobe's AMF0 protocol. On the server side, the RTMP server daemon process listens on TCP port 1935 by default, for connections and media data streaming. On the client side, the Flash user needs to use Adobe ActionScript 3's NetConnection function to set up a session with the server, and the NetStream function for audio and video streaming, and also microphone recording. The demo application has been set up at:
  3. Based on my understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is a key user interface component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: . The current implementation:
    1. Distinguishes recordings from different users with user IDs;
    2. Loads pre-defined text sentences to display for recording, which will be useful for pronunciation exemplar data collection;
    3. Performs peal-time audio recording;
    4. Can play back the recordings from the server; and 
    5. Has basic event control logic, such as to prevent users from recording and playing at the same time, etc.
  4. Also, I have also learned from on how to get phoneme acoustic scores from "forced alignment" using sphinx3. To generate the phoneme alignment scores, two steps are needed. The details of how to perform that alignment can be found on my more tech-oriented posts at and on my personal blog.
Currently, these tasks are ongoing:
  1. Set up the server side process to manage user recordings, i.e., distinguishing between users and different utterances.
  2. Figure out how to use ffmpeg, speexdec, and/or sox to automatically convert the recorded server side FLV files to PCM .wav files after the users upload the recordings. 
  3. Verify the recording parameters against the recording and speech recognition quality, possibly taking the network bandwidth into consideration.
  4. Incorporating delays between network and microphone events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission, etc.) to successfully finish before processing the next user event, which can often cause the recordings to be clipped.

Saturday, May 19, 2012

Kickstarter Launched: Choose Your Reading and Pronunciation Adventure!

An expert Android programmer named Guillem Perez wanted to help with the Pronunciation Evaluation project for the Google Summer of Code, but was too late to apply by the deadline. Due to that and the continuing need for funds to collect exemplar pronunciations and produce large quantities of high quality instructional content and graphics, we have launched a Kickstarter project: Choose Your Reading and Pronunciation Adventure.

Thanks to Google and the CMU Sphinx organization's generous sponsorship, we only need to raise less than half of the previous fundraising goal. The first two hours of the Kickstarter's launch already raised $150 in pledges.

Please keep the momentum going by pledging generously and spreading the word far and wide.  Thank you!

Friday, May 18, 2012

One Laptop Per Child hardware request

[Note: The One Laptop Per Child organization requires that we blog this request for their hardware which we hope to support --jps]

Dear OLPC Contributors Program:

Here is our Google Summer of Code team's project proposal for a pronunciation evaluation game to teach reading and spoken English on OLPC hardware....

1. Project Title & Shipment Detail

Name of Project: Pronunciation Evaluation for Google Summer of Code 2012

Number of Laptops (or other hardware) You Request to Borrow:  Three; one each XO-1, XO-1.5, and XO-1.75 laptops

Loan Length: 6 months at least; preferably ongoing for support and maintenance purposes

2. Team Participants:

James Salsman, Srikanth Ronanki, Troy Lee

Past Experience/Qualifications: Please see and

3. Objectives

We are building a free open source choose-your-own-adventure style game to teach beginning and intermediate English reading and pronunciation to learners of all ages and backgrounds using an automatic pronunciation evaluation system based on the CMU Sphinx3 speech recognition system, which detects mispronunciations at the phoneme level and provides feedback scores and learner adaptation with phoneme, biphone, word, and phrase scores based on standardized phoneme acoustic scores and durations and edit distance scoring using alternate pronunciation grammars.  We would like to build clients and stand-alone systems for OLPC hardware.

4. Plan of Action

Please see and

5. Needs

Why is this project needed?  There is currently no oral reading tutor or pronunciation tutor available on OLPC systems, and very few such free systems on the popular hardware platforms. The potential benefits of such systems when they are easily available and engaging are phenomenal: Please see which shows that speech recognition-based reading instruction can be more effective per time spent than instruction from a teacher instructing only two students at once.

Locally? The need is world-wide.

In the greater OLPC/Sugar community?  Yes.

Outside the community?  Yes.

Why can't this project be done in emulation using non-XO machines?  It uses audio input from microphone hardware.

Why are you requesting the number of machines you are asking for?  We would like to have one of each kind of laptops so that we can correctly accommodate any differences concerning audio input.  We will ask other OLPC users to help test.

Will you consider (1) salvaged/rebuilt or (2) damaged XO Laptops?  Yes, as long as they perform identically to nominal systems for audio input and display, and are not prohibitively difficult to develop with.

6. Sharing Deliverables

Project URL—where you'll Blog specific ongoing progress:

How will you convey tentative ideas & results back to the OLPC/Sugar community, prior to completion?  James has known Sameer Verma for years, but as Sameer is usually very busy, we request an OLPC mentor.

How will the final fruits of your labor be distributed to children or community members worldwide?  The client and stand-alone applications will be made available for free, with installation instructions kept current on the OLPC wiki, and we will ask that they be evaluated for official support.

Will your work have any possible application or use outside our community?  Certainly. We also intend to support cross-platform web browsers with Flash/speex/rtmplite and wami-recorder, Android phones and tablets, and eventually iOS platforms.

If yes, how will these people be reached?  The availability will be announced on education email lists, free marketplace listings, and we will publish at least one peer-reviewed report on our work.

Have you investigated working with nearby XO Lending Libraries or Project Groups?  Until July I am in rural Colorado.

7. Quality/Mentoring

Would your Project benefit from Support, Documentation and/or Testing people?  Yes.

Teachers' input into Usability?  This is unlikely because language tutors using speech recognition designed with substantial teacher input have not been entirely successful in the marketplace, even with research showing their clear advantages, possibly because they are insufficiently engaging for youth.  Our design is based on an adventure game where selecting alternative phrases by reading them out loud reveals a branching story incrementally, instead of trying to emulate oral reading from pages of a book.

How will you promote your work?  As above:  availability will be announced on education email lists, free marketplace listings, and we will publish at least one peer-reviewed report on our work.

Can we help you with an experienced mentor from the OLPC/Sugar community?  Yes, please put us in contact with a mentor who has audio input development experience.

8. Timeline (Start to Finish)

Client development should be web-based as much as possible, but a very small external microphone application will likely be required, probably comprised of a speex encoder and a set of Record/Stop-Play-Submit buttons and perhaps gain and/or volume controls. Its development will proceed alongside the development of the other clients and will be completed by August. We would like to retain the OLPC hardware for support, debugging, and software maintenance.

Please see and for milestones and detailed development schedules.

We will blog continuing progress and obstacles at

[X] I agree to pass on the laptop(s) to a local OLPC group or other interested contributors in case I do not have need for the laptop(s) anymore or in case my project progress stalls.