Thursday, March 29, 2012

GSoC 2012 Application Thoughts and Literature Review

When I was considering applying to Google Summer of Code 2012, my friends suggested that I choose an organization suited to my area of research. CMUSphinx is one such organization which provides a good open source platform to learn more about speech recognition.  I was thrilled and delighted when I came across Pronunciation Evaluation in the list of suggested projects on the CMUSphinx wiki, because I had worked on a similar word-level project earlier, using limited speech corpora, as a part of winter school organized by IIIT-H and CMU under the guidance of Dr. Kishore Prahallad and Dr. Bhiksha Raj. I thought it would be a wonderful opportunity to work on Pronunciation Evaluation with CMU Sphinx for GSoC 2012. Feedback on pronunciation is vital for spoken language processing. Automatic pronunciation evaluation and feedback can help non-native speakers identify their errors, learn sounds and vocabulary, and improve their pronunciation performance.

Thanks to Google, I found the project mentor's email address. He and I have already had several conversations by email and text chat, and he has been very helpful. I would like to work with him. I am excited and looking forward with great enthusiasm to work on this project. Given the opportunity, this will be my first work as an open source project contributor.

So far I have read six research papers related to automatic pronunciation evaluation using speech recognition in preparation for my project application. Below is a brief overview of them:

1. Franco et al (2000) “The SRI EduSpeak System: Recognition and Pronunciation Scoring for Language Learning” http://louis.speech.sri.com/papers/instil2000-eduspeak.pdf

The EduSpeak system is a software development toolkit that enables developers to use speech recognition and pronunciation scoring technology. The paper presents some adaptation techniques to recognize both native and non-native speech in a speaker-independent manner. The system provides automatic evaluation of pronunciation quality for computer-assisted language learning (CALL) applications. The authors developed algorithms to grade pronunciation quality of non-native speakers independent of the text, which is to say without the system knowing the content of the text in advance. A posterior probability-based phone-level mispronunciation detection scheme is implemented in the EduSpeak toolkit.

2. M.A. Peabody (2011) “Methods for Pronunciation Assessment in Computer Aided Language Learning” (Massachusets Institute of Technology: Ph.D. thesis) http://groups.csail.mit.edu/sls/publications/2011/Peabody_Thesis_PHD_2011.pdf

This thesis focuses on the problem of identifying mispronunciations made by non-native speakers using a CALL system. The eleven years of practice in this field has shown that text-independent systems have serious limitations, and detecting mispronunciations requires a large corpus of speech with human judgement of pronunciation quality. While typical approaches use expert phoneticians, Peabody obtained phone-level judgement of pronunciation quality by utilizing non-expert, crowdsourced, word level judgement of pronunciation. He also proposed a novel method for transforming mel-frequency cepstral coefficients (MFCCs) into a feature space that represents four key positions of English vowel production for robust pronunciation evaluation. I hope to measure a similar technique in my GSoC project if time permits.

3. Moustroufas and Digalakis (2007) “Automatic pronunciation evaluation of foreign speakers using unknown text” http://www.telecom.tuc.gr/~vas/papers/csl-pronunciation-evaluation.pdf

This paper presents various techniques to evaluate the pronunciation of students of a foreign language, again without using any knowledge of the uttered text. The authors used native speech corpora for training pronunciation evaluation. They experimented with different kinds of Gaussian mixture models (GMMs) and different values of a grammar probability weight (GPW) parameter in the hidden Markov models (HMMs) to evaluate their pronunciation scoring.

4. Abdou et al (2006) “Computer Aided Pronunciation Learning System Using Speech Recognition Techniques” http://www.cs.toronto.edu/~asamir/papers/is06.pdf

This paper describes the implementation of a speech enabled computer-aided pronunciation learning system called HAFSS. The system was developed for teaching Arabic pronunciation to non-native speakers. It used a speech recognizer and a phoneme duration classification algorithm implemented to detect pronunciation errors. The authors also used maximum likelihood linear regression (MLLR) speaker adaptation algorithms.

5. Bhat et al (2010) “Pronunciation scoring for Indian English learners using a phone recognition system” http://www.ee.iitb.ac.in/daplab/publications/cbg_kls_pr_iitm2010.pdf

These authors designed a pronunciation scoring system using a phone recognizer using both the popular HTK and CMU Sphinx speech recognition toolkits. The system was evaluated on Indian English speech with models trained on the Timit Database. They used forced alignment decoding with both HTK and Sphinx3.

6. Pakhomov et al (2008) “Forced-Alignment and Edit-Distance Scoring for Vocabulary Tutoring Applications” Lecture Notes in Computer Science 5246:443-50 http://www.springerlink.com/content/l0385t6v425j65h7/

This paper describes the measurement of different automatic speech recognition (ASR) technologies applied to the assessment of young children's basic English vocabulary. They used the HTK version 3.4 toolkit for ASR. They calculated acoustic confidence scores using forced alignment and compared those to edit distance between the expected and actual ASR output. They trained three types of phoneme level language models: fixed phonemes, free phonemes and a biphone model.