Anastasios Anastasakos

Speaker Normalization Methods for Speaker Independent Speech Recognition

June 25, 1997
4:30 PM
206 Egan

Abstract

Variability in speaker independent (SI) acoustic models is attributed to intra-speaker variation that is related to the phonetic content of speech, and to inter-speaker variation that is independent of the information content of the speech signal and is caused by differences in the way speakers produce speech. This thesis considers the problem of reducing the effect of inter-speaker variation in a HMM-based speech recognition system. Emphasis is placed on model-based approaches to speaker normalization that incorporate linear transformations in the training of HMM-based acoustic models.

A maximum likelihood formulation, termed Speaker Adaptive Training (SAT), is developed that aims at reducing the inter-speaker variability from the training data jointly with the estimation of the HMM parameters. In the proposed SAT method, the phonetic and speaker variation sources are decoupled and the speaker-induced variation is explicitly modeled. The SAT acoustic models capture the intra-speaker variability rather than both the intra and inter-speaker variability as is the case with the conventional SI acoustic models. This method of HMM training is well suited for tasks that involve some form of adaptation method that aims at minimizing the difference between the acoustic models and a test speaker. By modeling more accurately the phonetically relevant variation of speech the SAT acoustic models are able to be adapted more accurately to the test speakers. The SAT parameter estimation is formulated as an extension to the Baum-Welch algorithm and a specific solution is presented for the case that individual speaker characteristics are modeled via a multivariate linear regression model.

The proposed training algorithm is evaluated in large-vocabulary continuous speech recognition tasks. Experiments compare the recognition accuracy of the SAT acoustic models compared to that of common SI acoustic models when different speaker adaptation scenaria are applied in the recognition stage. The experimental results demonstrate the efficacy of the SAT acoustic models in accurately adapting to the test speakers. Specifically, in experiments using the Wall Street Journal corpus, the SAT acoustic models achieve up to 10% additional reduction in word error rate relative to the SI acoustic models and result in overall reductions of up to 25% for batch supervised speaker adaptation.

Thesis Committee:
Prof. J. Makhoul (advisor)
First Reader: Richard M. Schwartz
Prof. M. Salehi
Prof. J.G. Proakis