A Computational Framework for Exploring the Role of Speech Production in Speech Processing/Recognition


Prasanta K Ghosh University of Southern California Department of Electrical Engineering 3740 McClintock Ave.


Wednesday, 22 December 2010 (All day)


  • AG-80

It has been shown several times that the speech recognition accuracy improves if the direct measurement of speech articulation is used in addition to the speech acoustics from the talker. However, access to such direct speech articulation data during speech processing/recognition is not feasible in practice. In my presentation, I shall show that the speech production/articulation features can be estimated from the speech signal of any arbitrary talker although these features are not directly available from the talker. For the estimation of production-oriented features, I shall talk about a talker-independent acoustic-to-articulatory inversion framework using generalized smoothness criterion, which requires parallel articulatory and acoustic data from a single subject only and this training subject need not be any of the talkers. Use of these estimated features improves the acoustic-feature based recognition accuracy by ~4% (absolute) in a phonetic recognition experiment on TIMIT corpus. Interestingly, when the training subject is interpreted as a listener, the production-oriented features and, hence, the speech recognition can be interpreted as listener-specific. We will see that such a listener-specific framework to speech processing/recognition provides a production-oriented explanation of the variability in recognition accuracy by non-native listeners. We will also see that the listener-specific framework acts as a bridge between the scientific and technological viewpoints towards the role of speech production in speech perception in the human speech communication.