End-to-End Generation of Talking Faces from Noisy Speech

This project is supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

What is the problem?

The visual cues from a talker's face and articulators (lips, teeth, tongue) are important for speech comprehension. Trained professionals are able to understand what is being said by purely looking at lip movements (lip reading) [1]. For ordinary people and the hearing impaired population, the presence of visual signals of speech has been shown to significantly improve speech comprehension, even if the visual signals are synthetic [2]. The benefits of adding the visual speech signals are more pronounced when the acoustic signal is degraded, due to background noise, communication channel distortion, and reverberation.

In many scenarios such as telephony, however, speech communication is still acoustical. The absence of the visual modality can be due to the lack of cameras, the limited bandwidth of communication channels, or privacy concerns. One way to improve speech comprehension in these scenarios is to synthesize a talking face from the acoustic speech in real time at the receiver's side. A key challenge of this approach is to make sure that the generated visual signals, especially the lip movements, well coordinate with the acoustic signals, as otherwise more confusions will be introduced.

What is our approach?

Watch our presentation at ICASSP 2020. Demos start at 8'35".

Test It Yourself

Pre-trained talking face models and offline generation and training code will be available soon.

Publications

[1] Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, and Zhiyao Duan, "End-to-end generation of talking faces from noisy speech," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

References

[1] Dodd, Barbara Ed, and Ruth Ed Campbell, “Hearing by eye: The psychology of lip-reading,” The psychology of lip-reading Lawrence Erlbaum Associates, Inc, 1987.

[2] Maddox, Ross K., et al., “Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners,” Elife 4 2015.