Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

[paper link][code link]

This project is supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


What is the problem?

Speech communication does not solely depend on the acoustic signal. Visual cues, when present, also play a vital role. The presence of visual cues improves speech comprehension in noisy environments and for the hard-of-hearing population. Consequently, researchers developed systems that can automatically generate talking faces from speech in order to provide the visual cues when they are not available. These systems can increase the accessibility of abundantly available audio-only resources for the hearing impaired population. They can also find wide applications in entertainment, education, and healthcare.

During speech communication, emotion has a direct impact on the transmitted message and can change the meaning drastically [1]. Studies have shown that predicting emotions purely from speech audio is quite difficult for untrained people [2] and that we heavily rely on visual cues in emotion interpretation [3]. Therefore, to make the visual rendering more realistic and to improve speech communication, it is important for automatic talking face generation systems to render visual emotion expressions.

What is our approach?

Instead of inferring emotion from the input speech, in this work, we propose to use emotions as a condition input to our system. The motivation is to decouple the speech and emotion conditions. This allows us to manipulate emotions during the generation of face videos. The figure below shows the system overview, which employs the generative adversarial network (GAN) framework. Our generator network architecture is built based on our previous work [4], with a modification to accept the emotion condition input. For discriminator networks, we use one discriminator to distinguish the emotions expressed in videos, and another discriminator to distinguish the real and generated video frames.

Our Results

The following examples are generated from CREMA-D test samples. The image and speech input to the network are unseen from training.

Generated videos conditioned on the speech emotion label (i.e., generated visual emotion matches that of speech)

      anger               disgust               fear             happiness             neutral               sadness

Generated videos conditioned on an emotion label different from the speech emotion label (i.e., generated visual emotion mismatches that of speech)

A (audio emotion), V (video emotion)

  A: anger, V: disgust       A: anger, V: fear     A: anger, V: happiness     A: anger, V: neutral     A: anger, V: sadness

  A: disgust, V: anger     A: disgust, V: fear     A: disgust, V: happiness     A: disgust, V: neutral     A: disgust, V: sadness

  A: fear, V: anger       A: fear, V: disgust     A: fear, V: happiness       A: fear, V: neutral       A: fear, V: sadness

  A: happiness, V: anger   A: happiness, V: disgust   A: happiness, V: fear     A: happiness, V: neutral   A: happiness, V: sadness

  A: neutral, V: anger     A: neutral, V: disgust     A: neutral, V: fear     A: neutral, V: happiness   A: neutral, V: sadness

  A: sadness, V: anger     A: sadness, V: disgust     A: sadness, V: fear     A: sadness, V: happiness   A: sadness, V: neutral

Test It Yourself

Our pre-trained talking face model and offline generation and training code is available here.


[1] S. E. Eskimez, Y. Zhang and Z. Duan, "Speech Driven Talking Face Generation From a Single Image and an Emotion Condition," in IEEE Transactions on Multimedia, vol. 24, pp. 3480-3490, 2022, doi: 10.1109/TMM.2021.3099900. [paper link]


[1] M. Alpert, R. L. Kurtzberg, and A. J. Friedhoff, “Transient voice changesassociated with emotional stimuli,”Archives of General Psychiatry,vol. 8, no. 4, pp. 362–365, 1963.

[2] S. E. Eskimez, K. Imade, N. Yang, M. Sturge-Apple, Z. Duan, andW. Heinzelman, “Emotion classification: how does an automated systemcompare to naive human coders?” in International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2274–2278.

[3] A. Esposito, “The perceptual and cognitive role of visual and auditorychannels in conveying emotional information,” Cognitive Computation, vol. 1, no. 3, pp. 268–278, 2009.

[4] S. E. Eskimez, R. K. Maddox, C. Xu, and Z. Duan, “End-to-end gener-ation of talking faces from noisy speech,” in International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1948–1952.