This project is supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding". |
Speech communication does not solely depend on the acoustic signal. Visual cues, when present, also play a vital role. The presence of visual cues improves speech comprehension in noisy environments and for the hard-of-hearing population. Consequently, researchers developed systems that can automatically generate talking faces from speech in order to provide the visual cues when they are not available. These systems can increase the accessibility of abundantly available audio-only resources for the hearing impaired population. They can also find wide applications in entertainment, education, and healthcare.
During speech communication, emotion has a direct impact on the transmitted message and can change the meaning drastically [1]. Studies have shown that predicting emotions purely from speech audio is quite difficult for untrained people [2] and that we heavily rely on visual cues in emotion interpretation [3]. Therefore, to make the visual rendering more realistic and to improve speech communication, it is important for automatic talking face generation systems to render visual emotion expressions.
Instead of inferring emotion from the input speech, in this work, we propose to use emotions as a condition input to our system. The motivation is to decouple the speech and emotion conditions. This allows us to manipulate emotions during the generation of face videos. The figure below shows the system overview, which employs the generative adversarial network (GAN) framework. Our generator network architecture is built based on our previous work [4], with a modification to accept the emotion condition input. For discriminator networks, we use one discriminator to distinguish the emotions expressed in videos, and another discriminator to distinguish the real and generated video frames.
The following examples are generated from CREMA-D test samples. The image and speech input to the network are unseen from training.
anger disgust fear happiness neutral sadness
A (audio emotion), V (video emotion)
A: anger, V: disgust A: anger, V: fear A: anger, V: happiness A: anger, V: neutral A: anger, V: sadness
A: disgust, V: anger A: disgust, V: fear A: disgust, V: happiness A: disgust, V: neutral A: disgust, V: sadness
A: fear, V: anger A: fear, V: disgust A: fear, V: happiness A: fear, V: neutral A: fear, V: sadness
A: happiness, V: anger A: happiness, V: disgust A: happiness, V: fear A: happiness, V: neutral A: happiness, V: sadness
A: neutral, V: anger A: neutral, V: disgust A: neutral, V: fear A: neutral, V: happiness A: neutral, V: sadness
A: sadness, V: anger A: sadness, V: disgust A: sadness, V: fear A: sadness, V: happiness A: sadness, V: neutral
[1] S. E. Eskimez, Y. Zhang and Z. Duan, "Speech Driven Talking Face Generation From a Single Image and an Emotion Condition," in IEEE Transactions on Multimedia, vol. 24, pp. 3480-3490, 2022, doi: 10.1109/TMM.2021.3099900. [paper link]
[1] M. Alpert, R. L. Kurtzberg, and A. J. Friedhoff, “Transient voice changesassociated with emotional stimuli,”Archives of General Psychiatry,vol. 8, no. 4, pp. 362–365, 1963.
[2] S. E. Eskimez, K. Imade, N. Yang, M. Sturge-Apple, Z. Duan, andW. Heinzelman, “Emotion classification: how does an automated systemcompare to naive human coders?” in International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2274–2278.
[3] A. Esposito, “The perceptual and cognitive role of visual and auditorychannels in conveying emotional information,” Cognitive Computation, vol. 1, no. 3, pp. 268–278, 2009.
[4] S. E. Eskimez, R. K. Maddox, C. Xu, and Z. Duan, “End-to-end gener-ation of talking faces from noisy speech,” in International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1948–1952.