Audio-Visual Speech Source Separation

This project is partially supported by the US National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding", the National Natural Science Foundation of China under Grant No. 61473167, 61751308 and 61876095, and the German Research Foundation (DFG) in Project Crossmodal Learning DFG TRR-169.
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of these funding agencies.

What is the problem?

Speech separation aims to separate individual voices from an audio mixture of multiple simultaneous talkers. Audio-only approaches show unsatisfactory performance when the speakers are of the same gender or share similar voice characteristics. This is due to challenges on learning appropriate feature representations for separating voices in single frames and streaming voices across time. Visual signals of speech (e.g., lip movements), if available, can be leveraged to learn better feature representations for separation.

In this project, we propose two novel models to solve the source permutation problem in the state-of-the-art speaker-independent speech separation methods. The first proposed model is an Audio-Visual Matching network (AV-Match) which learns the correspondence between voice fluctuations and lip movements. The proposed matching-based audio-visual network can be combined with any audio-only speech separation methods to improve the separation quality. However, the separation ability of the matching-based model is limited by the performance of the audio-only methods. We thus further develop an end-to-end Audio-Visual Deep Clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time-Frequency (T-F) bin clustering. This fusion-based model breaks the limitations of the matching-based one and improves the separation quality by a large margin.

The Audio-Visual Matching Approach

The proposed audio-visual matching model integrates both motion (optical flow) and appearance (gray image) information of the lip region to correct spectrogram masks predicted by the audio-only separation model [1] for better separation. As shown in the figure below, the permutation problem primarily exists in the beginning quarter time frames of the masks predicted by the audio-only model, and the proposed audio-visual matching network can correct this problem by assigning the predicted masks to the correct speakers.

Fig. 1. The proposed audio-visual matching assisted speech separation framework.

Audio and visual streams are encoded as frame-wise embeddings, we compute inner products of temporally aligned audio and visual embeddings as similarity measure. Every five audio frames correspond to one video frame. With these similarities, we can assign the separated sources to correct speakers, thus relieving the permutation problem.

Fig. 2. The proposed Audio-Visual Matching network (AV-Match).

The matching-based method has following drawbacks:

The Audio-Visual Fusion Approach

As shown in following figure, the Audio-Visual Deep Clustering (AVDC) model receives similar inputs as that of the matching-based model and directly predicts T-F masks for the speakers. The predicted masks are used to reconstruct source signals using the speech mixture’s magnitude and phase spectrograms. Contributions of the AVDC model come in threefold:

Fig. 3. The proposed audio-visual deep clustering speech separation framework.

The core part of this approach is the Audio-Visual Deep Clustering (AVDC) model, which is illustrated below. It adopts a two-stage fusion strategy to integrate the audio and visual modalities. The first-stage fusion computes speaker-wise audio-visual T-F embeddings for each speaker in the mixture, while the second-stage fusion concatenates these audio-visual embeddings with the audio-only embedding computed using an audio-only Deep Clustering (DC) method for the final clustering of T-F bins.

Fig. 4. The proposed Audio-Visual Deep Clustering network (AVDC).

Experiments on 2-speaker mixtures

Comparison of the separation results

Fig. 5. Comparison of the separation results (MEAN+/-STD) on 2-speaker mixtures in GRID dataset. Methods with a superscript * show results when the optimal source permutation is used. Methods with a subscript "SPK3" show results with model trained on 3-speaker mixtures but evaluated on 2-speaker mixtures.

Based on the above table, we have following conclusions

Separation Demos

Original mixture and groud-truth:

Audio-based deep clustering method [1]:

AV-Match method:

AVDC method:

Experiments on 3-speaker mixtures

Comparison of the separation results

Fig. 6. Comparison of separation results (MEAN+/-STD) on 3-speaker mixtures in GRID dataset. Methods with a superscript * show results when the optimal source permutation is used. Methods with a subscript "SPK2" show results with model trained on 2-speaker mixtures but evaluated on 3-speaker mixtures.

We have following conclusions based on the above table:

Separation Demos

Original mixture and groud-truth:

Audio-based deep clustering method [1]:

AVDC method:

Ablation Study

Fig. 7. Ablation study of the proposed AVDC model on 2-speaker mixtures (left) and 3-speaker mixtures (right) in both the GRID and TCD-TIMIT datasets. Boxplots of ∆SDR with 5-fold cross validation are shown on different types of speech mixtures for DC (AVDC without the AV-branch in second-stage fusion), AVDC-WOA (AVDC without the audio-branch in second-stage fusion), and AVDC.

For the ablation study, we have following conclusions:

Audio-visual Embedding Visualization

Fig. 8. Visualization of audio-visual embeddings of a 3-speaker (3-female) mixture in the test set of the GRID dataset. Each of the first three subfigures shows the PCA in two dimensions of a speaker’s audio-visual embedding vectors of all T-F bins. The target speaker is separated from the other two speakers that are still mixed. The last subfigure shows the PCA in two dimensions of the combined (concatenated) 3-speaker embedding vectors of all T-F bins. All of the speakers are separated.


Rui Lu, Zhiyao Duan, and Changshui Zhang, Listen and look: audio-visual matching assisted speech source separation, IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315-1319, 2018.

Rui Lu, Zhiyao Duan, and Changshui Zhang, Audio–Visual Deep Clustering for Speech Separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1697-1712, 2019.


[1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 41th International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

[2] M. Kolbaek, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017.

[3] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” in Proc. ACM SIGGRAPH, 2018.

[4] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. Interspeech, 2018, pp. 1170–1174.