Audiovisual Singing Voice Separation

Bochen Li, Yuxuan Wang, and Zhiyao Duan

This project is in collaboration with the ByteDance AI Lab. This project is partially supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

 

Publications

Bochen Li, Yuxuan Wang, and Zhiyao Duan, Audiovisual Singing Voice Separation, Transactions of the International Society for Music Information Retrieval, 4(1), ppp.195–209, 2021. DOI: http://doi.org/10.5334/tismir.108. http://doi.org/10.5334/tismir.108 <pdf>

Background / Motivation

Method

Model structure

Results

Demo 1

Vocal separation results from the URSing dataset, which was recorded in a sound booth with different scenarios as the training/validation data.


Original mixture

Groud-truth solo vocal

Result from audio-based method

Result from proposed audiovisual method


Demo 2

Evaluations on a capella songs downloaded from YouTube.

Original mixture

Separated vocal from audio-based method

Separated solo vocal from proposed method


Demo 3

Evaluations on randomly mixed samples (same scenario as the training/validation data).

Original mixture

Groud-truth solo vocal

Result from audio-based method

Result from proposed audiovisual method