Audio-visual Analysis of Music Performance

This project is supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Table of Contents:

Background

Dataset Creation

Source Association

What is the Problem
Motivation
Method
Results
Publication

Visually-informed Multi-pitch Analysis

What is the Problem
Method
Results
Publication

Visually-informed Vibrato Analysis

What is the Problem
Method
Results
Publication

Background

Music performance is a multi-modal art form. For thousands of years, people have enjoyed music performances at live concerts through both hearing and sight. The visual modality is much more natural, and when available, it can be very helpful for solving many MIR tasks that are challenging using an audio-only approach. Audio-visual analysis of music performance is at the core of artificial intelligence, which bridges the cutting-edge techniques of computer vision and computer audition.

Dataset Creation

Despite the increased recent interest, progress in jointly using audio and visual modalities for the analysis of music performances has been rather slow. One of the main reasons, we argue, is the lack of datasets. We create the University of Rochester Multi-modal Music Performance (URMP) dataset, which covers 44 classical chamber music pieces ranging from duets to quintets. More details about the creation process, dataset features, and download entry are available at the above link.

We also created an audio-visual singing dataset named URSing. It comprises a number of singing performances as audio and video recordings. Each song contains the isolated track of solo singing voice and the mixure with accompaniment track. We anticipate that the dataset will be useful for multi-modal analysis of singing performances, such as audiovisual singing voice separation, and serve as ground-truth for evaluations. More details about the creation process, dataset features, and download entry are available at the above link.

Source Association

What is the Problem?

The problem is defined on chamber music performance. It features Western music instruments as ensemble performce with one instrument play one source. Given the performance video and the corresponding score/audio tracks, the system identifies the afflication between the different modalities.

Motivation

In audio-visual recordings of music performances, visual cues from instrument players exhibit good temporal correspondence with the audio signals and the music content. These correspondences provide useful information for estimating source associations. It enables novel research and applications. It is essential for leveraging the visual information to analyze individual sound sources in music performances. Potential applications includes:

Augmented video streaming service (let users click on a player in the video and isolate/enhance the corresponding source of the audio)
Augmented sheet music display interface (on each score track, the visual performance of the specific player is demonstrated)
Remixing of audio sources along with automatic video scene recomposition
Online video streaming (auto-whirling camcorder automatically focuses on the soloist)

Method

The system models three typical types of correspondences between 1) body motions (e.g., bowing for string instruments andsliding for trombone) and note onsets, 2) finger motions (e.g., fingering for most woodwind/brassinstruments) and note onsets, and 3) vibrato hand motions (e.g., fingering hand rolling for string instruments) with pitch fluctuations.

The details about the three modeuls are illustrated for:

The correspondence between body motions and note onsets

The correspondence between finger motions and note onsets

The correspondence between hand rolling motion and pitch fluctuations.

Results

We calculate the association accuracy to evaluate the association performance on the testing excerpts (~17K excerpts of ensembles up to 5 players playing simultaneously, i.e., quintets, and duration varies from 5 sec to 30 sec). On each testing excerpt, only when all the tracks are correctly associated can be counted as a correct association. For example of a quintet excerpt, random guess will yield 1/120 chance to output a correct association.

The numeric evaluation is demonstrated in the figures:

Publication

Bochen Li, Karthik Dinesh, Zhiyao Duan and Gaurav Sharma, See and listen: score-informed association of sound tracks to players in chamber music performance videos, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2906-2910. <pdf> <slides>

Bochen Li, Chenliang Xu, and Zhiyao Duan, Audio-visual source association for string ensembles through multi-modal vibrato analysis, in Proc. The 14th Sound and Computing Conference (SMC), 2017, pp. 159-166. (best paper award) <pdf> <slides>

Bochen Li, Karthik Denish, Chenliang Xu, Gaurav Sharma, and Zhiyao Duan, Online audio-visual source association for chamber music performances, Transactions of the International Society for Music Information Retrieval, vol. 2, no. 1, pp. 29–42, 2019. <pdf>