Audio-visual Analysis of Music Performance

This project is supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

 

Table of Contents:

Background

Music performance is a multi-modal art form. For thousands of years, people have enjoyed music performances at live concerts through both hearing and sight. The visual modality is much more natural, and when available, it can be very helpful for solving many MIR tasks that are challenging using an audio-only approach. Audio-visual analysis of music performance is at the core of artificial intelligence, which bridges the cutting-edge techniques of computer vision and computer audition.


Dataset Creation

Despite the increased recent interest, progress in jointly using audio and visual modalities for the analysis of music performances has been rather slow. One of the main reasons, we argue, is the lack of datasets. We create the University of Rochester Multi-modal Music Performance (URMP) dataset, which covers 44 classical chamber music pieces ranging from duets to quintets. More details about the creation process, dataset features, and download entry are available at the above link.


We also created an audio-visual singing dataset named URSing. It comprises a number of singing performances as audio and video recordings. Each song contains the isolated track of solo singing voice and the mixure with accompaniment track. We anticipate that the dataset will be useful for multi-modal analysis of singing performances, such as audiovisual singing voice separation, and serve as ground-truth for evaluations. More details about the creation process, dataset features, and download entry are available at the above link.


Source Association

What is the Problem?

The problem is defined on chamber music performance. It features Western music instruments as ensemble performce with one instrument play one source. Given the performance video and the corresponding score/audio tracks, the system identifies the afflication between the different modalities.

Motivation

In audio-visual recordings of music performances, visual cues from instrument players exhibit good temporal correspondence with the audio signals and the music content. These correspondences provide useful information for estimating source associations. It enables novel research and applications. It is essential for leveraging the visual information to analyze individual sound sources in music performances. Potential applications includes:

  • Augmented video streaming service (let users click on a player in the video and isolate/enhance the corresponding source of the audio)
  • Augmented sheet music display interface (on each score track, the visual performance of the specific player is demonstrated)
  • Remixing of audio sources along with automatic video scene recomposition
  • Online video streaming (auto-whirling camcorder automatically focuses on the soloist)

Method

The system models three typical types of correspondences between 1) body motions (e.g., bowing for string instruments andsliding for trombone) and note onsets, 2) finger motions (e.g., fingering for most woodwind/brassinstruments) and note onsets, and 3) vibrato hand motions (e.g., fingering hand rolling for string instruments) with pitch fluctuations.



The details about the three modeuls are illustrated for:

  • The correspondence between body motions and note onsets

  • The correspondence between finger motions and note onsets

  • The correspondence between hand rolling motion and pitch fluctuations.

Results

We calculate the association accuracy to evaluate the association performance on the testing excerpts (~17K excerpts of ensembles up to 5 players playing simultaneously, i.e., quintets, and duration varies from 5 sec to 30 sec). On each testing excerpt, only when all the tracks are correctly associated can be counted as a correct association. For example of a quintet excerpt, random guess will yield 1/120 chance to output a correct association.

The numeric evaluation is demonstrated in the figures:



Publication

Bochen Li, Karthik Dinesh, Zhiyao Duan and Gaurav Sharma, See and listen: score-informed association of sound tracks to players in chamber music performance videos, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2906-2910. <pdf> <slides>

Bochen Li, Chenliang Xu, and Zhiyao Duan, Audio-visual source association for string ensembles through multi-modal vibrato analysis, in Proc. The 14th Sound and Computing Conference (SMC), 2017, pp. 159-166. (best paper award) <pdf> <slides>

Bochen Li, Karthik Denish, Chenliang Xu, Gaurav Sharma, and Zhiyao Duan, Online audio-visual source association for chamber music performances, Transactions of the International Society for Music Information Retrieval, vol. 2, no. 1, pp. 29–42, 2019. <pdf>


Visually-informed Multi-pitch Analysis

What is the Problem?

Multi-pitch analysis (MPA) of polyphonic music is important in many music information retrieval (MIR) tasks including automatic music transcription, music source separation, and audio-score alignment. It can be performed at different levels: Multi-pitch Estimation (MPE) is to estimate concurrent pitches and the number of pitches (polyphony) in each time frame; Multi-pitch Streaming (MPS) goes one step further to also assign the pitch estimates to different sound sources. This is challenging for approaches based on audio alone due to the polyphonic nature of the audio signals. Video of the performance, when available, can be useful to alleviate some of the difficulties.

Method

We propose to detect the play/non-play (P/NP) activities from musical performance videos using optical flow analysis to help with audio-based multi-pitch analysis. Specifically, the detected P/NP activity provides a more accurate estimate of the instantaneous polyphony (i.e., the number of pitches at a time instant), and also helps with assigning pitch estimates to only active sound sources. This is illustrated in the following figure:


The P/NP activity labels for each player is detected by optical flow estimation and supervised classification:


Results

As the first attempt towards audio-visual multi-pitch analysis of multi-instrument musical performances, we demonstrate the concept on 11 string ensembles including duets, trios, quartets, and quintets. The evaluations of the P/NP activity deteciton is performed on each individual player. The evaluations of MPE and MPS is performed on the expanded set from the original 11 ensembles, e.g., from a quitet ensemble we can further create 10 duets, 10 trios, and 5 quartets.

  • Experimental results of the P/NP Detection and MPE on the original 11 pieces:
    P1-P5 stands for the 5 players (if applicable) from an ensemble. MPE accuracy is compared on systems from traditional audio-based method (Audio), visually-informed using the detected P/NP labels (Video PNP), and visually-informed using ground-truth P/NP labels (GT PNP) which set the upper bound of the proposed strategy given perfect P/NP detection results.

  • Experimental results of the MPE/MPS accuracy on the expanded set:


Publication

Karthik Dinesh*, Bochen Li*, Xinzhao Liu, Zhiyao Duan and Gaurav Sharma, Visually informed multi-pitch analysis of string ensembles, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 3021-3025. (* equal contribution) <pdf> <slides>


Visually-informed Vibrato Analysis

What is the Problem?

Vibrato is:

  • Important artistic effect in music
  • Pitch modulation of a note in a periodic fashion
  • Characterized by vibrato rate and vibrato extent

Previous literature about vibrato analysis always focuses on monophonic audio, and there is no existing audio-based approach for vibrato detection and analysis of multiple simultaneous sources of a polyphonic music mixture. For some instruments such as strings, vibrato is often visible from the left hand motion, and this visual information does not degrade as audio information does when polyphony increases. This motivates our proposed approach of vibrato detection and analysis through video-based analysis of the fine motion of the left hand. The following figure shows the comparison of pitch contours extracted from polyphonic audio and visual motions.

Method

System overview of the proposed video-based vibrato detection and analysis framework


Results

  • Overall vibrato detection evaluations for 2 audio-based methods and 2 video-based (proposed) methods.


  • Vibrato detection performance grouped by polyphony number (left) and instrument type (right). The performances of audio-based methods degrade as polyphony number increases or fundamental frequency decreases, while video=based methods stay the same.


  • Evaluation of video-based method for estimating vibrato rate and extent on all the detected vibrato notes.


Publication

Bochen Li, Karthik Dinesh, Gaurav Sharma, and Zhiyao Duan, Video-based vibrato detection and analysis for polyphonic string music, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2017, 123-130. (best paper nomination) <pdf> <slides>


Updating Continued ...