This project is supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding". |
Table of Contents:
Music performance is a multi-modal art form. For thousands of years, people have enjoyed music performances at live concerts through both hearing and sight. The visual modality is much more natural, and when available, it can be very helpful for solving many MIR tasks that are challenging using an audio-only approach. Audio-visual analysis of music performance is at the core of artificial intelligence, which bridges the cutting-edge techniques of computer vision and computer audition.
Despite the increased recent interest, progress in jointly using audio and visual modalities for the analysis of music performances has been rather slow. One of the main reasons, we argue, is the lack of datasets. We create the University of Rochester Multi-modal Music Performance (URMP) dataset, which covers 44 classical chamber music pieces ranging from duets to quintets. More details about the creation process, dataset features, and download entry are available at the above link.
We also created an audio-visual singing dataset named URSing. It comprises a number of singing performances as audio and video recordings. Each song contains the isolated track of solo singing voice and the mixure with accompaniment track. We anticipate that the dataset will be useful for multi-modal analysis of singing performances, such as audiovisual singing voice separation, and serve as ground-truth for evaluations. More details about the creation process, dataset features, and download entry are available at the above link.
The problem is defined on chamber music performance. It features Western music instruments as ensemble performce with one instrument play one source. Given the performance video and the corresponding score/audio tracks, the system identifies the afflication between the different modalities.
In audio-visual recordings of music performances, visual cues from instrument players exhibit good temporal correspondence with the audio signals and the music content. These correspondences provide useful information for estimating source associations. It enables novel research and applications. It is essential for leveraging the visual information to analyze individual sound sources in music performances. Potential applications includes:
The system models three typical types of correspondences between 1) body motions (e.g., bowing for string instruments andsliding for trombone) and note onsets, 2) finger motions (e.g., fingering for most woodwind/brassinstruments) and note onsets, and 3) vibrato hand motions (e.g., fingering hand rolling for string instruments) with pitch fluctuations.
The details about the three modeuls are illustrated for:
We calculate the association accuracy to evaluate the association performance on the testing excerpts (~17K excerpts of ensembles up to 5 players playing simultaneously, i.e., quintets, and duration varies from 5 sec to 30 sec). On each testing excerpt, only when all the tracks are correctly associated can be counted as a correct association. For example of a quintet excerpt, random guess will yield 1/120 chance to output a correct association.
The numeric evaluation is demonstrated in the figures:
Bochen Li, Karthik Dinesh, Zhiyao Duan and Gaurav Sharma, See and listen: score-informed association of sound tracks to players in chamber music performance videos, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2906-2910. <pdf> <slides>
Bochen Li, Chenliang Xu, and Zhiyao Duan, Audio-visual source association for string ensembles through multi-modal vibrato analysis, in Proc. The 14th Sound and Computing Conference (SMC), 2017, pp. 159-166. (best paper award) <pdf> <slides>
Bochen Li, Karthik Denish, Chenliang Xu, Gaurav Sharma, and Zhiyao Duan, Online audio-visual source association for chamber music performances, Transactions of the International Society for Music Information Retrieval, vol. 2, no. 1, pp. 29–42, 2019. <pdf>
Multi-pitch analysis (MPA) of polyphonic music is important in many music information retrieval (MIR) tasks including automatic music transcription, music source separation, and audio-score alignment. It can be performed at different levels: Multi-pitch Estimation (MPE) is to estimate concurrent pitches and the number of pitches (polyphony) in each time frame; Multi-pitch Streaming (MPS) goes one step further to also assign the pitch estimates to different sound sources. This is challenging for approaches based on audio alone due to the polyphonic nature of the audio signals. Video of the performance, when available, can be useful to alleviate some of the difficulties.
We propose to detect the play/non-play (P/NP) activities from musical performance videos using optical flow analysis to help with audio-based multi-pitch analysis. Specifically, the detected P/NP activity provides a more accurate estimate of the instantaneous polyphony (i.e., the number of pitches at a time instant), and also helps with assigning pitch estimates to only active sound sources. This is illustrated in the following figure:
The P/NP activity labels for each player is detected by optical flow estimation and supervised classification:
As the first attempt towards audio-visual multi-pitch analysis of multi-instrument musical performances, we demonstrate the concept on 11 string ensembles including duets, trios, quartets, and quintets. The evaluations of the P/NP activity deteciton is performed on each individual player. The evaluations of MPE and MPS is performed on the expanded set from the original 11 ensembles, e.g., from a quitet ensemble we can further create 10 duets, 10 trios, and 5 quartets.
Karthik Dinesh*, Bochen Li*, Xinzhao Liu, Zhiyao Duan and Gaurav Sharma, Visually informed multi-pitch analysis of string ensembles, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 3021-3025. (* equal contribution) <pdf> <slides>
Vibrato is:
Previous literature about vibrato analysis always focuses on monophonic audio, and there is no existing audio-based approach for vibrato detection and analysis of multiple simultaneous sources of a polyphonic music mixture. For some instruments such as strings, vibrato is often visible from the left hand motion, and this visual information does not degrade as audio information does when polyphony increases. This motivates our proposed approach of vibrato detection and analysis through video-based analysis of the fine motion of the left hand. The following figure shows the comparison of pitch contours extracted from polyphonic audio and visual motions.
System overview of the proposed video-based vibrato detection and analysis framework
Bochen Li, Karthik Dinesh, Gaurav Sharma, and Zhiyao Duan, Video-based vibrato detection and analysis for polyphonic string music, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2017, 123-130. (best paper nomination) <pdf> <slides>