AIR Lab | Resources

Datasets

URSing Dataset

We introduce a dataset for facilitating audio-visual analysis of singing performances. The dataset comprises a number of songs where singers’ solo voices are recorded in isolation. For each song, we provide the high-quality audio recordings of the solo singing voice and mix with accompaniments, and the video recording of the upper body of the vocal soloist which contains facial expressions and lip movements. We anticipate that the dataset will be useful for developing audiovisual source separation systems. Note that some of the accompaniment tracks come with the backing vocals, which introduces extra challenges of developing an audio-based singing voice separation system, and encourages researchers to integrate the soloists’ visual information to facilitate the separation process. We also anticipate that the dataset will be useful for other multi-modal information retrieval techniques such as audiovisual expressions analysis, audio-visual correspondence, audiovisual lyrics transcription, etc. A more detailed description and download link is here.

URMP Dataset

We create a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises a number of simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. We anticipate that the dataset will be useful as “ground truth” for evaluating audio-visual techniques for music source separation, transcription, and performance analysis. A more detailed description and sample data is here.

Bach10

Bach10 dataset is a polyphonic music dataset which can be used for versatile research problems, such as Multi-pitch Estimation and Tracking, Audio-score Alignment, Source Separation, etc. This dataset consists of the audio recordings of each part and the ensemble of ten pieces of four-part J.S. Bach chorales, as well as their MIDI scores, the ground-truth alignment between the audio and the score, the ground-truth pitch values of each part and the ground-truth notes of each piece. The audio recordings of the four parts (Soprano, Alto, Tenor and Bass) of each piece are performed by violin, clarinet, saxophone and bassoon, respectively. A more detailed description is here. Dataset Download

Ground-truth pitches for the PTDB-TUG speech dataset:

The Pitch-Tracking Database from Graz University of Technology (PTDB-TUG) is a speech database for pitch tracking. It contains microphone and laryngograph signals of 20 English native speakers reading the TIMIT corpus. The database also provides reference pitch trajectories which were calculated from the laryngograph signals using the RAPT pitch tracking algorithm [1]. Here, we provide another version of the reference pitch trajectories, calculated using the Praat pitch tracking algorithm [2] on the microphone signals. We found that about 85% of the Praat-generated ground-truth pitches agree with the RAPT-generated ground-truth pitches. Praat-generated Reference Pitch Trajectories Download

[1] D. Talkin, “A robust algorithm for pitch tracking (RAPT),” in Speech Coding and Synthesis (W.B. Kleijn and K.K. Paliwal, eds.), pp. 495–518, Elsevier Science B.V., 1995.
[2] P. Boersma, “Praat, a system for doing phonetics by computer,” Glot International, vol. 5, no. 9/10, pp. 341–345, 2001.

Non-stationary Noise:

For research on speech enhancement, we collected recordings of ten kinds of non-stationary noise: birds, casino, cicadas, computer keyboard, eating chips, frogs, jungle, machine guns, motor cycles, and ocean. The recording of aach noise is between one minute to three minutes long. Dataset Download.

Code

Code for recent projects can be accessed from the Publications page. AIR Lab also has a GitHub page here that hosts code of some projects.

Piano Music Transcription:

Please get access to the code here

Sound Search by Vocal Imitation:

This code performs sound search by vocal imitation using a Semi-Siamese Convolutional Network (SCN) described in the paper "Yichi Zhang and Zhiyao Duan, IMINET: convolutional semi-siamese networks for sound search by vocal imitation, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp: 304-308.". For a vocal imitation spectrogram comes in, it compares with the spectrogram of each sound candidate in the dataset. Highest similarity imitation-recording pairs are chosen and returned to the user. SYMM-IMINET_WASPAA2017_Code.rar

Multi-pitch Estimation & Streaming:

This code performs Multi-pitch Estimation (MPE) and Multi-pitch Streaming (MPS) on polyphonic music or multi-talker speech. For a piece of polyphonic audio composed of monophonic harmonic sound sources, this program first estimates pitches in each time frame, then it streams these pitch estimates across time into pitch trajectories (streams), each of which corresponds to a sound source. mpe_mps.zip
The MPE and MPS code is also available separately. mpe.zip, mps.zip

Multi-pitch Estimation & Streaming Evaluation:

This toolbox is for evaluating multi-pitch analysis results. It compares the estimated pitch content with the ground-truth pitch content and outputs some error measures. Help each file to see the details of their measurement. mpa_eval.zip