The goal of this project was to explore the efficacy of filterbank learning for low-level music information retrieval tasks such as automatic music transcription. Here, filterbank learning corresponds to replacing the standard feature extraction stage, where features such as Mel-Spectrogram or the Constant-Q Transform (CQT) are typically employed, with a bank of learnable complex filters.
In the classic filterbank learning approach, complex filters are typically represented as two separate 1D convolutional filters corresponding to the real (black) and imaginary (purple) parts. Audio is fed directly into each of the filters, and L2 pooling is applied to the grouped responses to compute the magnitude response for each complex filter.
In addition to the classic filterbank learning approach, we experimented with various extensions to learn frontend filters for the Onsets & Frames piano transcription model. We used the MAESTRO dataset for training, validation, and testing, and we also evaluated models on the MAPS dataset.
Several different filterbank initialization strategies were also investigated to see what kind of impact initialization had on the filterbank learning process. These strategies include random initialization, Variable-Q initialization, and harmonic comb initialization.
In general, the learned filterbanks did not outperform standard time-frequency representations like the Mel-Spectrogram or CQT as frontends for piano transcription. However, the gap in performance between models trained jointly with the learned filterbanks and those trained with the standard features was quite small. Futhermore, one suprising result is that the filterbanks initialized randomly only fell slightly behind those initialized with the Variable-Q strategy.
The following figures illustrate some of the filters learned as a result of the experiment with random initialization where both filterbank learning extenensions (Hilbert transform and variational dropout) were applied.
Please refer to the arXiv Version of the paper to view more examples of learned filters for each experiment.
Frank Cwitkowitz, Mojtaba Heydari, and Zhiyao Duan, Learning Sparse Analytic Filters for Piano Transcription, in Proc. The Sound and Music Computing Conference (SMC), 2022, pp. 209-216.
This work has been funded by the National Science Foundation grants IIS-1846184 and DGE-1922591. We would also like to thank Dr. Juan Cockburn and Dr. Andres Kwasinski for their guidance during preliminary work.