PI: Zhiyao Duan (University of Rochester),
Co-PI: Bryan Pardo (Northwestern University)
Students: Yichi Zhang, Sefik Emre Eskimez, Rui Lu (University of Rochester); Bongjun Kim, Fatemeh Pishdadian, Max Morrison (Northwestern University)
Acknowledgment: This material is based upon work supported by the National Science Foundation under Grant No. 1617107 and No. 1617497.
Award title: III: Small: Collaborative Research: Algorithms for Query by Example of Audio Databases
Duration: September 1, 2016 to August 31, 2020 (Estimated)
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Acknowledgment: We acknowledge NVIDIA's Titan X Pascal GPU donation for this research.
We propose to make general audio databases content-searchable using vocal imitation of the desired sound as the query key: A user vocalizes the audio concept in mind and the system retrieves audio recordings that are similar, in some way, to the vocalization. This would complement text-based search to reduce the search space, making search through large audio databases easier and quicker when text tags are available, and making search possible even in cases where text tags are not available.
Existing ways to index and search audio documents are based on text metadata and text-based search engines. This, however, is not efficient or effective in many scenarios.
Sound retrieval by vocal imitation system addresses these issues [1]. Vocal imitation is commonly used in human communication and can be employed for novel human-computer interaction. Presented with an audio recording as a query, the system compares the query with sound files in the library and returns files similar to the query. It can be also combined with a text-based search to make the search more efficient, effective, and intuitive.
There are two main challenges in designing vocal-imitationbased sound search systems: feature representation and matching algorithms.
(1) Feature representations of vocal imitations and real sounds need to be robust to different aspects (e.g., pitch, timbre, rhythm) that humans emphasize in different imitations for different sounds. They also need to consider differences between imitations and real sounds due to the physical constraints of the human vocal system.
(2) The matching algorithm needs to work with the feature representations to discern target sounds from irrelevant ones for a given query.
The sound retrieval system developed in this work will be broadly useful across media production tools for both the video and audio. Ordinary people and professional sound designers alike will be able to find appropriate sound effects to enhance the sound track of their video.
This work would be transformative for biodiversity monitoring (e.g., automatic ID of bird species in field recordings of birdsongs) and transformative for search through existing audio/video collections where it is currently impractical to hand-label the content of the data with searchable tags. All technologies that facilitate search and retrieval through sound examples are useful empowering technology for the visually impaired, as they allow for interfaces that focus on sound as the interaction modality.
Other application areas include diagnosis from audio examples. Callers to National Public Radio’s well-known “Car Talk” show typically vocalize the sound of their ailing auto to help the hosts of the show diagnose their problem. One could imagine a database of typical auto sounds and problems that would let one search by vocal imitation or a field recording of the car to get an initial diagnosis (e.g., “your car needs a new starter motor.”). A similar approach could be taken to aid in medical diagnosis of a cough. The technology developed in this work would be useful for automatic aids to language learning (identifying native vs. non-native accent).
PI Duan teaches in the Audio and Music Engineering program at Rochester, where audio-related research has been shown to be a successful way to attract diverse college students into STEM disciplines. The proposed work provides an opportunity for the PIs to broaden participation in STEM fields, through disseminating the results in teaching and by actively recruiting underrepresented groups to participate in the research, which the PIs are committed to doing.