Sound Search by Vocal Imitation

PI: Zhiyao Duan (University of Rochester), Co-PI: Bryan Pardo (Northwestern University)
Students: Yichi Zhang, Sefik Emre Eskimez, Rui Lu (University of Rochester); Bongjun Kim, Fatemeh Pishdadian, Max Morrison (Northwestern University)


Acknowledgment: This material is based upon work supported by the National Science Foundation under Grant No. 1617107 and No. 1617497.

Award title: III: Small: Collaborative Research: Algorithms for Query by Example of Audio Databases

Duration: September 1, 2016 to August 31, 2020 (Estimated)

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Acknowledgment: We acknowledge NVIDIA's Titan X Pascal GPU donation for this research.

Project Goals

We propose to make general audio databases content-searchable using vocal imitation of the desired sound as the query key: A user vocalizes the audio concept in mind and the system retrieves audio recordings that are similar, in some way, to the vocalization. This would complement text-based search to reduce the search space, making search through large audio databases easier and quicker when text tags are available, and making search possible even in cases where text tags are not available.

Why Is It Useful?

Existing ways to index and search audio documents are based on text metadata and text-based search engines. This, however, is not efficient or effective in many scenarios.

  • Much of the audio in user-contributed online repositories (e.g., SoundCloud, Freesound) has metadata that does not describe the details of the audio content, making the content undiscoverable through a text-based search.
  • Files labeled with content-relevant tags do not often have specific enough tags based on which searches can return hundreds or thousands of examples.
  • Even for audio libraries that are carefully designed with a hierarchical taxonomy and detailed text labels (e.g., sound effect libraries), searching a specific sound requires users to be familiar with the taxonomy and remember the detailed descriptors of the sound, which is the ability that only experienced sound production engineers have.
  • Even for these experts, difficulties exist. Many sounds, especially computer-synthesized sounds, do not have semantic meanings and are often labeled with the parameters of the synthesizers, and text-based search becomes very non-intuitive.

  • Sound retrieval by vocal imitation system addresses these issues [1]. Vocal imitation is commonly used in human communication and can be employed for novel human-computer interaction. Presented with an audio recording as a query, the system compares the query with sound files in the library and returns files similar to the query. It can be also combined with a text-based search to make the search more efficient, effective, and intuitive.

    Research Challenges

    There are two main challenges in designing vocal-imitationbased sound search systems: feature representation and matching algorithms.

    (1) Feature representations of vocal imitations and real sounds need to be robust to different aspects (e.g., pitch, timbre, rhythm) that humans emphasize in different imitations for different sounds. They also need to consider differences between imitations and real sounds due to the physical constraints of the human vocal system.

    (2) The matching algorithm needs to work with the feature representations to discern target sounds from irrelevant ones for a given query.

    Broader Impacts

    The sound retrieval system developed in this work will be broadly useful across media production tools for both the video and audio. Ordinary people and professional sound designers alike will be able to find appropriate sound effects to enhance the sound track of their video.

    This work would be transformative for biodiversity monitoring (e.g., automatic ID of bird species in field recordings of birdsongs) and transformative for search through existing audio/video collections where it is currently impractical to hand-label the content of the data with searchable tags. All technologies that facilitate search and retrieval through sound examples are useful empowering technology for the visually impaired, as they allow for interfaces that focus on sound as the interaction modality.

    Other application areas include diagnosis from audio examples. Callers to National Public Radio’s well-known “Car Talk” show typically vocalize the sound of their ailing auto to help the hosts of the show diagnose their problem. One could imagine a database of typical auto sounds and problems that would let one search by vocal imitation or a field recording of the car to get an initial diagnosis (e.g., “your car needs a new starter motor.”). A similar approach could be taken to aid in medical diagnosis of a cough. The technology developed in this work would be useful for automatic aids to language learning (identifying native vs. non-native accent).

    PI Duan teaches in the Audio and Music Engineering program at Rochester, where audio-related research has been shown to be a successful way to attract diverse college students into STEM disciplines. The proposed work provides an opportunity for the PIs to broaden participation in STEM fields, through disseminating the results in teaching and by actively recruiting underrepresented groups to participate in the research, which the PIs are committed to doing.

    Last updated .