NSF Projects

Music

Music Generation

Euterpe: A Web Framework for Interactive Music Systems

A prototyping web framework designed to help you deploy your interactive musical system or agent on the web without the need of having to learn web programming.

Draw and Listen! A Sketch-based System for Music Inpainting

We present a system that converts a user's hand-drawn curves to a melody, for filling in missing measures in a monophonic music piece.

FolkDuet: When Counterpoint Meets Chinese Folk Melodies

A deep reinforcement learning method that tranfers counterpoint patterns from J.S. Bach Chorales to compose countermelodies for Chinese folk melodies.

BachDuet: Human-Machine Counterpoint Improvisation

A deep learning system that allows a human musician to improvise a duet counterpoint with a machine partner in real time. We hope that this system will help revitalize the improvisation culture in classical music education and performance!

Online Music Accompaniment by Reinforcement Learning

We propose a reinforcement learning framework for online music accompaniment in the style of Western counterpoint. The reward model is trained from J.S. Bach chorales to model intra- and inter-part interaction.

Part-Invariant Model for Music Generation and Harmonization

We present a neural language (music) model that tries to model symbolic multi-part music. Our model is part-invariant, i.e., it can process/generate any part (voice) of a music score consisting of an arbitrary number of parts, using a single trained model. After training, the generation is performed by Gibbs Sampling.

Audio-Visual Analysis and Generation

Audiovisual Singing Voice Separation

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an improved performance from supervised training using deep learning techniques. We propose to apply the visual information corresponding to the singers’ vocal activities to further improve the quality of the separated vocal signals.

Audio-visual Analysis of Music Performance

We propose to leverage visual information captured from music performance videos to advance several music information retrieval (MIR) tasks, such as source association, multi-pitch analysis, and vibrato analysis. We also created two audio-visual music performance datasets, covering different musical instruments and voice.

Skeleton Plays Piano: Generating Pianist Movement from MIDI Data

We train a model to take the input of MIDI data, and output the visual performance as expressive body movements for pianist. It can be used for demonstration purpose for music learners, or immersive music enjoyment system, or human-computer interactions in automatic accompaniment systems. We show all the demo videos of the generated visual performance (as skeleton key points) compared with real human on same pieces.

Audio Analysis

Music Rhythm Analysis

A series of approaches to real-time beat, downbeat, and meter tracking for general music audio and singing voices.

Guitar Tablature Transcription with Inhibition

A methodology for regularizing guitar tablature transcription systems using an inhibition loss with weights derived from co-occurence likelihoods estimated using a collection of symbolic tablature.

Learning Sparse Analytic Filters for Piano Transcription

A methodology for learning a bank of sparse analytic filters to use as a frontend for music transcription models.

Piano Music Transcription into Music Notation

A complete piano music transcription system from transcribing notes from audio waveform to arranging as readable score notations

Score Following for Expressive Piano Performance

We address the "sustained effect" in piano music performance, caused by the usage of sustained pedal or legato articulations. Due to this effect, the mixture of energy between the sustained and following notes (non-notated in the score) always results in delay erros in score following systems. We propose to modify the audio feature representations to reduce the sustained effect and enhance the robustness of score following systems.

Automatic Lyrics Display for A Live Chorus Performance

Live musical performances (e.g., choruses, concerts, and operas) often require the display of lyrics for the convenience of the audience. We propose a computational system to automate this real-time lyrics display process using signal processing techniques

Speech

Speech Anti-Spoofing

We study the anti-spoofing system to improve the reliability of speaker verification systems against synthetic and converted voice. We propose methods to generalize anti-spoofing to unseen synthetic attacks and channel variation.

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

We propose an end-to-end talking face generation system that can take a speech utterance, a face image, and an emotion condition (e.g., happy, angry, etc.) as input, to render a talking face expressing that emotion.

End-to-End Generation of Talking Faces from Noisy Speech

We propose a system that can generate talking faces from input noisy speech and a reference image.

Generating 2D and 3D Talking Face Landmarks from Noisy Speech

We propose to use an LSTM network to generate 2D landmarks of a talking face from acoustic speech and a 1D convolutional network to generate 3D landmarks from noisy speech waveforms.

Adversarial Training for Speech Super-Resolution

We propose an adversarial training method for speech super-resolution or speech bandwidth extension.

Audio-Visual Speech Source Separation

we propose an audio-visual Audio-Visual Deep Clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time-Frequency (T-F) bin clustering.

General Sounds

Sound Search by Vocal Imitation

We propose to make general audio databases content-searchable using vocal imitation of the desired sound as the query key: A user vocalizes the audio concept in mind and the system retrieves audio recordings that are similar, in some way, to the vocalization.