A particle filtering approach to online beat tracking based on beat activiation values calculated by a recurrent neural network.
A deep reinforcement learning method that tranfers counterpoint patterns from J.S. Bach Chorales to compose countermelodies for Chinese folk melodies.
A deep learning system that allows a human musician to improvise a duet counterpoint with a machine partner in real time. We hope that this system will help revitalize the improvisation culture in classical music education and performance!
We propose a reinforcement learning framework for online music accompaniment in the style of Western counterpoint. The reward model is trained from J.S. Bach chorales to model intra- and inter-part interaction.
Generate music compositions automatically in a musically plausible way.
We train a model to take the input of MIDI data, and output the visual performance as expressive body movements for pianist. It can be used for demonstration purpose for music learners, or immersive music enjoyment system, or human-computer interactions in automatic accompaniment systems. We show all the demo videos of the generated visual performance (as skeleton key points) compared with real human on same pieces.
A complete piano music transcription system from transcribing notes from audio waveform to arranging as readable score notations
We create an audio-visual, multi-track, and multi-instrument music performance dataset that comprises a number of chamber music assembled from coordinated but separately recorded performances of individual tracks. With ground-truth pitch/note annotations and clean individual audio tracks available, this can be used for multi-modal analysis of music performance.
We introduce a dataset for facilitating audio-visual analysis of singing performances. The dataset comprises a number of singing performances as audio and video recordings. Each song contains the isolated track of solo singing voice and the mixure with accompaniment track. We anticipate that the dataset will be useful for multi-modal analysis of singing performances, such as audiovisual singing voice separation, and serve as ground-truth for evaluations.
We address the "sustained effect" in piano music performance, caused by the usage of sustained pedal or legato articulations. Due to this effect, the mixture of energy between the sustained and following notes (non-notated in the score) always results in delay erros in score following systems. We propose to modify the audio feature representations to reduce the sustained effect and enhance the robustness of score following systems.
Live musical performances (e.g., choruses, concerts, and operas) often require the display of lyrics for the convenience of the audience. We propose a computational system to automate this real-time lyrics display process using signal processing techniques
We propose to leverage visual information captured from music performance videos to advance several music information retrieval (MIR) tasks, such as source association, multi-pitch analysis, and vibrato analysis.
We propose an end-to-end talking face generation system that can take a speech utterance, a face image, and an emotion condition (e.g., happy, angry, etc.) as input, to render a talking face expressing that emotion.
We propose a system that can generate talking faces from input noisy speech and a reference image.
We propose to use a Convolutional network to generate 3D landmarks of a talking face from acoustic speech waveform.
We propose an adversarial training method for speech super-resolution or speech bandwidth extension.
we propose an audio-visual Audio-Visual Deep Clustering model (AVDC) to integrate visual information into the process of learning better feature representations (embeddings) for Time-Frequency (T-F) bin clustering.
We propose to use a LSTM network to generate landmarks of a talking face from acoustic speech.
We propose to make general audio databases content-searchable using vocal imitation of the desired sound as the query key: A user vocalizes the audio concept in mind and the system retrieves audio recordings that are similar, in some way, to the vocalization.