Journal Papers

‡: undergraduate student

[38] Ge Zhu, Yutong Wen, and Zhiyao Duan, Audio generation through score-based generative modeling: design principles and implementation, accepted by Foundations and Trends in Signal Processing, 2026. <code>

[37] Raquel Norel, Jennifer Gewandter, Zhengwu Zhang, Anika Tahsin, Chadi G Abdallah, John Markman, Zhiyao Duan, Guillermo Cecchi, and Paul Geha, Turning patients' open-ended narratives of chronic pain into quantitative measures: Natural language processing study, JMIR Human Factors, vol. 12, 2025.

[36] Ge Zhu, Jordan Darefsky‡, and Zhiyao Duan, Cacophony: An improved contrastive audio-text model, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 32, pp. 4867-4879, 2024. <arXiv> <code>

[35] Mojtaba Heydari and Zhiyao Duan, BeatNet+: Real-time rhythm analysis for diverse music audio, Transactions of the International Society for Music Information Retrieval (TISMIR), vol. 7, no. 1, pp. 274-287, 2024. <pdf> <code>

[34] Yujia Yan and Zhiyao Duan, Measure by measure: Enabling automatic music composition with modern staff notation, accepted by Transactions of the International Society for Music Information Retrieval (TISMIR), vol. 7, no. 1, pp. 228-245, 2024. <pdf> <code> <web>

[33] Ge Zhu, Juan-Pablo Careres, Zhiyao Duan, and Nicholas J. Bryan, MusicHiFi: Fast high-fidelity stereo vocoding, IEEE Signal Processing Letters, vol. 31, pp. 2365-2369, 2024. <arXiv> <web>

[32] Zhiyao Duan*, Peter van Kranenburg*, Juhan Nam*, and Preeti Rao*, Editorial for TISMIR special collection: Cultural diversity in MIR research, Transactions of the International Society for Music Information Retrieval, Special Collection on Cultural Diversity in MIR Research, vol. 6, no. 1, pp. 203-205, 2024. (* authors in alphabetical order) <pdf> <special collection>

[31] Yongyi Zang‡*, Christodoulos Benetatos*, and Zhiyao Duan, Euterpe: A web framework for music interaction and creation, Journal of Audio Engineering Society, vol. 71, no. 11, pp. 738-752, 2023. (* equal contribution) <pdf> <web> <code>

[30] Tong Shan, Casper E. Wenner, Chenliang Xu, Zhiyao Duan, and Ross K. Maddox, Speech-in-noise comprehension is improved when viewing a deep-neural-network-generated talking face, Trends in Hearing, vol. 26, pp. 1-10, 2022. <pdf>

[29] Ge Zhu, Jordan Darefsky‡, Fei Jiang, Anton Selitskiy, and Zhiyao Duan, Music source separation with generative flow, IEEE Signal Processing Letters, vol. 29, pp. 2288-2292, 2022. <pdf> <poster> <code> <web>

[28] Christodoulos Benetatos and Zhiyao Duan, Draw and listen! A sketch-based system for music inpainting, Transactions of the International Society for Music Information Retrieval, vol. 29, pp. 2288-2292, 2022. <pdf> <web> <demo> <code>

[27] Sefik Emre Eskimez, You Zhang, and Zhiyao Duan, Speech driven talking face generation from a single image and an emotion condition, IEEE Transactions on Multimedia, vol. 24, pp. 3480-3490, 2022. <link> <arXiv> <code> <web>

[26] Bochen Li, Yuxuan Wang, and Zhiyao Duan, Audiovisual singing voice separation, Transactions of the International Society for Music Information Retrieval, vol. 4, no. 1, pp. 195-209, 2021. <arXiv> <dataset>

[25] You Zhang, Fei Jiang, and Zhiyao Duan, One-class learning towards synthetic voice spoofing detection, IEEE Signal Processing Letters, vol. 28, pp. 937-941, 2021. <pdf> <code> <poster> <slides> <video>

[24] Fei Jiang and Zhiyao Duan, Speaker attractor network: generalizing speech separation to unseen numbers of sources, IEEE Signal Processing Letters, vol. 27, pp. 1859-1863, 2020. <pdf> <code>

[23] Sefik Emre Eskimez, Ross Maddox, Chenliang Xu, and Zhiyao Duan, Noise-resilient training method for face landmark generation from speech, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 27-38, 2020. <pdf> <web>

[22] Bochen Li, Karthik Denish, Chenliang Xu, Gaurav Sharma, and Zhiyao Duan, Online audio-visual source association for chamber music performances, Transactions of the International Society for Music Information Retrieval, vol. 2, no. 1, pp. 29–42, 2019. <pdf> <web>

[21] Rui Lu, Zhiyao Duan, and Changshui Zhang, Audio-visual deep clustering for speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1697-1712, 2019. <pdf> <web>

[20] Sefik Emre Eskimez, Kazuhito Koishida, and Zhiyao Duan, Adversarial training for speech super-resolution, IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 347-358, 2019. <pdf> <web>

[19] Zhiyao Duan*, Slim Essid*, Cynthia C. S. Liem*, Gaël Richard*, and Gaurav Sharma*, Audio-visual analysis of music performances, IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 63-73, 2019. (* authors in alphabetical order) <pdf>

[18] Emmanouil Benetos*, Simon Dixon*, Zhiyao Duan*, and Sebastian Ewert*, Automatic music transcription: an overview, IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20-30, 2019. (* authors in alphabetical order) <pdf>

[17] Yichi Zhang, Bryan Pardo, and Zhiyao Duan, Siamese style convolutional neural networks for sound search by vocal imitation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 429-441, 2019. <pdf> <web>

[16] Bochen Li*, Xinzhao Liu*, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma, Creating a multitrack classical music performance dataset for multi-modal music analysis: challenges, insights, and applications, IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522-535, 2019. (* equal contribution) <pdf> <web>

[15] Rui Lu, Zhiyao Duan, and Changshui Zhang, Listen and look: audio-visual matching assisted speech source separation, IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315-1319, 2018. <pdf> <web>

[14] Sefik Emre Eskimez, Peter Soufleris, Zhiyao Duan, and Wendi Heinzelman, Front-end speech enhancement for commercial speaker verification systems, Speech Communication, vol. 99, no. pp. 101-113, 2018. <pdf>

[13] Shiwei Yu, Hongjuan Zhang, and Zhiyao Duan, Singing voice separation by low-rank and sparse spectrogram decomposition with pre-learned dictionaries, Journal of the Audio Engineering Society, vol. 65, no. 5, pp. 377-388, 2017.

[12] Andrea Cogliati, Zhiyao Duan, and Brendt Wohlberg, Piano transcription with convolutional sparse lateral inhibition, IEEE Signal Processing Letters, vol. 24, no. 4, pp. 392-396, 2017. <pdf>

[11] David Temperley, Iris Ren, and Zhiyao Duan, Mediant mixture and ``blue notes'' in rock: An exploratory study, Music Theory Online, vol. 23, no. 1, 2017. <pdf> <examples>

[10] Bochen Li and Zhiyao Duan, An approach to score following for piano performances with the sustained effect, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2425-2438, 2016. <pdf> <web>

[9] Na Yang, Jianbo Yuan, Yun Zhou, Ilker Demirkol, Zhiyao Duan, Wendi Heinzelman, Melissa Sturge-Apple, Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification, International Journal of Speech Technology, doi:10.1007/s10772-016-9364-2, 2016. <pdf>

[8] Andrea Cogliati, Zhiyao Duan, and Brendt Wohlberg, Context-dependent piano music transcription with convolutional sparse coding, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2218-2230, 2016. <pdf> <web>

[7] Yichi Zhang and Zhiyao Duan, Supervised and unsupervised sound retrieval by vocal imitation, Journal of Audio Engineering Society, vol. 64, no. 7/8, pp. 533-543, 2016. <pdf> < web>

[6] Francisco J. Rodriguez-Serrano, Zhiyao Duan, Pedro Vera-Candeas, Bryan Pardo, and Julio J. Carabias-Orti, Online score-informed source separation with adaptive instrument models, Journal of New Music Research, vol. 44, no. 2, pp. 83-96, 2015. <pdf>

[5] Zafar Rafii, Zhiyao Duan, and Bryan Pardo, Combining rhythm-based and pitch-based methods for background and melody separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1884-1893, 2014. <pdf>

[4] Zhiyao Duan, Jinyu Han, and Bryan Pardo, Multi-pitch streaming of harmonic sound mixtures, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 138-150, 2014. <pdf> <code>

[3] Zhiyao Duan and Bryan Pardo, Soundprism: an online system for score-informed source separation of music audio, IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1205-1215, 2011. <pdf> <slides> <sound files> <code>

[2] Zhiyao Duan, Bryan Pardo, and Changshui Zhang, Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2121-2133, 2010. <pdf> <code>

[1] Zhiyao Duan, Yungang Zhang, Changshui Zhang, and Zhenwei Shi, Unsupervised single-channel music source separation by average harmonic structure modeling, IEEE Transactions on Audio, Speech, and Language Processing, vo. 16, no. 4, pp. 766-778, 2008. <pdf> <sound files>

Peer-reviewed Conference Papers

‡: undergraduate student

[93] Frank Cwitkowitz, Christodoulos Benetatos, Qixin Deng, Huiran Yu, Nathan Pruyne‡, Patrick O'Reilly, Hugo Flores Garcia, Zhiyao Duan, Bryan Pardo, HARP 3.0: Generalizing I/O and API support for machine learning in digital audio workstations, in NeurIPS 2025 Workshop on AI for Music, 2025. <pdf> <demo>

[92] Yu Zhang, Baotong Tian, and Zhiyao Duan, Conan: A chunkwise online network for zero-shot adaptive voice conversion, accepted by IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025. <arXiv>

[91] You Zhang, Andrew Francl, Ruohan Gao, Paul Calamia, Zhiyao Duan, and Ishwarya Ananthabhotla, Towards Perception-Informed Latent HRTF Representations, accepted by IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. (best student paper award) <arXiv>

[90] Frank Cwitkowitz and Zhiyao Duan, Investigating an overfitting and degeneration phenomenon in self-supervised multi-pitch estimation, accepted by International Society for Music Information Retrieval (ISMIR), 2025. <arXiv>

[89] Kyungbok Lee‡, You Zhang, and Zhiyao Duan, Audio visual segmentation through text embeddings, in Proc. IEEE International Conference on Image Processing (ICIP), 2025. <arXiv>

[88] You Zhang*, Baotong Tian*, Lin Zhang, and Zhiyao Duan, PartialEdit: Identifying partial deepfakes in the era of neural speech editing, in Proc. Interspeech, 2025. (* equal contribution) <web>

[87] Geoffroy Peeters, Zafar Rafii, Magdalena Fuentes, Zhiyao Duan, Emmanouil Benetos, Juhan Nam, and Yuki Mitsufuji, Twenty-five years of MIR research: Achievements, practices, evaluations, and future challenges, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.

[86] You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan, SVDD 2024: The inaugural singing voice deepfake detection challenge, in Proc. IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 782-787. <arXiv> <web> <slides> <poster>

[85] Kyungbok Lee‡, You Zhang, and Zhiyao Duan, A multi-stream fusion approach with one-class learning for audio-visual deepfake detection, in Proc. IEEE International Workshop on Multimedia Signal Processing (MMSP), 2024. DOI: 10.1109/MMSP61759.2024.10743671. arXiv> <slides> <code>

[84] Samuele Cornell, Jordan Darefsky‡, Zhiyao Duan, and Shinji Watanabe, Generating data with text-to-speech and large-language models for conversational speech recognition, in Interspeech Workshop on SynData4GenAI, 2024. <arXiv>

[83] Yujia Yan and Zhiyao Duan, Scoring intervals using non-hierarchical transformer for automatic piano transcription, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2024, pp. 973-980. (best paper runner-up) <arXiv> <poster, video> < slides> <code>

[82] Huiran Yu and Zhiyao Duan, Note-level transcription of choral music, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2024, pp. 182-188. <pdf, poster, video, slides>

[81] Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, and Zhiyao Duan, CtrSVDD: A benchmark dataset and baseline analysis for controlled singing voice deepfake detection, in Proc. Interspeech, 2024, pp. 4783-4787. <pdf> <web>

[80] Zehua Li, Meiying Chen, Yi Zhong, and Zhiyao Duan, GTR-Voice: Articulatory phonetics informed controllable expressive speech synthesis, in Proc. Interspeech, 2024, pp. 1775-1779. <pdf> <web>

[79] Yongyi Zang‡*, Yi Zhong*, Frank Cwitkowitz, and Zhiyao Duan, SynthTab: Leveraging synthesized data for guitar tablature transcription, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1286-1290. (* equal contribution) <arXiv> <web> <code>

[78] Yongyi Zang‡*, You Zhang*, Mojtaba Heydari, and Zhiyao Duan, SingFake: Singing voice deepfake detection, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12156-12160. (* equal contribution) <arXiv> <web>

[77] Enting Zhou‡, You Zhang, and Zhiyao Duan, Learning arousal-valence representation from categorical emotion labels of speech, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024. pp. 12126-12130. <arXiv>

[76] Ge Zhu, Yutong Wen‡, Marc-André Carbonneau, and Zhiyao Duan, EDMSound: Spectrogram based diffusion models for efficient and high-quality audio synthesis, in NeurIPS 2023 Workshop on Machine Learning for Audio, 2023. <pdf> <poster> <code> <web>

[75] Hugo Flores Garcia, Christodoulos Benetatos, Patrick O'Reilly, Aldo Aguilar, Zhiyao Duan, and Bryan Pardo, HARP: Bringing deep learning to the DAW with hosted, asynchronous, remote processing, in NeurIPS 2023 Workshop on Machine Learning for Creativity and Design (ML4CD), 2023. <pdf> <code>

[74] Yutong Wen‡, You Zhang, and Zhiyao Duan, Mitigating cross-database differences for learning unified HRTF representation, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023. <pdf> <slides>

[73] Qiaoyu Yang, Frank Cwitkowitz, and Zhiyao Duan, Harmonic analysis with neural semi-CRF, in Proc. International Society for Music Information Retrieval (ISMIR), 2023, pp. 676-683. <paper, poster, video> <code>

[72] Meiying Chen and Zhiyao Duan, ControlVC: Zero-shot voice conversion with time-varying controls on pitch and speed, in Proc. Interspeech, 2023, pp.2098-2102. <arXiv> <code> <demo>

[71] Yongyi Zang‡, You Zhang, and Zhiyao Duan, Phase perturbation improves channel robustness for speech spoofing countermeasures, in Proc. Interspeech, 2023, pp. 3162-3166. <arXiv>

[70] Ge Zhu, Yujia Yan, Juan-Pablo Caceres, and Zhiyao Duan, Transcription free filler word detection with neural semi-CRFs, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. <pdf> <slides> <video> <code>

[69] Siwen Ding, You Zhang, and Zhiyao Duan, SAMO: speaker attractor multi-center one-class learning for voice anti-spoofing, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. <pdf> <code>

[68] You Zhang, Yuxiang Wang, and Zhiyao Duan, HRTF field: unifying measured HRTF magnitude representation with neural fields, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. (top 3% of all accepted papers) <arXiv> <code> <poster>

[67] Mojtaba Heydari, Ju-Chiang Wang, and Zhiyao Duan, SingNet: A real-time singing voice beat and downbeat tracking system, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. <pdf> <video>

[66] Abudukelimu Wuerkaixi, Kunda Yan‡, You Zhang, Zhiyao Duan, and Changshui Zhang, DyViSE: Dynamic vision-guided speaker embedding for audio-visual speaker diarization, in Proc. IEEE International Workshop on Multimedia Signal Processing (MMSP), 2022. <pdf>

[65] Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, and Changshui Zhang, Rethinking audio-visual synchronization for active speaker detection, in Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2022. <arXiv>

[64] You Zhang, Ge Zhu, and Zhiyao Duan, A probabilistic fusion framework for spoofing aware speaker verification, in Proc. The Speaker and Language Recognition Workshop (Odyssey), 2022, pp. 77-84. <pdf> <link> <code> <video> <slides>

[63] Mojtaba Heydari and Zhiyao Duan, Singing beat tracking with self-supervised front-end and linear transformers, in Proc. International Society for Music Information Retrieval (ISMIR), 2022, pp. 617-624. <pdf, poster, video> <code>

[62] Frank Cwitkowitz, Jonathan Driedger, and Zhiyao Duan, A data-driven methodology for considering feasibility and pairwise likelihood in deep learning based guitar tablature transcription systems, in Proc. The Sound and Music Computing Conference (SMC), 2022, pp. 131-138. <pdf> <code> <slides> <poster>

[61] Frank Cwitkowitz, Mojtaba Heydari, and Zhiyao Duan, Learning sparse analytic filters for piano transcription, in Proc. The Sound and Music Computing Conference (SMC), 2022, pp. 209-216. <pdf> <code> <slides> <poster>

[60] Mojtaba Heydari, Matthew McCallum, Andreas Ehmann, and Zhiyao Duan, A novel 1d state space for efficient music rhythmic analysis, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 421-425. <pdf> <video> <code>

[59] Ge Zhu, Frank Cwitkowitz, and Zhiyao Duan, A study of the robustness of raw waveform based speaker embeddings under mismatched conditions, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7657-7661. <arXiv> <slides> <poster> <code>

[58] Rui Lu, Baigong Zheng, Jiarui Hai, Fei Tao, Zhiyao Duan*, and Ji Liu, Progressive teacher-student training framework for music tagging, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 3129-3133. (* work at Kuaishou Technology)

[57] Yujia Yan, Frank Cwitkowitz, and Zhiyao Duan, Skipping the frame-level: Event-based piano transcription with neural semi-CRFs, in Proc. The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. <pdf> <code>

[56] Xinhui Chen*, You Zhang*, Ge Zhu*, and Zhiyao Duan, UR channel-robust synthetic speech detection system for ASVspoof 2021, in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge Workshop (ASVspoof), 2021, pp. 75-82. (* equal contribution) <pdf> <link> <code> <video>

[55] Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan, BeatNet: A real-time music integrated beat and downbeat tracker, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 270-277. <pdf> <video> <code>

[54] Abudukelimu Wuerkaixi‡, Christodoulos Benetatos, Zhiyao Duan, and Changshui Zhang, CollageNet: Fusing arbitrary melody and accompaniment into a coherent song, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 786-793. <pdf>

[53] You Zhang, Ge Zhu, Fei Jiang, and Zhiyao Duan, An empirical study on channel effects for synthetic voice spoofing countermeasure systems, in Proc. Interspeech 2021, pp. 4309-4313, 2021. <pdf> <link> <code> <video> <slides>

[52] Ge Zhu, Fei Jiang, and Zhiyao Duan, Y-vector: Multiscale waveform encoder for speaker embedding, in Proc. Interspeech, 2021, pp. 96-100. <pdf> <code>

[51] Mojtaba Heydari and Zhiyao Duan, Don't look back: An online beat tracking method using RNN and enhanced particle filtering, in by Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 236-240. <pdf> <video>

[50] Yuxiang Wang, You Zhang, Zhiyao Duan, and Mark Bocko, Global HRTF personalization using anthropometric measures, in Audio Engineering Society 150th Convention, 2021. <pdf> <code>

[49] Nan Jiang, Sheng Jin‡, Zhiyao Duan, and Changshui Zhang, When counterpoint meets Chinese folk melodies, accepted by The Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS), 2020. <pdf+supplemnental> <poster> <video> <web> <code>

[48] Christodoulos Benetatos, Joseph VanderStel, and Zhiyao Duan, BachDuet: A deep learning system for human-machine counterpoint improvisation, in Proc. International Conference on New Interfaces for Musical Expression (NIME), 2020, pp. 635-640. <pdf> <slides> <video> <web>

[47] Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, and Zhiyao Duan, End-to-end generation of talking faces from noisy speech, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1948-1952. <pdf> <slides> <video> <web>.

[46] Yichi Zhang, Junbo Hu, Yiting Zhang‡, Bryan Pardo, and Zhiyao Duan, Vroom!: A search engine for sounds by vocal imitation queries, in Proc. ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR), 2020, pp. 23-32. <pdf> <slides> <video> <web> <Vroom! search engine>

[45] Nan Jiang, Sheng Jin‡, Zhiyao Duan, and Changshui Zhang, RL-Duet: Online music accompaniment generation using deep reinforcement learning, in Proc. AAAI, 2020, pp. 710-718. <pdf> <slides> <poster> <web>

[44] Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu, Hierarchical cross-modal talking face generation with dynamic pixel-wise loss, in Proc. CVPR, 2019. <pdf> <poster> <video>

[43] Bongjun Kim, Madhav Ghei‡, Bryan Pardo, and Zhiyao Duan, Vocal Imitation Set: a dataset of vocally imitated sound events using the AudioSet ontology, in Proc. Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018. <pdf> <dataset

[42] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu, Audio-visual event localization in unconstrained videos, in Proc. European Conference on Computer Vision (ECCV), 2018. <pdf>

[41] Lele Chen, Zhiheng Li‡, Ross Maddox, Zhiyao Duan, and Chenliang Xu, Lip movements generation at a glance, in Proc. European Conference on Computer Vision (ECCV), 2018.

[40] Bochen Li, Akira Maezawa, and Zhiyao Duan, Skeleton plays piano: online generation of pianist body movements from MIDI performance, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 218-224. <pdf> <slides> <demo>

[39] Yujia Yan, Ethan Lustig, Joseph Vaderstel, and Zhiyao Duan, Part-invariant model for music generation and harmonization, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 204-210. <pdf> <web>

[38] Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, and Zhiyao Duan, Generating talking face landmarks from speech, in Proc. International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), 2018. <pdf> <poster> <code>

[37] Zhihan Zhou‡, Yichi Zhang, and Zhiyao Duan, Joint speaker diarization and recognition using convolutional and recurrent neural networks, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2496-2500. <pdf> <poster>

[36] Xueyang Wang, Ryan Stables, Bochen Li, and Zhiyao Duan, Score-aligned polyphonic microtiming estimation, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 361-365. <pdf> <poster>

[35] Sefik Emre Eskimez, Zhiyao Duan, and Wendi Heinzelman, Unsupervised learning approach to feature analysis for automatic speech emotion recognition, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5099-5103. <pdf> <poster>

[34] Yichi Zhang and Zhiyao Duan, Visualization and interpretation of Siamese style convolutional neural networks for sound search by vocal imitation, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2406-2410. <pdf> <slides> <code>

[33] Rui Lu, Zhiyao Duan, and Changshui Zhang, Multi-scale recurrent neural network for sound event detection, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 131-135.

[32] Lele Chen, Sudhanshu Srivastava, Zhiyao Duan and Chenliang Xu, Deep cross-modal audio-visual generation, in Proc. ACM International Conference on Multimedia Thematic Workshops, 2017, pp. 349-357. <pdf>

[31] Yichi Zhang and Zhiyao Duan, IMINET: convolutional semi-siamese networks for sound search by vocal imitation, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017,. pp. 304-308. <pdf> <poster> <code>

[30] Rui Lu, Zhiyao Duan, and Changshui Zhang, Metric learning based data augmentation for environmental sound classification, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 1-5. <pdf> <slides>

[29] Bochen Li, Karthik Dinesh, Gaurav Sharma, and Zhiyao Duan, Video-based vibrato detection and analysis for polyphonic string music, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2017, 123-130. (best paper nomination) <pdf> <slides>

[28] Andrea Cogliati and Zhiyao Duan, A metric for music notation transcription accuracy , in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 407-413. <pdf> <poster>

[27] Bochen Li, Chenliang Xu, and Zhiyao Duan, Audio-visual source association for string ensembles through multi-modal vibrato analysis, in Proc. The 14th Sound and Music Computing Conference (SMC), 2017, pp. 159-166. (best paper award) <pdf> <slides>

[26] Bochen Li, Karthik Dinesh, Zhiyao Duan and Gaurav Sharma, See and listen: score-informed association of sound tracks to players in chamber music performance videos, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2906-2910. <pdf> <slides>

[25] Karthik Dinesh*, Bochen Li*, Xinzhao Liu, Zhiyao Duan and Gaurav Sharma, Visually informed multi-pitch analysis of string ensembles, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 3021-3025. (* equal contribution) <pdf> <slides>

[24] Rui Lu, Kailun Wu, Zhiyao Duan, and Changshui Zhang, Deep ranking: triplet MatchNet for music metric learning, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 121-125. <pdf> <slides>

[23] Sefik Emre Eskimez, Melissa Sturge-Apple, Zhiyao Duan, and Wendi Heinzelman, WISE: web-based interactive speech emotion classification, in Proc. 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP), 2016, pp. 2-7. <pdf> <slides>

[22] Andrea Cogliati, David Temperley, and Zhiyao Duan, Transcribing human piano performances into music notation, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2016, pp. 758-764. <pdf>

[21] Sefik Emre Eskimez, Kenneth Imade‡, Na Yang, Melissa Sturge-Apple, Zhiyao Duan, and Wendi Heinzelman, Emotion classification: how does an automated system compare to naive human coders?, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2274-2278. <pdf> <slides>

[20] Yichi Zhang and Zhiyao Duan, IMISOUND: An unsupervised system for sound query by vocal imitation, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2269-2273. <pdf> <slides>

[19] Andrea Cogliati, Zhiyao Duan, Brendt Wohlberg, Piano music transcription with fast convolutional sparse coding, in Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2015. <pdf>

[18] Yichi Zhang and Zhiyao Duan, Retrieving sounds by vocal imitation recognition, in Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2015. <pdf> <poster> <code>

[17] Jun Zhou, Shuo Chen, and Zhiyao Duan, Rotational reset strategy for online semi-supervised NMF-based speech enhancement for long recordings, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2015. <pdf> <poster>

[16] Bochen Li and Zhiyao Duan, Score following for piano performances with sustain-pedal effects, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2015, pp. 469-475. <pdf> <poster>

[15] Andrea Cogliati and Zhiyao Duan, Piano music transcription modeling note temporal evolution, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 429-433. <pdf>

[14] Zhiyao Duan and David Temperley, Note-level music transcription by maximum likelihood sampling, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2014, pp. 181-186. <pdf>

[13] Zhiyao Duan, Bryan Pardo, and Laurent Daudet, A novel cepstral representation for timbre modeling of sound sources in polyphonic mixtures, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 7495-7499. <pdf> <poster> <code>

[12] Jonathan Springer, Zhiyao Duan, and Bryan Pardo, Approaches to multiple concurrent species bird song recognition, in the 2nd International Workshop on Machine Listening in Multisource Environments, ICASSP, 2013. <pdf> <poster>

[11] Zhiyao Duan, Gautham J. Mysore, and Paris Smaragdis, Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments, in Proc. Interspeech, 2012, pp. 594-597. <pdf> <slides> <sound files>

[10] Zhiyao Duan, Gautham J. Mysore, and Paris Smaragdis, Online PLCA for real-time semi-supervised source separation, in Proc. International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), LNCS 7191, pp. 34-41, 2012. <pdf> <slides>

[9] Zhiyao Duan and Bryan Pardo, Aligning semi-improvised music audio with its lead sheet, in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2011, pp. 513-518. <pdf> <poster> <sound files>

[8] Zhiyao Duan and Bryan Pardo, A state space model for online polyphonic audio-score alignment, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 197-200. <pdf> <poster> <sound files>

[7] Zhiyao Duan, Jinyu Han, and Bryan Pardo, Song-level multi-pitch tracking by heavily constrained clustering, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 57-60. <pdf> <slides>

[6] Zhiyao Duan, Jinyu Han, and Bryan Pardo, Harmonically informed multi-pitch tracking, in Proc. International Society on Music Information Retrieval conference (ISMIR), 2009, pp. 333-338. <pdf> <slides>

[5] Zhiyao Duan, Lie Lu, and Changshui Zhang, Collective annotation of music from multiple semantic categories, in Proc. International Conference on Music Information Retrieval (ISMIR), 2008, pp. 237-242. <pdf> <poster>

[4] Zhiyao Duan, Lie Lu, and Changshui Zhang, Audio tonality mode classification without tonic annotations, in Proc. International Conference on Multimedia & Expo (ICME), 2008, pp. 1361-1364. <pdf> <poster>

[3] Zhiyao Duan and Changshui Zhang, A probabilistic approach to multiple fundamental frequency estimation from the amplitude spectrum peaks, in Proc. Music, Brain and Cognition workshop in the Twenty-first Annual Conference on Neural Information Processing Systems (NIPS), 2007. <pdf> <slides> <poster>

[2] Zhiyao Duan, Dan Zhang, Changshui Zhang, and Zhenwei Shi, Multi-pitch estimation based on partial event and support transfer, in Proc. International Conference on Multimedia & Expo (ICME),2007, pp.216-219. <pdf> <poster> <sound files>

[1] Nelson Lee, Zhiyao Duan, and Julius O. Smith, Excitation signal extraction for guitar tones, in Proc. International Computer Music Conference (ICMC), 2007, pp. 450-457. <pdf>

Conference Abstracts and Non-peer-reviewed Papers

[16] Ge Zhu, Yutong Wen, and Zhiyao Duan, A review on score-based generative models for audio applications, arXiv:2506.08457, 2025. <pdf>

[15] Christodoulos Benetatos and Zhiyao Duan, Score reduction for guitar through reinforcement learning, in International Society for Music Information Retrieval (ISMIR) Late Breaking & Demos, 2024. <pdf, poster, video>

[14] Christodoulos Benetatos, Frank Cwitkowitz, Nathan Pruyne‡, Hugo Flores Garcia, Patrick O'Reilly, Zhiyao Duan, and Bryan Pardo, HARP 2.0: Expanding hosted, asynchronous, remote processing for deep learning in the DAW, in International Society for Music Information Retrieval (ISMIR) Late Breaking & Demos, 2024. <pdf, poster, video> <code>

[13] Jordan Darefsky‡, Ge Zhu, and Zhiyao Duan, Parakeet: A natural sounding, conversational text-to-speech model, Blog post, May 12, 2024. <web>

[12] Frank Cwitkowitz and Zhiyao Duan, Toward Fully Self-Supervised Multi-Pitch Estimation, arXiv:2402.15569, 2024. <pdf>

[11] You Zhang, Yuxiang Wang, Mark Bocko, and Zhiyao Duan, Grid-agnostic personalized head-related transfer function modeling with neural fields, in Acoustical Society of America 184th Meeting, 2023. <link> (Signal Processing at the ASA Student Paper Award - Second Place)

[10] Samantha E. Lettenberger, Maryam Zafar, Julia M. Soto, You Zhang, Ge Zhu, Aaron J. Masino, Grace Nkrumah, Emma Waddell, Kelsey Spear, Abigail Arky, Rajbir Toor, Emily Hartman, Jacob Epifano, Rich Christie, Zhiyao Duan, and Ray Dorsey, Words spoken daily: A novel measure of cognition, in International Congress of Parkinson’s Disease and Movement Disorders (MDS), 2023.<link>

[9] Yuxiang Wang, You Zhang, Zhiyao Duan, and Mark Bocko, Employing deep learning method to predict global head-related transfer functions from scanned head geometry, in Acoustical Society of America 181th Meeting, 2021. <link>

[8] Qiaoyu Yang‡, Panzhen Wu‡, and Zhiyao Duan, Large-scale analysis of lyrics and melodies in Cantonese pop songs, Late Breaking & Demos in International Society for Music Information Retrieval Conference, 2021. <pdf>

[7] Mingrui Yuan‡ and Zhiyao Duan, Spoofing speaker verification systems with deep multi-speaker text-to-speech synthesis, arXiv:1910.13054, 2019. <pdf>

[6] Christodoulos Benetatos and Zhiyao Duan, BACHDUET: A human-machine duet improvisation system, Late Breaking & Demos in the International Society for Music Information Retrieval Conference, 2019. <pdf> <video>

[5] Andrea Cogliati, Mina Attin, and Zhiyao Duan, Annotating ECG signals with deep neural networks, American Heart Association-Trauma and Cardiac Resuscitation Symposium, Anaheim, California, November 2017.

[4] Yukun Chen‡, Yichi Zhang, and Zhiyao Duan, Sound event detection using convolutional neural networks, 2017 Detection and Classification of Acoustic Scenes and Events (DCASE).

[3] Andrea Cogliati, Zhiyao Duan, and Brendt Wohlberg, Transcribing piano music in the time domain into music notation, The 5th Joint Meeting of the Acoustical Society of America and Acoustical Society of Japan, Honolulu, Hawaii, December 2016.

[2] Bochen Li, Zhiyao Duan, and Gaurav Sharma, Associating players to sound tracks for musical performance videos, Late Breaking Demo in the International Society for Music Information Retrieval Conference, 2016.

[1] Iris Yuping Ren, David Temperley, Zhiyao Duan, Blue notes in rock: an exploratory study, The 6th workshop on Cognitively Based Music Informatics Research (CogMIR), 2016.

Book Chapters

[3] You Zhang, Fei Jiang, Ge Zhu, Xinhui Chen, and Zhiyao Duan, Generalizing Voice Presentation Attack Detection to Unseen Synthetic Attacks and Channel Variation, in Marcel, S., Fierrez, J., Evans, N. (eds). Handbook of Biometric Anti-Spoofing: Presentation Attack Detection and Vulnerability Assessment. Springer, Singapore, 2023. <pdf> <code>

[2] Bryan Pardo, Antoine Liutkus, Zhiyao Duan, and Gaël Richard, Applying source separation to music, Audio Source Separation and Speech Enhancement, Wiley, 2018.

[1] Bryan Pardo, Zafar Rafii, and Zhiyao Duan, Audio source separation in a musical context, Springer Handbook of Systematic Musicology, Springer-Verlag Berlin Heidelberg, 2017.

Patents

[2] Andrea Cogliati, Zhiyao Duan, and Brendt Wohlberg, Context-Dependent Piano Music Transcription with Convolutional Sparse Coding, U.S. Patent 9779706 issued in 2017.

[1] Gautham J. Mysore, Paris Smaragdis, and Zhiyao Duan, Online Non-negative Source Separation, U.S. Patent filed in 2011.

Theses

[10] Enting Zhou, Utilizing Discrete Emotion Labels for Speech Dimensional Emotion Representation Learning, Undergraduate Thesis, Department of Computer Science, University of Rochester, April 2023. Advisor: Zhiyao Duan. Reading Committee: Ross K. Maddox and Jiebo Luo.

[9] Bochen Li, Multi-modal Analysis for Music Performances, Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Rochester, August 2020. Advisor: Zhiyao Duan. Reading Committee: Mark Bocko, Chenliang Xu and Gaurav Sharma. Chair: David Temperley. (2021 Outstanding PhD Dissertation Award at the University of Rochester)

[8] Yichi Zhang, Sound Search by Vocal Imitation, Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Rochester, December 2019. Advisor: Zhiyao Duan. Reading Committee: Wendi Heinzelman and Chenliang Xu. Chair: Zhen Bai.

[7] Sefik Emre Eskimez, Robust Techniques for Generating Talking Faces from Speech, Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Rochester, July 2019. Co-Advisors: Wendi Heinzelman and Zhiyao Duan. Reading Committee: Ross K. Maddox. Chair: Chenliang Xu.

[6] Andrea Cogliati, Toward a Human-Centric Automatic Piano Music Transcription System, Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of Rochester, December 2017. Advisor: Zhiyao Duan. Reading Committee: Mark Bocko, David Temperley and Brendt Wohlberg. Chair: John Lambropoulos.

[5] Jonathan Downing, Joint Source Separation and Dereverberation of Single-channel Drum Kit Recordings, M.S. Thesis, Department of Electrical and Computer Engineering, University of Rochester, Decmber 2016. Advisor: Zhiyao Duan. Reading Committee: Gonzalo Mateos and David Temperley.

[4] Xinzhao Liu, Creating an Audio-visual Musical Performance Dataset for Enhanced Multi-pitch Analysis, M.S. Thesis, Department of Electrical and Computer Engineering, University of Rochester, May 2016. Advisor: Zhiyao Duan. Reading Committee: Gaurav Sharma and David Temperley.

[3] Andrew Trahan, A Two Part Event-Based Drum Kit Transcription System, M.S. Thesis, Department of Electrical and Computer Engineering, University of Rochester, May 2014. Advisor: Zhiyao Duan. Reading Committee: Jack Mottley and David Temperley.

[2] Zhiyao Duan, Computational Music Audio Scene Analysis, Ph.D. Dissertation, Department of Electrical Engineering and Computer Engineering, Northwestern University, August 2013. Advisor: Bryan Pardo. Reading Committee: Thrasyvoulos N. Pappas, Michael Honig, DeLiang Wang. <pdf>

[1] Zhiyao Duan, Research on Polyphonic Music Pitch Estimation, M.S. Thesis, Department of Automation, Tsinghua University, July 2008. (in Chinese). Advisor: Changshui Zhang.