Adversarial Training for Speech Super-Resolution

S. Emre Eskimez, Kazuhito Koishida, and Zhiyao Duan

This is collaboration work with Microsoft Research. This project is partially supported by the National Science Foundation under grant No. 1617107, titled "III: Small: Collaborative Research: Algorithms for Query by Example of Audio Databases.
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of these funding agencies.

Paper link

Overview

Speech super-resolution or speech bandwidth expansion aims to upsample a given speech signal by generating the missing high-frequency content. In this paper, we propose a deep neural network approach exploiting the adversarial training ideas that have been shown effective in image super-resolution. Specifically, our proposed network follows the Generative Adversarial Networks (GAN) setup, where the generator network uses a convolutional autoencoder architecture with 1D convolution kernels to generate high-frequency log-power spectra from the low-frequency log-power spectra of the input speech. We propose to use both the reconstruction loss and the adversarial loss for training, and we employ a recent regularization method that penalizes the gradient norms of the discriminator to stabilize the training. We compare our proposed approach with two state-of-the-art neural network baselines and evaluate these methods with both objective speech quality measures and subjective perceptual and intelligibility tests. Results show that our proposed method outperforms both baselines in terms of both objective and subjective evaluations. To gain insights of the network architecture, we analyze key parameters of the proposed network including the number of layers, the number of convolution kernels, and the relative weight of the reconstruction and adversarial losses. Besides, we analyze the computational complexity of our method and the baselines and discuss ways for phase estimation. We further develop a noise-resilient version of the proposed approach by training the network with noisy speech inputs. Objective evaluation validates the noise-resilient property on unseen noise types.

SSR-GAN Samples

2x Super-Resolution

Low Resolution

SSR-GAN Flipped-Phase

SSR-GAN Griffin-Lim

High Resolution

4x Super-Resolution

Low Resolution

SSR-GAN Flipped-Phase

SSR-GAN Griffin-Lim

High Resolution