This is the companion webpage for the paper:
Zhiyao Duan, Gautham J. Mysore and Paris Smaragdis, Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments, in Proc. Interspeech, 2012. <pdf> <slides>
As describied in the paper, we carried out experiments using the NOIZEUS speech dataset [1]. We collected noise files through recording or downloading from the internet.
For the proposed method, we varied the the Dirichlet prior ramp length tau from 0 (no prior at all) to 20 (always a prior throughout the iterations). But here we only show the resuls for tau=10.
We compare the proposed method with four categories of conventional speech enhancement methods, which are all online algorithms. We use P.C. Loizou's implementations [1]:
We also compare with an offline spectrogram decomposition method:
All these above methods train their noise models using the same noise-only excerpts, which are unseen in testing mixtures.
We evaluate speech enhancement results using two metrics:
Noise: computer keyboard (training excerpt) | Online methods (PESQ/SDR) | Offline methods | ||||||
---|---|---|---|---|---|---|---|---|
SNR (dB) | Noisy speech | Clean speech | Proposed (tau=10) | MB | Wiener-as | log-MMSE | KLT | PLCA |
-10 | WAV | WAV | WAV (1.65/1.13) | WAV (1.17/-5.52) | WAV (0.80/-7.42) | WAV (0.88/-7.14) | WAV (0.75/-7.31) | WAV (1.40/0.67) |
-5 | WAV | WAV | WAV (1.92/5.35) | WAV (1.42/-1.34) | WAV (1.06/-3.01) | WAV (1.14/-2.71) | WAV (0.96/-2.93) | WAV (2.05/5.03) |
0 | WAV | WAV | WAV (2.14/9.62) | WAV (1.41/1.82) | WAV (1.03/0.27) | WAV (1.13/0.70) | WAV (0.93/0.18) | WAV (2.20/9.52) |
5 | WAV | WAV | WAV (2.39/11.38) | WAV (1.77/6.40) | WAV (1.41/5.24) | WAV (1.57/5.67) | WAV (1.25/5.11) | WAV (2.50/9.85) |
10 | WAV | WAV | WAV (2.70/12.37) | WAV (2.14/10.93) | WAV (1.83/10.26) | WAV (1.96/10.77) | WAV (1.72/10.09) | WAV (3.01/10.94) |
Noise: casino (training excerpt) | Online methods (PESQ/SDR) | Offline methods | ||||||
---|---|---|---|---|---|---|---|---|
SNR (dB) | Noisy speech | Clean speech | Proposed (tau=10) | MB | Wiener-as | log-MMSE | KLT | PLCA |
-10 | WAV | WAV | WAV (1.23/-6.79) | WAV (0.80/-9.95) | WAV (0.74/-10.10) | WAV (0.78/-10.08) | WAV (0.76/-10.14) | WAV (2.09/-8.37) |
-5 | WAV | WAV | WAV (1.62/0.33) | WAV (1.39/-5.84) | WAV (1.40/-4.89) | WAV (1.47/-4.28) | WAV (1.40/-4.07) | WAV (1.61/-1.12) |
0 | WAV | WAV | WAV (1.51/3.81) | WAV (1.48/0.04) | WAV (1.50/0.77) | WAV (1.54/1.73) | WAV (1.39/1.56) | WAV (1.49/3.80) |
5 | WAV | WAV | WAV (1.88/6.33) | WAV (1.80/5.60) | WAV (1.84/6.53) | WAV (1.89/6.98) | WAV (1.64/6.84) | WAV (1.88/6.27) |
10 | WAV | WAV | WAV (1.88/5.60) | WAV (2.38/11.95) | WAV (2.23/12.74) | WAV (2.43/13.10) | WAV (2.44/13.85) | WAV (2.06/8.49) |
Now we show the tradeoff between noise reduction and speech distortion, introduced by the Dirichlet prior ramp length parameter tau.
We use three measures from [8]:
Noise: birds (training excerpt) | Proposed method with different prior ramp length (SDR/SIR/SAR) | |||||||
---|---|---|---|---|---|---|---|---|
SNR (dB) | Noisy speech | Clean speech | tau=0 | tau=1 | tau=5 | tau=10 | tau=15 | tau=20 |
-10 | WAV | WAV | WAV (-2.55/1.79/1.65) | WAV (-1.68/5.21/0.46) | WAV (0.31/9.13/1.43) | WAV (0.48/10.12/1.39) | WAV (1.14/12.06/1.77) | WAV (1.65/13.06/2.18) |
-5 | WAV | WAV | WAV (1.31/4.86/5.08) | WAV (5.27/13.21/6.23) | WAV (6.52/15.55/7.22) | WAV (5.20/15.97/5.69) | WAV (5.35/17.57/5.70) | WAV (5.08/18.13/5.37) |
0 | WAV | WAV | WAV (7.72/12.64/9.64) | WAV (9.83/21.83/10.14) | WAV (9.26/22.73/9.48) | WAV (8.51/22.34/8.72) | WAV (9.87/23.29/10.09) | WAV (9.69/23.46/9.89) |
5 | WAV | WAV | WAV (10.69/15.23/12.71) | WAV (8.89/18.78/9.42) | WAV (9.00/24.54/9.14) | WAV (8.53/24.13/8.67) | WAV (7.93/24.88/8.03) | WAV (8.82/25.27/8.93) |
10 | WAV | WAV | WAV (15.14/20.57/16.65) | WAV (14.15/30.17/14.26) | WAV (13.52/31.26/13.59) | WAV (13.45/31.01/13.53) | WAV (12.58/32.61/12.62) | WAV (12.84/31.66/12.90) |
[1] Loizou, P.C. Speech Enhancement: Theory and Practice, Taylor and Francis, 2007.
[2] Kamath, S. and Loizou, P.C., "A multi-band spectral subtraction method for enhanceing speech corrupted by colored noise," in Student Research Abstracts of Proc. ICASSP, 2002.
[3] Scalart, P. and Filho, J., "Speech enhancement based on a priori signal to noise estimation,'' in Proc. ICASSP, pp. 629--632, 1996.
[4] Ephraim, Y. and Malah, D., "Speech enhancement using a minimum mean-square error log-spectral amplitude estimator," IEEE Trans. Acoust. Speech Signal Process., 33:443--445, 1985.
[5] Hu, Y. and Loizou, P.C., "A generalized subspace approach for enhancing speech corrupted by colored noise," IEEE Trans. Speech Audio Process., pp. 334--341, 2003.
[6] Smaragdis, P., Raj, B. and Shashanka, M., "A probabilistic latent variable model for acoustic modeling," in Workshop of Advances in Models for Acoustic Processing, NIPS, 2006.
[7] Rix, A., Beerends, J. Hollier, M. and Hekstra, A., "Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codes," in Proc. ICASSP, pp. 749--752, 2001.
[8] Vincent, E., Fevotte, C. and Gribonval, R., "Performance measurement in blind audio source separation," IEEE Trans. on Audio Speech Lang. Process., 14(4):1462--1469, 2006.
For any questions or comments, please contact us at zhiyaoduan00 AT gmail DOT com.