Computational Intelligence for Speech Enhancement using Deep Neural Network

Hepsiba D.; Judith Justin

doi:10.24423/cames.397

Hepsiba D. Department of Biomedical Instrumentation Engineering, Avinashilingam Institute for Home Science and Higher Education for Women / Department of Biomedical Engineering, Karunya Institute of Technology and Sciences, Coimbatore
Judith Justin Department of Biomedical Instrumentation Engineering, Avinashilingam Institute for Home Science and Higher Education for Women, Coimbatore

DOI: http://dx.doi.org/10.24423/cames.397

Abstract

In real time, the speech signal received contains noise produced in the background and reverberations. These disturbances reduce the quality of speech; therefore, it is important to eliminate the noise and increase the intelligibility and quality of speech signal. Speech enhancement is the primary task in any real-time application that handles speech signals. In the proposed method, the most effective and challenging noise, i.e., babble noise, is removed, and the clean speech is recovered. The enhancement of the corrupted speech signal is done by applying a deep neural network-based denoising algorithm in which the ideal ratio mask is used to mask the noisy speech and separate the clean speech signal. In the proposed system, the speech signal corrupted by noise is enhanced. Evaluation of enhanced speech signal by performance metrics such as short time objective intelligibility and signal to noise ratio of the denoised speech show that the speech intelligibility and speech quality are improved by the proposed method.

Keywords

deep neural network, noisy speech, speech enhancement, feature extraction, speech quality, computational intelligence,

References

1. J. Li, L. Deng, R. Haeb-Umbach, Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications, 1st ed., Academic, Orlando, FL, USA, 2015.
2. B. Li, Y. Tsao, K.C. Sim, An investigation of spectral restoration algorithms for deep neural networks-based noise robust speech recognition, [in:] Proceedings of Interspeech, Lyon, France, pp. 3002–3006, 2013.
3. H. Levitt, Noise reduction in hearing aids: An overview, Journal of Rehabilitation Research and Development, 38(1), 111–121, 2001.
4. A. Chern, Y.-H. Lai, Y.-P. Chang, Y. Tsao, R.Y. Chang, H.-W. Chang, A smartphonebased multi-functional hearing assistive system to facilitate speech recognition in the classroom, IEEE Access, 5: 10339–10351, 2017, doi: 10.1109/ACCESS.2017.2711489.
5. J. Li, L. Yang, J. Zhang, Y. Yan, Comparative intelligibility investigation of single-channel noise reduction algorithms for Chinese, Japanese and English, Journal of the Acoustical Society of America, 129(5): 3291–3301, 2011, doi: 10.1121/1.3571422.
6. J. Li, S. Sakamoto, S. Hongo, M. Akagi, Y. Suzuki, Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication, Speech Communication, 53(5): 677–689, 2011, doi: 10.1016/j.specom.2010.04.009.
7. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error logspectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2): 443–445, 1985, doi: 10.1109/TASSP.1985.1164550.
8. S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2): 113–120, Apr. 1979, doi: 10.1109/TASSP.1979.1163209.
9. Hepsiba D., J. Justin, Role of deep neural network in speech enhancement: A review, [in:] J. Hemanth, T. Silva, A. Karunananda [Eds.], Artificial Intelligence, SLAAI-ICAI 2018. Communications in Computer and Information Science, Vol. 890, Springer, Singapore, 2019, doi: 10.1007/978-981-13-9129-3_8.
10. P. Scalart, J.V. Filho, speech enhancement based on a priori signal to noise estimation, [in:] Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 629–633, 1996, doi: 10.1109/ICASSP.1996.543199.
11. W. Xue, A.H. Moore, M. Brookes, P.A. Naylor, Modulation-domain multichannel Kalman filtering for speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10): 1833–1847, 2018, doi: 10.1109/TASLP.2018.2845665.
12. J. Du, Q. Huo, A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions, [in:] Proceedings of Interspeech, pp. 569–572, Brisbane, Australia, 2008.
13. B. Kollmeier, R. Koch, Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction, The Journal of the Acoustical Society of America, 95(3): 1593–1602, 1994, doi: 10.1121/1.408546.
14. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, 87(4): 1738–1752, 1990, doi: 10.1121/1.399423.
15. H. Hermansky, N. Morgan, RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, 2(4): 578–589, 1994, doi: 10.1109/89.326616.
16. T. Dau, D. Püschel, A quantitative model of the “effective” signal processing in the auditory system, The Journal of the Acoustical Society of America, 99(6): 3615–3622, 1996, doi: 10.1121/1.414959.
17. K. Han, Y.Wang, D.L.Wang,W.S.Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6): 982–992, 2015, doi: 10.1109/TASLP.2015.2416653.
18. S. Davis, P. Mermelstein, Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, 1980, doi: 10.1109/TASSP.1980.1163420.
19. Y. Zhao, Z.-Q.Wang, D.L.Wang, Two-stage deep learning for noisy-reverberant speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1): 53–62, 2019, doi: 10.1109/TASLP.2018.2870725.
20. Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12): 1849–1858, 2014, doi: 10.1109/TASLP.2014.2352935.
21. J. Benesty, S. Makino, J.D. Chen, Speech Enhancement, Springer, New York, NY, USA, 2005.
22. P.C. Loizou, Speech Enhancement: Theory and Practice, CRC Press, Boca Raton, FL, USA, 2013, doi: 10.1201/9781420015836.
23. H.-Y. Lee, J.-W. Cho, M. Kim, H.-M. Park, DNN-based feature enhancement using DOA constrained ICA for robust speech recognition, IEEE Signal Processing Letters, 23(8): 1091–1095, August 2016, doi: 10.1109/LSP.2016.2583658.
24. Y. Shao, S. Srinivasan, D.L. Wang, Incorporating auditory feature uncertainties in robust speaker identification, [in:] Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, pp. 277–280, 2007.
25. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1): 7–19, 2015, doi: 10.1109/TASLP.2014.2364452.
26. IEEE, IEEE recommended practice for speech quality measurements, IEEE Transactions on Audio and Electroacoustics, 17: 225–246, 1969.
27. Y. Hu, P. Loizou, Subjective evaluation and comparison of speech enhancement algorithms, Speech Communication, 2007, 49: 588–601, https://ecs.utdallas.edu/loizou/speech/noizeus/.
28. K. Tan, D. Wang, Towards model compression for deep learning based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 1785–1794, 2021, doi: 10.1109/TASLP.2021.3082282.
29. F. Bao, W. Abdulla, A new ratio mask representation for CASA-based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1): 7–19, 2018, doi: 10.1109/TASLP.2018.2868407.
30. Y. Liu, H. Zhang, X. Zhang, L. Yang, Supervised speech enhancement with real spectrum approximation, [in:] Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5746–5750, 2019, doi: 10.1109/ICASSP.2019.8683691.
31. C. Valentini-Botinhao, J. Yamagishi, Speech enhancement of noisy and reverberant speech for text-to-speech, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8): 1420–1433, 2018, doi: 10.1109/TASLP.2018.2828980.
32. J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang, Audio-visual speech enhancement using multimodal deep convolutional neural networks, IEEE Transactions on Emerging Topics in Computational Intelligence, 2(20): 117–128, 2018, doi: 10.1109/TETCI.2017.2784878.
33. P. Pujol, S. Pol, C. Nadeu, A. Hagen, H. Bourlard, Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system, IEEE Transactions on Speech and Audio Processing, 13(1): 14–22, 2005, doi: 10.1109/TSA.2004.834466.
34. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, Cross-language transfer learning for deep neural network-based speech enhancement, [in:] Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, pp. 336–340, 2014, doi: 10.1109/ISCSLP.2014.6936608.
35. Z.-Q. Wang, D.L. Wang, Robust speech recognition from ratio masks, [in:] Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720–5724, 2016, doi: 10.1109/ICASSP.2016.7472773.
36. W. Yuan, A time–frequency smoothing neural network for speech enhancement, Speech Communications, 124: 75–84, 2020, doi: 10.1016/j.specom.2020.09.002.
37. T. Lavanya, T. Nagarajan, P. Vijayalakshmi, Multi-level single channel speech enhancement using a unified framework for estimating magnitude and phase spectra, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1315–1327, 2020, doi: 10.1109/TASLP.2020.2986877.
38. K. Sekiguchi, Y. Bando, A.A. Nugraha, K. Yoshii, T. Kawahara, Semi-supervised multichannel speech enhancement with a deep speech prior, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12): 2197–2212, 2019, doi: 10.1109/TASLP.2019.2944348.
39. F.B. Gelderblom, T.V. Tronstad, E.M. Viggen, Subjective evaluation of a noisereduced training target for deep neural network-based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(3): 583–594, 2020, doi: 10.1109/TASLP.2018.2882738.
40. T. Kawase, M. Okamoto, T. Fukutomi, Y. Takahashi, Speech enhancement parameter adjustment to maximize accuracy of automatic speech recognition, IEEE Transactions on Consumer Electronics, 66(2): 125–133, 2020, doi: 10.1109/TCE.2020.2986003.
41. D. Baby, T. Viratanen, J.F. Gemmeke, H. van Hamme, Coupled dictionaries for exemplarbased speech enhancement and automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11): 1788–1799, 2015, doi: 10.1109/TASLP.2015.2450491.
42. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12): 2136–2147, 2015, doi: 10.1109/TASLP.2015.2468583.
43. L. Sun, J. Du, L.-R. Dai, C.-H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement, [in:] 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, 2017, doi: 10.1109/HSCMA.2017.7895577.
44. W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, H.-M. Wang, Voice conversion based on cross-domain features using variational auto encoders, [in:] 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 51–55, 2018, doi: 10.1109/ISCSLP.2018.8706604.
45. W. Han, C. Wu, X. Zhang, Q. Zhang, S. Bai, Joint optimization of modified ideal ratio mask and deep neural networks for monaural speech enhancement, [in:] Proceedings of 2017 9th International Conference on Communication Software and Networks (ICCSN), pp. 1070–1074, 2017, doi: 10.1109/ICCSN.2017.8230275.
46. D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(3): 483–492, 2016, doi: 10.1109/TASLP.2015.2512042.
47. J. Ming, D. Crookes, Speech enhancement based on full-sentence correlation and clean speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3): 531–543, 2017, doi: 10.1109/TASLP.2017.2651406.
48. R. Jaiswal, D. Romero, Implicit Wiener filtering for speech enhancement in non-stationary noise, [in:] 2021 11th International Conference on Information Science and Technology (ICIST), pp. 39–47, 2021, doi: 10.1109/ICIST52614.2021.9440639.