Long-Term Spectral Pseudo-Entropy (LTSPE): A New Robust Feature for Speech Activity Detection
Subject Areas : Speech ProcessingMohammad Rasoul kahrizi 1 * , Seyed jahanshah kabudian 2
1 - Razi university
2 - Razi University
Keywords: Audio Signal Processing, , Speech Processing, , Speech Activity Detection (SAD), , Speech Recognition, , Voice Activity Detection (VAD), , Robust Feature, , LTSPE, , robust feature , LTSPE,
Abstract :
Speech detection systems are known as a type of audio classifier systems which are used to recognize, detect or mark parts of an audio signal including human speech. Applications of these types of systems include speech enhancement, noise cancellation, identification, reducing the size of audio signals in communication and storage, and many other applications. Here, a novel robust feature named Long-Term Spectral Pseudo-Entropy (LTSPE) is proposed to detect speech and its purpose is to improve performance in combination with other features, increase accuracy and to have acceptable performance. To this end, the proposed method is compared to other new and well-known methods of this context in two different conditions, with uses a well-known speech enhancement algorithm to improve the quality of audio signals and without using speech enhancement algorithm. In this research, the MUSAN dataset has been used, which includes a large number of audio signals in the form of music, speech and noise. Also various known methods of machine learning have been used. As well as Criteria for measuring accuracy and error in this paper are the criteria for F-Score and Equal-Error Rate (EER) respectively. Experimental results on MUSAN dataset show that if our proposed feature LTSPE is combined with other features, the performance of the detector is improved. Moreover, this feature has higher accuracy and lower error compared to similar ones.
[1] M. R. Kahrizi, "Long-Term Spectral Pseudo-Entropy (LTSPE) Feature," IEEE Dataport, 2017. [Online]. Available: http://dx.doi.org/10.21227/H2G05K. Accessed: May. 27, 2019.
[2] W. Wang, H. Liu, J. Yang, G. Cao, and C. Hua, "Speech enhancement based on noise classification and deep neural network," Modern Physics Letters B, p. 1950188, 2019.
[3] H. Wang, Z. Ye, and J. Chen, "A Front-End Speech Enhancement System for Robust Automotive Speech Recognition," in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 1-5: IEEE.
[4] K. Dinesh, R. Prakash, and M. P. Madhan, "Real-time Multi-Source Speech Enhancement for Voice Personal Assistant by using Linear Array Microphone based on Spatial Signal Processing," in 2019 International Conference on Communication and Signal Processing (ICCSP), 2019, pp. 0965-0967: IEEE.
[5] I. Ariav, D. Dov, and I. Cohen, "A deep architecture for audio-visual voice activity detection in the presence of transients," Signal Processing, vol. 142, pp. 69-74, 2018.
[6] M. Pal, D. Paul, and G. Saha, "Synthetic speech detection using fundamental frequency variation and spectral features language," Computer Speech, vol. 48, pp. 31-50, 2018.
[7] B. Mouaz, B. H. Abderrahim, and E. Abdelmajid, "Speech Recognition of Moroccan Dialect Using Hidden Markov Models," Procedia Computer Science, vol. 151, pp. 985-991, 2019.
[8] H. Chen, "Speaker Identification: Time-Frequency Analysis With Deep Learning," ETD Collection for Tennessee State University, 2018.
[9] P. Vecchiotti, G. Pepe, E. Principi, and S. Squartini, "Detection of activity and position of speakers by using Deep Neural Networks and Acoustic Data Augmentation," Expert Systems with Applications, 2019.
[10] F. Tao, "Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition," 2018.
[11] A. Ivry, B. Berdugo, and I. Cohen, "Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets," IEEE Journal of Selected Topics in Signal Processing, 2019.
[12] V. Andrei, H. Cucu, and C. Burileanu, "Overlapped speech detection and competing speaker counting-humans vs. deep learning," IEEE Journal of Selected Topics in Signal Processing, 2019.
[13] C. Vikram, N. Adiga, and S. M. Prasanna, "Detection of Nasalized Voiced Stops in Cleft Palate Speech Using Epoch-Synchronous Features," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
[14] N. Mansour, M. Marschall, T. May, A. Westermann, and T. Dau, "A method for conversational signal-to-noise ratio estimation in real-world sound scenarios," The Journal of the Acoustical Society of America, vol. 145, no. 3, pp. 1873-1873, 2019.
[15] M. Pandharipande, R. Chakraborty, A. Panda, and S. K. Kopparapu, "Robust Front-End Processing For Emotion Recognition In Noisy Speech," in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 324-328: IEEE.
[16] Y. Malviya, S. Kaul, and K. Goyal, "Music Speech Discrimination," CS229 Final Project, 2016.
[17] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
[18] P. K. Ghosh, A. Tsiartas, and S. Narayanan, "Robust voice activity detection using long-term signal variability," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 600-613, 2011.
[19] A. Tsiartas et al., "Multi-band long-term signal variability features for robust voice activity detection," in Interspeech, 2013, pp. 718-722.
[20] A. Makur and S. K. Mitra, "Warped discrete-Fourier transform: Theory and applications," IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 48, no. 9, pp. 1086-1093, 2001.
[21] J. Ramırez, J. C. Segura, C. Benıtez, A. De La Torre, and A. Rubio, "Efficient voice activity detection algorithms using long-term speech information," Speech Communication, vol. 42, no. 3-4, pp. 271-287, 2004.
[22] S. O. Sadjadi and J. H. Hansen, "Unsupervised speech activity detection using voicing measures and perceptual spectral flux," IEEE Signal Processing Letters, vol. 20, no. 3, pp. 197-200, 2013.
[23] T. Drugman, Y. Stylianou, Y. Kida, and M. Akamine, "Voice activity detection: Merging source and filter-based information," IEEE Signal Processing Letters, vol. 23, no. 2, pp. 252-256, 2016.
[24] D. Snyder, G. Chen, and D. Povey, "MUSAN: A Music, Speech, and Noise Corpus," CoRR, vol. abs/1510.08484, 2015.
[25] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967.
[26] I. Cohen and B. Berdugo, "Speech enhancement for non-stationary noise environments," Signal Processing, vol. 81, no. 11, pp. 2403-2418, 2001.
[27] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.