Whispered Speech Emotion Recognition with Gender Detection using BiLSTM and DCNN
Subject Areas : Signal ProcessingAniruddha Mohanty 1 * , Ravindranath C. Cherukuri 2
1 - CHRIST (DEEMED TO BE UNIVERSITY)
2 - CHRIST (DEEMED TO BE UNIVERSITY)
Keywords: Whispered Speech, Emotion Recognition, Speech Features, Data Corpus, BiLSTM, DCNN.,
Abstract :
Emotions are human mental states at a particular instance in time concerning one’s circumstances, mood, and relationships with others. Identifying emotions from the whispered speech is complicated as the conversation might be confidential. The representation of the speech relies on the magnitude of its information. Whispered speech is intelligible, a low-intensity signal, and varies from normal speech. Emotion identification is quite tricky from whispered speech. Both prosodic and spectral speech features help to identify emotions. The emotion identification in a whispered speech happens using prosodic speech features such as zero-crossing rate (ZCR), pitch, and spectral features that include spectral centroid, chroma STFT, Mel scale spectrogram, Mel-frequency cepstral coefficient (MFCC), Shifted Delta Cepstrum (SDC), and Spectral Flux. There are two parts to the proposed implementation. Bidirectional Long Short-Term Memory (BiLSTM) helps to identify the gender from the speech sample in the first step with SDC and pitch. The Deep Convolutional Neural Network (DCNN) model helps to identify the emotions in the second step. This implementation is evaluated with the help of wTIMIT data corpus and gives 98.54% accuracy. Emotions have a dynamic effect on genders, so this implementation performs better than traditional approaches. This approach helps to design online learning management systems, different applications for mobile devices, checking cyber-criminal activities, emotion detection for older people, automatic speaker identification and authentication, forensics, and surveillance.
[1] Slobodan T Jovicic and Zoran Saric, “Acoustic analysis of consonants in whispered speech,” Journal of voice, vol 22, no. 3, pp. 263–274, 2008.
[2] Mamta Kumari and Israj Ali, “An efficient algorithm for gender detection using voice samples,” in 2015 Communication, Control and Intelligent Systems (CCIS), Mathura, Utter Pradesh, 2015, pp. 221–226, doi: 10.1109/CCIntelS.2015.7437912.
[3] Sara Motamed, Saeed Setayeshi, Azam Rabiee and Arash Sharifi, “Speech Emotion Recognition Based on Fusion Method,” Journal of Information Systems and Telecommunication (JIST), vol. 3, pp. 50--56, 2017, doi: 10.7508/jist.2017.17.007.
[4] JS Li, CC Huang, ST Sheu and MW Lin, “Speech emotion recognition and its applications,” in Proc. of Taiwan Institute of Kansei Conference, 2010, pp. 187–192.
[5] Antonio Guerrieri, Eleonora Braccili, Federica Sgro and Giulio Nicolo Meldolesi “Gender identification in a two-level hierarchical speech emotion recognition system for an Italian Social Robot,” Sensors, vol. 22, no. 5, pp. 1714, 2022, doi: https://doi.org/10.3390/s22051714.
[6] Milton Sarria-Paja, Tiago H Falk and Douglas O’Shaughnessy, “Whispered speaker verification and gender detection using weighted instantaneous frequencies,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7209–7213, 2013, doi: 10.1109/ICASSP.2013.6639062.
[7] Jun Deng, Sascha Fruhholz, Zixing Zhang and Bojrn Schuller, “Recognizing emotions from whispered speech based on acoustic feature transfer learning,” IEEE Access, vol. 5, pp. 5235–5246, 2017.
[8] Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, and Alexis Moinet, “Voice conversion for whispered speech synthesis,” IEEE Signal Processing Letters, vol. 27, pp. 186–190, 2019, doi: 10.1109/LSP.2019.2961213.
[9] Puneet Mishra and Ruchir Sharma, “Gender differentiated convolutional neural networks for speech emotion recognition,” in 2020 12th International Congress on Ultra-Modern Telecommuni- cations and Control Systems and Workshops (ICUMT), 2020, pp. 142–148, doi: 10.1109/ICUMT51630.2020.9222412
[10] Mustaqeem Soonil Kwon, “Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network,” International Journal of Intelligent Systems, vol 36, no. 9, pp. 5116– 5135, 2021, doi:
https://doi.org/10.1002/int.22505
[11] J. Ancilin and A. Milton, “Improved speech emotion recognition with mel frequency magnitude coefficient,” Applied Acoustics, vol. 179, pp. 108046, 2021, doi: https://doi.org/10.1016/j.apacoust.2021.108046.
[12] S. Jothimani and K. Premalatha, “Mff-saug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network,” Chaos, Solitons & Fractals, vol. 162, pp. 112512, 2022, doi: https://doi.org/10.1016/j.chaos.2022.112512.
[13] Bhanusree Yalamanchili, Srinivas Kumar Samayamantula, and Koteswara Rao Anne, “Neural network-based blended ensemble learning for speech emotion recognition,” Multidimensional Systems and Signal Processing, vol. 33, no. 4, pp. 1323--1348, 2022, doi: https://doi.org/10.1007/s11045-022-00845-9.
[14] Tiantian Feng, Rajat Hebbar and Shrikanth Narayanan, “Trustser: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition,” arXiv preprint arXiv:2305.11229, 2023, doi: https://doi.org/10.48550/arXiv.2305.11229
[15] Darekar, RV and Chavan, Meena and Sharanyaa, S and Ranjan, Nihar M. “A hybrid meta-heuristic ensemble based classification technique speech emotion recognition,” Advances in Engineering Software, vol. 180, pp. 103412, 2023.
[16] Rekimoto Jun, “Dualvoice: A speech interaction method using whisper-voice as commands,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts, pp. 1–6, 2022, doi: https://doi.org/10.1145/3491101.3519700.
[17]Harshit Dolka, Arul Xavier V M and Sujitha Juliet,“ Speech emotion recognition using ANN on MFCC features,” in 2021 3rd international conference on signal processing and communication (ICPSC), Coimbatore, India, 2021, pp. 431–435, doi: 10.1109/ICSPC51351.2021.9451810.
[18] M. Kiran Reddy and K. Sreenivasa Rao, “Robust pitch extraction method for the hmm-based speech synthesis system,” IEEE signal processing letters, vol. 24, no. 8, pp. 1133–1137, 2017, doi: 10.1109/LSP.2017.2712646.
[19] Joyjit Chatterjee, Vajja Mukesh, Hui-Huang Hsu, Garima Vyas and Zhen Liu, “Speech emotion recognition using cross- correlation and acoustic features,” in 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, 2018, pp. 243–249, doi: https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00050.
[20] Sangeetha Rajesh and N J Nalini, “Musical instrument emotion recognition using deep recurrent neural network,” Procedia Computer Science, vol. 167, pp. 16--25, 2020, doi: https://doi.org/10.1016/j.procs.2020.03.178.
[21] Mohammed Aly and Nouf Saeed Alotaibi, “A novel deep learning model to detect covid-19 based on wavelet features extracted from mel- scale spectrogram of patients’ cough and breathing sounds,” Informatics in Medicine Unlocked, vol. 32, pp. 101049, 2022, doi: https://doi.org/10.1016/j.imu.2022.101049.
[22] Zakariya Qawaqneh, Arafat Abu Mallouh, and Buket D.Barkana,“Age and gender classification from speech and face images by jointly fine-tuned deep neural networks,” Expert Systems with Applications, vol. 85, pp. 76–86, 2017, doi: https://doi.org/10.1016/j.eswa.2017.05.037.
23] Anusha Koduru, Hima Bindu Valiveti and Anil Kumar Budati, “Feature extraction algorithms to improve the speech emotion recognition rate,” International Journal of Speech Technology, vol. 23, no. 1, pp. 45–55, 2020, doi: https://doi.org/10.1007/s10772-020-09672-4.
[24] Shaoyun Zhang and Chao Li, “Research on feature fusion speech emotion recognition technology for smart teaching,” Mobile Information Systems, vol. 2022, 2022, doi: https://doi.org/10.1155/2022/7785929.
[25] Ramesh G, Prasanna G B, Santosh V Bhat, Chandrashekar Naik and Champa H N, “An Efficient Method for Handwritten Kannada Digit Recognition based on PCA and SVM Classifier,” Journal of Information Systems and Telecommunication (JIST), vol. 3, no. 35, pp. 169 2021, doi: 20.1001.1.23221437.2021.9.35.3.2.
[26] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in 2013 IEEE workshop on automatic speech recognition and understanding, Olomouc, Czech Republic, 2013, pp. 273–278, doi: 10.1109/ASRU.2013.6707742.
[27] Neena Aloysius and M. Geetha, “A review on deep convolutional neural networks,” in 2017 international conference on communication and signal processing (ICCSP), 2017, pp. 0588–0592, doi: 10.1109/ICCSP.2017.8286426.
[28] Sai Bharath Chandra Gutha, M. Ali Basha Shaik, Teja Udayakumar and Ajit Ashok Saunshikhar, “Improved feed forward attention mechanism in bidirectional recurrent neural networks for robust sequence classification,” in 2020 International Conference on Signal Processing and Communications (SPCOM), IEEE, pp. 1—5, 2020, doi: 10.1109/SPCOM50965.2020.9179606