Enhancing Speaker Identification System Based on MFCC Feature Extraction and Gated Recurrent Unit Network

Sharif-Noughabi, M.; Razavi, Seyyed Mohammad; Taghipour-gorjikolaie, Mehran

doi:10.61186/jist.48366.12.48.254

Manuscript ID : 2024102548366 Visit : 2461 Page: 254 - 263

10.61186/jist.48366.12.48.254

Article Type: Original Research

Enhancing Speaker Identification System Based on MFCC Feature Extraction and Gated Recurrent Unit Network

Subject Areas : Speech Processing

M. Sharif-Noughabi ¹ , Seyyed Mohammad Razavi ^{2
*} , Mehran Taghipour-gorjikolaie ³

1 - Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
2 - Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
3 - Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran

Received: 2024-10-25 Accepted : 2024-12-28 Published : 2025-03-05

Keywords: Speaker Identification, Gated Recurrent Unit Network (GRU), Convolutional Neural Network (CNN), MFCC,

Abstract :

One of the biometric detection methods is to identify people based on speech signals. The implementation of a speaker identification (SI) system can be done in many different ways, and recently, many researchers have been focusing on using deep neural networks. One of the types of deep neural networks is recurrent neural networks, where memory and recurrent parts are handled by layers such as LSTM or Gated Recurrent Unit (GRU). In this paper, we propose a new structure as a classifier in the speaker identification system, which significantly improves the recognition rate by combining a convolutional neural network with two layers of GRU (CNN+ GRU). MFCC coefficients that have been extracted as cell arrays from each period of Pt speech will be used as sequence vectors for the input of proposed classifier. The performance of the SI system has improved in comparison to basic methods according to experiments conducted on two databases, LibriSpeech and VoxCeleb1. When Pt is longer, the system performs better, so that on the LibriSpeech database with 251 speakers, recognition accuracy is equal to 92.94% for Pt=1s, and it rises to 99.92% for Pt=9s. The proposed CNN+GRU classifier has a low sensitivity to specific genders, which can be said to be almost zero.

References:

[1] S. Hourri and J. Kharroubi, “A Novel Scoring Method Based on Distance Calculation for Similarity Measurement in Text-Independent Speaker Verification,” Procedia Computer Science, vol. 148, pp. 256–265, 2019.
[2] M. Chaiani, M. Bengherabi, S. A. Selouani and M. Boudraa, "Dysarthric speaker identification with constrained training durations," 2018 International Conference on Signal, Image, Vision and their Applications (SIVA), Guelma, Algeria, 2018, pp. 1-6.
[3] R. Jahangir et al., "Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network," in IEEE Access, vol. 8, pp. 32187-32202, 2020.
[4] M. Barhoush, A. Hallawa and A. Schmeink, "Robust Automatic Speaker Identification System Using Shuffled MFCC Features," 2021 IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT), Soyapango, El Salvador, 2021, pp. 1-6.
[5] S. Langari , H. Marvi, and M. Zahedi, “Efficient speech emotion recognition using modified feature extraction,” Informatics in Medicine Unlocked, vol. 20, p. 100424, Jan. 2020.
[6] X. Liu, M. Sahidullah and T. Kinnunen, "Optimized Power Normalized Cepstral Coefficients Towards Robust Deep Speaker Verification," 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 2021, pp. 185-190.
[7] P. Sandhya, V. Spoorthy, S. G. Koolagudi and N. V. Sobhana, "Spectral Features for Emotional Speaker Recognition," 2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bengaluru, India, 2020, pp. 1-6.
[8] K. Aghajani and E. P. Afrakoti I., “Speech emotion recognition using Scalogram based deep structure,” International Journal of Engineering. Transactions B: Applications, vol. 33, no. 2, Feb. 2020.
[9] A. Abbaskhah, Hamed Sedighi, and Hossein Marvi, “Infant cry classification by MFCC feature extraction with MLP and CNN structures,” Biomedical Signal Processing and Control, vol. 86, pp. 105261–105261, Sep. 2023.
[10] A. Sezavar, H. Farsi, and S. Mohamadzadeh, “A New Model for Person Reidentification Using Deep CNN and Autoencoders,” Iranian Journal of Energy and Environment, vol. 14, no. 4, pp. 314–320, 2023.
[11] E. Ghasemi, S. M. Razavi, S. Mohamadzadeh, and M. Taghipour-Gorjikolaie, “Facial Expression Recognition through Suboptimal Filter Design Using a Metaheuristic Kidney Algorithm,” Journal of Electrical and Computer Engineering Innovations, vol. 12, no. 2, pp. 425–438,2024.
[12] A.Nagrani , J. S.Chung , and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset”. arXiv preprint arXiv:1706.08612. 2017.
[13] J. W. Jung , H. S. Heo , I. H.Yang , H. J. Shim , and H. J. Yu, “Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification” . extraction, vol. 8, no. 12, pp. 23-24, 2018.
[14] G. Wei, Y. Zhang, H. Min, and Y. Xu, “End-to-end speaker identification research based on multi-scale SincNet and CGAN,” Neural Computing and Applications, vol. 35, no. 30, pp. 22209–22222, Aug. 2023.
[15] S. S.Tirumala and S. R. Shahamiri, “A review on deep learning approaches in speaker identification”. In Proceedings of the 8th international conference on signal processing systems, Nov. 2016, pp. 142-147.
[16] K. A. Abdalmalak and A. Gallardo-Antolín, “Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers,” Neural Computing and Applications, vol. 29, no. 3, pp. 637–651, Jul. 2016.
[17] A. Ashar, M. S. Bhatti and U. Mushtaq, "Speaker Identification Using a Hybrid CNN-MFCC Approach," 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), Karachi, Pakistan, 2020, pp. 1-4.
[18] B. K. P and R. K. M, “ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score,” Multimedia Tools and Applications, vol. 79, no. 39–40, pp. 28859–28883, Aug. 2020.
[19] M. K. Singh, “A text independent speaker identification system using ANN, RNN, and CNN classification technique,” Multimedia Tools and Applications, vol. 83, no. 16, pp. 48105–48117, Nov. 2023.
[20] M. R. Firmansyah, R. Hidayat and A. Bejo, "Comparison of Windowing Function on Feature Extraction Using MFCC for Speaker Identification," 2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), Bandung, Indonesia, 2021, pp. 1-5.
[21] S. Chakraborty and R. Parekh, "An improved approach to open set text-independent speaker identification (OSTI-SI)," 2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Kolkata, India, 2017, pp. 51-56.
[22] E. S. Hassan et al., “Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs,” PLoS ONE, vol. 19, no. 2, p. e0294235, Feb. 2024.
[23] J. I. Ramírez-Hernández, A. Manzo-Martínez, F. Gaxiola, L. C. González-Gurrola, V. C. Álvarez-Oliva, and R. López-Santillán, “A comparison between MFCC and MSE features for Text-Independent speaker recognition using machine learning algorithms,” in Studies in computational intelligence, 2023, pp. 123–140.
[24] N. M. Almarshady, A. A. Alashban, and Y. A. Alotaibi, “Analysis and investigation of speaker identification problems using deep learning networks and the YOHO English Speech Dataset,” Applied Sciences, vol. 13, no. 17, p. 9567, Aug. 2023.
[25] S.Hizlisoy , and , R. S. Arslan , “Text independent speaker recognition based on MFCC and machine learning”. Selcuk University Journal of Engineering Sciences, vol. 20, no. 3, pp. 73-78, 2021.
[26] V. S. R. Gade and S. Manickam, “Speaker recognition using Improved Butterfly Optimization Algorithm with hybrid Long Short Term Memory network,” Multimedia Tools and Applications, vol.13, pp.1-23, Feb. 2024.
[27] A. Fikri and A. Zahra, “Speaker Identification in Multiple Languages: Regional, Indonesian, and English with Short Utterance,” International Journal of Emerging Technology and Advanced Engineering, vol. 13, no. 9, pp. 25–35, Oct. 2023.
[28] M. Hasheminejad , and H. Farsi, (2016). “Instance Based Sparse Classifier Fusion for Speaker Verification”. Journal of Information Systems and Telecommunication (JIST), vol. 3, no. 15, pp. 1, 2016.
[29] R. Li , J. Y. Jiang , J. Liu , C. C. Hsieh , and W. Wang, “Automatic speaker recognition with limited data”. In Proceedings of the 13th International Conference on Web Search and Data Mining, Jan. 2020, pp. 340-348.
[30] Md. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, “A robust speaker identification system using the responses from a model of the auditory periphery,” PLoS ONE, vol. 11, no. 7, p. e0158520, Jul. 2016.
[31] S. Nagarajan, S. S. S. Nettimi, L. S. Kumar, M. K. Nath, and A. Kanhe, “Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales,” Digital Signal Processing, vol. 104, p. 102763, Sep. 2020.
[32] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv (Cornell University), Jan. 2014.
[33] N. Mohammadi, A. Rezakhani, H. H. S. Javadi, and P. Asghari, “FLHB-AC: Federated Learning History-Based Access Control using Deep Neural Networks in healthcare system,” Journal of Information Systems and Telecommunication (JIST), vol. 12, no. 46, pp. 90–104, Jun. 2024.
[34] A. Shewalkar, D. Nyavanandi, and S. A. Ludwig, “Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU,” Journal of Artificial Intelligence and Soft Computing Research, vol. 9, no. 4, pp. 235–245, Oct. 2019.
[35] A. Barati, H. Farsi, and S. Mohamadzadeh, “Integration of the latent variable knowledge into deep image captioning with Bayesian modeling,” IET Image Processing, , vol. 17, no. 7, pp. 2256–2271,2024.
[36] H. S. Munir, S. Ren, M. Mustafa, C. N. Siddique, and S. Qayyum, “Attention based GRU-LSTM for software defect prediction,” PLoS ONE, vol. 16, no. 3, p. e0247444, Mar. 2021.
[37] C. Yin, D. Tang, F. Zhang, Q. Tang, Y. Feng, and Z. He, “Students learning performance prediction based on feature extraction algorithm and attention-based bidirectional gated".

Language Model Adaptation Using Dirichlet Class Language Model Based on Part-of-Speech
Print Date : 2014-03-21
Instance Based Sparse Classifier Fusion for Speaker Verification
Print Date : 2016-09-24
Speech Emotion Recognition Based on Fusion Method
Print Date : 2017-03-13
Long-Term Spectral Pseudo-Entropy (LTSPE): A New Robust Feature for Speech Activity Detection
Print Date : 2019-05-29
A New VAD Algorithm using Sparse Representation in Spectro-Temporal Domain
Print Date : 2019-11-04

Share To

Article Url

Enhancing Speaker Identification System Based on MFCC Feature Extraction and Gated Recurrent Unit Network