Enhancing Speaker Identification System Based on MFCC Feature Extraction and Gated Recurrent Unit Network
Subject Areas : Speech ProcessingM. Sharif-Noughabi 1 , S. M. Razavi 2 * , M. Taghipour-gorjikolaie 3
1 - Birjand University
2 - Birjand University
3 - Birjand University
Keywords: speaker identification, gated recurrent unit network (GRU), convolutional neural network (CNN), MFCC,
Abstract :
One of the biometric detection methods is to identify people based on speech signals. The implementation of a speaker identification (SI) system can be done in many different ways, and recently, many researchers have been focusing on using deep neural networks. One of the types of deep neural networks is recurrent neural networks, where memory and recurrent parts are handled by layers such as LSTM or Gated Recurrent Unit (GRU). In this paper, we propose a new structure as a classifier in the speaker identification system, which significantly improves the recognition rate by combining a convolutional neural network with two layers of GRU (CNN+ GRU). MFCC coefficients that have been extracted as cell arrays from each period of Pt speech will be used as sequence vectors for the input of proposed classifier. The performance of the SI system has improved in comparison to basic methods according to experiments conducted on two databases, LibriSpeech and VoxCeleb1. When Pt is longer, the system performs better, so that on the LibriSpeech database with 251 speakers, recognition accuracy is equal to 92.94% for Pt=1s, and it rises to 99.92% for Pt=9s. The proposed CNN+GRU classifier has a low sensitivity to specific genders, which can be said to be almost zero.
S. Hourri and J. Kharroubi, “A Novel Scoring Method Based on Distance Calculation for Similarity Measurement in Text-Independent Speaker Verification,” Procedia Computer Science, vol. 148, pp. 256–265, 2019.
M. Chaiani, M. Bengherabi, S. A. Selouani and M. Boudraa, "Dysarthric speaker identification with constrained training durations," 2018 International Conference on Signal, Image, Vision and their Applications (SIVA), Guelma, Algeria, 2018, pp. 1-6.
R. Jahangir et al., "Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network," in IEEE Access, vol. 8, pp. 32187-32202, 2020.
M. Barhoush, A. Hallawa and A. Schmeink, "Robust Automatic Speaker Identification System Using Shuffled MFCC Features," 2021 IEEE International Conference on Machine Learning and Applied Network Technologies (ICMLANT), Soyapango, El Salvador, 2021, pp. 1-6.
C. Kim and R. M. Stern, "Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1315-1329, July 2016.
X. Jing, J. Ma, J. Zhao and H. Yang, "Speaker recognition based on principal component analysis of LPCC and MFCC," 2014 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Guilin, China, 2014, pp. 403-408.
K. Aghajani and E. P. Afrakoti I., “Speech emotion recognition using Scalogram based deep structure,” International Journal of Engineering. Transactions B: Applications, vol. 33, no. 2, Feb. 2020.
A.Nagrani , J. S.Chung , and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset”. arXiv preprint arXiv:1706.08612. 2017.
J. W. Jung , H. S. Heo , I. H.Yang , H. J. Shim , and H. J. Yu, “Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification” . extraction, vol. 8, no. 12, pp. 23-24, 2018.
G. Wei, Y. Zhang, H. Min, and Y. Xu, “End-to-end speaker identification research based on multi-scale SincNet and CGAN,” Neural Computing and Applications, vol. 35, no. 30, pp. 22209–22222, Aug. 2023.
S. S.Tirumala and S. R. Shahamiri, “A review on deep learning approaches in speaker identification”. In Proceedings of the 8th international conference on signal processing systems, Nov. 2016, pp. 142-147.
K. A. Abdalmalak and A. Gallardo-Antolín, “Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers,” Neural Computing and Applications, vol. 29, no. 3, pp. 637–651, Jul. 2016.
A. Ashar, M. S. Bhatti and U. Mushtaq, "Speaker Identification Using a Hybrid CNN-MFCC Approach," 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), Karachi, Pakistan, 2020, pp. 1-4.
B. K. P and R. K. M, “ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score,” Multimedia Tools and Applications, vol. 79, no. 39–40, pp. 28859–28883, Aug. 2020.
M. K. Singh, “A text independent speaker identification system using ANN, RNN, and CNN classification technique,” Multimedia Tools and Applications, vol. 83, no. 16, pp. 48105–48117, Nov. 2023.
M. R. Firmansyah, R. Hidayat and A. Bejo, "Comparison of Windowing Function on Feature Extraction Using MFCC for Speaker Identification," 2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), Bandung, Indonesia, 2021, pp. 1-5.
S. Chakraborty and R. Parekh, "An improved approach to open set text-independent speaker identification (OSTI-SI)," 2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Kolkata, India, 2017, pp. 51-56.
E. S. Hassan et al., “Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs,” PLoS ONE, vol. 19, no. 2, p. e0294235, Feb. 2024.
J. I. Ramírez-Hernández, A. Manzo-Martínez, F. Gaxiola, L. C. González-Gurrola, V. C. Álvarez-Oliva, and R. López-Santillán, “A comparison between MFCC and MSE features for Text-Independent speaker recognition using machine learning algorithms,” in Studies in computational intelligence, 2023, pp. 123–140.
N. M. Almarshady, A. A. Alashban, and Y. A. Alotaibi, “Analysis and investigation of speaker identification problems using deep learning networks and the YOHO English Speech Dataset,” Applied Sciences, vol. 13, no. 17, p. 9567, Aug. 2023.
S.Hizlisoy , and , R. S. Arslan , “Text independent speaker recognition based on MFCC and machine learning”. Selcuk University Journal of Engineering Sciences, vol. 20, no. 3, pp. 73-78, 2021.
V. S. R. Gade and S. Manickam, “Speaker recognition using Improved Butterfly Optimization Algorithm with hybrid Long Short Term Memory network,” Multimedia Tools and Applications, vol.13, pp.1-23, Feb. 2024.
A. Fikri and A. Zahra, “Speaker Identification in Multiple Languages: Regional, Indonesian, and English with Short Utterance,” International Journal of Emerging Technology and Advanced Engineering, vol. 13, no. 9, pp. 25–35, Oct. 2023.
M. Hasheminejad , and H. Farsi, (2016). “Instance Based Sparse Classifier Fusion for Speaker Verification”. Journal of Information Systems and Telecommunication (JIST), vol. 3, no. 15, pp. 1, 2016.
R. Li , J. Y. Jiang , J. Liu , C. C. Hsieh , and W. Wang, “Automatic speaker recognition with limited data”. In Proceedings of the 13th International Conference on Web Search and Data Mining, Jan. 2020, pp. 340-348.
Md. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, “A robust speaker identification system using the responses from a model of the auditory periphery,” PLoS ONE, vol. 11, no. 7, p. e0158520, Jul. 2016.
A. Fazel and S. Chakrabartty, "An Overview of Statistical Pattern Recognition Techniques for Speaker Verification," in IEEE Circuits and Systems Magazine, vol. 11, no. 2, pp. 62-81, Secondquarter 2011.
J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv (Cornell University), Jan. 2014.
N. Mohammadi, A. Rezakhani, H. H. S. Javadi, and P. Asghari, “FLHB-AC: Federated Learning History-Based Access Control using Deep Neural Networks in healthcare system,” Journal of Information Systems and Telecommunication (JIST), vol. 12, no. 46, pp. 90–104, Jun. 2024.
V. Nair , and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines”. In Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807-814.
H. S. Munir, S. Ren, M. Mustafa, C. N. Siddique, and S. Qayyum, “Attention based GRU-LSTM for software defect prediction,” PLoS ONE, vol. 16, no. 3, p. e0247444, Mar. 2021.
C. Yin, D. Tang, F. Zhang, Q. Tang, Y. Feng, and Z. He, “Students learning performance prediction based on feature extraction algorithm and attention-based bidirectional gated recurrent unit network,” PLoS ONE, vol. 18, no. 10, p. e0286156, Oct. 2023.
Y. Wang et al., “Prediction of outpatients with conjunctivitis in Xinjiang based on LSTM and GRU models,” PLoS ONE, vol. 18, no. 9, p. e0290541, Sep. 2023.