Designing a Semi-Intelligent Crawler for Creating a Persian Question Answering Corpus Called Popfa
Subject Areas : Natural Language ProcessingHadi Sharifian 1 , Nasim Tohidi 2 , Chitra Dadkhah 3 *
1 - K.N. Toosi University of Technology, Faculty of Computer Engineering
2 - K.N. Toosi University of Technology, Faculty of Computer Engineering
3 - K.N. Toosi University of Technology, Faculty of Computer Engineering
Keywords: Question Answering, Persian Corpus, Religious Questions, Medical Questions, Natural Language Processing,
Abstract :
Question answering in natural language processing is an interesting field for researchers to examine their ability in solving the tough Alan Turing test. Every day computer scientists are trying hard to develop and promote question answering systems in various natural languages, especially English. However, in Persian, it is not easy to advance these systems. The main problem is related to low resources and not enough corpora in this language. Thus, in this paper, a Persian question answering text corpus is created, which covers a wide range of religious, midwifery, and issues related to youth marriage topics and question types commonly encountered in Persian language usage. In this regard, the most important challenge was introducing a method for data gathering in Persian as well as facilitating and expanding the data gathering process. Though, SIC (Semi-Intelligent Crawler) is proposed as a solution that can overcome the challenge and find a way to crawl the Persian websites, gather text and finally import it to a database. The outcome of this research is a corpus called Popfa, which stands for POrsesh Pasokh (question answering) in FArsi. This corpus contains more than 53,000 standard questions and answers. Besides, it has been evaluated with standard approaches. All the questions in Popfa are answered by specialists in two general topics: religious and medical questions. Therefore, researchers can now use this corpus for doing research on Persian question answering.
[1] R. French, "The Turing Test: The first 50 years," Trends in Cognitive Sciences, vol. 4, no. 3, pp. 115-122, 2000.
[2] Khalifeh Zadeh, Zahra; Zare Chahooki, Mohammad Ali;, "An Effective Method of Feature Selection in Persian Text for Improving the Accuracy of Detecting Request in Persian Messages on Telegram," Journal of Information Systems and Telecommunication (JIST), vol. 8, no. 32, pp. 249-262, 2021.
[3] Tohidi, Nasim; Hasheminejad, Seyed Mohammad Hossein, "A Practice of Human-Machine Collaboration for Persian Text Summarization," in The 27th International Computer Conference, Tehran, 2022.
[4] Hoseinmardy, Ali; Momtazi, Saeedeh;, "Recognizing Transliterated English Words in Persian Texts," Journal of Information Systems and Telecommunication (JIST), vol. 8, no. 30, pp. 84-92, 2020.
[5] Tohidi, Nasim; Dadkhah, Chitra; Rustamov, Rustam B., "Optimizing Persian multi-objective question answering system," International Journal on Technical and Physical Problems of Engineering (IJTPE), vol. 13, no. 46, 2021.
[6] Tohidi, Nasim; Dadkhah, Chitra; Rustamov, Rustam B., "Optimizing the Performance of Persian Multi-objective question answering system," in The 16th International Conference on Technical and Physical Problems of Engineering, Istanbul, Turkey, 2020.
[7] C. P. Masica, The Indo-Aryan Languages, New York: Cambridge University Press, 1993.
[8] Khashabi, Daniel; Cohan, Arman; Shakeri, Siamak; Hosseini, Pedram; Pezeshkpour, Pouya; Alikhani, Malihe; Aminnaseri, Moin; Bitaab, Marzieh; Brahman, Faeze; Ghazarian, Sarik; Gheini, Mozhdeh; Kabiri, Arman; Karimi Mahabagdi, Rabeeh; Memarrast, Omid; et al., "ParsiNLU: A Suite of Language Understanding Challenges for Persian," Transactions of the Association for Computational Linguistics, vol. 9, p. 1147–1162, 2021.
[9] E. M. Voorhees, "The TREC-8 Question Answering Track Report (1999)," in In Proceedings of TREC-8, 1999.
[10] Tohidi, Nasim; Hasheminejad, Seyed Mohammad Hossein, "MOQAS: Multi-objective question answering system," Journal of Intelligent & Fuzzy Systems, vol. 36, no. 4, pp. 3495-3512, 2019.
[11] Khodadi, I.; Saniee Abadeh, M., "Genetic programming-based feature learning for question answering," Elsevier, Information Processing and Management, vol. 40, 2015.
[12] Joshi, Mandar; Choi, Eunsol; Weld, Daniel; Zettlemoyer, Luke, "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017.
[13] Muttaleb Hasan, Ali; Zakaria, Lailatul Qadri, "Question classification using support vector machine and pattern matching," Journal of Theoretical and Applied Information Technology , vol. 87, no. 2, pp. 259-265, 2005.
[14] Veisi, Hadi; Fakour Shandi, Hamed, "A Persian Medical Question Answering System," International Journal on Artificial Intelligence Tools, vol. 29, no. 6, 2020.
[15] Aleahmad, Abolfazl; Amiri, Hadi; Darrudi, Ehsan; Oroumchian, Farhad;, "Hamshahri: A standard Persian text collection," Knowledge-Based Systems, vol. 22, no. 5, pp. 382-387, 2009.
[16] Mollaei, Ali; Rahati Quchani, Saeed; Estaji, Azam, "Question classification in Persian language based on conditional random fields," in 2nd International eConference on Computer and Knowledge Engineering (ICCKE), 2012.
[17] Sherkat, Ehsan; Farhoodi, Mojgan, "A Hybrid Approach for Question Classification in Persian Automatic Question Answering Systems," in 4th International eConference on Computer and Knowledge Engineering (ICCKE), Mashahd, Iran, 2014.
[18] A. P. Ben Veyseh, "Cross-Lingual Question Answering Using Common Semantic Space," in Proceedings of the 2016 Workshop on Graph-based Methods for Natural Language Processing, San Diego, California, 2016.
[19] Boreshban, Yasaman; Yousefinasa, Hamed; Mirroshandel, Seyed Abolghasem, "Providing a Religious Corpus of Question Answering System in Persian," Signal and Data Processing, vol. 15, no. 1, pp. 87-102, 2018.
[20] Etezadi, Romina; Shamsfard, Mehrnoush, "PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge Graph," in 11th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran, 2020.
[21] Abadani, Negin; Mozafari, Jamshid; Fatemi, Afsaneh; Nematbakhsh, Mohamadali; Kazemi, Arefeh, "ParSQuAD: Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0," International Journal of Web Research, vol. 4, no. 1, pp. 34-46, 2021.
[22] Kazemi, Arefeh; Mozafari, Jamshid; Nematbakhsh, Mohammad Ali, "PersianQuAD: The Native Question Answering Dataset for the Persian Language," IEEE Access, vol. 10, pp. 26045-26057, 2022.
[23] Darvishi, Kasra; Shahbodagh, Newsha; Abbasiantaeb, Zahra; Momtazi, Saeedeh, "PQuAD: A Persian Question Answering Dataset," arXiv:2202.06219, 2022.
[24] Jurafsky, Daniel; Martin, James H., Speech and Language Processing, Upper Saddle River, NJUnited States: Prentice Hall, 2019.
[25] Radev, Dragomir R.; Qi, Hong; Wu, Harris; Fan, Weiguo, "Evaluating Web-based Question Answering Systems," in The Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain, 2002.
[26] Järvelin, Kalervo; Kekäläinen, Jaana, "Cumulated gain-based evaluation of IR techniques," ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422-446, 2002.