A Corpus for Evaluation of Cross Language Text Re-use Detection Systems
Subject Areas : Pattern RecognitionSalar Mohtaj 1 , Habibollah Asghari 2 *
1 - Faculty IV, Technische University Berlin, Germany
2 - ICT Research Institute, ACECR, Tehran, Iran
Keywords: Cross language plagiarism detection, Corpus, Text re-use detection, Obfuscation.,
Abstract :
In recent years, the availability of documents through the Internet along with automatic translation systems have increased plagiarism, especially across languages. Cross-lingual plagiarism occurs when the source or original text is in one language and the plagiarized or re-used text is in another language. Various methods for automatic text re-use detection across languages have been developed whose objective is to assist human experts in analyzing documents for plagiarism cases. For evaluating the performance of these systems and algorithms, standard evaluation resources are needed. To construct cross lingual plagiarism detection corpora, the majority of earlier studies have paid attention to English and other European language pairs, and have less focused on low resource languages. In this paper, we investigate a method for constructing an English-Persian cross-language plagiarism detection corpus based on parallel bilingual sentences that artificially generate passages with various degrees of paraphrasing. The plagiarized passages are inserted into topically related English and Persian Wikipedia articles in order to have more realistic text documents. The proposed approach can be applied to other less-resourced languages. In order to evaluate the compiled corpus, both intrinsic and extrinsic evaluation methods were employed. So, the compiled corpus can be suitably included into an evaluation framework for assessing cross-language plagiarism detection systems. Our proposed corpus is free and publicly available for research purposes.
[1] A. Barrón-Cedeño, P. Rosso, D. Pinto, and A. Juan, “On Cross-lingual Plagiarism Analysis using a Statistical Model“, Proceedings of the ECAI’08 workshop on uncovering plagiarism, authorship and social software misuse, Patras, Greece, 22 July 2008 (Vol. 377). CEUR-WS.org.
[2] N. Ehsan, and A. Shakery, “Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information”, Information Processing and Management, vol. 52, no. 6, pp. 1004-1017, 2016.
[3] C. Callison-Burch, “Paraphrasing and translation”, Doctoral Dissertation, School of Informatics, University of Edinburgh, 2007.
[4] M. Potthast, A. Barrón-Cedeño, A. Eiselt, B. Stein, and P. Rosso, “Overview of the 2nd international competition on plagiarism detection”. In CLEF 2010 labs and workshops, notebook papers, 22-23 September 2010, Padua, Italy (Vol. 1176). CEUR-WS.org.
[5] M. Potthast, B. Stein, A. Eiselt, A. Barrón-Cedeño, and P. Rosso, “Overview of the 1st international competition on plagiarism detection”. In 3rd PAN workshop; Uncovering plagiarism, authorship and social software misuse (PAN 09), San Sebastian, Spain, 10 September 2009, pp. 1–9.
[6] A. Barrón-Cedeño, P. Rosso, S. L. Devi, P. D. Clough, and M. Stevenson, “PAND@FIRE: Overview of the cross-language !ndian text re-use detection competition.“ Multi-lingual information access in south asian languages - second international workshop, FIRE 2010, gandhinagar, india, february 19-21, 2010 and third international workshop, FIRE 2011, Bombay, India, 2-4 December 2011, revised selected papers (Vol. 7536, pp. 59–70). Springer.
[7] M. S., Arefin, Y. Morimoto, and M. A. Sharif. “BAENPD: A Bilingual Plagiarism Detector”, Journal of Computers. vol. 8, no. 5, pp. 1145-1156, 2013. [8] D. Pinto, J. Civera, A. Barrón-Cedeño, A. Juan, and P. Rosso. “A statistical approach to cross-lingual natural language tasks” Journal of Algorithms, vol 64, no. 1, pp. 51-60, 2009. [9] M. Potthast, A. Eiselt, A. Barrón-Cedeño, B. Stein, B., and P. Rosso, “Overview of the 3rd international competition on plagiarism detection”. In CLEF 2011 labs and workshop, notebook papers, 19-22 September 2011, Amsterdam, the Netherlands (Vol. 1177). CEUR-WS.org. [10] W. A. Gale, and K. W. Church, “A program for aligning sentences in bilingual corpora." Computational Linguistics, vol. 19, no. 1 pp. 75-102. 1993.
[11] R. C. Pereira, V. P. Moreira, and R. Galante, “A new approach for cross-language plagiarism analysis”. Multi-lingual and multimodal information access evaluation: International conference of the cross-language evaluation forum, CLEF 2010, Padua, Italy, 20-23 September 2010. Proceedings (Vol. 6360, pp. 15–26). Springer.
[12] M. Potthast, A. Barrón-Cedeño, B. Stein, and P. Rosso, “Cross-language plagiarism detection”, Language Resources and Evaluation, vol. 45, no. 1, pp. 45–62, 2011.
[13] Z. Ceska, M. Toman, and K. Jezek, “Multi-lingual plagiarism detection”. In 13th international conference on Artificial intelligence: Methodology, systems, and applications, (AIMSA 2008), Varna, Bulgaria, September 4-6, 2008. Proceedings (Vol. 5253, pp. 83–92). Springer.
[14] M. Potthast, M., Hagen, M., Völske, M. and B. Stein, “Crowdsourcing interaction logs to understand text reuse from the web”, In Proceedings of the 51st annual meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), Sofia, Bulgaria, 4-9 August 2013, pp. 1212-1221.
[15] H. Asghari, O. Fatemi, S. Mohtaj, H. Faili, and P. Rosso. “On the use of word embedding for cross language plagiarism detection”, Intelligent Data Analysis, vol. 23, no. 3, pp. 661-680, 2019. [16] E. Al-Thwaib, B. H. Hammo, and S. Yagi, “An academic Arabic corpus for plagiarism detection: Design, construction and experimentation”, International Journal of Educational Technology in Higher Education, vol. 17, no. 1, pp.1-26. 2020.
[17] K. Khoshnavataher, V. Zarrabi, S. Mohtaj, S., and H. Asghari, “Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation”, Notebook for PAN at CLEF 2015. [18] M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. “An evaluation framework for plagiarism detection”, In COLING 2010: 23rd International Conference on Computational Linguistics, 23-27 August 2010, Beijing, China, posters volume, pp. 997-1005.
[19] S. F. Adafre, and M. De Rijke, “Finding similar sentences across multiple languages in Wikipedia”. In Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics, 4 April 2006, Trento, Italy, pp. 62–69.
[20] P. G. Otero, and I. G. L´opez, “Wikipedia as multi-lingual source of comparable corpora”, In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC, pp. 21–25, 2010.
[21] T. Wang, R. Di, and J. Song, “A Novel Online Encyclopedia-Oriented Approach for Large-Scale Knowledge Base Construction”, J. Softw., vol. 9, no 2, pp. 482–489, 2014.
[22] P. Resnik, “Mining the web for bilingual text”, In Proceedings of the 27th annual meeting of the Association for Computational Linguistics (ACL), university of Maryland, College Park, Maryland, USA, 20-26 June 1999. pp. 527-534.
[23] H. Zamani, H. Faili, A. Shakery, “Sentence alignment using local and global information”, Computer Speech & Language, 39, pp. 88-107, 2016. doi: 10.1016/j.csl.2016.03.002.
[24] A. Barrón-Cedeño, M. L. Paramita, P. D. Clough, and P. Rosso, “A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles”. Advances in information retrieval - 36th European Conference on IR Research, (ECIR 2014), Amsterdam, the Netherlands, 13-16 April 2014. Proceedings (Vol. 8416), pp. 424–429, Springer.
[25] M. Rosvall, and C. T. Bergstrom, C. T. “Maps of random walks on complex networks reveal community structure”. Proceedings of the National Academy of Sciences of the USA, 105(4), 2008, pp. 1118–1123.
[26] S. Fortunato, and A. Lancichinetti, “Community detection algorithms: A comparative analysis: invited presentation, extended abstract. In 4th international conference on performance evaluation methodologies and tools, VALUETOOLS’09, Pisa, Italy, 20-22 October 2009, pp. 1-2. ICST/ACM.
[27] A. Farghaly, “Computer processing of Arabic script-based languages: current state and future directions”. In Proceedings of the workshop on computational approaches to Arabic script-based languages, Stroudsburg, PA, USA, 28 August 2004, pp. 1-1.
[28] S. Mohtaj, B. Roshanfekr, A. Zafarian, and H. Asghari. “Parsivar: A language processing toolkit for Persian”, In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 7-12 May 2018, Miyazaki, Japan, pp. 1112-1118.
[29] M. Potthast, T. Gollub, M. Hagen, J. Kiesel, M. Michel, M., A. Oberlander, B. Stein, B. “Overview of the 4th international competition on plagiarism detection”. In CLEF 2012 evaluation labs and workshop, online working notes, Rome, Italy, 17-20 September 2012 (Vol. 1178). CEUR-WS.org.
[30] P. Clough and M. Stevenson, “Developing a corpus of plagiarized short answers”, Language Resources and Evaluation, vol. 45, no 1, pp. 5–24, 2011.
[31] M. L. Paramita, P. D. Clough, A. Aker, A., and R. J. Gaizauskas. “Correlation between similarity measures for inter-language linked Wikipedia articles”. In Proceedings of the eighth international conference on language resources and evaluation, (LREC 2012), Istanbul, Turkey, 23-25 May 2012, pp. 790–797.
[32] H. Asghari, K. Khoshnava, O. Fatemi, and H. Faili, “Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus”, Notebook for PAN at CLEF, 2015.