List of articles (by subject) Natural Language Processing


    • Open Access Article

      1 - A Persian Fuzzy Plagiarism Detection Approach
      Shima Rakian Faramarz Safi Esfahani Hamid Rastegari
      Plagiarism is one of the common problems that is present in all organizations that deal with electronic content. At present, plagiarism detection tools, only detect word by word or exact copy phrases and paraphrasing is often mixed. One of the successful and applicable More
      Plagiarism is one of the common problems that is present in all organizations that deal with electronic content. At present, plagiarism detection tools, only detect word by word or exact copy phrases and paraphrasing is often mixed. One of the successful and applicable methods in paraphrasing detection is fuzzy method. In this study, a new fuzzy approach has been proposed to detect external plagiarism in Persian texts which is called Persian Fuzzy Plagiarism Detection (PFPD). The proposed approach compares paraphrased texts with the aim to recognize text similarities. External plagiarism detection, evaluates through a comparison between query document and a document collection. To avoid un-necessary comparisons this tool employs intelligent technology for comparing, suspicious documents, in different levels hierarchically. This method intends to conformed Fuzzy model to Persian language and improves previous methods to evaluate similarity degree between two sentences. Experiments on three corpora TMC, Irandoc and extracted corpus from prozhe.com, are performed to get confidence on proposed method performance. The obtained results showed that using proposed method in candidate documents retrieval, and in evaluating text similarity, increases the precision, recall and F measurement in comparing with one of the best previous fuzzy methods, respectively 22.41, 17.61, and 18.54 percent on the average. Manuscript profile
    • Open Access Article

      2 - Opinion Mining in Persian Language Using Supervised Algorithms
      Saeedeh Alimardani abdollah aghaei
      Rapid growth of Internet results in large amount of user-generated contents in social media, forums, blogs, and etc. Automatic analysis of this content is needed to extract valuable information from these contents. Opinion mining is a process of analyzing opinions, sent More
      Rapid growth of Internet results in large amount of user-generated contents in social media, forums, blogs, and etc. Automatic analysis of this content is needed to extract valuable information from these contents. Opinion mining is a process of analyzing opinions, sentiments and emotions to recognize people’s preferences about different subjects. One of the main tasks of opinion mining is classifying a text document into positive or negative classes. Most of the researches in this field applied opinion mining for English language. Although Persian language is spoken in different countries, but there are few studies for opinion mining in Persian language. In this article, a comprehensive study of opinion mining for Persian language is conducted to examine performance of opinion mining in different conditions. First we create a Persian SentiWordNet using Persian WordNet. Then this lexicon is used to weight features. Results of applying three machine learning algorithms Support vector machine (SVM), naive Bayes (NB) and logistic regression are compared before and after weighting by lexicon. Experiments show support vector machine and logistic regression achieve better results in most cases and applying SO (semantic orientation) improves the accuracy of logistic regression. Increasing number of instances and using unbalanced dataset has a positive effect on the performance of opinion mining. Generally this research provides better results comparing to other researches in opinion mining of Persian language. Manuscript profile
    • Open Access Article

      3 - A fuzzy approach for ambiguity reducing in text similarity estimation (case study: Persian web contents)
      Hamid Ahangarbahan gholamali montazer
      Finding similar web contents have great efficiency in academic community and software systems. There are many methods and metrics in literature to measure the extent of text similarity among various documents and some its application especially in plagiarism detection s More
      Finding similar web contents have great efficiency in academic community and software systems. There are many methods and metrics in literature to measure the extent of text similarity among various documents and some its application especially in plagiarism detection systems. However, most of them do not take ambiguity inherent in word or text pair’s comparison as well as structural features into account. As a result, pervious methods did not have enough accuracy to deal vague information. So using structural features and considering ambiguity inherent word improve the identification of similar contents. In this paper, a new method has been proposed that taking lexical and structural features in text similarity measures into consideration. After preprocessing and removing stopwords, each text was divided into general words and domain-specific knowledge words. Then, the two lexical and structural fuzzy inference systems were designed to assess lexical and structural text similarity. The proposed method has been evaluated on Persian paper abstracts of International Conference on e-Learning and e-Teaching (ICELET) Corpus. The results shows that the proposed method can achieve a rate of 75% in terms of precision and can detect 81% of the similar cases. Manuscript profile
    • Open Access Article

      4 - An Improved Sentiment Analysis Algorithm Based on Appraisal Theory and Fuzzy Logic
      Azadeh  Roustakiani Neda Abdolvand Saeideh Rajaei Harandi
      Millions of comments and opinions are posted daily on websites such as Twitter or Facebook. Users share their opinions on various topics. People need to know the opinions of other people in order to purchase consciously. Businesses also need customers’ opinions and big More
      Millions of comments and opinions are posted daily on websites such as Twitter or Facebook. Users share their opinions on various topics. People need to know the opinions of other people in order to purchase consciously. Businesses also need customers’ opinions and big data analysis to continue serving customer-friendly services, manage customer complaints and suggestions, increase financial benefits, evaluate products, as well as for marketing and business development. With the development of social media, the importance of sentiment analysis has increased, and sentiment analysis has become a very popular topic among computer scientists and researchers, because it has many usages in market and customer feedback analysis. Most sentiment analysis methods suffice to split comments into three negative, positive and neutral categories. But Appraisal Theory considers other characteristics of opinion such as attitude, graduation and orientation which results in more precise analysis. Therefore, this research has proposed an algorithm that increases the accuracy of the sentiment analysis algorithms by combining appraisal theory and fuzzy logic. This algorithm was tested on Stanford data (25,000 comments on the film) and compared with a reliable dictionary. Finally, the algorithm reached the accuracy of 95%. The results of this research can help to manage customer complaints and suggestions, marketing and business development, and product testing. Manuscript profile
    • Open Access Article

      5 - Farsi Conceptual Text Summarizer: A New Model in Continuous Vector Space
      Mohammad Ebrahim Khademi Mohammad Fakhredanesh Seyed Mojtaba Hoseini
      Traditional methods of summarization were very costly and time-consuming. This led to the emergence of automatic methods for text summarization. Extractive summarization is an automatic method for generating summary by identifying the most important sentences of a text. More
      Traditional methods of summarization were very costly and time-consuming. This led to the emergence of automatic methods for text summarization. Extractive summarization is an automatic method for generating summary by identifying the most important sentences of a text. In this paper, two innovative approaches are presented for summarizing the Persian texts. In these methods, using a combination of deep learning and statistical methods, we cluster the concepts of the text and, based on the importance of the concepts in each sentence, we derive the sentences that have the most conceptual burden. In the first unsupervised method, without using any hand-crafted features, we achieved state-of-the-art results on the Pasokh single-document corpus as compared to the best supervised Persian methods. In order to have a better understanding of the results, we have evaluated the human summaries generated by the contributing authors of the Pasokh corpus as a measure of the success rate of the proposed methods. In terms of recall, these have achieved favorable results. In the second method, by giving the coefficient of title effect and its increase, the average ROUGE-2 values increased to 0.4% on the Pasokh single-document corpus compared to the first method and the average ROUGE-1 values increased to 3% on the Khabir news corpus. Manuscript profile
    • Open Access Article

      6 - SGF (Semantic Graphs Fusion): A Knowledge-based Representation of Textual Resources for Text Mining Applications
      Morteza Jaderyan Hassan Khotanlou
      The proper representation of textual documents has been the greatest challenge in text mining applications. In this paper, a knowledge-based representation model for text analysis applications is introduced. The proposed functionalities of the system are achieved by int More
      The proper representation of textual documents has been the greatest challenge in text mining applications. In this paper, a knowledge-based representation model for text analysis applications is introduced. The proposed functionalities of the system are achieved by integrating structured knowledge in the core components of the system. The semantic, lexical, syntactical and structural features are identified by the pre-processing module. The enrichment module is introduced to identify contextually similar concepts and concept maps for improving the representation. The information content of documents and the enriched contents are then fused (merged) into the graphical structure of a semantic network to form a unified and comprehensive representation of documents. The 20Newsgroup and Reuters-21578 datasets are used for evaluation. The evaluation results suggest that the proposed method exhibits a high level of accuracy, recall and precision. The results also indicate that even when a small portion of the information content is available, the proposed method performs well in standard text mining applications Manuscript profile
    • Open Access Article

      7 - DeepSumm: A Novel Deep Learning-Based Multi-Lingual Multi-Documents Summarization System
      Shima Mehrabi Seyed Abolghassem Mirroshandel Hamidreza  Ahmadifar
      With the increasing amount of accessible textual information via the internet, it seems necessary to have a summarization system that can generate a summary of information for user demands. Since a long time ago, summarization has been considered by natural language pro More
      With the increasing amount of accessible textual information via the internet, it seems necessary to have a summarization system that can generate a summary of information for user demands. Since a long time ago, summarization has been considered by natural language processing researchers. Today, with improvement in processing power and the development of computational tools, efforts to improve the performance of the summarization system is continued, especially with utilizing more powerful learning algorithms such as deep learning method. In this paper, a novel multi-lingual multi-document summarization system is proposed that works based on deep learning techniques, and it is amongst the first Persian summarization system by use of deep learning. The proposed system ranks the sentences based on some predefined features and by using a deep artificial neural network. A comprehensive study about the effect of different features was also done to achieve the best possible features combination. The performance of the proposed system is evaluated on the standard baseline datasets in Persian and English. The result of evaluations demonstrates the effectiveness and success of the proposed summarization system in both languages. It can be said that the proposed method has achieve the state of the art performance in Persian and English. Manuscript profile
    • Open Access Article

      8 - Recognizing Transliterated English Words in Persian Texts
      Ali Hoseinmardy Saeedeh Momtazi
      One of the most important problems of text processing systems is the word mismatch problem. This results in limited access to the required information in information retrieval. This problem occurs in analyzing textual data such as news, or low accuracy in text classific More
      One of the most important problems of text processing systems is the word mismatch problem. This results in limited access to the required information in information retrieval. This problem occurs in analyzing textual data such as news, or low accuracy in text classification and clustering. In this case, if the text-processing engine does not use similar/related words in the same sense, it may not be able to guide you to the appropriate result. Various statistical techniques have been proposed to bridge the vocabulary gap problem; e.g., if two words are used in similar contexts frequently, they have similar/related meanings. Synonym and similar words, however, are only one of the categories of related words that are expected to be captured by statistical approaches. Another category of related words is the pair of an original word in one language and its transliteration from another language. This kind of related words is common in non-English languages. In non-English texts, instead of using the original word from the target language, the writer may borrow the English word and only transliterate it to the target language. Since this kind of writing style is used in limited texts, the frequency of transliterated words is not as high as original words. As a result, available corpus-based techniques are not able to capture their concept. In this article, we propose two different approaches to overcome this problem: (1) using neural network-based transliteration, (2) using available tools that are used for machine translation/transliteration, such as Google Translate and Behnevis. Our experiments on a dataset, which is provided for this purpose, shows that the combination of the two approaches can detect English words with 89.39% accuracy. Manuscript profile
    • Open Access Article

      9 - Utilizing Gated Recurrent Units to Retain Long Term Dependencies with Recurrent Neural Network in Text Classification
      Nidhi Chandra Laxmi  Ahuja Sunil Kumar Khatri Himanshu Monga
      The classification of text is one of the key areas of research for natural language processing. Most of the organizations get customer reviews and feedbacks for their products for which they want quick reviews to action on them. Manual reviews would take a lot of time a More
      The classification of text is one of the key areas of research for natural language processing. Most of the organizations get customer reviews and feedbacks for their products for which they want quick reviews to action on them. Manual reviews would take a lot of time and effort and may impact their product sales, so to make it quick these organizations have asked their IT to leverage machine learning algorithms to process such text on a real-time basis. Gated recurrent units (GRUs) algorithms which is an extension of the Recurrent Neural Network and referred to as gating mechanism in the network helps provides such mechanism. Recurrent Neural Networks (RNN) has demonstrated to be the main alternative to deal with sequence classification and have demonstrated satisfactory to keep up the information from past outcomes and influence those outcomes for performance adjustment. The GRU model helps in rectifying gradient problems which can help benefit multiple use cases by making this model learn long-term dependencies in text data structures. A few of the use cases that follow are – sentiment analysis for NLP. GRU with RNN is being used as it would need to retain long-term dependencies. This paper presents a text classification technique using a sequential word embedding processed using gated recurrent unit sigmoid function in a Recurrent neural network. This paper focuses on classifying text using the Gated Recurrent Units method that makes use of the framework for embedding fixed size, matrix text. It helps specifically inform the network of long-term dependencies. We leveraged the GRU model on the movie review dataset with a classification accuracy of 87%. Manuscript profile
    • Open Access Article

      10 - Word Sense Induction in Persian and English: A Comparative Study
      Masood Ghayoomi
      Words in the natural language have forms and meanings, and there might not always be a one-to-one match between them. This property of the language causes words to have more than one meaning; as a result, a text processing system faces challenges to determine the precis More
      Words in the natural language have forms and meanings, and there might not always be a one-to-one match between them. This property of the language causes words to have more than one meaning; as a result, a text processing system faces challenges to determine the precise meaning of the target word in a sentence. Using lexical resources or lexical databases, such as WordNet, might be a help, but due to their manual development, they become outdated by passage of time and language change. Moreover, the lexical resources might be domain dependent which are unusable for open domain natural language processing tasks. These drawbacks are a strong motivation to use unsupervised machine learning approaches to induce word senses from the natural data. To reach the goal, the clustering approach can be utilized such that each cluster resembles a sense. In this paper, we study the performance of a word sense induction model by using three variables: a) the target language: in our experiments, we run the induction process on Persian and English; b) the type of the clustering algorithm: both parametric clustering algorithms, including hierarchical and partitioning, and non-parametric clustering algorithms, including probabilistic and density-based, are utilized to induce senses; c) the context of the target words to capture the information in vectors created for clustering: for the input of the clustering algorithms, the vectors are created either based on the whole sentence in which the target word is located; or based on the limited surrounding words of the target word. We evaluate the clustering performance externally. Moreover, we introduce a normalized, joint evaluation metric to compare the models. The experimental results for both Persian and English test data showed that the window-based partitioningK-means algorithm obtained the best performance. Manuscript profile
    • Open Access Article

      11 - A Survey on Multi-document Summarization and Domain-Oriented Approaches
      Mahsa Afsharizadeh Hossein Ebrahimpour-Komleh Ayoub Bagheri Grzegorz  Chrupała
      Before the advent of the World Wide Web, lack of information was a problem. But with the advent of the web today, we are faced with an explosive amount of information in every area of search. This extra information is troublesome and prevents a quick and correct decisio More
      Before the advent of the World Wide Web, lack of information was a problem. But with the advent of the web today, we are faced with an explosive amount of information in every area of search. This extra information is troublesome and prevents a quick and correct decision. This is the problem of information overload. Multi-document summarization is an important solution for this problem by producing a brief summary containing the most important information from a set of documents in a short time. This summary should preserve the main concepts of the documents. When the input documents are related to a specific domain, for example, medicine or law, summarization faces more challenges. Domain-oriented summarization methods use special characteristics related to that domain to generate summaries. This paper introduces the purpose of multi-document summarization systems and discusses domain-oriented approaches. Various methods have been proposed by researchers for multi-document summarization. This survey reviews the categorizations that authors have made on multi-document summarization methods. We also categorize the multi-document summarization methods into six categories: machine learning, clustering, graph, Latent Dirichlet Allocation (LDA), optimization, and deep learning. We review the different methods presented in each of these groups. We also compare the advantages and disadvantages of these groups. We have discussed the standard datasets used in this field, evaluation measures, challenges and recommendations. Manuscript profile
    • Open Access Article

      12 - Deep Transformer-based Representation for Text Chunking
      Parsa Kavehzadeh Mohammad Mahdi  Abdollah Pour Saeedeh Momtazi
      Text chunking is one of the basic tasks in natural language processing. Most proposed models in recent years were employed on chunking and other sequence labeling tasks simultaneously and they were mostly based on Recurrent Neural Networks (RNN) and Conditional Random F More
      Text chunking is one of the basic tasks in natural language processing. Most proposed models in recent years were employed on chunking and other sequence labeling tasks simultaneously and they were mostly based on Recurrent Neural Networks (RNN) and Conditional Random Field (CRF). In this article, we use state-of-the-art transformer-based models in combination with CRF, Long Short-Term Memory (LSTM)-CRF as well as a simple dense layer to study the impact of different pre-trained models on the overall performance in text chunking. To this aim, we evaluate BERT, RoBERTa, Funnel Transformer, XLM, XLM-RoBERTa, BART, and GPT2 as candidates of contextualized models. Our experiments exhibit that all transformer-based models except GPT2 achieved close and high scores on text chunking. Due to the unique unidirectional architecture of GPT2, it shows a relatively poor performance on text chunking in comparison to other bidirectional transformer-based architectures. Our experiments also revealed that adding a LSTM layer to transformer-based models does not significantly improve the results since LSTM does not add additional features to assist the model to achieve more information from the input compared to the deep contextualized models. Manuscript profile
    • Open Access Article

      13 - An Efficient Sentiment Analysis Model for Crime Articles’ Comments using a Fine-tuned BERT Deep Architecture and Pre-Processing Techniques
      Sovon Chakraborty Muhammad Borhan Uddin Talukdar Portia  Sikdar Jia Uddin
      The prevalence of social media these days allows users to exchange views on a multitude of events. Public comments on the talk-of-the-country crimes can be analyzed to understand how the overall mass sentiment changes over time. In this paper, a specialized dataset has More
      The prevalence of social media these days allows users to exchange views on a multitude of events. Public comments on the talk-of-the-country crimes can be analyzed to understand how the overall mass sentiment changes over time. In this paper, a specialized dataset has been developed and utilized, comprising public comments from various types of online platforms, about contemporary crime events. The comments are later manually annotated with one of the three polarity values- positive, negative, and neutral. Before feeding the model with the data, some pre-processing tasks are applied to eliminate the dispensable parts each comment contains. In this study, A deep Bidirectional Encoder Representation from Transformers (BERT) is utilized for sentiment analysis from the pre-processed crime data. In order the evaluate the performance that the model exhibits, F1 score, ROC curve, and Heatmap are used. Experimental results demonstrate that the model shows F1 Score of 89% for the tested dataset. In addition, the proposed model outperforms the other state-of-the-art machine learning and deep learning models by exhibiting higher accuracy with less trainable parameters. As the model requires less trainable parameters, and hence the complexity is lower compared to other models, it is expected that the proposed model may be a suitable option for utilization in portable IoT devices. Manuscript profile
    • Open Access Article

      14 - Persian Ezafe Recognition Using Neural Approaches
      Habibollah Asghari Heshaam Faili
      Persian Ezafe Recognition aims to automatically identify the occurrences of Ezafe (short vowel /e/) which should be pronounced but usually is not orthographically represented. This task is similar to the task of diacritization and vowel restoration in Arabic. Ezafe reco More
      Persian Ezafe Recognition aims to automatically identify the occurrences of Ezafe (short vowel /e/) which should be pronounced but usually is not orthographically represented. This task is similar to the task of diacritization and vowel restoration in Arabic. Ezafe recognition can be used in spelling disambiguation in Text to Speech Systems (TTS) and various other language processing tasks such as syntactic parsing and semantic role labeling. In this paper, we propose two neural approaches for the automatic recognition of Ezafe markers in Persian texts. We have tackled the Ezafe recognition task by using a Neural Sequence Labeling method and a Neural Machine Translation (NMT) approach as well. Some syntactic features are proposed to be exploited in the neural models. We have used various combinations of lexical features such as word forms, Part of Speech Tags, and ending letter of the words to be applied to the models. These features were statistically derived using a large annotated Persian text corpus and were optimized by a forward selection method. In order to evaluate the performance of our approaches, we examined nine baseline models including state-of-the-art approaches for recognition of Ezafe markers in Persian text. Our experiments on Persian Ezafe recognition based on neural approaches employing some optimized features into the models show that they can drastically improve the results of the baselines. They can also achieve better results than the Conditional Random Field method as the best-performing baseline. On the other hand, although the results of the NMT approach show a better performance compared to other baseline approaches, it cannot achieve better performance than the Neural Sequence Labeling method. The best achieved F1-measure based on neural sequence labeling is 96.29% Manuscript profile