• Home
  • Natural Language Processing
  • OpenAccess
    • List of Articles Natural Language Processing

      • Open Access Article

        1 - Computing Semantic Similarity of Documents Based on Semantic Tensors
        Navid Bahrami Amir H.  Jadidinejad Mojdeh Nazari
        Exploiting semantic content of texts due to its wide range of applications such as finding related documents to a query, document classification and computing semantic similarity of documents has always been an important and challenging issue in Natural Language Process More
        Exploiting semantic content of texts due to its wide range of applications such as finding related documents to a query, document classification and computing semantic similarity of documents has always been an important and challenging issue in Natural Language Processing. In this paper, using Wikipedia corpus and organizing it by three-dimensional tensor structure, a novel corpus-based approach for computing semantic similarity of texts is proposed. For this purpose, first the semantic vector of available words in documents are obtained from the vector space derived from available words in Wikipedia articles, then the semantic vector of documents is formed according to their words vector. Consequently, measuring the semantic similarity of documents can be done by comparing their semantic vectors. The vector space of the corpus of Wikipedia will cause the curse of dimensionality challenge because of the existence of the high-dimension vectors. Usually vectors in high-dimension space are very similar to each other; in this way, it would be meaningless and vain to identify the most appropriate semantic vector for the words. Therefore, the proposed approach tries to improve the effect of the curse of dimensionality by reducing the vector space dimensions through random indexing. Moreover, the random indexing makes significant improvement in memory consumption of the proposed approach by reducing the vector space dimensions. The addressing capability of synonymous and polysemous words in the proposed approach will be feasible by means of the structured co-occurrence through random indexing. Manuscript profile
      • Open Access Article

        2 - DeepSumm: A Novel Deep Learning-Based Multi-Lingual Multi-Documents Summarization System
        Shima Mehrabi Seyed Abolghassem Mirroshandel Hamidreza  Ahmadifar
        With the increasing amount of accessible textual information via the internet, it seems necessary to have a summarization system that can generate a summary of information for user demands. Since a long time ago, summarization has been considered by natural language pro More
        With the increasing amount of accessible textual information via the internet, it seems necessary to have a summarization system that can generate a summary of information for user demands. Since a long time ago, summarization has been considered by natural language processing researchers. Today, with improvement in processing power and the development of computational tools, efforts to improve the performance of the summarization system is continued, especially with utilizing more powerful learning algorithms such as deep learning method. In this paper, a novel multi-lingual multi-document summarization system is proposed that works based on deep learning techniques, and it is amongst the first Persian summarization system by use of deep learning. The proposed system ranks the sentences based on some predefined features and by using a deep artificial neural network. A comprehensive study about the effect of different features was also done to achieve the best possible features combination. The performance of the proposed system is evaluated on the standard baseline datasets in Persian and English. The result of evaluations demonstrates the effectiveness and success of the proposed summarization system in both languages. It can be said that the proposed method has achieve the state of the art performance in Persian and English. Manuscript profile
      • Open Access Article

        3 - A Customized Web Spider for Why-QA Pairs Corpus Preparation
        Manvi  Breja
        Considering the growth of researches on improving the performance of non-factoid question answering system, there is a need of an open-domain non-factoid dataset. There are some datasets available for non-factoid and even how-type questions but no appropriate dataset av More
        Considering the growth of researches on improving the performance of non-factoid question answering system, there is a need of an open-domain non-factoid dataset. There are some datasets available for non-factoid and even how-type questions but no appropriate dataset available which comprises only open-domain why-type questions that can cover all range of questions format. Why-questions play a significant role and are usually asked in every domain. They are more complex and difficult to get automatically answered by the system as why-questions seek reasoning for the task involved. They are prevalent and asked in curiosity by real users and thus their answering depends on the users’ need, knowledge, context and their experience. The paper develops a customized web crawler for gathering a set of why-questions from five popular question answering websites viz. Answers.com, Yahoo! Answers, Suzan Verberne’s open-source dataset, Quora and Ask.com available on Web irrespective of any domain. Along with the questions, their category, document title and appropriate answer candidates are also maintained in the dataset. With this, distribution of why-questions according to their type and category are illustrated. To the best of our knowledge, it is the first large enough dataset of 2000 open-domain why-questions with their relevant answers that will further help in stimulating researches focusing to improve the performance of non-factoid type why-QAS. Manuscript profile
      • Open Access Article

        4 - An Aspect-Level Sentiment Analysis Based on LDA Topic Modeling
        Sina Dami Ramin Alimardani
        Sentiment analysis is a process through which the beliefs, sentiments, allusions, behaviors, and tendencies in a written language are analyzed using Natural Language Processing (NLP) techniques. This process essentially comprises of discovering and understanding people' More
        Sentiment analysis is a process through which the beliefs, sentiments, allusions, behaviors, and tendencies in a written language are analyzed using Natural Language Processing (NLP) techniques. This process essentially comprises of discovering and understanding people's positive or negative sentiments regarding a product or entity in the text. The increased significance of sentiments analysis has coincided with the growth in social media such as surveys, blogs, Twitter, etc. The present study takes advantage of the topic modeling approach based on latent Dirichlet allocation (LDA) to extract and represent the thematic features as well as a support vector machine (SVM) to classify and analyze sentiments at the aspect level. LDA seeks to extract latent topics by observing all the texts, which is accomplished by assigning the probability of each word being attributed to each topic. The important features that represent the thematic aspect of the text are extracted and fed to a support vector machine for classification through this approach. SVM is an extremely powerful classification algorithm that provides the possibility to separate complex data from one another accurately by mapping the data to a space with much larger aspects and creating an optimal hyperplane. Empirical data on real datasets indicate that the proposed model is promising and performs better compared to the baseline methods in terms of precision (with 89.78% on average), recall (with 78.92% on average), and F-measure (with 83.50% on average). Manuscript profile
      • Open Access Article

        5 - Designing a Semi-Intelligent Crawler for Creating a Persian Question Answering Corpus Called Popfa
        Hadi Sharifian Nasim Tohidi Chitra Dadkhah
        Question answering in natural language processing is an interesting field for researchers to examine their ability in solving the tough Alan Turing test. Every day computer scientists are trying hard to develop and promote question answering systems in various natural l More
        Question answering in natural language processing is an interesting field for researchers to examine their ability in solving the tough Alan Turing test. Every day computer scientists are trying hard to develop and promote question answering systems in various natural languages, especially English. However, in Persian, it is not easy to advance these systems. The main problem is related to low resources and not enough corpora in this language. Thus, in this paper, a Persian question answering text corpus is created, which covers a wide range of religious, midwifery, and issues related to youth marriage topics and question types commonly encountered in Persian language usage. In this regard, the most important challenge was introducing a method for data gathering in Persian as well as facilitating and expanding the data gathering process. Though, SIC (Semi-Intelligent Crawler) is proposed as a solution that can overcome the challenge and find a way to crawl the Persian websites, gather text and finally import it to a database. The outcome of this research is a corpus called Popfa, which stands for POrsesh Pasokh (question answering) in FArsi. This corpus contains more than 53,000 standard questions and answers. Besides, it has been evaluated with standard approaches. All the questions in Popfa are answered by specialists in two general topics: religious and medical questions. Therefore, researchers can now use this corpus for doing research on Persian question answering. Manuscript profile