Computing Semantic Similarity of Documents Based on Semantic Tensors
: Semantic Web
Amir H. Jadidinejad
Natural Language Processing,
Exploiting semantic content of texts due to its wide range of applications such as finding related documents to a query, document classification and computing semantic similarity of documents has always been an important and challenging issue in Natural Language Processing. In this paper, using Wikipedia corpus and organizing it by three-dimensional tensor structure, a novel corpus-based approach for computing semantic similarity of texts is proposed. For this purpose, first the semantic vector of available words in documents are obtained from the vector space derived from available words in Wikipedia articles, then the semantic vector of documents is formed according to their words vector. Consequently, measuring the semantic similarity of documents can be done by comparing their semantic vectors. The vector space of the corpus of Wikipedia will cause the curse of dimensionality challenge because of the existence of the high-dimension vectors. Usually vectors in high-dimension space are very similar to each other; in this way, it would be meaningless and vain to identify the most appropriate semantic vector for the words. Therefore, the proposed approach tries to improve the effect of the curse of dimensionality by reducing the vector space dimensions through random indexing. Moreover, the random indexing makes significant improvement in memory consumption of the proposed approach by reducing the vector space dimensions. The addressing capability of synonymous and polysemous words in the proposed approach will be feasible by means of the structured co-occurrence through random indexing.