Density Measure in Context Clustering for Distributional Semantics of Word Sense Induction
Subject Areas : Data Mining
1 - Institute for Humanities and Cultural Studies
Keywords: Word Sense Induction, , Word Embedding, , Clustering, Silhouette Score, , Unsupervised Machine Learning, , Distributional Semantic, , Density, ,
Abstract :
Word Sense Induction (WSI) aims at inducing word senses from data without using a prior knowledge. Utilizing no labeled data motivated researchers to use clustering techniques for this task. There exist two types of clustering algorithm: parametric or non-parametric. Although non-parametric clustering algorithms are more suitable for inducing word senses, their shortcomings make them useless. Meanwhile, parametric clustering algorithms show competitive results, but they suffer from a major problem that is requiring to set a predefined fixed number of clusters in advance. Word Sense Induction (WSI) aims at inducing word senses from data without using a prior knowledge. Utilizing no labeled data motivated researchers to use clustering techniques for this task. There exist two types of clustering algorithm: parametric or non-parametric. Although non-parametric clustering algorithms are more suitable for inducing word senses, their shortcomings make them useless. Meanwhile, parametric clustering algorithms show competitive results, but they suffer from a major problem that is requiring to set a predefined fixed number of clusters in advance. The main contribution of this paper is to show that utilizing the silhouette score normally used as an internal evaluation metric to measure the clusters’ density in a parametric clustering algorithm, such as K-means, in the WSI task captures words’ senses better than the state-of-the-art models. To this end, word embedding approach is utilized to represent words’ contextual information as vectors. To capture the context in the vectors, we propose two modes of experiments: either using the whole sentence, or limited number of surrounding words in the local context of the target word to build the vectors. The experimental results based on V-measure evaluation metric show that the two modes of our proposed model beat the state-of-the-art models by 4.48% and 5.39% improvement. Moreover, the average number of clusters and the maximum number of clusters in the outputs of our proposed models are relatively equal to the gold data
[1] F. de Saussure, Cours de linguistique générale, C. Bally, A. Sechehaye, and A. Riedlinger, Eds. Lausanne, Paris: Payot, 1916.
[2] G. A. Miller, “WordNet: A lexical database for English,” Communications of the ACM, Vol. 38, No. 11, 1995, pp. 39-41.
[3] E. Hovey, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel, “OntoNotes: The 90% solutions,” in Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Stroudsburg, PA, USA: Association for Computational Linguistics, 2006, pp. 57-60.
[4] E. Huang, R. Socher, C. D. Manning, and A. Ng, “Improving word representations via global context and multiple word prototypes,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea: Association for Computational Linguistics, 2012, Vol. 1, pp. 837-882.
[5] A. Neelakantan, J. Shankar, A. Passos, and A. McCallum, “Efficient non-parametric estimation of multiple embeddings per word in vector space,” in Proceedings of the Conference on Empirical Methods in Natural Language. Doha, Qatar: Association for Computational Linguistics, 2014.
[6] J. Li, and D. Jurafsky, “Do multi-sense embeddings improve natural language understanding?” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2015, pp. 1722-1732.
[7] D. M. Blei, M. I. Jordan, T. L. Griffiths, and J. B. Tenenbaum, “Hierarchical topic models and the nested Chines restaurant process,” in Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press, 2003, pp. 17-24.
[8] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial database with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, E. Simoudis, J. Han, and U. M. Fayyad, Eds. AAAI Press, 1996, pp. 226-231.
[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of the 26th International Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z . Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111-3119.
[10] O. Levy, and Y. Goldberg, “Dependency-based word embeddings,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. Baltimore, Maryland: Association for Computational Linguistics, 2014, pp. 302-308.
[11] I. Beltagy, K. Erk, and R. Mooney, “Semantic Parsing using distributional semantics and probabilistic logic,” in Proceedings of the ACL 2014 Workshop on Semantic Parsing, Association for Computational Linguistics, 2014, pp. 7-11.
[12] A. Das, D. Ganguly, and U. Garain, “Named entity recognition with word embeddings and Wikipedia categories for a low-resource language,” ACM Transaction on Asian Low-Resource Language Information Processing, Vol. 16, No. 3, 2017, pp. 1-19.
[13] L. C. Yu, J. Wang, K. R. Lai, and X. Zhang, “Refining word embeddings using intensity scores for sentiment analysis,” IEEE/ACM Transaction on Audio, Speech and Language Processing, Vol. 26, No. 3, 2018, pp. 671-681.
[14] L. Song, Z. Wang, H. Mi, and D. Gildea, “Sense embedding learning for word sense induction,” in Proceedings of the 5th Joint Conference on Lexical and Computational Semantics. The *SEM 2016 Organizing Committee, 2016, pp. 85-90.
[15] Z. S. Harris, “Distributional structure,” Word, Vol. 23, No. 10, 1954, pp. 146-162.
[16] L. Wittgenstein, Philosophical Investigations, Oxford, UK: Blackwell Publishing Ltd, 1953.
[17] J. R. Firth, “A synopsis of linguistic theory 1930-1955,” Studies in Linguistic Analysis (Special Volume of the Philosophical Society), 1957, pp. 1-32.
[18] G. A. Miller, and W. G. Charles, “Contextual correlates of semantic similarity,” Language and Cognitive Processing, Vol. 6, No. 1, 1991, pp. 1-28.
[19] D. M. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, Vol. 3, 2003, pp. 993-1022.
[20] Y. W. The, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet process,” Journal of the American Statistical Association, Vol. 101, No. 476, 2006, pp. 1566-1581.
[21] G. M. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, Vol. 18, No. 11, 1975, pp. 613-620.
[22] D. Jurafsky, and J. H. Martin, Speech and Language Proeccing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2018, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
[23] Y. Peirsman, and D. Geeraerts, “Predicting strong associations on the basis of corpus data,” in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, pp. 648-656.
[24] T. K. Landauer, and S. T. Dumais, “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge,” Psychological Review, Vol. 104, No. 2, 1997, pp. 211-240.
[25] M. Sahlgren, The Word-space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-dimensional Vector Spaces, Ph.D. dissertation, Stockholm University, Stockholm, Sweden, 2006.
[26] Z. S. Harris, A Theory of Language and Information: A Mathematical Approach, Oxford, England: Oxford University Press, 1991.
[27] D. Lin, “Automatic retrieval and clustering of similar words,” in Proceedings of the 17th International Conference on Computational Linguistics, Morristown, NJ, USA: Association for Computational Linguistics, 1998, Vol. 2, pp. 768-774.
[28] S. Padó, and M. Lapata, “Dependency-based construction of semantic space models,” Computational Linguistics, Vol. 33, No. 2, 2007, pp. 161-199.
[29] K. M. Hermann, and P. Blunsom, “The role of syntax in vector space models of compositional semantics,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013, Vol. 1, pp. 894-904.
[30] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, Vol. 14, pp. 1532-1543.
[31] J. Wang, M. Bansal, K. Gimpel, B. D. Ziebart, and C. T. Yu, “A sense-topic model for word sense induction with unsupervised data enrichment,” Transaction of the Association for Computational Linguistics, Vol. 3, 2015, pp. 59-71.
[32] A. Amrami, and Y. Goldberg, “Word sense induction with neural biLM and symmetric pattern,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 4860-4867.
[33] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 2227-2237.
[34] D. Alagić, J. Šnajder, and S. Padó, “Leveraging lexical substitutes for unsupervised word sense induction,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, 2018.
[35] E. A. Corrêa, and D.R. Amancio, “Word sense induction using word embeddings and community detection in complex networks,” Physica A: Statistical Mechanics and its Applications, Vol. 523, 2019, pp. 180-190.
[36] B. Perozzi, R. Al-Rfou’, V. Kulkarni, and S. Skiena, “Inducing language networks from continuous space word representations,” in Complex Networks, P. Contucci, R. Menezes, A. Omicini, and J. Poncela-Casasnovas, Eds. Cham: Springer International Publishing, 2014, pp. 261-273.
[37] S. Manandhar, I P. Klapaftis, D. Dligach, and S. S. Pradhan, “SemEval-2010 task 14: Word sense induction & disambiguation,” in Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp 63-68.
[38] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkley, California: University of California Press, 1967, Vol. 1, pp. 281-297.
[39] Y. Lie, Z. Li, H. Xiong, X. Gao, and J. Wu, “Understanding of internal clustering validation measures,” in Proceedings of the 2010 IEEE International Conference on Data Mining. Washington, D.C., USA: IEEE Computer Society, 2010, pp 911-916.
[40] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, Vol. 20, No. 1, 1987, pp 53-65.
[41] C. Shaoul, and C. Westbury, “The Westbury Lab Wikipedia Corpus,” 2010, http://www.psych.ualberta.ca/~ westburylab/downloads/westburylab.wikicorp.download.html
[42] C. J. V. Rijsbergen, Information Retrieval, 2nd ed. Newton, MA, USA: Butterworth-Heinemann, 1979.
[43] B. E. Dom, “An information-theoretic external cluster-validity measure,” IBM, Technical Report, 2001.
[44] M. Melia, “Comparing clusterings – an information based distance,” Journal of Multivariate Analysis, Vol. 98, No. 5, 2007, pp. 873-8995.
[45]A. Rosenberg, and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic: Association for Computational Linguistics, 2007, pp 410-420.