Word Sense Induction in Persian and English: A Comparative Study

Words in the natural language have forms and meanings, and there might not always be a one-to-one match between them. This property of the language causes words to have more than one meaning; as a result, a text processing system faces challenges to determine the precise meaning of the target word in a sentence. Using lexical resources or lexical databases, such as WordNet, might be a help, but due to their manual development, they become outdated by passage of time and language change. Moreover, the lexical resources might be domain dependent which are unusable for open domain natural language processing tasks. These drawbacks are a strong motivation to use unsupervised machine learning approaches to induce word senses from the natural data. To reach the goal, the clustering approach can be utilized such that each cluster resembles a sense. In this paper, we study the performance of a word sense induction model by using three variables: a) the target language: in our experiments, we run the induction process on Persian and English; b) the type of the clustering algorithm: both parametric clustering algorithms, including hierarchical and partitioning, and non-parametric clustering algorithms, including probabilistic and density-based, are utilized to induce senses; c) the context of the target words to capture the information in vectors created for clustering: for the input of the clustering algorithms, the vectors are created either based on the whole sentence in which the target word is located; or based on the limited surrounding words of the target word. We evaluate the clustering performance externally. Moreover, we introduce a normalized, joint evaluation metric to compare the models. The experimental results for both Persian and English test data showed that the window-based partitioningK-means algorithm obtained the best performance.

Language, as a means of communication between human beings, is composed of two components [1]: form, and meaning. The "form" can be represented either via an audio signal transmitted through a voice channel from a speaker to a recipient, or via an orthographic form through the writing system and the alphabetical set of the language. In text processing, the orthographic form of the language is taken into consideration. Ambiguity is a property of a natural language that causes challenges in text processing. There exist two types of ambiguities: a) syntactic ambiguity, and b) lexical ambiguity. The sentence "I saw the man with a telescope.", for instance, is a sample of syntactic ambiguity to either mean "I used a telescope to see the man" or "I saw the man who carried a telescope".
There are two reasons to cause lexical ambiguity [2, p: 146]: (a) polysemy where a word has more than one meaning, such as /rošan/ (light/bright) in /ran ge rošan/ (light color) and /Ɂotāqe rošan/ (bright room) in Persian or "plane" in "fly by plane" and "cut by plane" in English; and (b) homonymy where the word is both homograph and homophone, such as /rox/ (rook/face/roc) in /mohreye rox/ (the rook piece [in chess]), /roxe zibāye Ɂu/ (her beautiful face), and /parandeye rox/ (the roc bird) in Persian or "bank" (financial place/side of river) in English. In Example (1)a-f, the sentences that contain the target word "bank" are grouped (clustered) in Figure 1. Based on the semantic similarity of the target word "bank" in the sentences, one group belongs to the concept "financial place" (bank 1 ) and the other group belongs to the concept "side of river" (bank 2 ).
(1) a. He cashed a check at the bank.
b. She sat on the bank of the river and watched the currents. c. They detected frauds in the bank. d. I saw a deer near the river bank. e. That bank holds the mortgage on my home. f. They pulled the canoe up on the bank. The lexical ambiguity in text processing is more pronounced in languages that use the Arabic script in their writing system, such as Persian, due to avoiding writing short vowels than languages that use the phonemic orthography, such as English. In text processing, both polysemy and homonymy are recognized as one problem. The context of the target ambiguous word plays a very important role to determine and to disambiguate the meaning.
The Word Sense Induction (WSI) task means that the machine has to induce word senses from the natural data automatically without prior knowledge. This task uses an unsupervised machine learning approach and it can be defined as a clustering task. The example in Figure 1 represents the idea of how clustering can identify the senses of a word. One property of this task is that no initial training data is required. This paper focuses on WSI and aims at inducing the meaning of both polysemous and homonymous Persian and English words from their local contexts and comparing the performance of the clustering algorithms. One additional contribution of this paper is introducing a normalized, joint, external evaluation metric to be able to compare the models more accurately against the naïve baselines.
The construction of the paper is as follows: after the introduction, in Section 2, we describe the semantic representation methods to be used for the clustering task. Section 3 reviews the related works on WSI. In Section 4, our models for both Persian and English are proposed. The obtained results as well as our proposed, joint evaluation metric are discussed in Section 5; and finally, the paper is concluded in Section 6.

2-1-Distributional Semantics
Ambiguity is one of the properties of the natural language. According to the idea proposed by Wittgenstein [3], the meaning of a word can be determined by its usage in the language. Following this idea, Harris [4] proposed an idea in the framework of "distributional semantics" such that the words which are used in the same local contexts intend to have a similar meaning. Based on this idea, the "distributional hypothesis" was proposed, and Firth [5] emphasized that "the local context of the word plays an important role in determining words" senses". Miller and Charles [6] proposed "strong contextual hypothesis" such that two words are to some extent semantically similar if they have similar contexts. Based on this hypothesis, the words "year", "date", and "Wednesday" in Example (2) are semantically similar.
(2) a. I go to the cinema this year.
b. I go to the cinema on this date. c. I go to the cinema this Wednesday.
Since the context plays a very important role to capture the meaning of a word, precise encoding of the word"s context information is required. To this end, Peirsman and Geeraerts [7] introduced three types of linguistic contexts to be extracted from a large corpus: a) document-based model where the words in the same paragraph or in the same documents are used as the context [8,9]; b) syntax-based model where words are compared according to their syntactic relations, from dependency relations [10,11,12,13] to the combinatory categorial grammar [14]; and c) word-based model where word-word co-occurrence statistics are extracted from a fixed window size. These word co-occurrences resemble the "bag-of-words" model [9].
Song et al. [15] introduced two general approaches to represent context information in "distributional semantics": a) using the Bayesian model utilized in topic modeling [16], and b) using a feature-based model to represent the semantic information as a vector. The latter model uses a vector space model to represent the vectorized semantic information of words. The vectors can be used in the clustering task to induce words" meanings. The advantage of using a vector space model is compressing the information about the words and their contexts, called "word embedding". Computing the geometric distance between the vectors makes it possible to decide how two words intend to be similar. Euclidean distance and Cosine distance are two well-known methods for computing the geometric distance between the vectors [17]. However, there are studies that try to better represent the distributional semantics by combining word embeddings with the knowledge-bases known as "knowledge embedding model" [18], enriching word embeddings with ontologies [19], and utilizing a contextualized knowledge embedding model as a joint model where word embedding and sense embedding (sense representations of the words in the local context from corpora that are sense tagged) are combined with knowledge-bases [20].

2-2-Modeling Methods
To use word embedding methods for capturing the local context information of a word and compressing the information to be represented in a vector, two methods can be utilized: a) using the matrix decomposition techniques, and b) using the neural network-based techniques. The Global Vector (GloVe) representation [21] uses the matrix decomposition technique to provide the distributional representation of words. Continuous Skip gram (Skipgram) and Continuous Bag Of Words (CBOW) models [22] use the neural network-based technique to represent the contextual information of a word in a vector. In this paper, we use the Skip-gram model for capturing contextual information of the target word in a vector.

2-3-Context Clustering
There are two major clustering algorithms in terms of defining the number of clusters: parametric and nonparametric. The two well-known parametric clustering algorithms commonly used in natural language processing applications are partitioning and hierarchical. Partitioning clustering uses a centroid-based clustering and computes the distance of individual vectors to the centroid, such as the K-means algorithm [23]. The hierarchical clustering uses a statistical criterion to compute the clusters" distance. This algorithm is either agglomerative (bottom-up) or divisive (top-down). We use the divisive clustering algorithm for the WSI task.
The common property of parametric algorithms is that they require a pre-defined number of clusters. Therefore, the State-Of-The-Art (SOTA) techniques in the field have performed their experiments on a pre-defined number of clusters; e.g., the proposed model by used the K-means algorithm with 3 clusters, i.e. k=3. To have a better estimation on the number of clusters, Ghayoomi [25] utilized the silhouette score [26] in Equation (1) as a metric to define the number of clusters. Using this method to identify the number of clusters outperformed the SOTA results. (1) where a(i) is the average dissimilarity of element i with other elements in the same cluster computed by Equation (2); and b(i) is the minimum distance between an element of a cluster with all other elements in the rest of clusters, computed by Equation (3).
where i and j are two elements in cluster C and d(i,j) is the distance between i and j, C i is the cluster in which element i belongs to and j is another element of this cluster, and C k is cluster that element i is not its member.
In this research, we use the silhouette score as a metric for each cluster to decide about the best number of clusters: the higher the score, the better the clustering result.
Non-parametric clustering algorithms are another approach for the WSI task. The number of senses (clusters) is unknown in advance and the algorithms should try to find the senses. Chinese Restaurant Processing (CRP) [27] and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [28] are two non-parametric clustering algorithms that we use for this goal. CRP models the behavior of Chinese when they go to a restaurant: either to sit on a table that one has already sat on, or to take a new seat. The algorithm uses a probabilistic Bayesian model. The DBSCAN uses a density-based model to find the best partitioning of clusters.
In this paper, we compare the performance of both parametric and non-parametric algorithms for the WSI task in Persian and English. The study of the algorithms themselves and their properties are out of the scope of this paper.

3-Related Works on WSI
Clustering the context to distinguish senses of the target polysemous or homonymous word is one of the main approaches in WSI. In this approach, the number of clusters indicates the number of the target word"s senses. Huang et al. [24] used the K-means algorithm with word embedding to cluster word contexts. Neelakantan et al. [29] predicted each sense of a word as a context cluster assignment. Their model worked based on the K-means algorithm. In these two researches, a fixed number of clusters, namely 3 clusters, was defined to run the Kmeans clustering. Li and Jurafsky [30] proposed using CRP as a non-parametric model to capture the senses dynamically. In their approach, the model decided either to generate a new sense for each context or to assign the context to an already generated sense. Wang et al. [31] proposed a model to use weighted topic modeling for sense induction. Amrami and Goldberg [32] utilized the BiML model, a bidirectional recurrent neural network model, proposed by Peters et al. [33] for WSI and extended the model such that predicted word probabilities were used in the language model. Alagić et al. [34] used the lexical substitution model to induce word senses. Therefore, words which belonged to a cluster should be able to be substituted in an appropriate context. The proposed model was compared against manual substitution along with other clustering evaluation metrics. Corrêa and Amancio [35] proposed a model to capture the structural relationship among contexts. To this end, they used the complex network proposed by Perozzi et al. [36] for context embedding. Tallo [37] used sentence embedding for WSI and investigated the encoding of linguistic properties of words in the embedding. Dong and Wang [38] used WSI in the medical domain to enhance sense inventories. They evaluated four models, namely using context clustering, two types of word clustering, and sparse coding in word vector space. Among them, the sparse coding model proposed by Arora et al. [39] outperformed the other models to discover more complete word senses.
As reported by Song et al. [15], the K-means parametric model used by Neelakantan et al. [29] outperforms the CRP algorithm proposed by Li and Jurafsky [30] based on the SemEval2010 WSI task [40]. As Song et al. [15] stated, the main reason for obtaining such results is the poor performance of CRP in making a decision to assign a word to a new cluster. In the results of the two models, the K-means algorithm used 3 clusters as the predefined, fixed number of clusters, while CRP ended to a lesser number of clusters on average than the best average number of clusters for both noun and verb categories in the SemEval2010 WSI task. This indicates that relaxing the predefined number of clusters in K-means can further improve the performance of the task.

4-Architecture of the Proposed Model
The clustering model we proposed in our research is represented in Figure 2. As can be seen in the figure, the model is constructed of three modules and datasets which are described below.

4-1-Major Modules of the Model
The model contains three modules: vectorization, clustering, and evaluation. In vectorization, first the words" vectors based on the big corpus of a language described in Section 4.2 are created. In vectorization of words, three parameters should be taken into consideration in advance: a) the number of dimensions of each vector; b) the number of the surrounding words of the target word in the local context; c) the information to be considered in vectorization which is the word forms in our case. The setting of the parameters is described in Section 5.3. The vector of the instances that contain the target word is created in two modes: a) in the first mode, thereafter called the "SentContext" mode, the weighted vectors of the words in a sentence are summed up to build the vector of each instance that includes the target word. Then, this score is normalized based on the sentence length. In the second mode, thereafter called the "WinContext", the limited surrounding context of the target word is used to build the sentence vector.
It has to be mentioned that not all words in a sentence are content words and there exists a closed list of functional words frequently used, such as preposition, conjunctions, coordinators, etc. These words can be considered as stop words. We use a weighting method to increase the impact of content words, and reduce the impact of functional words. To this end, we use TF-IDF 1 [41] to assign a weight to the words.
In the next step, all instances of the target word are clustered based on their vector representation. We assume that each cluster shows one sense of the word. In the clustering module, we utilize both parametric and nonparametric clustering algorithms described in Section 2.3. The parametric algorithms are run based on the two context modes. For clustering, the data should be reformatted from word forms to a vector space model described above. More precise vectors result in better clustering performance.
It should be added that a two step embedding process 1 Term Frequency-Inverse Document Frequency And after the clustering step, sense embedding is done for semantic distribution of the target word with respect to its meaning in the local context. In the evaluation module, two evaluation criteria, namely F-measure and V-measure in addition to a joint metric, are used. These metrics are explained in more detail in Section 5.2. In the evaluation process, the instances of the test data are added to the data pool to be clustered and the induction results of the test data are compared with the corresponding gold standard labels. To this end, we used the toolkit developed in SemEval2010 WSI task [40] that does this mapping. 1

4-2-Datasets
To run our experiments, we require three datasets for Persian and English: a big corpus, data pool, and test data. The big corpus is used for training word embedding as well as sense embedding to identify the senses of the target words based on the clustering output. The data pool is used for clustering the target words based on their context; and the test data is used for evaluating the models.
The big corpus that we use for creating the Persian words embedding contains over 538 million word tokens developed by Ghayoomi [42]. This corpus is a composition of several other corpora, including a) The 1 https://www.cs.york.ac.uk/semeval2010_WSI/files/evaluation.zip Pesian Linguistic DataBase [43] which is a balanced Persian corpus containing both historical and contemporary Persian. In this research, we only use the contemporary dataset; b) The Newspaper Corpus which is a collection of news crawled from the online archive of several Persian newspapers; c) The Hamshahri Corpus [44] which is also another news corpus collected from the online archive of the Hamshahri Newspaper; d) The Bijankhan Corpus [45] which is a fraction of Peykare [46], the Persian Text Corpus; and e) The Persian Wikipedia corpus which contains 361,479 articles downloaded from the dump of Persian Wikipedia articles in July 2016. 2 The big corpus that we use for creating English word vectors is the Westbury Lab Wikipedia Corpus developed by Shaoul and Westbury [47]. This corpus, which is freely available, is collected from the dump of English Wikipedia articles in April 2010. The corpus contains almost 990 million word tokens of the general domain and it has been used for similar tasks as reported in the literature [24,29]. It should be mentioned that the documents with less than 2000 characters long are excluded from the corpus.
To evaluate the clustering results of the Persian WSI experiments, we use the test data developed by Ghayoomi [42]. This dataset is standardized based on the SemEval2010 framework. In this dataset, 20 Persian words which are either polysemous or homonymous, are selected from Farsnet [48], the Persian Wordnet. For each target word, 100 sentences are manually annotated; as a result, the test dataset contains 2000 instances in total. Moreover, 279,567 unannotated sentences which contain any of the target words are selected from the big corpus as the data pool.
To evaluate the clustering results of the English WSI experiments, we use the SemEval2010 dataset for the WSI task [40] that is mostly from the news domain. In total, 100 words (50 verbs and 50 nouns) are the target words in this dataset. This dataset contains 8,915 instances as test data with sense annotation and 888,722 unannotated sentences in the data pool. Table 1 summarizes the statistical information of the data pool, the test data, and the size of the big corpus for Persian and English.

5-1-Baselines
To evaluate the performance of the clustering algorithms, we use two naïve baselines introduced in SemEval2010 [40]: a) the Most Frequent Sense (MFS) where all instances are assigned to a single cluster that contains the most frequent sense; b) one sense per cluster, thereafter called 1S1C, where each instance is assigned to an individual cluster; therefore the number of clusters is equal to the number of instances.
In addition, there are two SOTA results reported in the literature: a) the CRP algorithm utilized by Li and Jurafsky [30] for non-parametric clustering; and b) the K-means algorithm proposed by Neelakantan et al. [29] for parametric clustering. In this K-means algorithm, there is no optimization on the number of clusters and 3 senses are assumed as the pre-defined number of senses for each English word. Thereafter, we call this model K-means-3.
All of the basic baselines and the SOTA models are performed with the Persian data to compare the clustering performance, disregarding the dependency of the algorithm to the data.

5-2-Evaluation Metrics
To evaluate the performance of the clustering results, we utilize two known external evaluation metrics which are commonly used for WSI, namely F-measure [49] and Vmeasure [50]. In addition, we propose a new normalized, joint evaluation metric, called J-measure, for a fair evaluation of the models.

F-Measure
F-measure computes the accuracy of information retrieval as in Equation (4). (4) where P is precision, R is recall, and β is a weighting parameter. If β > 1, more weight is assigned to recall, and in case β < 1, more weight is assigned to precision. If β = 1, precision and recall are considered equally. Equations (5) and (6) compute precision and recall, respectively. In all equations, K is the CLUSTER set, which is the hypothesized clusters from the clustering output and C is the CLASS set, which is the correct partitioning of the data; i.e., for a target dataset with N elements, we have two partitions: the guess partition K, and the gold partition C.
where n ij is the number of members of class ci  C that is the element of cluster kj  K.

V-Measure
Another alternative to evaluate clustering is an entropybased approach proposed by Rosenberg and Hirschberg [50]. Different entropy-based evaluation metrics have been proposed for clustering so far [51,52]. Among them, the V-measure metric proposed by Rosenberg and Hirschberg [50] is the most popular one. V-measure computes the harmonic mean of homogeneity, , and completeness, , of clustering as stated in Equation (7).
Homogeneity means that in each CLUSTER, there are a few numbers of CLASSes. The best mode of homogeneity is when a cluster consists of only samples of one class. Completeness, which is the reverse of homogeneity, means that each CLASS is appeared in a few numbers of CLUSTERs. The best mode of completeness is when all samples of the same class are within a single cluster.
As Rosenberg and Hirschberg [50] explained, homogeneity and completeness are formally defined in (8) and (9): (8) where (9) where C={c i | i = 1, …, n} is the set of CLASS, K = {k i | 1, …, m} is the set of CLUSTER, and N is the number of data points in the data set, and a ck is the number of elements of class c in cluster k.

The Proposed Evaluation Metric to Evaluate the Clustering Performance
The advantage of V-measure over F-measure is that in the evaluation, completeness as well as homogeneity are taken into consideration, while in F-measure only the distribution of classes in clusters, i.e. homogeneity in the clustering, is considered and it does not care about whether in each cluster the number of classes are minimized. This difference indicates that V-measure is more reliable than F-measure. On the other hand, V-measure alone dedicates a high score to the partitioning with one instance per cluster, because in such partitioning the number of classes in each cluster is perfectly minimized. This indicates that despite the advantages of V-measure, it is not a reliable metric. Therefore, to accurately evaluate the performance of the clustering result, we need to consider both metrics.
The results of the two metrics represent two extremes such that there is a trade-off between them, i.e. in most of the cases if V-measure is high, F-measure is low, and vice versa. For instance, if the SOTA scores based on V-and Fmeasures are compared against naïve baselines in the WSI task, it can be determined that the naïve baselines, namely 1S1C and MFS, obtain better scores than the advanced SOTA clustering algorithms and the SOTA models are not able to beat the simple baselines. This determines that Vand F-measures in Equations (4) and (7) are not perfect to compare the clustering performance accurately. As a result, we propose a normalized, joint metric, called J-measure in Equation (10) which is the harmonic mean of V-and Fmeasures. The obtained score is uniformed such that both homogeneity and completeness are included.
where is F-measure and it obtains the result from Equation (4), is V-measure and it obtains the result from Equation (7), and is the weighting parameter. If , then more weight is assigned to F-measure; therefore only homogeneity in clustering is considered. In case , then more weight is assigned to V-measure to consider both homogeneity and completeness. If , then there is a uniform distribution over F-and V-measure. If = 1 in Equations (4), (7) and (10), then Equation (10) can be rewritten as Equation (11) to show how precision, recall, homogeneity, and completeness can relate to each other: Table 2: Results of the baselines, SOTAs, and the experimented models for Persian according to V-measure (V), F-measure (F) and J-measure (J) criteria  Table 3: Results of the baselines, SOTAs, and the experimented models for English according to V-measure (V), F-measure (F) and J-measure (J) criteria

5-3-Setup of Experiments
In this study, we experimentally compare the performance of several clustering algorithms to induce Persian and English word senses. The clustering algorithms require vector representation of the data. To this end, the Gensim Python 1 library is utilized to create the words" vectors according to this setups: a) employing the skip-gram model to capture the context of words; b) setting 8 words (4 words before and 4 words after the target word) similar to Huang et al. [24] to extract the information of the words" local contexts; c) setting the vector size to 300 dimensions similar to Neelakantan et al. [29]; and d) using the words with frequency 5 and above to build words" vector. In the next step, the weighted average of words" vector is created from the context vectors. Then, we compute TF-IDF of each word based on the idea proposed by Neelakantan et al. [29] and use it as a weighting value in each vector to compute the context vector.
1 https://radimrehurek.com/gensim/index.html The partitioning and hierarchy-based clustering algorithms are run in two modes, SentContext and WinContext modes, described in Section 4.1. In the WinContext mode, the context is set to 8 words to be similar to the context to build the words" vector. As a result, we perform our experiments by considering 4 words before and 4 words after the target word.
We also compute the two-tailed t-test to compare the performance of the models and study how statistically significant the difference between the models is.

5-4-Results and Discussion
Tables 2 and 3 summarize the obtained results of using various algorithms for inducing Persian and English words" senses. Among the basic baselines, the 1S1C has obtained a higher score for V-measure than the MSF baseline, but the score of F-measure is the lowest. The obtained results for the MFS baseline are vice-versa. Although the 1S1C baseline considers homogeneity and completeness properties, the MFS baseline takes only homogeneity into consideration.
Among the two clustering approaches used for the SOTA models, the parametric clustering algorithm implemented in the Kmeans-3 model obtained a higher result than the CRP model based on the J-measure criterion. The difference between the models based on the J-measure was statistically significant (p < 0.05). It has to be mentioned that the F-measure results for both models are almost the same. This showed that in terms of homogeneity, the models behaved the same; but considering the completeness property, the advantage of the Kmeans-3 model over the CRP model was highlighted.
In addition to the SOTA techniques, we utilized different parametric and non-parametric methods in our study. We utilized DBSCAN model, as a non-parametric algorithm, for inducing word senses. The model could not beat the CRP model as a baseline according to the Jmeasure results for both Persian and English. We further observed that the performance of the DBSCAN model was very similar to the MFS baseline since it had a high score for F-measure which means that this clustering algorithm ends up to one single cluster in most of the cases and only homogeneity was taken into consideration.
As mentioned, we used two modes in our experiments, SentContext and WinContext. To have a fair comparison between the modes, we ran the WinContext mode based on the Kmeans-3 model for both Persian and English to be able to compare the results with the SentContext mode of Kmeans-3 as one of the SOTA models.
According to the results, the WinContext mode of the Kmeans-3 model for both Persian and English had beaten the Kmeans-3 model in SentContext mode based on Vmeasure. The difference between the modes of the Kmeans-3 model was statistically significant (p < 0.01). The superiority of the Kmeans-3 model in WinContect mode was reflected in the J-measure. This result indicated that the surrounding words of the target word in the local context have a major impact on determining the meaning of the target word, and all of the words in the sentence are not effective. Comparing the results based on F-measure, the WinContext mode obtained a higher result than the SentContext for Persian; however, SentContext achieved better F-measure than the WinContext for English.
Comparing the proposed models of parametric clustering in either Win-or SentContext mode with the baselines indicated that none of the models had been able to beat the two naïve baselines: the 1S1C baseline based on V-measure, and the MFS baseline based on F-measure. Therefore, it was not possible to compare and to rank the models fairly. J-measure, however, filled the gap. According to the results of the proposed evaluation metric, i.e. J-measure, the proposed WSI models outperformed the naïve baselines. The score of the joint metric has made it possible to compare the proposed models with the SOTA models as well.
We further utilized two parametric algorithms to induce word senses. First, we used the divisive algorithm in SentContext and WinContext modes for both Persian and English. According to the V-measure results, the divisive algorithm had beaten the MFS baseline as well as CRF.
As can be seen, the WinContext mode of the divisive algorithm for both Persian and English obtained a higher result than the SentContext mode. This showed that the divisive algorithm required a narrow context to determine the meaning of words. The differences between the modes were statistically significant (p < 0.05). It had to be mentioned that neither of the modes of the divisive clustering algorithm for Persian were able to beat the respective mode of the Kmeans-3 model according to Jmeasure. While WinContext mode of the divisive clustering for English dataset had been able to beat the respective mode of the Kmeans-3 model based on Vmeasure which was also reflected in J-measure.
In addition to the divisive algorithm, we used the Kmeans algorithm enhanced with the silhouette score, thereafter called Kmeans-silhouette, for finding the best number of clusters in the two modes for both Persian and English. According to the results of V-measure, the WinContext mode of this algorithm for both datasets had beaten the SentContext mode. The difference between the modes of this clustering algorithm for the Persian data was statistically significant (p < 0.05) but not for the English dataset. Comparing this clustering algorithm to the Kmeans-3, as the SOTA baseline, it had to be mentioned that the WinContext mode of Kmeans-silhouette model for both datasets was able to beat the respective mode of the Kmeans-3 model according to V-measure with statistically significant difference (p < 0.05). This shows that the surrounding words in the local context are important for K-means clustering to induce word senses. The SentContext mode of the English data had beaten the SentContext mode of the Kmeans-3 model with statistically significant difference (p < 0.05), but not the Persian data, where a slightly poor performance of the SentContext mode was obtained. Comparing the Kmeanssilhouette model to the divisive algorithm for both modes of the two languages, the Kmeans-silhouette model had beaten the divisive clustering algorithm with statistically significant difference (p < 0.05).
We further ranked the models and found the best model which has a reasonable good performance based on J-measure. In general, the WinContext mode of the Kmeans-silhouette model for both Persian and English performed the best. This determined that the surrounding words in the context play a significant role in determining the meaning of the word and all of the words in the sentence do not play a major role. This achievement results in reduction of the computation time to produce words" vectors and perform clustering.

6-Conclusion
In this paper, we studied the performance of various clustering algorithms, from parametric to non-parametric, to induce words" senses automatically. The algorithms were run by using Persian or English datasets. Furthermore, two modes, WinContext or SentContext, were used to build words" vectors. Finally, we utilized two evaluation criteria, namely V-and F-measure. There is always a trade-off between these metrics and a model evaluated with these metrics cannot beat a naïve baseline. Therefore, we contributed to propose J-measure as a harmonic mean of V-and F-measure to ease comparing the models. The results were compared with two basic baselines, 1S1C and MFS, and two SOTA models, CRP and Kmeans-3. By comparing the experimental results, we concluded that the parametric clustering algorithm performs better than the non-parametric clustering algorithm for inducing word senses. Among the parametric clustering algorithms, the Kmeans-silhouette clustering model in WinContext performed the best to induce senses of both Persian and English words. This result indicated that the surrounding words of the local context are highly effective in determining the meaning of words than other words in the sentence.
Devlin et al. [53] proposed a model for language representation known as the Bidirectional Encoder Representations from Transformers (BERT) model. This model is currently the SOTA model. One direction of this study as the future work is using the BERT embedding model for the WSI task and comparing the results with the Word2Vec-based embedding model.