Overcoming the Link Prediction Limitation in Sparse Networks using Community Detection

Link prediction seeks to detect missing links and the ones that may be established in the future given the network structure or node features. Numerous methods have been presented for improving the basic unsupervised neighbourhood-based methods of link prediction. A major issue confronted by all these methods, is that many of the available networks are sparse. This results in high volume of computation, longer processing times, more memory requirements, and more poor results. This research has presented a new, distinct method for link prediction based on community detection in large-scale sparse networks. Here, the communities over the network are first identified, and the link prediction operations are then performed within each obtained community using neighbourhood-based methods. Next, a new method for link prediction has been carried out between the clusters with a specified manner for maximal utilization of the network capacity. Utilized community detection algorithms are Best partition, Link community, Info map and Girvan-Newman, and the datasets used in experiments are Email, HEP, REL, Wikivote, Word and PPI. For evaluation of the proposed method, three measures have been used: precision, computation time and AUC. The results obtained over different datasets demonstrate that extra calculations have been prevented, and precision has been increased. In this method, runtime has also been reduced considerably. Moreover, in many cases Best partition community detection method has good results compared to other community detection algorithms.


1-Introduction
As networks grow, link prediction greatly helps our trade and communication in many large-scale online commercial and social networks. Besides attempting to find missing links, link prediction also seeks to predict new links that may establish in the future. It is precious in a complex network to predict this category of links. On the other hand, high costs are required in laboratories to detect new or missing relations or links for some networks, such as protein-protein interaction relations. Clearly, prediction of correct links in such networks can play a pivotal role in treatment of many diseases such as AIDS and cancer. However, these networks are almost imperfect, lowdensity, and sparse. Also, practical experimentation to correct them, especially for biological networks, causes high costs to incur. Link prediction can predict and subsequently improve the structure of the networks [1]. Many prediction methods have been presented, attempting to improve prediction results. Many of the available networks are sparse, which causes high extra calculation. This means that number of zero entries that needs to be scored in the associate adjacency matrix are far more than the existing ones, in computation and loss of time and resources. To the best of our knowledge, this issue has been mentioned implicitly or explicitly in some researches, but the appropriate solution has not been found [2] [3]. This paper seeks to present a new, more accurate approach for link prediction in sparse networks. Regarding the main pitfalls of sparse networks for link prediction, we reduce the time consuming computations in addition to improve the precision as well. Eliminating the extra computations will be possible by removing the unnecessary predictions that do not have significant effect on the main results. We will achieve this aim by clustering the nodes and localizing the computations on the compacted parts of the network.
After that, we consider some effective strategies to implement between clusters link predictions. The proposed method can be used for both, predicting new links or finding missing links correctly, especially in sparse networks, somehow as the networks are sparse, the result becomes better. We may use the terms clustering or community detection interchangeably throughout this paper. The rest of the paper is organized as follows. In section 2, the related works are illustrated. After that in section 3, the proposed method and the evaluation are explained. In section 4, results and discussion are reported, and finally, in section 5, future work and conclusion will be discussed.

2-Related Works
We review the related researches about link prediction using community detection and link prediction for sparse networks, in this section, after a short overview of the primary related concepts. Link prediction methods have mainly two major categories: unsupervised and supervised. There are several unsupervised methods where the score ( ) is considered for each pair of nonexistent links. Clearly, the higher the score, the greater the probability of establishment of a link is. The methods are divided into two broad categories: neighborhoodbased and path-based methods [4]- [7]. It is worth mentioning that we use the neighborhood-based methods, we will refer to them as basic methods, including CN, JC, AA, RA, and PA. The full name and ranking formula for the methods are shown in Table 1. It is popular for new ideas to be tested with basic methods.

2-1-A Review of link Prediction using Community Detection
The community structure can be observed in many of the available networks. The notion of community or cluster depends largely on the type of the network or the information it contains [8]. In a metabolic network or an inter-protein interaction (protein-protein) network, for instance, a community can be a series of adjacent proteins that perform a biological operation inside a cell [9]. In a commercial network, a cluster can be a series of customers with similar purchasing backgrounds or similar tastes [10]. On the web, a cluster can be a series of pages about a certain issue [11]. In [12], the number of links between every two nodes and is calculated, and is normalized using the number of possible links between them. This value is referred to as the probability that there is a link between the two nodes and . The drawback of this method is that prediction is made among all the nodes, and the network undergoing prediction is not necessarily sparse. Another cluster related link prediction type involves the stochastic block model [13], [14]. In this type of model, all the nodes are summed for categorization. The probability that two nodes are connected is obtained based on their membership in the relevant clusters. The most significant disadvantage of these methods is that they are impractical for large-scale networks due to the high time complexity of obtaining the optimal clustering. In a similar study conducted in 2012 [15], community information has been used differently for prediction as a characteristic of the nodes. The major drawback of these methods involves the high time complexity of obtaining a clustering in large-scale networks and computation for both nodes. A method referred to as the spectral algorithm has been presented in [3]. The approach is similar to a semi-local method, which uses neither local information nor general network paths, making it highly timeconsuming and infeasible in large networks. The authors of [2] have proposed a distributed method based on clustering for link prediction, which depends on Google's MapReduce technology. Although, it has been mentioned in the paper's abstract that clustering is performed basically on dispersed vertices, so that they are grouped in an integrated fashion, the paper does not claim that it applies to sparse graphs.

2-2-A review of link Prediction in Sparse Graphs
Although numerous works on link prediction have been presented that have attempted to improve the precision of the results, the sparsity has been slightly considered in the works. However, many real-world networks are sparse, which causes poor prediction results and loss of time. Hence, the question in some works since 2013 is what is the best way of avoiding this issue [3]. In a supervised solution, the method used in [16] has utilized incidence rather than adjacency matrix factorization, demonstrating that the incidence matrix factorization (IMF) method performs better than adjacency matrix factorization (AMF) in a sparse matrix as well.
It has been mentioned in [17] that the available link prediction algorithms have focused on triangular structures. The method exhibits low efficiency over sparse tree networks. A method based on network degree heterogeneity has been presented in that paper. As authors stated, however, they have examined only tree structures, whereas many complex networks in the real world are sparse, and do not necessarily contain tree structures. In [18], it has been assumed that social network users' habits and characteristics correspond to their social communication on the networks. That is, links are predicted through the notion of aligned social networks. Besides, in [19], the focus is mainly on the structure of the network, and the paper models the problem based on intrinsic characteristics of the network. A drawback of these models is the cost of training the model to handle the big data. Another interesting research used for sparse networks is [20], where the relationship between clustering and the precision of the methods has been investigated based on network structure. Even though, all the unsupervised link prediction methods mentioned above attempt to improve results, reducing the extra computation in sparse networks by splitting it into separate communities and so improving the results in this way has not been considered yet. In this article, we try to overcome the poor result of link prediction in the sparse networks by dividing networks into multiple communities and concentrating on the inter and intra community computations. Subsequently, we will shorten the execution time of link prediction in the sparse networks also.

3-Proposed Method
The algorithm presented in the proposed method involves three major phases, and the validity of each phase affects the final results. The first step is data preparation and preprocessing, which is explained in the next paragraph. Other steps are clustering the network into some partitions and performing the link prediction inter and intra communities, and integrating the results according to some specific policies. The steps are mentioned below.

3-1-Data
Since some datasets are directional, we need to convert them to un-directional graphs because of the nature of the basic link prediction algorithms that do not consider the direction of nodes [21]. First, the dataset is mapped to a matrix, then the matrix needs to be symmetric, and the elements on the main diameter needed to be zero. In this research, five datasets have been used for experimentation. Email 1 (the email communications at the Rovira i Virgili University), the collaboration network on high-energy 1 http://konect.uni-koblenz.de/networks/arenas-email physics 2 , the collaboration network of co-authors on physics-related topics on the arXiv website 3 , the communication network of associated words 4 , and the communication network of human protein 5 . Table 2 describes the properties of each data set, respectively. The quality and precision of link prediction in this research depend to a large extent on correct cluster detection. The utilized clustering methods are as follows. Fast unfolding [22] is a link-based community detection algorithm. Linkcommunity [23] which finds communities such that it may contain nodes overlapping others. Another method used in this research involves the InfoMap community detection algorithm [24], [25]. The Girvan-Newman algorithm utilizes the edge betweenness feature [26].

3-2-Cluster-Based Sparse Link Prediction (CBSLP)
For easy referencing to the algorithm, the abbreviation CBSLP, which stands for Cluster-Based Sparse Link Prediction, has been used hereafter. The data are first mapped into a graph after pre-processing, and the community detection algorithms mentioned in the previous section are then applied to them (line 9 of Figure 2). Prediction is made within each community; thereafter a matrix is defined for the inter-community step, in the relevant entries of which, all the edges between pair of communities are located. All the edges are traversed for finding inter-community edges, and each edge is inserted in the relevant entry of the matrix. Thus, graphs of inter-community edges are finally obtained. Next, each of the communities is subject to link prediction, each of the four basic neighborhood algorithms is examined (Table 1), and new links are predicted. Of course, probable repetitive edges resulting from the prediction in both steps are eliminated ( Figure 2).

3-2-1-Intra-Cluster link Prediction
Using community detection, we divide the whole graph into several separated subgraphs that can be investigated independently for link prediction with more confidence of the closely connected links for better prediction results. Performance of CBSLP is as well as a divide and conquer method. First of all, seeking for communities and after that searching for the relation between those communities is performed. As seen in Figure 1(a), the obtained communities are represented as . 1 and 2 are two of these clusters. Edges and vertices located in a single community are separated, and prediction is made within each of the communities, as clear from Figure 1(b). For edges indicted by dashed lines, link prediction is very likely made with the basic methods.

3-2-2-Inter-Cluster link Prediction
After dividing the main graph into communities and predicting the intra links in each community, it is necessary to investigate the probable links between each pair of communities. Because there are certainly several edges between communities that have not been considered in the calculations.
Here, we generate a graph between every two separate communities for the interconnected edges, and predicts links within each connected pair of the communities. The number of communities depends on the community detection algorithms. Some algorithms, like Best partition, automatically determine the appropriate number of communities, while some other clustering algorithms need a predefined number to break down the network into that number of communities. We utilize the elbow method to automatically determine the number of communities. In order to perform the inter-community link prediction, first, we collect the common links between every two communities. Then we consider and add the links between the nodes located in each community, that participate in inter-community relations for the increment of the accuracy of the computations. For example, in Figure  1 Thus, the inter-community edges are taken into account, the total capacity of the network is used for prediction, and extra calculation is avoided at the same time as well. Traversing all the common edges between communities for finding inter-community relations that participate in the intersection communities' results in isolating new communities between pair of connected communities. Figure 1(c) shows the approach for two different communities. Implementation of the proposed method using inter-communities link prediction is also shown in Figure 2.

3-3-Evaluation
Three factors can be used for measuring the success of link prediction in large sparse networks: precision, AUC (Area Under Curve), and runtime. To calculate precision, 10-fold cross validation is performed. For each fold, 10% of the existing links are removed randomly to predict by the algorithm again. This is done ten times, and each time, a different 10% of the links are selected to be removed. This ensures that each link is withheld exactly once, so all links are present in the training data and the test data an equal number of times. Another evaluation metric for link prediction in unsupervised methods is AUC, also. It can be interpreted as the probability that a randomly chosen missing link is given a higher similarity score than a randomly chosen pair of unconnected links. If among n independent comparisons, there are n′ times the missing link having a higher score and n′′ times they have the same score, the AUC value is calculated as the following [13]: The link prediction detailed above is taken, with an accurate chronometer measuring the time from the beginning to the end of the implementation, and average time, i.e., mean runtime in each of the ten iterations, is calculated. This measure can be used for the assessment of the algorithm speed.

4-Results and Discussion
In this section, we will investigate the results of using the proposed method from different viewpoints including: decreasing the number of checked edges, comparing the best performance link prediction functions, and runtime comparison of CBSLP with basic methods.

4-1-Number of Edges under Examination
An interesting difference between the proposed method and the basic algorithms such as AA, PA, JC, and CN, lies in the numbers of edges and nodes under examination. This causes computations to be carried out in shorter times, regardless of the processing hardware that has been utilized, leading to good results over sparse networks. A summary of the comparison is provided in Table 3. It is worth paying attention in CBSLP that we attempted to remove or ignore the lowest importance links. This table demonstrates the number of initial zero entries in the similarity matrix that should be calculated by basic methods and the proposed method. For example, for the Email dataset, the former methods have about 641844 calculations, while the latter method makes this value lower approximately one-fourth about 163350 in the worst case. Indeed, there are some inter community edged that should be taken into account, but they are few and can be ignored.

4-2-Comparison with Similar Competing Methods
For evaluation of CBSLP, its performance is compared with primary methods. In Table 4, a summary of the results obtained by the proposed method is provided, along with comparing to those of different community detection methods mentioned above. It should be mentioned that the column containing the cumulative results involves the overall results obtained from both intra-community and inter-community phases. The proposed method has no claim on dense graphs such as HEP or Rel, because it may not be appropriate for such a graph structure in a particular application, and may also be led to the elimination of valuable predictions from the graph. In Table 4 BP, LC, info are the abbreviations of best partition, link community and Infomap respectively where all of them are community detection methods that were mentioned before. The bold numbers show the best result in each column of Table4. As a result, CBSLP achieved better results in sparse networks such as Email, Word, Wiki-Vote, PPI. It is worth to mention that (-) in each column means that the pertaining method could not terminate the calculations within a reasonable time (72 hours). Another evaluation metric is AUC. Results in table 5 also confirm the precision metric findings.

4-3-Runtime Analysis and Comparison
In the above four sections, it was discussed that the basic methods have not been successful in link prediction over the Word network, and could not solve it within a reasonable time (72 hours). It is also noticeable that the basic methods CN could probably not be implemented over several similar large networks within a logical time, while the CBSLP in this research successfully computed a sample within a proper time. Therefore, this method has improved time as well, as shown in Table 3. The specification of the system used in this research is shown in Table 6.  Tables 4  and 5. The runtimes of methods was calculated for each dataset, and the results can be observed in Table 7. Clearly, about 0.031 of the links predicted to occur between communities over a network like Email, which means that about 20% of the links occur between communities rather than within them. Unfortunately, however, not much change occurs when the inter-and intra-community links are predicted and evaluated at the same time, as clear from the proposed method with cumulative results' column in Table 4. This is because two lists with different scores are merged, which causes the scores to drift on the list with higher precision, and the results not to change and the final result to worsen even. If the results are cumulated correctly, the method will definitely succeed in denser graphs as well.

5-Conclusion and Future Works
The proposed method, CBSLP, involves a framework for large sparse graphs, since it prevents extra computation, improves runtime, and saves memory. Besides, it can be regarded as a new link prediction method for sparse networks due to its strategy details. However, CBSLP is an initial version of the framework, which should evolve greatly. In the proposed method, clustering was used as a tool not only for improvement of the prediction results but also for elimination of extra calculation. In addition, there is a lot that needs to be done for its evolution. For the precision of the proposed method to increase, attempts can be made to make link prediction also using path-based methods. An appropriate method among path-based algorithms that is recommended in sparse graphs is the SRW 1 method, which improves the results probably. One can attempt to experiment newer and better community detection algorithms for higher precision, such as [27] or [28]. Moreover, a mechanism has been sought to utilize weighted graph version of the network for improvement of the results using inter-cluster relations and their outcomes. It is possible even applying rank aggregation to link prediction lists with different scores for achieving better results. Methods such as that in [15] or [29] can be used to employ cluster information in order to improve the proposed method in terms of precision.