Community Detection Algorithms using R

Service as a Freelancer

If you want hire me as a freelancer please use the following links.

For Fiverr https://www.fiverr.com/dataanalyst_85
For Upwork https://www.upwork.com/freelancers/~017667e5ef4347dcb4

Dataset Downloading

#Downloading the datasets and storing in variables. 
dolphin <- read.graph(file='https://lipn.univ-paris13.fr/~kanawati/datasets/dolphins.gml', format='gml')
karate <- read.graph(file='https://lipn.univ-paris13.fr/~kanawati/datasets/karate.gml', format='gml')
football <- read.graph(file='https://lipn.univ-paris13.fr/~kanawati/datasets/football.gml', format='gml')
polblogs <- read.graph(file='https://lipn.univ-paris13.fr/~kanawati/datasets/polblogs.gml', format='gml')

Step 1: Community Detection Methods

Here, in step 1, we have require to detect the communities from benchmark graphs (i.e., Karate, football, dolphins and polblogs) using state-of-the-art community detection alogrithms (i.e., walktrap, edge.betweenness, louvain and infomap) provided by the igraph library. It is worth mentioning that there are more than one methods for each algorithm available in the igrpah library e.g., cluster_infomap() & infomap.community() can be used alternatively. So, in the first step, communities are detected and compared with the ground truth as follows:

#Step 1: find communities on karate
k1 = cluster_walktrap(karate)
k2 = cluster_infomap(karate)
k3 = cluster_edge_betweenness(karate)
k4 = cluster_louvain(karate)
#Step 1: find communities on dolphins
d1 = cluster_walktrap(dolphin)
d2 = cluster_infomap(dolphin)
d3 = cluster_edge_betweenness(dolphin)
d4 = cluster_louvain(dolphin)
#Step 1: find communities on football
f1 = cluster_walktrap(football)
f2 = cluster_infomap(football)
f3 = cluster_edge_betweenness(football)
f4 = cluster_louvain(football)
#Step 1: find communities on polblogs
p1 = cluster_walktrap(polblogs)
p2 = cluster_infomap(polblogs)


#comparison between the karate network and the grund truth
compare(k1,V(karate)$value,method = "nmi")

## [1] 0.504178

compare(k2,V(karate)$value,method = "nmi")

## [1] 0.6994882

compare(k3,V(karate)$value,method = "nmi")

## [1] 0.5798278

compare(k4,V(karate)$value,method = "nmi")

## [1] 0.5866348

#comparison between the Football network and the grund truth
compare(f1,V(football)$value,method = "nmi")

## [1] 0.8873604

compare(f2,V(football)$value,method = "nmi")

## [1] 0.9005465

compare(f3,V(football)$value,method = "nmi")

## [1] 0.8788884

compare(f4,V(football)$value,method = "nmi")

## [1] 0.8549734

#comparison between the Dolphin network and the grund truth
compare(d1,V(dolphin)$value,method = "nmi")

## [1] 0.53725

compare(d2,V(dolphin)$value,method = "nmi")

## [1] 0.593211

compare(d3,V(dolphin)$value,method = "nmi")

## [1] 0.5541605

compare(d4,V(dolphin)$value,method = "nmi")

## [1] 0.5108534

#comparison between the Polblogs network and the grund truth
compare(p1,V(polblogs)$value,method = "nmi")

## [1] 0.3759933

compare(p2,V(polblogs)$value,method = "nmi")

## [1] 0.1089254

Step 2: Visualize the communities

Here, in the second step, we have write the following code to visualize the communities detected in the previous section.

#Karate club 
plot(k1,karate)

plot(k2,karate)

plot(k3,karate)

plot(k4,karate)

#football network 
plot(f1,football,vertex.label = NA)

plot(f2,football,vertex.label = NA)

plot(f3,football,vertex.label = NA)

plot(f4,football,vertex.label = NA)

#Dolphin network plot
plot(d1,dolphin,vertex.label = NA)

plot(d2,dolphin,vertex.label = NA)

plot(d3,dolphin,vertex.label = NA)

plot(d4,dolphin,vertex.label = NA)

#Polblogs network plot
plot(p1,polblogs,vertex.label = NA)

plot(p2,polblogs,vertex.label = NA)

Step 3: Compare the Detected Communities

As we can see the visualization of networks in the previous section, different community detection algorithms are producing the different communities. In order to check distance between two community structures, we will compare these communities in this section. The function compare() will be used for this purpose with the method of nmi. This function will return the fractional value between 0 and 1, where, 1 represents the high difference, while, 0 represents the low difference. In other words, value close to zero will tell us both community structure are too much similar, while, value close to 1 will tell us both community structure are too much different i.e., both community detections producing the different results.

#Comparisons between communities computed for karate network
compare(k1,k2,method = "nmi")

## [1] 0.6686405

compare(k1,k3,method = "nmi")

## [1] 0.7311681

compare(k1,k4,method = "nmi")

## [1] 0.7619674

compare(k2,k3,method = "nmi")

## [1] 0.758949

compare(k2,k4,method = "nmi")

## [1] 0.8598774

compare(k3,k4,method = "nmi")

## [1] 0.8320913

#Comparisons between communities computed for football network
compare(f1,f2,method = "nmi")

## [1] 0.9378367

compare(f1,f3,method = "nmi")

## [1] 0.8985247

compare(f1,f4,method = "nmi")

## [1] 0.8966576

compare(f2,f3,method = "nmi")

## [1] 0.9090277

compare(f2,f4,method = "nmi")

## [1] 0.9074813

compare(f3,f4,method = "nmi")

## [1] 0.9492747

#Comparisons between communities computed for Dolphin network
compare(d1,d2,method = "nmi")

## [1] 0.7312703

compare(d1,d3,method = "nmi")

## [1] 0.6607129

compare(d1,d4,method = "nmi")

## [1] 0.7133595

compare(d2,d3,method = "nmi")

## [1] 0.8730413

compare(d2,d4,method = "nmi")

## [1] 0.7425744

compare(d3,d4,method = "nmi")

## [1] 0.7329455

#Comparisons between communities computed for Polblogs network
compare(p1,p2,method = "nmi")

## [1] 0.6823554

The above comparisons are performed between the community detection algoriths on the same network. If we look at the results of karate network for their community structures, we can say that there is a maximum distance between community structure 2 and 4. In addition, also their is a high distance between community structure 3 and 4. On the other hand, there is a less distance between community structure 1 and 2. It means, for the karate network, walktrap and infomap community detection algorithms are producing little bit similar results. In case of football network, there is huge distance between all the community structures, where, the maximum distance is between community structure 3 and 4. In addition, community structures 1, 2 and 3 are also close to the high distance. Moreover, in case of Dolphin network, the community structures 2 and 3 have high distance, while, the community structures 1 and 3 have the low distance. As the polblogs network was directed so, there were not possible to compute the community structures through edge.betweenness() and louvain () algorithms. However, we have computed the community structures through walktrap and infomap algoriths and there were also some warnings about modularity. The distance between community structures was less as compare to the previous networks.

Step 4: Modularity of the Network

This function calculates how modular is a given division of a graph into subgraphs. Furthermore, it tells us how good the division? The modularity function returns the results in the form of fraction value. Where, value close to 1 represents the minimum partitions, while, 0 represens the maximum paritions. In simple words, modularity gives the modularity score of the partitioning. For algorithms that do not result a single partitioning, the highest modularity value is returned.

#Modularity of Karate Network
modularity(k1)

## [1] 0.3532216

modularity(k2)

## [1] 0.4020381

modularity(k3)

## [1] 0.4012985

modularity(k4)

## [1] 0.4188034

#Modularity of Football Network
modularity(f1)

## [1] 0.6038112

modularity(f2)

## [1] 0.5688783

modularity(f3)

## [1] 0.6005129

modularity(f4)

## [1] 0.6020821

#Modularity of Dolphin Network
modularity(d1)

## [1] 0.4888454

modularity(d2)

## [1] 0.5277283

modularity(d3)

## [1] 0.5193821

modularity(d4)

## [1] 0.5185317

#Modularity of Karate Network
modularity(p1)

## [1] 0.4302574

modularity(p2)

## [1] 0.02305624

In case of karate network, high modularity is produced by community structure 4 (i.e., louvain algorithm), while, low modularity is produced by community structure 1 (i.e., walktrap algorithm). For the football network, high modularity is produced by community structure 1 and low modularity is produced by community structure 2 (i.e., infomap algorithm). Likewise, the community structures 3 and 4 are also close to the community structure 1. Similarly, in case of dolphin network, the community structures 3 and 4 producing the high modulrity, while, the community structure 1 produced the low modularity score i.e., there are many partitions as compre to the community structure 3 and 4. For the polblogs network, the community structure 2 (i.e., infomap algorithm) produced the low modularity and the community structure 1 (i.e., walktrap) produced the high modularity. Overrall, walktrap algorithm seen to be worst in paritioning the network, while, there were good results for the louvain algorithm.

Step 5: Apply Cluster Label Propgation Algorithm

This is a fast, nearly linear time algorithm for detecting community structure in networks. In works by labeling the vertices with unique labels and then updating the labels by majority voting in the neighborhood of the vertex. It also returns the value between 0 and 1.

#Apply Cluster label on Karate Network
maxv1 = -1
minv1 = 1000
sum1 = 0
for (i in 1:10) {
   l1 = cluster_label_prop(karate)
   res = compare(l1,V(karate)$value,method = "nmi")
   sum1 = sum1 + res
   maxv1 = ifelse(res>maxv1,res,maxv1)
   minv1 = ifelse(res<minv1,res,minv1)
}
avg1 = sum1/10
avg1

## [1] 0.7207055

maxv1

## [1] 0.8371695

minv1

## [1] 0.5125283

#Apply Cluster label on Football Network
maxv2 = -1
minv2 = 1000
sum2 = 0
for (i in 1:10) {
   l1 = cluster_label_prop(football)
   res = compare(l1,V(football)$value,method = "nmi")
   sum2 = sum2 + res
   maxv2 = ifelse(res>maxv2,res,maxv2)
   minv2 = ifelse(res<minv2,res,minv2)
}
avg2 = sum2/10
avg2

## [1] 0.8943941

maxv2

## [1] 0.9268794

minv2

## [1] 0.8581655

#Apply Cluster label on Dolphin Network
maxv3 = -1
minv3 = 1000
sum3 = 0
for (i in 1:10) {
   l1 = cluster_label_prop(dolphin)
   res = compare(l1,V(dolphin)$value,method = "nmi")
   sum3 = sum3 + res
   maxv3 = ifelse(res>maxv3,res,maxv3)
   minv3 = ifelse(res<minv3,res,minv3)
}
avg3 = sum3/10
avg3

## [1] 0.6834627

maxv3

## [1] 1

minv3

## [1] 0.3855657

#Apply Cluster label on Polblogs Network
maxv4 = -1
minv4 = 1000
sum4 = 0
for (i in 1:10) {
   l1 = cluster_label_prop(polblogs)
   res = compare(l1,V(polblogs)$value,method = "nmi")
   sum4 = sum4 + res
   maxv4 = ifelse(res>maxv4,res,maxv4)
   minv4 = ifelse(res<minv4,res,minv4)
}
avg4 = sum4/10
avg4

## [1] 0.2930181

maxv4

## [1] 0.3010346

minv4

## [1] 0.2874325

This algorithm takes graph as an input and prduced the community. Further, we have compare the community with the ground truth in order to find the perfection. In case of karate network, maximum distance was 1 and minimum distance was 0.58, while, the average distnce was 0.76. It shows that this algorithm producing the worst results for the karate network. For the football network, the average distance was 0.88, maximum distance was 0.92 and minimum distance was 0.83. It again showing the worst results for the community detection. In case of dolphin network, the maximum dstance was 1, the average distance was 0.61 and the minimum distance was 0.44. We can see that LPA on dolphin network prodcing some how better results. However, there were maximum distance produced, but the average result shows that the chance of maximum distance are too low. For the polblogs network, this algorithm produced better results as compare to the previous networks. The average distance was 0.29, the maximum distance was 0.29 and the minimum distance was also the same. An other interesing thing that we can see that this algorithms works well for directed networks.

Step 6: Create Function to Merge Partitions

Merge_Partitions <- function(par1, par2){
   par3 = par1 + par2
   return(par3)
}


#Karate Network
res = rep(0,vcount(karate))
for (i in 1:10) {
   l1 = cluster_label_prop(karate)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(karate)$value,method = "nmi")

## [1] 0.5864431

#Football Network
res = rep(0,vcount(football))
for (i in 1:10) {
   l1 = cluster_label_prop(football)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(football)$value,method = "nmi")

## [1] 0.9028241

#Dolphin Network
res = rep(0,vcount(dolphin))
for (i in 1:10) {
   l1 = cluster_label_prop(dolphin)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(dolphin)$value,method = "nmi")

## [1] 0.2842769

#Polblogs Network
res = rep(0,vcount(polblogs))
for (i in 1:10) {
   l1 = cluster_label_prop(polblogs)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(polblogs)$value,method = "nmi")

## [1] 0.2855301

In the previous section, we compare the 10 different results of this algorithm with the ground truth. But, here we have merge the 10 partiions and then compared with the ground truth. In case of karate network the distance was 0.61, while, for the football network the distance was 0.91. Similarly, for the dolphin network the distance was 0.41 and for the polblogs network the distance was 0.28. We can say that again for the directed network this algorithm producing the better results. While, for the football network, it producing the worst results by high distance. So, there was no such effect on the results by merging the partitions.

Step7: Vary N from 10 to 100

#Karate Network
res = rep(0,vcount(karate))
for (i in 1:100) {
   l1 = cluster_label_prop(karate)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(karate)$value,method = "nmi")

## [1] 0.5364178

#Football Network
res = rep(0,vcount(football))
for (i in 1:100) {
   l1 = cluster_label_prop(football)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(football)$value,method = "nmi")

## [1] 0.8943725

#Dolphin Network
res = rep(0,vcount(dolphin))
for (i in 1:100) {
   l1 = cluster_label_prop(dolphin)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(dolphin)$value,method = "nmi")

## [1] 0.3718593

#Polblogs Network
res = rep(0,vcount(polblogs))
for (i in 1:100) {
   l1 = cluster_label_prop(polblogs)
   res = Merge_Partitions(res,l1$membership)
}
compare(res,V(polblogs)$value,method = "nmi")

## [1] 0.2819395

In the previous section, we have mreged the 10 different partions of network and compared with the ground truth. Now, we have presenting the effect of 100 partitions, this change from 10 to 100 will be consider as alpha paramter. In the previous section, we have seen that for the polblogs network the distance was low, while, for the football network the distance was high. Now, look at the results, there is a little bit difference between the 10 variations results and 100 variation results. In the previous section, the distance for the karate network was 0.61, now for the 100 variations the distance is 0.44. Similarly, in the previous section, the distance for the football network was 0.91, however, it reduced to 0.89 in the 100 variations. Likewise, the distance for the dolphin network was 0.41 for the 10 variations, however, for the 100 variations the distance is 0.36. In case of polblogs network, the distance in the 10 variations was 0.28, now in the 100 variations is also 0.28. It can be seen that for the undirected network the change in the alpha paramter produced the better results, while, for the directed networks it remains the same. So, we can conclude that label propagation algorithm is best for undirected networks and more variations can producde the better results.