Clustering based on graph community-detection algorithm

Theorethical background

Inspired by John W. Foreman book "Data Smart".

Clustering - detection groups of similar cases in data. Classical approach is to compute pairwise distances between cases and with various method combine nearest cases. Above mentioned book suppose approach when we considered our data as graph. If distance between two cases is less than threshold value than these two cases considered as connected nodes. Further we could apply to this graph various methods of communities detection. Each discovered community corresponds to cluster. So we have following algorithm:

Compute pairwise distance
Build adjacency matrix - (matrix with 0/1 values - is this cases connected or not)
Build graph with adjacency matrix
Detect community on this graph

Example of computations

library(igraph)
dat = mtcars # we use data about automobile
dat = scale(dat) # scale data
#compute pairwise distance
dist_matrix = as.matrix(dist(dat,method = "euclid")) 
# normalize distance (max = 1)
dist_matrix = dist_matrix/max(dist_matrix,na.rm = TRUE) 
# points with distance less than 30% of maximum considered connected
graph_matrix = dist_matrix<(30/100) 
diag(graph_matrix) = FALSE # we don't need self-connected node
# build graph
graph = graph.adjacency(graph_matrix, mode = "undirected") 
communities = fastgreedy.community (graph) # detect comminities

See results on next slide.

Clustering results

Results of clustering. Different fill colors corresponds to different clusters.

plot(communities,graph, mark.border = NA) # plot results

plot of chunk unnamed-chunk-2

Application

You can try application here: https://gdemin.shinyapps.io/ClusteringApp/

Main features:

You could select from preloaded dataset or upload your own
Option to scale data
Different methods of distance calculation
Threshold value selection
Different algorithm of communities detection
Different graph plotting layouts

Thank you for your attention!