Data Smart, Ch5, Network Graphs and Community Detection

Community Detection using Modularity Maximization

Executive Summary

Chapter 5 of John Foreman’s book Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html] looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data.

We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages. The modularity score returned, 0.549, is marginally better than the value of 0.546 calculated using the Excel-based process and is the same as that calculated in the book using Gephi, the well-known open source graph analysis tool.

Process Outline

The same wine promotion dataset, as used in Chapter 2 to explain and develop the k-means clustering algorithms [http://workinganalytics.com/data-smart-ch2-customer-segmentation-with-r-using-k-medians-clustering-2/], is used here.

The author implements the following steps to perform community detection (cluster analysis) on the data in Excel:

Prepare the data for graph analysis Calculate the cosine similarity in wine promotion purchases for each customer pairing. The customers form the graph vertices and the cosine similarity defines the edge connections in the graph Prune the graph using the r-Neighborhood technique, keeping only edges with similarity scores at or above the 80th percentile Calculate new scores for the inter-customer relationships, taking into account the expectations of their being connected had the graph been connected randomly Calculate graph modularity arising from cluster assignments of the customers. Adjust cluster assignments to maximize graph modularity.

A note on graph modularity

Graph modularity works by assessing a probability-of-connection score for each pair of nodes in a network, taking into account the degree (number of connections) of each node in the graph. When clusters are formed, the graph modularity score is incremented for each pair within a cluster that has a connection between them; the modularity score is penalized for each pair within a cluster that has no connection between them. The process of modularity maximization adjusts the number of clusters and the cluster assignments to find the optimum graph modularity score.

R Code

# make use of the igraph and lsa R packages
library (igraph) # R package for network analysis and visualization
library(lsa) # required for the cosine() function, used to generate a similarity matrix

# read in the wine data, convert the data to a matrix, calculate similarities between vertices
wineData <- read.csv(".\\WineNetwork.csv", header = TRUE, check.names = FALSE)
wineNW <- wineData[, -c(1,2,3,4,5,6,7)] # remove the metadata columns
wineNW[is.na(wineNW)] <- 0 # replace NULL entries with 0
wineNW_mat <- as.matrix(wineNW)# convert the data.frame to a matrix
cosineSim_mat <- cosine(wineNW_mat) # calculate the cosine similarities of all the customers
diag(cosineSim_mat) <- 0 # assign self-similarities *the diagonal in the matrix) to be of value 0

# prune the data using the r-neighborhood approach, create the graph object, plot the graph
cosineSim_mat[(cosineSim_mat < 0.5)] <-0 # prune the graph. Similarities of strength <0.5 are set to 0
cosineSim_mat[(cosineSim_mat >= 0.5)] <- 1
cosineSim_graph <- graph.adjacency(cosineSim_mat, mode = 'undirected', weighted = TRUE) # convert the matrix to an igraph object
plot(cosineSim_graph)

Fig. 1 A view of the pruned graph before modularity maximization is applied

# create a matrix of inter-vertex scores
membershipVector <- numeric(100) # creates a vector of length 100. Each element is set equal to zero. 
modMatrix <- mod.matrix(cosineSim_graph, membership = membershipVector) # generates a matrix of inter-vertex scores

# use the multilevel.community function in igraph to optimize the graph modularity 
wineMod_opt = multilevel.community(cosineSim_graph) # selects optimum cluster assignments 
wineMod_opt

## Graph community structure calculated with the multi level algorithm
## Number of communities (best split): 6 
## Modularity (best split): 0.5489076 
## Membership vector:
##      Adams      Allen   Anderson     Bailey      Baker     Barnes 
##          2          3          1          2          3          3 
##       Bell    Bennett     Brooks      Brown     Butler   Campbell 
##          1          5          5          2          6          1 
##     Carter      Clark    Collins       Cook     Cooper        Cox 
##          2          3          6          1          3          1 
##       Cruz      Davis       Diaz    Edwards      Evans     Fisher 
##          2          6          2          5          3          6 
##     Flores     Foster     Garcia      Gomez   Gonzalez       Gray 
##          1          3          3          3          3          1 
##      Green  Gutierrez       Hall     Harris  Hernandez       Hill 
##          2          5          6          3          2          2 
##     Howard     Hughes    Jackson      James    Jenkins    Johnson 
##          6          2          6          2          1          1 
##      Jones      Kelly       King        Lee      Lewis       Long 
##          5          3          2          3          2          3 
##      Lopez     Martin   Martinez     Miller   Mitchell      Moore 
##          3          6          6          3          6          1 
##    Morales     Morgan     Morris     Murphy      Myers     Nelson 
##          3          5          1          2          2          5 
##     Nguyen      Ortiz     Parker      Perez      Perry   Peterson 
##          3          5          4          2          2          6 
##   Phillips     Powell      Price    Ramirez       Reed      Reyes 
##          1          3          3          3          3          3 
## Richardson     Rivera    Roberts   Robinson  Rodriguez     Rogers 
##          3          2          3          2          1          3 
##       Ross    Russell    Sanchez    Sanders      Scott      Smith 
##          2          1          3          3          3          1 
##    Stewart   Sullivan     Taylor     Thomas   Thompson     Torres 
##          2          5          2          3          3          5 
##     Turner     Walker       Ward     Watson      White   Williams 
##          3          2          3          2          3          3 
##     Wilson       Wood     Wright      Young 
##          5          3          3          3

modularity(cosineSim_graph, membership =wineMod_opt$membership) # calculates final graph modularity

## [1] 0.5489076

for (i in 1:max(wineMod_opt$membership)){
  print(wineMod_opt$names[wineMod_opt$membership == i]) # print list of members belonging to cluster i
} # prints out members by cluster assignment

##  [1] "Anderson"  "Bell"      "Campbell"  "Cook"      "Cox"      
##  [6] "Flores"    "Gray"      "Jenkins"   "Johnson"   "Moore"    
## [11] "Morris"    "Phillips"  "Rodriguez" "Russell"   "Smith"    
##  [1] "Adams"     "Bailey"    "Brown"     "Carter"    "Cruz"     
##  [6] "Diaz"      "Green"     "Hernandez" "Hill"      "Hughes"   
## [11] "James"     "King"      "Lewis"     "Murphy"    "Myers"    
## [16] "Perez"     "Perry"     "Rivera"    "Robinson"  "Ross"     
## [21] "Stewart"   "Taylor"    "Walker"    "Watson"   
##  [1] "Allen"      "Baker"      "Barnes"     "Clark"      "Cooper"    
##  [6] "Evans"      "Foster"     "Garcia"     "Gomez"      "Gonzalez"  
## [11] "Harris"     "Kelly"      "Lee"        "Long"       "Lopez"     
## [16] "Miller"     "Morales"    "Nguyen"     "Powell"     "Price"     
## [21] "Ramirez"    "Reed"       "Reyes"      "Richardson" "Roberts"   
## [26] "Rogers"     "Sanchez"    "Sanders"    "Scott"      "Thomas"    
## [31] "Thompson"   "Turner"     "Ward"       "White"      "Williams"  
## [36] "Wood"       "Wright"     "Young"     
## [1] "Parker"
##  [1] "Bennett"   "Brooks"    "Edwards"   "Gutierrez" "Jones"    
##  [6] "Morgan"    "Nelson"    "Ortiz"     "Sullivan"  "Torres"   
## [11] "Wilson"   
##  [1] "Butler"   "Collins"  "Davis"    "Fisher"   "Hall"     "Howard"  
##  [7] "Jackson"  "Martin"   "Martinez" "Mitchell" "Peterson"

# evaluate number of orders by cluster, for each offer
sumOrdersByCluster <- (aggregate(t(wineNW), by = list(wineMod_opt$membership), sum))
wineConclusion <- cbind(t(sumOrdersByCluster[,-1]), wineData)

# print out the top features each cluster
for (i in 1:max(wineMod_opt$membership)){
  print(i)
  print(head(wineConclusion[order(wineConclusion[,i], decreasing =TRUE),1:12]))
}

## [1] 1
##      1 2 3 4 5 6 Offer #  Campaign   Varietal Minimum Qty (kg)
## V24 12 0 0 0 0 0      24 September Pinot Noir                6
## V26 11 0 3 0 0 1      26   October Pinot Noir              144
## V17  7 0 0 0 0 0      17      July Pinot Noir               12
## V2   5 0 0 0 0 5       2   January Pinot Noir               72
## V12  1 1 0 0 0 3      12       May   Prosecco               72
## V16  1 0 3 1 0 0      16      June     Merlot               72
##     Discount (%)     Origin
## V24           34      Italy
## V26           83  Australia
## V17           47    Germany
## V2            17     France
## V12           83  Australia
## V16           88 California
## [1] 2
##     1  2 3 4  5 6 Offer # Campaign     Varietal Minimum Qty (kg)
## V30 0 15 3 0  1 3      30 December       Malbec                6
## V7  0 14 5 0  0 0       7    March     Prosecco                6
## V29 0 14 0 1  2 0      29 November Pinot Grigio                6
## V18 0 11 1 0  2 0      18     July    Espumante                6
## V8  0  7 2 0 11 0       8    March    Espumante                6
## V13 0  5 0 0  1 0      13      May       Merlot                6
##     Discount (%)       Origin
## V30           54       France
## V7            40    Australia
## V29           87       France
## V18           50       Oregon
## V8            45 South Africa
## V13           43        Chile
## [1] 3
##     1 2  3 4 5 6 Offer # Campaign   Varietal Minimum Qty (kg) Discount (%)
## V22 0 0 14 0 1 6      22   August  Champagne               72           63
## V31 0 0 14 1 1 1      31 December  Champagne               72           89
## V6  0 0 11 0 1 0       6    March   Prosecco              144           86
## V4  0 0 10 0 1 1       4 February  Champagne               72           48
## V9  0 0 10 0 0 0       9    April Chardonnay              144           57
## V14 0 0  9 0 0 0      14     June     Merlot               72           64
##     Origin
## V22 France
## V31 France
## V6   Chile
## V4  France
## V9   Chile
## V14  Chile
## [1] 4
##     1  2  3 4 5 6 Offer # Campaign           Varietal Minimum Qty (kg)
## V11 0  0  5 1 1 6      11      May          Champagne               72
## V16 1  0  3 1 0 0      16     June             Merlot               72
## V20 0  0  5 1 0 0      20   August Cabernet Sauvignon               72
## V29 0 14  0 1 2 0      29 November       Pinot Grigio                6
## V31 0  0 14 1 1 1      31 December          Champagne               72
## V1  0  0  5 0 0 5       1  January             Malbec               72
##     Discount (%)     Origin
## V11           85     France
## V16           88 California
## V20           82      Italy
## V29           87     France
## V31           89     France
## V1            56     France
## [1] 5
##     1  2  3 4  5 6 Offer # Campaign     Varietal Minimum Qty (kg)
## V8  0  7  2 0 11 0       8    March    Espumante                6
## V3  0  0  4 0  2 0       3 February    Espumante              144
## V18 0 11  1 0  2 0      18     July    Espumante                6
## V29 0 14  0 1  2 0      29 November Pinot Grigio                6
## V4  0  0 10 0  1 1       4 February    Champagne               72
## V6  0  0 11 0  1 0       6    March     Prosecco              144
##     Discount (%)       Origin
## V8            45 South Africa
## V3            32       Oregon
## V18           50       Oregon
## V29           87       France
## V4            48       France
## V6            86        Chile
## [1] 6
##     1 2  3 4 5 6 Offer # Campaign           Varietal Minimum Qty (kg)
## V11 0 0  5 1 1 6      11      May          Champagne               72
## V22 0 0 14 0 1 6      22   August          Champagne               72
## V1  0 0  5 0 0 5       1  January             Malbec               72
## V2  5 0  0 0 0 5       2  January         Pinot Noir               72
## V28 0 1  1 0 0 4      28 November Cabernet Sauvignon               12
## V12 1 1  0 0 0 3      12      May           Prosecco               72
##     Discount (%)    Origin
## V11           85    France
## V22           63    France
## V1            56    France
## V2            17    France
## V28           56    France
## V12           83 Australia

Conclusion

The graph analysis package, igraph, is a powerful resource, well capable of implementing and scaling up the modularity maximization process developed by the author in Excel in Data Smart.