Chapter 5 of John Foreman’s book Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html] looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data.
We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages. The modularity score returned, 0.549, is marginally better than the value of 0.546 calculated using the Excel-based process and is the same as that calculated in the book using Gephi, the well-known open source graph analysis tool.
The same wine promotion dataset, as used in Chapter 2 to explain and develop the k-means clustering algorithms [http://workinganalytics.com/data-smart-ch2-customer-segmentation-with-r-using-k-medians-clustering-2/], is used here.
The author implements the following steps to perform community detection (cluster analysis) on the data in Excel:
Prepare the data for graph analysis Calculate the cosine similarity in wine promotion purchases for each customer pairing. The customers form the graph vertices and the cosine similarity defines the edge connections in the graph Prune the graph using the r-Neighborhood technique, keeping only edges with similarity scores at or above the 80th percentile Calculate new scores for the inter-customer relationships, taking into account the expectations of their being connected had the graph been connected randomly Calculate graph modularity arising from cluster assignments of the customers. Adjust cluster assignments to maximize graph modularity.
Graph modularity works by assessing a probability-of-connection score for each pair of nodes in a network, taking into account the degree (number of connections) of each node in the graph. When clusters are formed, the graph modularity score is incremented for each pair within a cluster that has a connection between them; the modularity score is penalized for each pair within a cluster that has no connection between them. The process of modularity maximization adjusts the number of clusters and the cluster assignments to find the optimum graph modularity score.
# make use of the igraph and lsa R packages
library (igraph) # R package for network analysis and visualization
library(lsa) # required for the cosine() function, used to generate a similarity matrix
# read in the wine data, convert the data to a matrix, calculate similarities between vertices
wineData <- read.csv(".\\WineNetwork.csv", header = TRUE, check.names = FALSE)
wineNW <- wineData[, -c(1,2,3,4,5,6,7)] # remove the metadata columns
wineNW[is.na(wineNW)] <- 0 # replace NULL entries with 0
wineNW_mat <- as.matrix(wineNW)# convert the data.frame to a matrix
cosineSim_mat <- cosine(wineNW_mat) # calculate the cosine similarities of all the customers
diag(cosineSim_mat) <- 0 # assign self-similarities *the diagonal in the matrix) to be of value 0
# prune the data using the r-neighborhood approach, create the graph object, plot the graph
cosineSim_mat[(cosineSim_mat < 0.5)] <-0 # prune the graph. Similarities of strength <0.5 are set to 0
cosineSim_mat[(cosineSim_mat >= 0.5)] <- 1
cosineSim_graph <- graph.adjacency(cosineSim_mat, mode = 'undirected', weighted = TRUE) # convert the matrix to an igraph object
plot(cosineSim_graph)
# create a matrix of inter-vertex scores
membershipVector <- numeric(100) # creates a vector of length 100. Each element is set equal to zero.
modMatrix <- mod.matrix(cosineSim_graph, membership = membershipVector) # generates a matrix of inter-vertex scores
# use the multilevel.community function in igraph to optimize the graph modularity
wineMod_opt = multilevel.community(cosineSim_graph) # selects optimum cluster assignments
wineMod_opt
## Graph community structure calculated with the multi level algorithm
## Number of communities (best split): 6
## Modularity (best split): 0.5489076
## Membership vector:
## Adams Allen Anderson Bailey Baker Barnes
## 2 3 1 2 3 3
## Bell Bennett Brooks Brown Butler Campbell
## 1 5 5 2 6 1
## Carter Clark Collins Cook Cooper Cox
## 2 3 6 1 3 1
## Cruz Davis Diaz Edwards Evans Fisher
## 2 6 2 5 3 6
## Flores Foster Garcia Gomez Gonzalez Gray
## 1 3 3 3 3 1
## Green Gutierrez Hall Harris Hernandez Hill
## 2 5 6 3 2 2
## Howard Hughes Jackson James Jenkins Johnson
## 6 2 6 2 1 1
## Jones Kelly King Lee Lewis Long
## 5 3 2 3 2 3
## Lopez Martin Martinez Miller Mitchell Moore
## 3 6 6 3 6 1
## Morales Morgan Morris Murphy Myers Nelson
## 3 5 1 2 2 5
## Nguyen Ortiz Parker Perez Perry Peterson
## 3 5 4 2 2 6
## Phillips Powell Price Ramirez Reed Reyes
## 1 3 3 3 3 3
## Richardson Rivera Roberts Robinson Rodriguez Rogers
## 3 2 3 2 1 3
## Ross Russell Sanchez Sanders Scott Smith
## 2 1 3 3 3 1
## Stewart Sullivan Taylor Thomas Thompson Torres
## 2 5 2 3 3 5
## Turner Walker Ward Watson White Williams
## 3 2 3 2 3 3
## Wilson Wood Wright Young
## 5 3 3 3
modularity(cosineSim_graph, membership =wineMod_opt$membership) # calculates final graph modularity
## [1] 0.5489076
for (i in 1:max(wineMod_opt$membership)){
print(wineMod_opt$names[wineMod_opt$membership == i]) # print list of members belonging to cluster i
} # prints out members by cluster assignment
## [1] "Anderson" "Bell" "Campbell" "Cook" "Cox"
## [6] "Flores" "Gray" "Jenkins" "Johnson" "Moore"
## [11] "Morris" "Phillips" "Rodriguez" "Russell" "Smith"
## [1] "Adams" "Bailey" "Brown" "Carter" "Cruz"
## [6] "Diaz" "Green" "Hernandez" "Hill" "Hughes"
## [11] "James" "King" "Lewis" "Murphy" "Myers"
## [16] "Perez" "Perry" "Rivera" "Robinson" "Ross"
## [21] "Stewart" "Taylor" "Walker" "Watson"
## [1] "Allen" "Baker" "Barnes" "Clark" "Cooper"
## [6] "Evans" "Foster" "Garcia" "Gomez" "Gonzalez"
## [11] "Harris" "Kelly" "Lee" "Long" "Lopez"
## [16] "Miller" "Morales" "Nguyen" "Powell" "Price"
## [21] "Ramirez" "Reed" "Reyes" "Richardson" "Roberts"
## [26] "Rogers" "Sanchez" "Sanders" "Scott" "Thomas"
## [31] "Thompson" "Turner" "Ward" "White" "Williams"
## [36] "Wood" "Wright" "Young"
## [1] "Parker"
## [1] "Bennett" "Brooks" "Edwards" "Gutierrez" "Jones"
## [6] "Morgan" "Nelson" "Ortiz" "Sullivan" "Torres"
## [11] "Wilson"
## [1] "Butler" "Collins" "Davis" "Fisher" "Hall" "Howard"
## [7] "Jackson" "Martin" "Martinez" "Mitchell" "Peterson"
# evaluate number of orders by cluster, for each offer
sumOrdersByCluster <- (aggregate(t(wineNW), by = list(wineMod_opt$membership), sum))
wineConclusion <- cbind(t(sumOrdersByCluster[,-1]), wineData)
# print out the top features each cluster
for (i in 1:max(wineMod_opt$membership)){
print(i)
print(head(wineConclusion[order(wineConclusion[,i], decreasing =TRUE),1:12]))
}
## [1] 1
## 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg)
## V24 12 0 0 0 0 0 24 September Pinot Noir 6
## V26 11 0 3 0 0 1 26 October Pinot Noir 144
## V17 7 0 0 0 0 0 17 July Pinot Noir 12
## V2 5 0 0 0 0 5 2 January Pinot Noir 72
## V12 1 1 0 0 0 3 12 May Prosecco 72
## V16 1 0 3 1 0 0 16 June Merlot 72
## Discount (%) Origin
## V24 34 Italy
## V26 83 Australia
## V17 47 Germany
## V2 17 France
## V12 83 Australia
## V16 88 California
## [1] 2
## 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg)
## V30 0 15 3 0 1 3 30 December Malbec 6
## V7 0 14 5 0 0 0 7 March Prosecco 6
## V29 0 14 0 1 2 0 29 November Pinot Grigio 6
## V18 0 11 1 0 2 0 18 July Espumante 6
## V8 0 7 2 0 11 0 8 March Espumante 6
## V13 0 5 0 0 1 0 13 May Merlot 6
## Discount (%) Origin
## V30 54 France
## V7 40 Australia
## V29 87 France
## V18 50 Oregon
## V8 45 South Africa
## V13 43 Chile
## [1] 3
## 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%)
## V22 0 0 14 0 1 6 22 August Champagne 72 63
## V31 0 0 14 1 1 1 31 December Champagne 72 89
## V6 0 0 11 0 1 0 6 March Prosecco 144 86
## V4 0 0 10 0 1 1 4 February Champagne 72 48
## V9 0 0 10 0 0 0 9 April Chardonnay 144 57
## V14 0 0 9 0 0 0 14 June Merlot 72 64
## Origin
## V22 France
## V31 France
## V6 Chile
## V4 France
## V9 Chile
## V14 Chile
## [1] 4
## 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg)
## V11 0 0 5 1 1 6 11 May Champagne 72
## V16 1 0 3 1 0 0 16 June Merlot 72
## V20 0 0 5 1 0 0 20 August Cabernet Sauvignon 72
## V29 0 14 0 1 2 0 29 November Pinot Grigio 6
## V31 0 0 14 1 1 1 31 December Champagne 72
## V1 0 0 5 0 0 5 1 January Malbec 72
## Discount (%) Origin
## V11 85 France
## V16 88 California
## V20 82 Italy
## V29 87 France
## V31 89 France
## V1 56 France
## [1] 5
## 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg)
## V8 0 7 2 0 11 0 8 March Espumante 6
## V3 0 0 4 0 2 0 3 February Espumante 144
## V18 0 11 1 0 2 0 18 July Espumante 6
## V29 0 14 0 1 2 0 29 November Pinot Grigio 6
## V4 0 0 10 0 1 1 4 February Champagne 72
## V6 0 0 11 0 1 0 6 March Prosecco 144
## Discount (%) Origin
## V8 45 South Africa
## V3 32 Oregon
## V18 50 Oregon
## V29 87 France
## V4 48 France
## V6 86 Chile
## [1] 6
## 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg)
## V11 0 0 5 1 1 6 11 May Champagne 72
## V22 0 0 14 0 1 6 22 August Champagne 72
## V1 0 0 5 0 0 5 1 January Malbec 72
## V2 5 0 0 0 0 5 2 January Pinot Noir 72
## V28 0 1 1 0 0 4 28 November Cabernet Sauvignon 12
## V12 1 1 0 0 0 3 12 May Prosecco 72
## Discount (%) Origin
## V11 85 France
## V22 63 France
## V1 56 France
## V2 17 France
## V28 56 France
## V12 83 Australia
The graph analysis package, igraph, is a powerful resource, well capable of implementing and scaling up the modularity maximization process developed by the author in Excel in Data Smart.