Cluster Analysis in R

Objective

For this clustering example I will research a food wholesaler store dataset to segment different types of customers and analyze what products each customer segment spends money on.

Process

Load and explore data. Evaluate whether pre-processing is necessary
Create a distance matrix
Build a dendrogram
Extract clusters from dendrogram
Explore resulting clusters
Conclusion: Customer 2 spends more money per visit, particularly on Grovery and Milk

Load and explore data. Evaluate whether pre-processing is necessary

A quick look at the data shows me that I have transactions in dollars representing 3 product categories: Milk, Grocery, and Frozen.

# Load Data
ws_customers <- readRDS(url("https://assets.datacamp.com/production/course_5390/datasets/ws_customers.rds"),"rb")
head(ws_customers)

##    Milk Grocery Frozen
## 1 11103   12469    902
## 2  2013    6550    909
## 3  1897    5234    417
## 4  1304    3643   3045
## 5  3199    6986   1455
## 6  4560    9965    934

Explore data to look for initial insights

I see that the data is in a tidy format and ready for analysis. I quartile the data to understand its different spend properties by category. I can see that there are outliers and that Grocery has the largest average spend.

summary(ws_customers)

##       Milk          Grocery          Frozen     
##  Min.   :  333   Min.   : 1330   Min.   :  264  
##  1st Qu.: 1375   1st Qu.: 2743   1st Qu.:  824  
##  Median : 2335   Median : 5332   Median : 1455  
##  Mean   : 4831   Mean   : 7830   Mean   : 2870  
##  3rd Qu.: 5302   3rd Qu.:10790   3rd Qu.: 3046  
##  Max.   :25071   Max.   :26839   Max.   :15601

Create a distance matrix and build a dendrogram

I can plot a Dendogram to understand what groups of data belong together. I use Hierarchial Clustering to group together certain spending patterns and differentiate between different types of buying behavior.

# Calculate Euclidean distance between customers
customers_spend <- ws_customers
dist_customers <- dist(customers_spend, method = "euclidean")

# Generate a complete linkage analysis
hc_customers <- hclust(dist_customers, method = "complete")

# Plot the dendrogram
plot(hc_customers)

Extract clusters from dendrogram

After examining the dendogram I can isolate 4 distinct groups of customers by slicing spending patterns at the $15,000 spend mark.

# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h = 15000)

# Generate the segmented customers dataframe
segment_customers <- mutate(customers_spend, cluster = clust_customers)

dist_customers <- dist(customers_spend)
hc_customers <- hclust(dist_customers)
clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)

# Count the number of customers that fall into each cluster
#count(segment_customers, n())

# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)

# Plot the colored dendrogram
plot(dend_colored)

Explore resulting clusters

By looking at 4 distinct customer segments I can infer the following insights:

1. Customers in cluster 1 spent more money on Milk than any other cluster.
2. Customers in cluster 3 spent more money on Grocery than any other cluster.
3. Customers in cluster 4 spent more money on Frozen goods than any other cluster.
4. Customers in cluster 2 and did not show any excessive spending in any category.

# Calculate the mean for each category
segment_customers %>%
  group_by(cluster) %>%
  summarise_all(funs(mean(.)))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

## # A tibble: 4 x 4
##   cluster   Milk Grocery Frozen
##     <int>  <dbl>   <dbl>  <dbl>
## 1       1 16950   12891.   991.
## 2       2  2513.   5229.  1796.
## 3       3 10452.  22551.  1355.
## 4       4  1250.   3917. 10889.

Conclusion

The results of these insights can help the Food Wholesaler target each customer segment effectively by offering discounts and upselling the right products. For example, customer segment #1 primarily purchases Milk but is likely to spend more on Grocery related discounts than Frozen.