For this clustering example I will research a food wholesaler store dataset to segment different types of customers and analyze what products each customer segment spends money on.
A quick look at the data shows me that I have transactions in dollars representing 3 product categories: Milk, Grocery, and Frozen.
# Load Data
ws_customers <- readRDS(url("https://assets.datacamp.com/production/course_5390/datasets/ws_customers.rds"),"rb")
head(ws_customers)
## Milk Grocery Frozen
## 1 11103 12469 902
## 2 2013 6550 909
## 3 1897 5234 417
## 4 1304 3643 3045
## 5 3199 6986 1455
## 6 4560 9965 934
I see that the data is in a tidy format and ready for analysis. I quartile the data to understand its different spend properties by category. I can see that there are outliers and that Grocery has the largest average spend.
summary(ws_customers)
## Milk Grocery Frozen
## Min. : 333 Min. : 1330 Min. : 264
## 1st Qu.: 1375 1st Qu.: 2743 1st Qu.: 824
## Median : 2335 Median : 5332 Median : 1455
## Mean : 4831 Mean : 7830 Mean : 2870
## 3rd Qu.: 5302 3rd Qu.:10790 3rd Qu.: 3046
## Max. :25071 Max. :26839 Max. :15601
I can plot a Dendogram to understand what groups of data belong together. I use Hierarchial Clustering to group together certain spending patterns and differentiate between different types of buying behavior.
# Calculate Euclidean distance between customers
customers_spend <- ws_customers
dist_customers <- dist(customers_spend, method = "euclidean")
# Generate a complete linkage analysis
hc_customers <- hclust(dist_customers, method = "complete")
# Plot the dendrogram
plot(hc_customers)
After examining the dendogram I can isolate 4 distinct groups of customers by slicing spending patterns at the $15,000 spend mark.
# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h = 15000)
# Generate the segmented customers dataframe
segment_customers <- mutate(customers_spend, cluster = clust_customers)
dist_customers <- dist(customers_spend)
hc_customers <- hclust(dist_customers)
clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)
# Count the number of customers that fall into each cluster
#count(segment_customers, n())
# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)
# Plot the colored dendrogram
plot(dend_colored)
By looking at 4 distinct customer segments I can infer the following insights:
1. Customers in cluster 1 spent more money on Milk than any other cluster.
2. Customers in cluster 3 spent more money on Grocery than any other cluster.
3. Customers in cluster 4 spent more money on Frozen goods than any other cluster.
4. Customers in cluster 2 and did not show any excessive spending in any category.
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(funs(mean(.)))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## # A tibble: 4 x 4
## cluster Milk Grocery Frozen
## <int> <dbl> <dbl> <dbl>
## 1 1 16950 12891. 991.
## 2 2 2513. 5229. 1796.
## 3 3 10452. 22551. 1355.
## 4 4 1250. 3917. 10889.
The results of these insights can help the Food Wholesaler target each customer segment effectively by offering discounts and upselling the right products. For example, customer segment #1 primarily purchases Milk but is likely to spend more on Grocery related discounts than Frozen.