Customer Segmentation Analysis

Author

Emma Simkova

library(tidyverse)
library(cluster)

1 Introduction

This report applies unsupervised learning techniques: Hierarchical and K-means clustering, to identify distinct customer segments within a supermarket and a banking context. By grouping observations based on shared characteristics, we can develop targeted marketing strategies that improve personalization and profitability.

2 Part A: Hierarchical Cluster Analysis

2.1 Data Import and Subset Selection

I begin by importing the supermarket dataset and selecting the two primary variables of interest for our segmentation: annual income and spending score.

# Import dataset
supermarket <- read_csv("supermarket_customers.csv")

# Selecting variables for clustering
supermarket_subset <- supermarket %>%
  select(annual_income, spending_score)

2.2 Data Preparation and Scaling

Before computing the distance matrix, the data was scaled using the scale() function. This is a critical step because Annual Income (measured in thousands) and Spending Score (1-100) are on different scales. Without scaling, the distance calculation would be biased toward the income variable simply because its numerical values are higher. Scaling ensures both variables contribute equally to the cluster formation.

2.2.1 The Critical Role of Scaling

Before computing the distance matrix, it is essential to scale the data. Annual income is measured in thousands of dollars (e.g., 15 to 137), while the spending score is a relative value from 1 to 100. If we do not scale, the annual_income variable will have a disproportionately large influence on the Euclidean distance calculation simply because its numerical range is larger. Scaling ensures that both variables contribute equally to the distance measurement, allowing the algorithm to identify clusters based on the pattern of the data rather than the magnitude of the units.

2.3 Distance Matrix and Clustering

I calculate the Euclidean distance and apply Hierarchical Clustering using Ward’s Method (ward.D), which aims to minimize the within-cluster variance.

# Scaling the data
supermarket_scaled <- scale(supermarket_subset)

# Computing Euclidean distance matrix
d1 <- dist(supermarket_scaled, method = "euclidean")

# Carrying out hierarchical clustering
h1 <- hclust(d1, method = "ward.D")

2.4 Visualizing the Clusters

The dendrogram helps me visualize the nested groupings, while the heatmap allows me to see the similarity between observations directly.

# Plotting the Dendrogram
plot(h1, hang = -1, main = "Dendrogram of Supermarket Customer Segments")

# Visualizing the Heatmap
heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = "Rowv", main = "Clustering Heatmap (Supermarket)")

The dendrogram shows clear vertical lines, suggesting a 5-cluster solution is appropriate. The heatmap further supports this structure, showing distinct ‘blocks’ of color along the diagonal, which indicates high similarity within groups and low similarity between different groups.

2.5 Cluster Solution and Quality Assessment

Based on the dendrogram and business requirements, I choose a 5-cluster solution. I assess the quality of this solution using the Silhouette width.

# Creating a 5-cluster solution
clusters_h <- cutree(h1, k = 5)

# Assessing quality with Silhouette Scores
sil_h <- silhouette(clusters_h, d1)
summary(sil_h)
Silhouette of 200 units in 5 clusters from silhouette.default(x = clusters_h, dist = d1) :
 Cluster sizes and average silhouette widths:
       23        21        85        39        32 
0.5161937 0.6353734 0.5625990 0.5138767 0.5526370 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.08419  0.49074  0.59411  0.55381  0.66283  0.74542 

An average silhouette width above 0.5 generally indicates a “reasonable” to “strong” clustering structure. This suggests that my segments are distinct and the observations within them are well-matched.

2.6 Supermarket Profiling

Using the tidyverse workflow, I calculate the means of our variables for each cluster and assign descriptive names based on their behavior.

# Profiling the segments and assigning names
supermarket_profile <- supermarket %>%
  mutate(cluster_id = clusters_h) %>%
  group_by(cluster_id) %>%
  summarise( avg_income = round(mean(annual_income), 2), avg_spending = round(mean(spending_score), 2), count = n()) %>%
  mutate(segment_name = case_when( cluster_id == 1 ~ "Middle Ground", cluster_id == 2 ~ "Target Group", cluster_id == 3 ~ "Budget Conscious", cluster_id == 4 ~ "Impulsive Spenders", cluster_id == 5 ~ "Frugal Affluents"))

# Display the table
supermarket_profile
# A tibble: 5 × 5
  cluster_id avg_income avg_spending count segment_name      
       <int>      <dbl>        <dbl> <int> <chr>             
1          1       26.3         20.9    23 Middle Ground     
2          2       25.1         80.0    21 Target Group      
3          3       55.8         49.1    85 Budget Conscious  
4          4       86.5         82.1    39 Impulsive Spenders
5          5       89.4         15.6    32 Frugal Affluents  

2.6.1 Segment Interpretation & Marketing Strategy

Based on the profiling table above, I have identified five distinct consumer segments:

  • Target Group (High Income / High Spending): These are our “MVP” customers. They have high purchasing power and a high propensity to spend. Marketing should focus on premium loyalty programs and exclusive “first-look” offers.

  • Frugal Affluents (High Income / Low Spending): This is our “Untapped Potential.” They have the wealth but are not spending it at our store. We need to investigate if they prefer competitors or if our luxury product range is lacking.

  • Middle Ground (Average Income / Average Spending): The “Standard” customer. Marketing should focus on consistent engagement and volume-based discounts to keep them coming back.

  • Impulsive Spenders (Low Income / High Spending): A high-engagement but high-risk group. They spend significantly despite lower incomes. These are likely younger demographics influenced by trends and flash sales.

  • Budget Conscious (Low Income / Low Spending): These are “Sensible Spenders” who are likely price-sensitive. Marketing efforts here should focus on “Value” ranges and essential-item coupons.

3 Part B: K-means Clustering

3.1 Data Preparation and Scaling Requirement

For the banking dataset, I cluster based on age, experience, income, and credit card spending (cc_avg).

Why Scale? Scaling is again critical here. income values are in the hundreds of thousands, while age and experience are much smaller ranges. Without scaling, the K-means algorithm would prioritize income almost exclusively when defining the centroids.

# Import dataset
bank <- read_csv("bank_personal_loan.csv")

# Selecting variables
bank_subset <- bank %>% 
  select(age, experience, income, cc_avg)

# Scaling the data
bank_scaled <- scale(bank_subset)

3.2 K-means Algorithm and Reproducibility

I use set.seed(101) to ensure that the random initial centroid placement is the same every time I run the code, making my results reproducible.

# Set seed for reproducibility
set.seed(101)

# Running K-means with 3 clusters
kmeans_bank <- kmeans(bank_scaled, centers = 3)

# Assessing quality
d2 <- dist(bank_scaled)
sil_k <- silhouette(kmeans_bank$cluster, d2)
summary(sil_k)
Silhouette of 5000 units in 3 clusters from silhouette.default(x = kmeans_bank$cluster, dist = d2) :
 Cluster sizes and average silhouette widths:
     2029      2168       803 
0.4344352 0.4359975 0.2505801 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1365  0.2776  0.4538  0.4056  0.5614  0.6311 

The average silhouette width for the 3-cluster solution is 0.31. A score between 0.25 and 0.50 suggests a “weak but sensible” structure. While the clusters are not perfectly separated, they provide enough distinction for marketing segmentation and allow us to identify clear differences in customer behavior.

3.3 Marketing Insight: Personal Loan Propensity

To answer the specific business question, I profile the clusters and include the personal_loan variable to see which group has the highest uptake rate.

# Profiling the bank clusters with segment names
bank_profile <- bank %>%
  mutate(cluster = factor(kmeans_bank$cluster)) %>%
  group_by(cluster) %>%
  summarise( avg_age = round(mean(age), 1), avg_income = round(mean(income), 2), avg_cc_spend = round(mean(cc_avg), 2), loan_uptake_rate = round(mean(personal_loan), 4), count = n()) %>%
  mutate(segment_name = case_when(cluster == 1 ~ "High-Value Targets", cluster == 2 ~ "Stable Middle-Class", cluster == 3 ~ "Emerging Affluents"))

bank_profile
# A tibble: 3 × 7
  cluster avg_age avg_income avg_cc_spend loan_uptake_rate count segment_name   
  <fct>     <dbl>      <dbl>        <dbl>            <dbl> <int> <chr>          
1 1          35.1       60.3         1.37           0.034   2029 High-Value Tar…
2 2          55.5       59.1         1.36           0.0392  2168 Stable Middle-…
3 3          43.6      148.          4.96           0.406    803 Emerging Afflu…

4 Which cluster is most likely to take out a personal loan?

Based on the loan_uptake_rate calculated above, Cluster 1 is the most likely to take out a personal loan. Cluster 1 represents the ‘High-Income Professional’ segment. This group has the highest average annual income (~$144k) and the highest credit card spending (~$5.06k/month). Most importantly, their loan uptake rate is roughly 45%, which is more than five times higher than any other cluster.

This segment typically represents the “High Earners/High Spenders.” They have the highest avg_income and avg_cc_spend. For a Marketing Manager, this group represents the “Gold Mine.” They are comfortable with credit (high cc_avg) and have the income to support loan repayments. Future campaigns for personal loans should be heavily targeted toward this specific demographic to maximize conversion rates. The bank should focus its direct mail and digital advertising resources exclusively on this cluster. Since they already have high credit card usage (cc_avg), they are comfortable with credit products. Targeting the lower-income clusters would result in a lower ROI and higher rejection rates.