Customer Segmentation Analysis Using K-Means Clustering

Introduction

Customer segmentation is a fundamental application of unsupervised learning in marketing analytics. This report employs the K-means clustering algorithm to segment customers of a retail mall based on demographic and behavioral variables—namely, age, annual income, and spending score. The objective is to derive actionable customer profiles to inform targeted marketing strategies.

Data Preparation

Package Import

The analysis utilizes the following R packages for data manipulation, visualization, and clustering.

# Data manipulation and visualization
library(tidyverse)
# Advanced clustering visualization
library(factoextra)
# Clustering algorithms
library(cluster)

Data Loading and Inspection

The dataset Mall_Customers.csv is loaded, containing customer information. An initial inspection is performed to understand its structure and completeness.

# Load the customer dataset
customer_data <- read.csv("Mall_Customers.csv")

# Display the first observations and data structure
head(customer_data)

##   CustomerID  Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

str(customer_data)

## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Genre                 : chr  "Male" "Male" "Female" "Female" ...
##  $ Age                   : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual.Income..k..    : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending.Score..1.100.: int  39 81 6 77 40 76 6 94 3 72 ...

Data Preprocessing and Exploratory Analysis

Prior to clustering, it is essential to preprocess the data and explore the distributions and relationships between key variables.

Feature Selection and Standardization

For K-means clustering, we select the three relevant numerical features and standardize them to have a mean of zero and a standard deviation of one. This prevents variables with larger scales (like income) from dominating the distance calculation.

# Select relevant features for clustering
features_for_clustering <- customer_data %>%
  select(Age, Annual.Income..k.., Spending.Score..1.100.)

# Standardize the features (Z-score normalization)
scaled_features <- scale(features_for_clustering)
colnames(scaled_features) <- c("Scaled_Age", "Scaled_Income", "Scaled_SpendingScore")
head(scaled_features)

##      Scaled_Age Scaled_Income Scaled_SpendingScore
## [1,] -1.4210029     -1.734646           -0.4337131
## [2,] -1.2778288     -1.734646            1.1927111
## [3,] -1.3494159     -1.696572           -1.7116178
## [4,] -1.1346547     -1.696572            1.0378135
## [5,] -0.5619583     -1.658498           -0.3949887
## [6,] -1.2062418     -1.658498            0.9990891

Exploratory Visualization

Understanding the distribution and pairwise relationships of the original variables provides initial insights.

# Create a paired scatter plot matrix using base R
pairs(features_for_clustering,
      main = "Pairwise Relationships: Age, Income, and Spending Score",
      pch = 19, col = adjustcolor("steelblue", alpha.f = 0.6))

# Calculate and print basic correlations
cor_matrix <- cor(features_for_clustering)
print("Correlation Matrix:")

## [1] "Correlation Matrix:"

print(round(cor_matrix, 2))

##                          Age Annual.Income..k.. Spending.Score..1.100.
## Age                     1.00              -0.01                  -0.33
## Annual.Income..k..     -0.01               1.00                   0.01
## Spending.Score..1.100. -0.33               0.01                   1.00

Clustering Analysis

Determining the Optimal Number of Clusters

A critical step in K-means is selecting the number of clusters (k). The Elbow Method visualizes the total within-cluster sum of squares (WSS) against different k values; the “elbow” point indicates a good balance between model complexity and explanatory power.

# Use the Elbow Method to suggest optimal k
set.seed(123) # Ensure reproducibility
fviz_nbclust(scaled_features, 
             FUNcluster = kmeans, 
             method = "wss", 
             k.max = 10) +
  geom_vline(xintercept = 5, linetype = 2, color = "red") + 
  labs(title = "Elbow Method for Optimal K",
       subtitle = "Total Within-Cluster Sum of Squares vs. Number of Clusters")

Performing K-Means Clustering

Based on the Elbow Method, we choose k = 5 to perform the final K-means clustering.

# Perform K-means clustering with k=5
final_k <- 5
kmeans_result <- kmeans(scaled_features, 
                        centers = final_k, 
                        nstart = 25) # Use multiple random starts for stability

# Assign cluster labels back to the original data
customer_data$Cluster <- as.factor(kmeans_result$cluster)

# Print the size of each cluster
cluster_sizes <- table(customer_data$Cluster)
print("Number of customers in each cluster:")

## [1] "Number of customers in each cluster:"

print(cluster_sizes)

## 
##  1  2  3  4  5 
## 54 20 47 39 40

Visualizing the Clustering Results

We visualize the clustered data in the space defined by the first two principal components for clarity, as it synthesizes the multidimensional information.

# Visualize the clusters
fviz_cluster(kmeans_result, 
             data = scaled_features,
             palette = "Set2", 
             geom = "point",
             ellipse.type = "norm", 
             star.plot = FALSE, 
             repel = TRUE,
             ggtheme = theme_minimal(),
             main = "K-means Clustering Results (k=5)")

Interpretation and Business Implications

Profiling the Clusters

To translate the clustering results into actionable segments, we analyze the average characteristics of customers in each cluster.

# Calculate the mean of original features for each cluster
cluster_profile <- customer_data %>%
  group_by(Cluster) %>%
  summarise(
    Count = n(),
    Percent = round(n() / nrow(customer_data) * 100, 1),
    Avg_Age = round(mean(Age), 1),
    Avg_Annual_Income_k = round(mean(Annual.Income..k..), 1),
    Avg_Spending_Score = round(mean(Spending.Score..1.100.), 1)
  ) %>%
  arrange(desc(Avg_Annual_Income_k)) 

print(cluster_profile)

## # A tibble: 5 × 6
##   Cluster Count Percent Avg_Age Avg_Annual_Income_k Avg_Spending_Score
##   <fct>   <int>   <dbl>   <dbl>               <dbl>              <dbl>
## 1 4          39    19.5    39.9                86.1               19.4
## 2 5          40    20      32.9                86.1               81.5
## 3 3          47    23.5    55.6                54.4               48.9
## 4 1          54    27      25.2                41.1               62.2
## 5 2          20    10      46.2                26.8               18.4

Discussion and Strategic Recommendations

Based on the profile table above, we can define five distinct customer segments and propose targeted strategies.

Cluster 1 (High-Income, Low-Spending): With the highest average income (86.1k) but the lowest spending score (19.4), these are Cautious Affluents. They have the financial capacity but are highly selective or infrequent shoppers. Strategy: Focus on high-quality, investment-piece products (e.g., luxury goods, electronics), offer premium customer service and extended warranties to build trust and convert their potential into spending.

Cluster 2 (High-Income, High-Spending): This group also has high income (86.1k) but exhibits the highest spending score (81.5). They are the ideal VIP Champions. Strategy: Implement exclusive VIP programs with personal shopping assistants, invite them to pre-sale events, and reward them with high-value experiences to solidify their loyalty and encourage advocacy.

Cluster 3 (Older, Moderate-Income, Moderate-Spending): As the oldest segment (Avg. 55.6 years) with moderate income and spending, they represent Established & Practical shoppers. Strategy: Promote comfort, quality, and value-for-money products. Marketing through email newsletters and in-store promotions for home goods, healthcare products, and leisure activities would be effective.

Cluster 4 (Young, Moderate-Income, High-Spending): The youngest segment (Avg. 25.2 years) with a high spending score (62.2). These are Young Trendsetters. Strategy: Engage heavily on TikTok with influencer collaborations. Offer trendy fashion, fast-moving consumer goods, and student/youth discounts to capture their discretionary spending.

Cluster 5 (Mid-Age, Low-Income, Low-Spending): This smallest segment has low scores across income (26.8k) and spending (18.4). They are Budget-Constrained shoppers. Strategy: Attract them with clear value propositions: promotions on essentials, budget-friendly bundles, and loyalty points programs that reward frequent, small purchases.

Conclusion

This analysis successfully segmented the mall’s customer base into five distinct groups using K-means clustering. The methodology, from data standardization to the Elbow Method and final visualization, provides a reproducible framework for customer segmentation. The derived profiles offer clear, data-driven guidance for developing differentiated marketing and service strategies, ultimately aiming to enhance customer satisfaction and increase revenue.