Customer segmentation is a fundamental application of unsupervised learning in marketing analytics. This report employs the K-means clustering algorithm to segment customers of a retail mall based on demographic and behavioral variables—namely, age, annual income, and spending score. The objective is to derive actionable customer profiles to inform targeted marketing strategies.
The analysis utilizes the following R packages for data manipulation, visualization, and clustering.
# Data manipulation and visualization
library(tidyverse)
# Advanced clustering visualization
library(factoextra)
# Clustering algorithms
library(cluster)
The dataset Mall_Customers.csv is loaded, containing
customer information. An initial inspection is performed to understand
its structure and completeness.
# Load the customer dataset
customer_data <- read.csv("Mall_Customers.csv")
# Display the first observations and data structure
head(customer_data)
## CustomerID Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
str(customer_data)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Genre : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
Prior to clustering, it is essential to preprocess the data and explore the distributions and relationships between key variables.
For K-means clustering, we select the three relevant numerical features and standardize them to have a mean of zero and a standard deviation of one. This prevents variables with larger scales (like income) from dominating the distance calculation.
# Select relevant features for clustering
features_for_clustering <- customer_data %>%
select(Age, Annual.Income..k.., Spending.Score..1.100.)
# Standardize the features (Z-score normalization)
scaled_features <- scale(features_for_clustering)
colnames(scaled_features) <- c("Scaled_Age", "Scaled_Income", "Scaled_SpendingScore")
head(scaled_features)
## Scaled_Age Scaled_Income Scaled_SpendingScore
## [1,] -1.4210029 -1.734646 -0.4337131
## [2,] -1.2778288 -1.734646 1.1927111
## [3,] -1.3494159 -1.696572 -1.7116178
## [4,] -1.1346547 -1.696572 1.0378135
## [5,] -0.5619583 -1.658498 -0.3949887
## [6,] -1.2062418 -1.658498 0.9990891
Understanding the distribution and pairwise relationships of the original variables provides initial insights.
# Create a paired scatter plot matrix using base R
pairs(features_for_clustering,
main = "Pairwise Relationships: Age, Income, and Spending Score",
pch = 19, col = adjustcolor("steelblue", alpha.f = 0.6))
# Calculate and print basic correlations
cor_matrix <- cor(features_for_clustering)
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(round(cor_matrix, 2))
## Age Annual.Income..k.. Spending.Score..1.100.
## Age 1.00 -0.01 -0.33
## Annual.Income..k.. -0.01 1.00 0.01
## Spending.Score..1.100. -0.33 0.01 1.00
A critical step in K-means is selecting the number of clusters (k). The Elbow Method visualizes the total within-cluster sum of squares (WSS) against different k values; the “elbow” point indicates a good balance between model complexity and explanatory power.
# Use the Elbow Method to suggest optimal k
set.seed(123) # Ensure reproducibility
fviz_nbclust(scaled_features,
FUNcluster = kmeans,
method = "wss",
k.max = 10) +
geom_vline(xintercept = 5, linetype = 2, color = "red") +
labs(title = "Elbow Method for Optimal K",
subtitle = "Total Within-Cluster Sum of Squares vs. Number of Clusters")
Based on the Elbow Method, we choose k = 5 to perform the final K-means clustering.
# Perform K-means clustering with k=5
final_k <- 5
kmeans_result <- kmeans(scaled_features,
centers = final_k,
nstart = 25) # Use multiple random starts for stability
# Assign cluster labels back to the original data
customer_data$Cluster <- as.factor(kmeans_result$cluster)
# Print the size of each cluster
cluster_sizes <- table(customer_data$Cluster)
print("Number of customers in each cluster:")
## [1] "Number of customers in each cluster:"
print(cluster_sizes)
##
## 1 2 3 4 5
## 54 20 47 39 40
We visualize the clustered data in the space defined by the first two principal components for clarity, as it synthesizes the multidimensional information.
# Visualize the clusters
fviz_cluster(kmeans_result,
data = scaled_features,
palette = "Set2",
geom = "point",
ellipse.type = "norm",
star.plot = FALSE,
repel = TRUE,
ggtheme = theme_minimal(),
main = "K-means Clustering Results (k=5)")
To translate the clustering results into actionable segments, we analyze the average characteristics of customers in each cluster.
# Calculate the mean of original features for each cluster
cluster_profile <- customer_data %>%
group_by(Cluster) %>%
summarise(
Count = n(),
Percent = round(n() / nrow(customer_data) * 100, 1),
Avg_Age = round(mean(Age), 1),
Avg_Annual_Income_k = round(mean(Annual.Income..k..), 1),
Avg_Spending_Score = round(mean(Spending.Score..1.100.), 1)
) %>%
arrange(desc(Avg_Annual_Income_k))
print(cluster_profile)
## # A tibble: 5 × 6
## Cluster Count Percent Avg_Age Avg_Annual_Income_k Avg_Spending_Score
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 4 39 19.5 39.9 86.1 19.4
## 2 5 40 20 32.9 86.1 81.5
## 3 3 47 23.5 55.6 54.4 48.9
## 4 1 54 27 25.2 41.1 62.2
## 5 2 20 10 46.2 26.8 18.4
Based on the profile table above, we can define five distinct customer segments and propose targeted strategies.
Cluster 1 (High-Income, Low-Spending): With the highest average income (86.1k) but the lowest spending score (19.4), these are Cautious Affluents. They have the financial capacity but are highly selective or infrequent shoppers. Strategy: Focus on high-quality, investment-piece products (e.g., luxury goods, electronics), offer premium customer service and extended warranties to build trust and convert their potential into spending.
Cluster 2 (High-Income, High-Spending): This group also has high income (86.1k) but exhibits the highest spending score (81.5). They are the ideal VIP Champions. Strategy: Implement exclusive VIP programs with personal shopping assistants, invite them to pre-sale events, and reward them with high-value experiences to solidify their loyalty and encourage advocacy.
Cluster 3 (Older, Moderate-Income, Moderate-Spending): As the oldest segment (Avg. 55.6 years) with moderate income and spending, they represent Established & Practical shoppers. Strategy: Promote comfort, quality, and value-for-money products. Marketing through email newsletters and in-store promotions for home goods, healthcare products, and leisure activities would be effective.
Cluster 4 (Young, Moderate-Income, High-Spending): The youngest segment (Avg. 25.2 years) with a high spending score (62.2). These are Young Trendsetters. Strategy: Engage heavily on TikTok with influencer collaborations. Offer trendy fashion, fast-moving consumer goods, and student/youth discounts to capture their discretionary spending.
Cluster 5 (Mid-Age, Low-Income, Low-Spending): This smallest segment has low scores across income (26.8k) and spending (18.4). They are Budget-Constrained shoppers. Strategy: Attract them with clear value propositions: promotions on essentials, budget-friendly bundles, and loyalty points programs that reward frequent, small purchases.
This analysis successfully segmented the mall’s customer base into five distinct groups using K-means clustering. The methodology, from data standardization to the Elbow Method and final visualization, provides a reproducible framework for customer segmentation. The derived profiles offer clear, data-driven guidance for developing differentiated marketing and service strategies, ultimately aiming to enhance customer satisfaction and increase revenue.