Bernadette Mutsvagiwa
Unsupervised Learning Project - Clustering
Introduction
This study utilizes a mall customer data set comprised of age, annual income and spending behavior to conduct a comparative segmentation analysis. By implementing K-Means, Hierarchical Clustering, and DBSCAN, we aim to evaluate how different algorithmic approaches group consumers based on their demographic and financial profiles.
Data Loading and Pre-processing
First, we will import the data and verify its completeness. The dataset is from Kaggle. We are remove the CustomerID column since it’s just a label and doesn’t help the algorithms find actual patterns in customer behavior.
# Loading required libraries
library(dplyr)
library(ggplot2)
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# Seting working directory
setwd("/Users/benna/Documents/Unsupervised Learning Project")
# Loading the dataset
data <- read.csv("Mall_Customers.csv")
# Checking for missing values
sum(is.na(data))
```{[1] 0}
head(data)
``` customerid
1 1 Male 19 15 39
2 2 Male 21 15 81
3 3 Female 20 16 6
4 4 Female 23 16 77
5 5 Female 31 17 40
6 6 Female 22 17 76
The initial phase involves importing the data set via
read.csv() and performing a data integrity check using
is.na() to identify missing values. We exclude the
CustomerID variable, as it lacks predictive value for segmenting
customer behavior. Finally, a preview of the initial rows confirms the
data was ingested and structured correctly.
Data Scaling
Because of features like Age, Income, and Spending Score which are measured in different units, we will apply scaling to the data set. By centering the features around a mean of 0 and a standard deviation of 1, we prevent variables with larger numerical ranges from disproportionately influencing the clustering distance calculations.
# Loading the dataset
data <- read.csv("/Users/benna/Documents/Unsupervised Learning Project/Mall_Customers.csv")
# Renaming columns for clarity and consistency
colnames(data) <- c("CustomerID", "Gender", "Age", "AnnualIncome", "SpendingScore")
# Removing the CustomerID column as it is not needed for clustering
data <- data[, c("Age", "AnnualIncome", "SpendingScore")]
# Scaling only the numerical columns (Age, AnnualIncome, SpendingScore) data_scaled <- scale(data)
# Checking the scaled data
head(data_scaled)
[1,] -1.4210029 -1.734646 -0.4337131
[2,] -1.2778288 -1.734646 1.1927111
[3,] -1.3494159 -1.696572 -1.7116178
[4,] -1.1346547 -1.696572 1.0378135
[5,] -0.5619583 -1.658498 -0.3949887
[6,] -1.2062418 -1.658498 0.9990891
In this step, we perform feature standardization on Age, Annual
Income, and Spending Score to ensure a balanced weight distribution
across the model. By utilizing the scale() function, we
normalize these variables to a mean of 0 and a standard deviation of 1,
preventing the clustering algorithm from being biased toward features
with larger numeric magnitudes
# Visualizing the distribution of numerical features before scaling
par(mfrow = c(1, 3))
hist(data$Age, main = "Age Distribution", xlab = "Age", col = "pink")
hist(data$AnnualIncome, main = "Annual Income Distribution", xlab = "Income", col = "turquoise")
hist(data$SpendingScore, main = "Spending Score Distribution", xlab = "Spending Score", col = "red")
In this part we generated histograms to visualize the pre-scaling feature space. This exploratory step is critical for evaluating the variance and frequency distribution of the demographic and financial attributes before they are transformed for the clustering model.
Clustering - K Means
In this next step, it involves executing the K-Means algorithm across various values of \(k\). This iterative approach allows us to compare different clustering configurations and visualize how the data points are distributed as the number of segments changes.
# Applying K-Means clustering with 5 clusters (adjust k if necessary)
set.seed(123) # Set a seed for reproducibility
kmeans_result <- kmeans(data_scaled, centers = 5)
# Adding cluster labels to the original dataset
data$cluster <- kmeans_result$cluster
# Visualizing the K-Means clustering results
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(cluster))) +
geom_point() +
labs(title = "K-Means Clustering (5 clusters)", color = "Cluster") + theme_minimal()
To segment the data, we implemented the K-Means algorithm using a
target of 5 clusters. We utilized set.seed(123) to
guarantee consistent results across different runs. The resulting
cluster assignments were then integrated into our primary data set and
visualized via a ggplot2 scatter plot, specifically mapping the
relationship between Age and Annual Income.
Hierarchical Clustering
Next, we use hierarchical clustering to create a dendrogram which
looks like a family tree of our data. This visual tool helps us see how
different customer groups branch off from one another, making it easier
to decide exactly how many clusters to use. In this step, we utilize
hierarchical clustering to explore the data’s structure. We begin by
generating a distance matrix via the dist() function to
quantify the dissimilarity between observations. Subsequently, we
implement the hclust() algorithm using the Ward’s minimum
variance method (ward.D2). The resulting dendrogram serves
as a diagnostic tool, allowing us to determine the optimal number of
clusters by identifying the most appropriate horizontal cut-off
point.
# Calculating distance matrix
dist_matrix <- dist(data_scaled)
# Performing hierarchical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")
# Visualizing the dendrogram
plot(hclust_result, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "")
Clustering - DBSCAN
In the final stage of our analysis, we implement DBSCAN (Density-Based Spatial Clustering of Applications with Noise). This algorithm allows us to identify clusters based on point density rather than distance from a center, after which we will visualize the resulting spatial groupings and any identified outliers. We implement DBSCAN to identify density-reachable clusters within the feature space. Using an Epsilon of 0.5 and a Minimum Points threshold of 5, the algorithm partitions the data while effectively filtering out outliers. Following the computation, we utilize ggplot2 to map these density-based groupings, providing a contrast to our previous distance-based models.
# Install and load dbscan package
library("dbscan")
The following object is masked from ‘package:stats’:
as.dendrogram
# Apply DBSCAN with epsilon = 0.5 and minPts = 5
dbscan_result <- dbscan(data_scaled, eps = 0.5, minPts = 5)
# Add DBSCAN labels to the dataset
data$dbscan_cluster <- dbscan_result$cluster
# Visualizing the DBSCAN clustering results
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(dbscan_cluster))) + geom_point() +
labs(title = "DBSCAN Clustering", color = "Cluster") +
theme_minimal()
Conclusion
This comprehensive analysis successfully segmented a mall customer base by leveraging three distinct unsupervised learning techniques which are K-Means, Hierarchical Clustering, and DBSCAN. By evaluating the intersection of demographic data (Age) and financial metrics (Annual Income and Spending Score), we transformed raw data into actionable business intelligence.
Initially, K-Means was employed to partition the customers into five distinct segments, providing a clear, distance-based framework for identifying high-value vs. low-spending groups. To validate these groupings, we utilized Hierarchical Clustering; the resulting dendrogram allowed us to visualize the genealogy of the data, confirming the natural hierarchy of customer relationships and providing a structural basis for determining the optimal number of clusters. Finally, DBSCAN was implemented as a density-based check, which was particularly effective at identifying core customer clusters while isolating noise and those unique, outlier customers who do not fit standard patterns.
Together, these methodologies provide a multi dimensional view of the consumer landscape, enabling the business to move beyond one size fits all advertising. Instead, stakeholders can now design data driven, targeted marketing campaigns and personalized services that resonate with the specific behaviors and economic profiles of each identified segment, ultimately driving higher engagement and resource efficiency.