Bernadette Mutsvagiwa

Unsupervised Learning Project - Clustering

Introduction

This study utilizes a mall customer data set comprised of age, annual income and spending behavior to conduct a comparative segmentation analysis. By implementing K-Means, Hierarchical Clustering, and DBSCAN, we aim to evaluate how different algorithmic approaches group consumers based on their demographic and financial profiles.

Data Loading and Pre-processing

First, we will import the data and verify its completeness. The dataset is from Kaggle. We are remove the CustomerID column since it’s just a label and doesn’t help the algorithms find actual patterns in customer behavior.

# Loading required libraries 
library(dplyr)

library(ggplot2)

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
# Seting working directory 
setwd("/Users/benna/Documents/Unsupervised Learning Project")  

# Loading the dataset 
data <- read.csv("Mall_Customers.csv")  

# Checking for missing values 
sum(is.na(data))

```{[1] 0}

Displaying the first few rows of the data

head(data)



``` customerid
1          1   Male  19                 15                     39
2          2   Male  21                 15                     81
3          3 Female  20                 16                      6
4          4 Female  23                 16                     77
5          5 Female  31                 17                     40
6          6 Female  22                 17                     76

The initial phase involves importing the data set via read.csv() and performing a data integrity check using is.na() to identify missing values. We exclude the CustomerID variable, as it lacks predictive value for segmenting customer behavior. Finally, a preview of the initial rows confirms the data was ingested and structured correctly.

Data Scaling

Because of features like Age, Income, and Spending Score which are measured in different units, we will apply scaling to the data set. By centering the features around a mean of 0 and a standard deviation of 1, we prevent variables with larger numerical ranges from disproportionately influencing the clustering distance calculations.

# Loading the dataset 
data <- read.csv("/Users/benna/Documents/Unsupervised Learning Project/Mall_Customers.csv")  

# Renaming columns for clarity and consistency 
colnames(data) <- c("CustomerID", "Gender", "Age", "AnnualIncome", "SpendingScore")  

# Removing the CustomerID column as it is not needed for clustering 
data <- data[, c("Age", "AnnualIncome", "SpendingScore")]  

# Scaling only the numerical columns (Age, AnnualIncome, SpendingScore) data_scaled <- scale(data)  

# Checking the scaled data 
head(data_scaled)
[1,] -1.4210029    -1.734646    -0.4337131
[2,] -1.2778288    -1.734646     1.1927111
[3,] -1.3494159    -1.696572    -1.7116178
[4,] -1.1346547    -1.696572     1.0378135
[5,] -0.5619583    -1.658498    -0.3949887
[6,] -1.2062418    -1.658498     0.9990891

In this step, we perform feature standardization on Age, Annual Income, and Spending Score to ensure a balanced weight distribution across the model. By utilizing the scale() function, we normalize these variables to a mean of 0 and a standard deviation of 1, preventing the clustering algorithm from being biased toward features with larger numeric magnitudes

# Visualizing the distribution of numerical features before scaling 
par(mfrow = c(1, 3)) 
hist(data$Age, main = "Age Distribution", xlab = "Age", col = "pink") 
hist(data$AnnualIncome, main = "Annual Income Distribution", xlab = "Income", col = "turquoise") 
hist(data$SpendingScore, main = "Spending Score Distribution", xlab = "Spending Score", col = "red")

In this part we generated histograms to visualize the pre-scaling feature space. This exploratory step is critical for evaluating the variance and frequency distribution of the demographic and financial attributes before they are transformed for the clustering model.

Clustering - K Means

In this next step, it involves executing the K-Means algorithm across various values of \(k\). This iterative approach allows us to compare different clustering configurations and visualize how the data points are distributed as the number of segments changes.

# Applying K-Means clustering with 5 clusters (adjust k if necessary) 
set.seed(123)  # Set a seed for reproducibility 
kmeans_result <- kmeans(data_scaled, centers = 5)  

# Adding cluster labels to the original dataset 
data$cluster <- kmeans_result$cluster  

# Visualizing the K-Means clustering results 
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(cluster))) +   
  geom_point() +   
  labs(title = "K-Means Clustering (5 clusters)", color = "Cluster") +           theme_minimal()

To segment the data, we implemented the K-Means algorithm using a target of 5 clusters. We utilized set.seed(123) to guarantee consistent results across different runs. The resulting cluster assignments were then integrated into our primary data set and visualized via a ggplot2 scatter plot, specifically mapping the relationship between Age and Annual Income.

Hierarchical Clustering

Next, we use hierarchical clustering to create a dendrogram which looks like a family tree of our data. This visual tool helps us see how different customer groups branch off from one another, making it easier to decide exactly how many clusters to use. In this step, we utilize hierarchical clustering to explore the data’s structure. We begin by generating a distance matrix via the dist() function to quantify the dissimilarity between observations. Subsequently, we implement the hclust() algorithm using the Ward’s minimum variance method (ward.D2). The resulting dendrogram serves as a diagnostic tool, allowing us to determine the optimal number of clusters by identifying the most appropriate horizontal cut-off point.

# Calculating distance matrix 
dist_matrix <- dist(data_scaled)  

# Performing hierarchical clustering 
hclust_result <- hclust(dist_matrix, method = "ward.D2")  

# Visualizing the dendrogram 
plot(hclust_result, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "")

Clustering - DBSCAN

In the final stage of our analysis, we implement DBSCAN (Density-Based Spatial Clustering of Applications with Noise). This algorithm allows us to identify clusters based on point density rather than distance from a center, after which we will visualize the resulting spatial groupings and any identified outliers. We implement DBSCAN to identify density-reachable clusters within the feature space. Using an Epsilon of 0.5 and a Minimum Points threshold of 5, the algorithm partitions the data while effectively filtering out outliers. Following the computation, we utilize ggplot2 to map these density-based groupings, providing a contrast to our previous distance-based models.

# Install and load dbscan package 
library("dbscan")

The following object is masked from ‘package:stats’:

    as.dendrogram
# Apply DBSCAN with epsilon = 0.5 and minPts = 5 
dbscan_result <- dbscan(data_scaled, eps = 0.5, minPts = 5)  

# Add DBSCAN labels to the dataset 
data$dbscan_cluster <- dbscan_result$cluster  

# Visualizing the DBSCAN clustering results 
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(dbscan_cluster))) +   geom_point() +  
labs(title = "DBSCAN Clustering", color = "Cluster") +   
theme_minimal()

Conclusion

This comprehensive analysis successfully segmented a mall customer base by leveraging three distinct unsupervised learning techniques which are K-Means, Hierarchical Clustering, and DBSCAN. By evaluating the intersection of demographic data (Age) and financial metrics (Annual Income and Spending Score), we transformed raw data into actionable business intelligence.

Initially, K-Means was employed to partition the customers into five distinct segments, providing a clear, distance-based framework for identifying high-value vs. low-spending groups. To validate these groupings, we utilized Hierarchical Clustering; the resulting dendrogram allowed us to visualize the genealogy of the data, confirming the natural hierarchy of customer relationships and providing a structural basis for determining the optimal number of clusters. Finally, DBSCAN was implemented as a density-based check, which was particularly effective at identifying core customer clusters while isolating noise and those unique, outlier customers who do not fit standard patterns.

Together, these methodologies provide a multi dimensional view of the consumer landscape, enabling the business to move beyond one size fits all advertising. Instead, stakeholders can now design data driven, targeted marketing campaigns and personalized services that resonate with the specific behaviors and economic profiles of each identified segment, ultimately driving higher engagement and resource efficiency.