Introduction

In this analysis, we will perform customer segmentation using K-Means clustering, hierarchical clustering, and DBSCAN on a dataset of mall customers. The dataset includes information about customers’ age, annual income, and spending score. We will explore how clustering algorithms group similar customers together based on these features.

Data Loading and Preprocessing

First, we load the data and check for any missing values. We will also remove the CustomerID column, which is not relevant for clustering.

# Loading required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Seting working directory
setwd("/Users/zayne/Desktop")

# Loading the dataset
data <- read.csv("Mall_Customers.csv")

# Checking for missing values
sum(is.na(data))

## [1] 0

# Displaying the first few rows of the data
head(data)

##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Explanation: In this section, we load the dataset using read.csv(), check for missing values using is.na(), and remove the CustomerID column because it is not useful for clustering the customers. We also display the first few rows of the data to ensure everything is loaded correctly.

Data Scaling

Since the dataset includes different units (Age, Income, and Spending Score), we will scale the numerical features to have a mean of 0 and a standard deviation of 1. This helps to ensure that the clustering algorithms treat all features equally.

# Loading the dataset
data <- read.csv("/Users/zayne/Desktop/Mall_Customers.csv")

# Renaming columns for clarity and consistency
colnames(data) <- c("CustomerID", "Gender", "Age", "AnnualIncome", "SpendingScore")

# Removing the CustomerID column as it's not relevant for clustering
data <- data[, c("Age", "AnnualIncome", "SpendingScore")]

# Scaling only the numerical columns (Age, AnnualIncome, SpendingScore)
data_scaled <- scale(data)

# Checking the scaled data
head(data_scaled)

##             Age AnnualIncome SpendingScore
## [1,] -1.4210029    -1.734646    -0.4337131
## [2,] -1.2778288    -1.734646     1.1927111
## [3,] -1.3494159    -1.696572    -1.7116178
## [4,] -1.1346547    -1.696572     1.0378135
## [5,] -0.5619583    -1.658498    -0.3949887
## [6,] -1.2062418    -1.658498     0.9990891

Explanation: Here, we scale the numerical features (Age, Annual Income, and Spending Score) to ensure each feature has the same importance during clustering. The scale() function standardizes each feature, transforming it to have a mean of 0 and a standard deviation of 1.

Visualizing the Data

Before applying the clustering algorithms, let’s visualize the distributions of the numerical features (Age, Annual Income, and Spending Score) to better understand the data.

# Visualizing the distribution of numerical features before scaling
par(mfrow = c(1, 3))
hist(data$Age, main = "Age Distribution", xlab = "Age", col = "lightblue")
hist(data$AnnualIncome, main = "Annual Income Distribution", xlab = "Income", col = "lightgreen")
hist(data$SpendingScore, main = "Spending Score Distribution", xlab = "Spending Score", col = "lightcoral")

Explanation: In this part, we use histograms to visualize the distributions of the numerical features before scaling. This helps us understand the spread and central tendencies of Age, Annual Income, and Spending Score.

K-Means Clustering

Now, we apply K-Means clustering to group customers into clusters. We will try different numbers of clusters and visualize the results.

# Applying K-Means clustering with 5 clusters (adjust k if necessary)
set.seed(123)  # Set a seed for reproducibility
kmeans_result <- kmeans(data_scaled, centers = 5)

# Adding cluster labels to the original dataset
data$cluster <- kmeans_result$cluster

# Visualizing the K-Means clustering results
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(cluster))) +
  geom_point() +
  labs(title = "K-Means Clustering (5 clusters)", color = "Cluster") +
  theme_minimal()

Explanation: We apply the K-Means algorithm with k = 5 clusters. The set.seed(123) ensures that the results are reproducible. After clustering, we add the cluster labels to the dataset and use ggplot2 to visualize the clusters in a scatter plot of Age vs Annual Income.

Hierarchical Clustering

Next, we perform hierarchical clustering and visualize the dendrogram to decide on the optimal number of clusters.

# Calculating distance matrix
dist_matrix <- dist(data_scaled)

# Performing hierarchical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")

# Visualizing the dendrogram
plot(hclust_result, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "")

Explanation: In this section, we use hierarchical clustering. First, we calculate the distance matrix using dist(), which measures the dissimilarity between each pair of data points. We then apply the hclust() function with the “ward.D2” method and visualize the result as a dendrogram. The dendrogram helps determine the number of clusters by cutting the tree at a specific level.

DBSCAN Clustering

Finally, we apply DBSCAN, which is a density-based clustering algorithm, and visualize the results.

# Install and load dbscan package
library(dbscan)

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

# Apply DBSCAN with epsilon = 0.5 and minPts = 5
dbscan_result <- dbscan(data_scaled, eps = 0.5, minPts = 5)

# Add DBSCAN labels to the dataset
data$dbscan_cluster <- dbscan_result$cluster

# Visualizing the DBSCAN clustering results
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(dbscan_cluster))) +
  geom_point() +
  labs(title = "DBSCAN Clustering", color = "Cluster") +
  theme_minimal()

Explanation: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies clusters based on the density of data points. We apply DBSCAN with parameters eps = 0.5 (maximum distance between two points to be considered neighbors) and minPts = 5 (minimum number of points to form a cluster). After running DBSCAN, we add the cluster labels to the dataset and visualize the results using ggplot2.

Conclusion

In this analysis, we applied three clustering algorithms to segment mall customers based on their age, annual income, and spending score. Each algorithm identified different groups of customers, which can be useful for targeted marketing and personalized services.

K-Means clustering grouped customers into 5 clusters.
Hierarchical clustering created a dendrogram that helped determine the optimal number of clusters.
DBSCAN identified clusters based on customer density.

By using these clustering techniques, businesses can gain valuable insights into customer behavior and tailor their marketing strategies accordingly.

Mall Customer Segmentation Analysis

Nyasha Nyarirangwe

2025-01-30