In this analysis, we will perform customer segmentation using K-Means clustering, hierarchical clustering, and DBSCAN on a dataset of mall customers. The dataset includes information about customers’ age, annual income, and spending score. We will explore how clustering algorithms group similar customers together based on these features.
First, we load the data and check for any missing values. We will
also remove the CustomerID column, which is not relevant
for clustering.
# Loading required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Seting working directory
setwd("/Users/zayne/Desktop")
# Loading the dataset
data <- read.csv("Mall_Customers.csv")
# Checking for missing values
sum(is.na(data))
## [1] 0
# Displaying the first few rows of the data
head(data)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
Explanation: In this section, we load the dataset
using read.csv(), check for missing values using
is.na(), and remove the CustomerID column
because it is not useful for clustering the customers. We also display
the first few rows of the data to ensure everything is loaded
correctly.
Since the dataset includes different units (Age, Income, and Spending Score), we will scale the numerical features to have a mean of 0 and a standard deviation of 1. This helps to ensure that the clustering algorithms treat all features equally.
# Loading the dataset
data <- read.csv("/Users/zayne/Desktop/Mall_Customers.csv")
# Renaming columns for clarity and consistency
colnames(data) <- c("CustomerID", "Gender", "Age", "AnnualIncome", "SpendingScore")
# Removing the CustomerID column as it's not relevant for clustering
data <- data[, c("Age", "AnnualIncome", "SpendingScore")]
# Scaling only the numerical columns (Age, AnnualIncome, SpendingScore)
data_scaled <- scale(data)
# Checking the scaled data
head(data_scaled)
## Age AnnualIncome SpendingScore
## [1,] -1.4210029 -1.734646 -0.4337131
## [2,] -1.2778288 -1.734646 1.1927111
## [3,] -1.3494159 -1.696572 -1.7116178
## [4,] -1.1346547 -1.696572 1.0378135
## [5,] -0.5619583 -1.658498 -0.3949887
## [6,] -1.2062418 -1.658498 0.9990891
Explanation: Here, we scale the numerical features
(Age, Annual Income, and Spending Score) to ensure each feature has the
same importance during clustering. The scale() function
standardizes each feature, transforming it to have a mean of 0 and a
standard deviation of 1.
Before applying the clustering algorithms, let’s visualize the distributions of the numerical features (Age, Annual Income, and Spending Score) to better understand the data.
# Visualizing the distribution of numerical features before scaling
par(mfrow = c(1, 3))
hist(data$Age, main = "Age Distribution", xlab = "Age", col = "lightblue")
hist(data$AnnualIncome, main = "Annual Income Distribution", xlab = "Income", col = "lightgreen")
hist(data$SpendingScore, main = "Spending Score Distribution", xlab = "Spending Score", col = "lightcoral")
Explanation: In this part, we use histograms to visualize the distributions of the numerical features before scaling. This helps us understand the spread and central tendencies of Age, Annual Income, and Spending Score.
Now, we apply K-Means clustering to group customers into clusters. We will try different numbers of clusters and visualize the results.
# Applying K-Means clustering with 5 clusters (adjust k if necessary)
set.seed(123) # Set a seed for reproducibility
kmeans_result <- kmeans(data_scaled, centers = 5)
# Adding cluster labels to the original dataset
data$cluster <- kmeans_result$cluster
# Visualizing the K-Means clustering results
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(cluster))) +
geom_point() +
labs(title = "K-Means Clustering (5 clusters)", color = "Cluster") +
theme_minimal()
Explanation: We apply the K-Means algorithm with
k = 5 clusters. The set.seed(123) ensures that
the results are reproducible. After clustering, we add the cluster
labels to the dataset and use ggplot2 to visualize the
clusters in a scatter plot of Age vs Annual Income.
Next, we perform hierarchical clustering and visualize the dendrogram to decide on the optimal number of clusters.
# Calculating distance matrix
dist_matrix <- dist(data_scaled)
# Performing hierarchical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")
# Visualizing the dendrogram
plot(hclust_result, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "")
Explanation: In this section, we use hierarchical
clustering. First, we calculate the distance matrix using
dist(), which measures the dissimilarity between each pair
of data points. We then apply the hclust() function with
the “ward.D2” method and visualize the result as a dendrogram. The
dendrogram helps determine the number of clusters by cutting the tree at
a specific level.
Finally, we apply DBSCAN, which is a density-based clustering algorithm, and visualize the results.
# Install and load dbscan package
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
# Apply DBSCAN with epsilon = 0.5 and minPts = 5
dbscan_result <- dbscan(data_scaled, eps = 0.5, minPts = 5)
# Add DBSCAN labels to the dataset
data$dbscan_cluster <- dbscan_result$cluster
# Visualizing the DBSCAN clustering results
ggplot(data, aes(x = Age, y = AnnualIncome, color = as.factor(dbscan_cluster))) +
geom_point() +
labs(title = "DBSCAN Clustering", color = "Cluster") +
theme_minimal()
Explanation: DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) is a clustering algorithm that
identifies clusters based on the density of data points. We apply DBSCAN
with parameters eps = 0.5 (maximum distance between two
points to be considered neighbors) and minPts = 5 (minimum
number of points to form a cluster). After running DBSCAN, we add the
cluster labels to the dataset and visualize the results using
ggplot2.
In this analysis, we applied three clustering algorithms to segment mall customers based on their age, annual income, and spending score. Each algorithm identified different groups of customers, which can be useful for targeted marketing and personalized services.
By using these clustering techniques, businesses can gain valuable insights into customer behavior and tailor their marketing strategies accordingly.