Customer Segmentation using K-Means Clustering

Introduction

In today’s competitive retail landscape, understanding customer behavior is paramount for the success of any shopping mall. The project aims to explore the shopping habits of individuals in a small shopping mall based on a dataset that contains shopping information from a shopping mall. The data was gathered from various genders and age groups to provide a good view of the shopping patterns in the mall.

The purpose of this project is to answer below questions: - How to achieve customer segmentation using machine learning algorithm (KMeans Clustering) - Who are your target customers with whom can start marketing strategy [easy to converse] - How the marketing strategy works in real world

K-Means clustering is one of the simplest and most commonly used clustering algorithms. It tries to find cluster that are representative of certain regions of the data. The algorithm alternates between two steps: assigning each data point that are assigned to it. The algorithm is finished when the assignment of instances to cluster no longer changes.

Data Analysis

#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#>    CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1           1   Male  19                 15                     39
#> 2           2   Male  21                 15                     81
#> 3           3 Female  20                 16                      6
#> 4           4 Female  23                 16                     77
#> 5           5 Female  31                 17                     40
#> 6           6 Female  22                 17                     76
#> 7           7 Female  35                 18                      6
#> 8           8 Female  23                 18                     94
#> 9           9   Male  64                 19                      3
#> 10         10 Female  30                 19                     72

Information about columns in dataset:

CustomerID: unique ID assigned to the customer
Gender: Gender of the customer
Age: Age of the customer
Annual income: annual income of the customee
Spending score: Score assigned by the mall based on customer behavior and spending nature.

Select relevant columns for clustering:

kmeans_data <- data[, c("Annual.Income..k..", "Spending.Score..1.100.")]

Histogram of Age

Scatter plot of Annual Income vs Spending Score:

Applying K-means

The optimal number of clusters

The next step is to find the neccessary number of clusters. Since the analyzed dataset is rather small and straightforward. The optimal number of clusters will be chosen through the Elbow method and the Silhouette statistic.

plot(1:10, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K", ylab = "Total Within-Cluster Sum of Squares",
     main = "Elbow Method for Optimal K")

plot(2:10, sil_width, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K", ylab = "Average Silhouette Width",
     main = "Silhouette Method for Optimal K")

According the two method, in order to get meaningful customer segmentation, the highest silhouette width is most well-defined at K= 5, this aligns with the Elbow method ensures that the segmentation is both well-structured, which is crucial for marketing strategy development and customer targeting.

Applying K-means clustering with the optimal K = 5 based on elbow and silhouette methods

set.seed(123)
kmeans_model <- kmeans(kmeans_data, centers = 5, nstart = 25)

Visualize the clusters:

fviz_cluster(kmeans_model, data = kmeans_data, geom = "point", ellipse.type = "norm") +
  labs(title = "Customer Segmentation using K-Means Clustering")

Red cluster: Customer with average income and low to moderate spending scores
Yellow cluster: Customers possibly budget conscious or infrequent shoppers
Green cluster: Likely impulse buyers or younger customers
Blue cluster: Customers potentially for luxury items with selective purchases
Pink cluster: Ideally high-value customers, good for premium marketing campaigns

Customer Segmentation using K-Means Clustering

Thi Ngoc Dieu Doan

Introduction

Data Analysis

Applying K-means

The optimal number of clusters

Conclusion