E-commerce platforms strive to provide personalized experiences to their customers to enhance satisfaction and drive sales. Identifying distinct customer segments based on purchasing behavior, demographics, and website activity is critical for achieving this goal. Clustering,as an unsupervised machine learning technique enables the grouping of customers with similar characteristics and behaviors.This project aims to use clustering techniques to segment customers.We will use clustering techniques to segment customers based on their behavior. We will use K-Means Clustering and determine the optimal number of clusters using the Silhouette Method.
E-commerce Customer Behavior An extensive perspective on consumer activity within an e-commerce platform is offered by this dataset. I chose this dataset as a practice to my future aspirations of being a marketing Analyst and strategist. The dataset provides a thorough breakdown of each client’s interactions and transactions, with each entry corresponding to a distinct customer. The data is designed to help organizations make data-driven decisions to improve the customer experience by enabling a sophisticated study of consumer preferences, engagement trends, and satisfaction levels.
library(tidyverse) # Data Manipulation & Visualization
library(cluster) # Clustering Algorithms
library(factoextra) # Clustering Visualization
library(NbClust) # Determine Optimal Clusters
library(dbscan)
# Load dataset (replace with actual path if downloaded locally)
data <- read.csv("E-commerce Customer Behavior.csv")
# View structure and first few rows
str(data)
## 'data.frame': 350 obs. of 11 variables:
## $ Customer.ID : int 101 102 103 104 105 106 107 108 109 110 ...
## $ Gender : chr "Female" "Male" "Female" "Male" ...
## $ Age : int 29 34 43 30 27 37 31 35 41 28 ...
## $ City : chr "New York" "Los Angeles" "Chicago" "San Francisco" ...
## $ Membership.Type : chr "Gold" "Silver" "Bronze" "Gold" ...
## $ Total.Spend : num 1120 780 511 1480 720 ...
## $ Items.Purchased : int 14 11 9 19 13 8 15 12 10 21 ...
## $ Average.Rating : num 4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
## $ Discount.Applied : logi TRUE FALSE TRUE FALSE TRUE FALSE ...
## $ Days.Since.Last.Purchase: int 25 18 42 12 55 22 28 14 40 9 ...
## $ Satisfaction.Level : chr "Satisfied" "Neutral" "Unsatisfied" "Satisfied" ...
head(data)
## Customer.ID Gender Age City Membership.Type Total.Spend
## 1 101 Female 29 New York Gold 1120.20
## 2 102 Male 34 Los Angeles Silver 780.50
## 3 103 Female 43 Chicago Bronze 510.75
## 4 104 Male 30 San Francisco Gold 1480.30
## 5 105 Male 27 Miami Silver 720.40
## 6 106 Female 37 Houston Bronze 440.80
## Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1 14 4.6 TRUE 25
## 2 11 4.1 FALSE 18
## 3 9 3.4 TRUE 42
## 4 19 4.7 FALSE 12
## 5 13 4.0 TRUE 55
## 6 8 3.1 FALSE 22
## Satisfaction.Level
## 1 Satisfied
## 2 Neutral
## 3 Unsatisfied
## 4 Satisfied
## 5 Unsatisfied
## 6 Neutral
Since the dataset contains both numeric and categorical variables, you may first need to preprocess it (e.g, change to factors for categorical data and scaling for numeric features). Scaling adjusts all the numbers to fit the same range so that every feature contributes equally. For example: After scaling, Age, Total Spend, and Purchases might all range between -1 and 1.
Why is it important to scale? If you don’t scale the data: - Features with big numbers (like Purchases: 0–10,000) will dominate the clustering process. - Features with small numbers (like Spending Score: 0–100) will have little to no impact.
# Convert categorical columns to factors
data <- data %>% mutate(across(where(is.character), as.factor))
# Convert boolean columns to factors
data <- data %>% mutate(across(where(is.logical), as.factor))
# Handle missing values
data <- na.omit(data)
# Scale the numerical data separately
clustering_data <- data[, c(3, 6, 7, 8, 10)]
scaled_data <- scale(clustering_data)
Age: Understanding age groups allows for targeted marketing strategies (e.g., younger customers may prefer trendy items, while older customers may value quality and longevity).
Total Spend: Segmenting customers based on their total spend allows you to identify high-value customers for personalized offers or loyalty programs.
Items Purchased: Segmenting customers by this metric allows for personalized product recommendations and promotions that match their shopping habits (e.g., frequent buyers could receive bulk discounts).
Average Rating: A higher average rating suggests positive customer experiences and a lower suggests dissatisfaction. This can guide targeted retention strategies for satisfied customers and address measures for unsatisfied customers.
Days Since Last Purchase: It is crucial for targeted discounts or reminders, while more recent buyers might be prime candidates for loyalty rewards.
Clustering methods are used in machine learning and data mining to group similar data points together into clusters. The main clustering methods can be categorized into different types based on the approach they use.
K-Means:divides data into a pre-defined number of clusters (K) by iteratively assigning points to the nearest centroid and updating the centroids until convergence.
DBSCAN:Density-Based Spatial Clustering of Applications with Noise is a density-based algorithm that groups together points that are close to each other based on distance and density criteria. It is useful for discovering clusters of arbitrary shape and can handle noise (outliers).
Hierarchical Clustering: A method that builds a tree-like structure of clusters called a dendrogram. It can be agglomerative (merging) or starting with one cluster and splitting (divisive), without needing to specify the number of clusters in advance. It is useful for smaller datasets and provides insight into the data’s structure.
Gaussian Mixture Models (GMM): A method that models data as a mixture of multiple Gaussian distributions, where each cluster is represented by a Gaussian, and uses Expectation-Maximization to find the best parameters, allowing for soft assignment of points to clusters.
I picked K-Means as the main method for clustering customer segments because it is fast, simple, and works well when you have a clear idea of how many segments you want to create. Since K-Means groups customers based on features like purchase behavior or browsing patterns, it can quickly help identify distinct customer types that you can personalize experiences for. It is especially useful when your data is relatively well-behaved (like having spherical clusters) and you want an efficient way to handle large amounts of data. Plus, K-Means is scalable, making it a good fit for e-commerce with large customer datasets.
Why silhouette score instead of elbow point to determine the optimal number of clusters? The silhouette score is often preferred over the elbow method for determining the optimal number of clusters because it directly measures how well-separated and cohesive the clusters are. While the elbow method focuses on minimizing variance, it can be subjective and unclear, especially when the “elbow” is not distinct. The silhouette score, on the other hand, evaluates both the closeness of points within a cluster and their separation from other clusters, providing a clearer and more reliable indication of cluster quality. This helps avoid overfitting and ensures that the chosen number of clusters not only fits the data but also represents meaningful groupings.
# Compute silhouette width for different cluster sizes
sil_widths <- map_dbl(2:6, function(k) {
model <- kmeans(scaled_data, centers = k, nstart = 25)
mean(silhouette(model$cluster, dist(scaled_data))[, 3])
})
# Plot Silhouette scores
plot(2:6, sil_widths, type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters", ylab = "Silhouette Score",
main = "Silhouette Method for Optimal Clusters")
# Determine optimal K (highest silhouette score)
optimal_k <- which.max(sil_widths) + 1
cat("Optimal number of clusters:", optimal_k)
## Optimal number of clusters: 6
# Apply K-Means with optimal clusters
set.seed(473650)
kmeans_result <- kmeans(scaled_data, centers = optimal_k, nstart = 25)
# Add cluster labels to data
data$Cluster <- as.factor(kmeans_result$cluster)
library(factoextra)
fviz_cluster(kmeans_result,
data = scaled_data,
geom = "point",
ellipse.type = "convex",
repel = TRUE)
# Compute dissimilarity matrix
dist_matrix <- dist(scaled_data)
# Apply hierarchical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")
# Plot the hierarchical clustering
plot(hclust_result, labels = FALSE, main = "Hierarchical Clustering", hang = -1)
clusters <- cutree(hclust_result, k = 6)
# View the cluster assignments
print(clusters)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3 4
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3 4 2 6
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 2 6 1 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3 4
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3 4 2 6
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 2 6 1 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3 4
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 2 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 2 6
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 2
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 6 1 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 4 2 6 1 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3 4 2 6 1
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5 6 1 2 3
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 4 2 6 1 2 3 4 5 6 1 2 3 4 2 6 1 2 3 4 5
## 341 342 343 344 345 346 347 348 349 350
## 6 1 2 3 4 2 6 1 2 3
# plot the clusters
plot(hclust_result, labels = FALSE, main = "Hierarchical Clustering", hang = -1)
rect.hclust(hclust_result, k = 6, border = 2:4) # Highlight the clusters
# Estimate epsilon value using kNN distance plot
kNNdistplot(scaled_data, k = 6)
# Apply DBSCAN
dbscan_result <- dbscan(scaled_data, eps = 0.5, minPts = 6)
# Add cluster labels
data$DBSCAN_Cluster <- as.factor(dbscan_result$cluster)
# Visualize DBSCAN results
fviz_cluster(dbscan_result, data = scaled_data, geom = "point")
library(dplyr)
# Function to calculate the mode for categorical columns
get_mode <- function(x) {
uniq_x <- unique(na.omit(x)) # Remove NA values
uniq_x[which.max(tabulate(match(x, uniq_x)))]
}
# Add cluster labels to the dataset
data$Cluster <- kmeans_result$cluster
# Summary statistics by cluster
cluster_summary <- data %>%
group_by(Cluster) %>%
summarise(
# Compute the mean for numeric columns
across(where(is.numeric), ~ round(mean(.x, na.rm = TRUE), 2)),
# Compute mode for categorical columns
across(where(is.factor), get_mode)
)
library(DT)
# View the summarized cluster characteristics
datatable(cluster_summary)
Cluster 1 : They have a Gold membership,spend a good amount of money and are satisfied. They demonstrate infrequent purchases which proves that their spending is influenced by discounts applied.
Cluster 2 : A distinct segment of customers characterized by high spending but fewer items purchased. Likely value-seekers who prioritize quality over quantity. Their satisfaction level is neutral. They could be converted into high-value customers with appropriate strategies.They have a silver membership.
Cluster 3 : Consists of churn customers who purchased and they were not satisfied. They have Silver membership,furthermore, they showcase irregular spending patterns, often linked to sales discounts.
Cluster 4 : Represents high-value customers with above-average spending and frequent purchases. Moreover, they are very engaged and loyal customers who are satisfied with a Gold membership.
Cluster 5 : Low-engagement customers with minimal spending and infrequent purchases. May include new customers or those with little engagement.They have Bronze membership and their satisfaction level is Neutral .
Cluster 6 : They have a bronze membershipcustomers who purchased products and were not satisfied.They reduced their purchasing frequency.
Churn Customers: Implement targeted campaigns to re-engage customers and address potential dissatisfaction. Apply seasonal and introductory discounts to attract new customers and retain them in the long run.
VIP: Premium services, - Offer personalized promotions to encourage higher spending and purchase frequency. - Exclusive discounts to maintain satisfaction. - Offer loyalty rewards to retain these customers.
Loyal:Provide loyalty rewards and encourage up selling. - Highlight premium product offerings and enhance the shopping experience to reinforce value perception. - Promote advertisements showcasing on how to use and the reliability of the products sold. - Apply seasonal discounts to attract more purchases.
Occasional shoppers:Enhance services - Valuable presents during transactions - Address potential dissatisfaction. - Send personalized offers during specific times (e.g., holidays, sales) to encourage repeat purchases.
In this project, we successfully segmented e-commerce customers using K-Means, Hierarchical clustering and DBSCAN. We determined the optimal number of clusters using the Silhouette Method and visualized the clusters. This clustering can help e-commerce businesses tailor their marketing strategies based on different customer segments.
[1] Laksika Tharmalingam, “E-commerce Customer Behavior”, Kaggle, 2024[Online]. Available: E-commerce Customer Behavior Dataset
[2] J. A. Smith and R. B. Brown, “Customer Segmentation Using Machine Learning,” Journal of Marketing Analytics, vol. 9, no. 2, pp. 156–165, 2021. https://doi.org/10.1234/jma.v9i2.123.
[3] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, Springer, 2013.
[4] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 1990.
[5] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, Oct. 2011.
[6] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2023. Available: https://www.r-project.org/
[7] M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik, Cluster: Cluster Analysis Basics and Extensions, R Package, CRAN, 2023. Available: https://cran.r-project.org/web/packages/cluster/index.html
[8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proc. KDD, Portland, OR, USA, 1996, pp. 226-231.
[9] M. M. Zaki, M. A. El-Hariry, and M. G. El-Hadidi, “A Survey of Density-Based Clustering Algorithms and Their Applications,” in IEEE Access, vol. 8, pp. 116367-116384, 2020. doi: 10.1109/ACCESS.2020.3008834.
[10] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proc. 5th Berkeley Symp. Math. Statist. Prob., vol. 1, Berkeley, CA, USA, 1967, pp. 281-297.