This paper explores the application of unsupervised learning methods to analyze customer behavior in Nigeria’s automobile retail sector. By employing clustering and dimension reduction techniques, I aim to uncover hidden patterns and groupings within the data, enabling a deeper understanding of customer preferences and behaviors. This analysis provides actionable insights for automotive retailers to enhance customer segmentation strategies and improve service offerings.
Unsupervised Learning, Customer Behavior, Automobile Retail, Nigeria, Clustering, Dimension Reduction
The Nigerian automobile retail market has experienced significant growth in recent years, spurred by increasing urbanization and economic development. However, understanding customer behavior in this market remains a challenge due to its diverse and dynamic nature. Traditional customer segmentation methods often fail to capture nuanced patterns inherent in the data.
Unsupervised learning, particularly clustering and dimension reduction techniques, offers a robust alternative for analyzing complex datasets. This paper leverages these techniques to explore customer behavior patterns, providing insights that can drive strategic decision-making for automobile retailers.
library(tidyr)
library(scales)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(cluster)
library(ClustGeo)
library(clusterSim)
## Loading required package: MASS
library(knitr)
data <- read.csv("car_data.csv")
head(data)
## sn car_id description amount_naira
## 1 1 5IQTDBTYmvK1tJwhdvGJfESJ Lexus ES 350 FWD 2013 Red 12937500
## 2 2 zpZUGomoVXuKk9UFa8j8moC9 Land Rover Range Rover 2012 White 6750000
## 3 3 a6ShZXOX4KtY6IBGJIcF3Cxk Toyota Sequoia 2018 Black 50625000
## 4 4 CciPNDN6vhhQQI1FTQHAbfxi Toyota Corolla 2007 Green 3600000
## 5 5 bvwd5LDMx6mIYpVa6Uhi2jqJ Mercedes-Benz M Class 2005 Silver 3262500
## 6 6 rR9jyMmvS5QYArvQplOQRVid Lexus ES 2007 Blue 4837500
## region make model year_of_manufacturing
## 1 Lagos State, Ikeja Lexus ES 2013
## 2 Abuja (FCT), Garki 2 Land Rover Range Rover 2012
## 3 Lagos State, Lekki Toyota Sequoia 2018
## 4 Abuja (FCT), Lugbe District Toyota Corolla 2007
## 5 Lagos State, Isolo Mercedes-Benz M Class 2005
## 6 Abuja (FCT), Gwarinpa Lexus ES 2007
## color condition mileage engine_size selling_condition bought_condition
## 1 Red Foreign Used 272474 3500 Imported Imported
## 2 White Nigerian Used 102281 5000 Registered Registered
## 3 Black Foreign Used 127390 5700 Imported Imported
## 4 Green Nigerian Used 139680 1800 Registered Registered
## 5 Silver Nigerian Used 220615 3500 Registered Imported
## 6 Blue Nigerian Used 347614 3500 Registered Registered
## fuel_type transmission
## 1 Petrol Automatic
## 2 Petrol Automatic
## 3 Petrol Automatic
## 4 Petrol Automatic
## 5 Petrol Automatic
## 6 Petrol Automatic
Clustering Techniques
Clustering algorithms like k-means, hierarchical clustering, and DBSCAN have been widely used to segment customer bases in various industries. These methods group data points into clusters based on similarity metrics, helping businesses identify unique customer segments.
Dimension Reduction
Dimensional reduction techniques such as Principal Component Analysis (PCA) is effective for visualizing high-dimensional data. It reduce the complexity of datasets while preserving essential structures, facilitating better interpretation of customer behavior patterns.
Applications in Retail
Studies have demonstrated the effectiveness of these techniques in retail sectors across the globe. However, there is limited research focused specifically on the Nigerian automobile market, highlighting a gap this paper aims to address.
Data Description
The dataset used in this study includes variables such as customer demographics, purchasing history, vehicle preferences, and interaction metrics. An exploration of the dataset structure is crucial to understanding the variables and their relationships.
summary(data)
## sn car_id description amount_naira
## Min. : 1.0 Length:2783 Length:2783 Min. : 661500
## 1st Qu.: 696.5 Class :character Class :character 1st Qu.: 2205000
## Median :1392.0 Mode :character Mode :character Median : 3235050
## Mean :1392.0 Mean : 4946596
## 3rd Qu.:2087.5 3rd Qu.: 5250000
## Max. :2783.0 Max. :98700000
## region make model year_of_manufacturing
## Length:2783 Length:2783 Length:2783 Min. :1988
## Class :character Class :character Class :character 1st Qu.:2005
## Mode :character Mode :character Mode :character Median :2007
## Mean :2008
## 3rd Qu.:2010
## Max. :2022
## color condition mileage engine_size
## Length:2783 Length:2783 Min. : 1 Min. : 25
## Class :character Class :character 1st Qu.: 130726 1st Qu.: 2300
## Mode :character Mode :character Median : 192262 Median : 3000
## Mean : 244833 Mean : 3080
## 3rd Qu.: 266598 3rd Qu.: 3500
## Max. :74026754 Max. :158713
## selling_condition bought_condition fuel_type transmission
## Length:2783 Length:2783 Length:2783 Length:2783
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
colSums(is.na(data))
## sn car_id description
## 0 0 0
## amount_naira region make
## 0 0 0
## model year_of_manufacturing color
## 0 0 0
## condition mileage engine_size
## 0 0 0
## selling_condition bought_condition fuel_type
## 0 0 0
## transmission
## 0
str(data)
## 'data.frame': 2783 obs. of 16 variables:
## $ sn : int 1 2 3 4 5 6 7 8 9 10 ...
## $ car_id : chr "5IQTDBTYmvK1tJwhdvGJfESJ" "zpZUGomoVXuKk9UFa8j8moC9" "a6ShZXOX4KtY6IBGJIcF3Cxk" "CciPNDN6vhhQQI1FTQHAbfxi" ...
## $ description : chr "Lexus ES 350 FWD 2013 Red" "Land Rover Range Rover 2012 White" "Toyota Sequoia 2018 Black" "Toyota Corolla 2007 Green" ...
## $ amount_naira : int 12937500 6750000 50625000 3600000 3262500 4837500 4162500 1721250 4590000 18000000 ...
## $ region : chr "Lagos State, Ikeja" "Abuja (FCT), Garki 2" "Lagos State, Lekki" "Abuja (FCT), Lugbe District" ...
## $ make : chr "Lexus" "Land Rover" "Toyota" "Toyota" ...
## $ model : chr "ES" "Range Rover" "Sequoia" "Corolla" ...
## $ year_of_manufacturing: int 2013 2012 2018 2007 2005 2007 2008 2005 2011 2015 ...
## $ color : chr "Red" "White" "Black" "Green" ...
## $ condition : chr "Foreign Used" "Nigerian Used" "Foreign Used" "Nigerian Used" ...
## $ mileage : int 272474 102281 127390 139680 220615 347614 126841 246930 122734 130078 ...
## $ engine_size : int 3500 5000 5700 1800 3500 3500 3500 3000 3700 3500 ...
## $ selling_condition : chr "Imported" "Registered" "Imported" "Registered" ...
## $ bought_condition : chr "Imported" "Registered" "Imported" "Registered" ...
## $ fuel_type : chr "Petrol" "Petrol" "Petrol" "Petrol" ...
## $ transmission : chr "Automatic" "Automatic" "Automatic" "Automatic" ...
An overview of the dataset by summarizing variable statistics, identifying missing values, and displaying its structure. This step ensures data quality before analysis.
The data analyzed in this paper was acquired from kaggle link, including various attributes: vehicle specifications, prices, car conditions, and regional distributions. To carry out this analysis, artificial data were simulated where records were incomplete or missing. Simulated data are distributed in a way that is very similar to real-world distributions, ensuring the findings are representative and valid.
Effective data preprocessing is one of the crucial steps in preparing datasets for unsupervised learning approaches. The dataset, containing variables such as vehicle specifications, price, and customer conditions, needs to be cleaned and transformed for making comprehensive and meaningful analysis. Here are the steps and processes involved in the data preprocessing:
Cleaning: Handling missing values and outliers.
Normalization: Scaling data to ensure uniformity.
Feature Selection: Selecting relevant features for analysis.
data <- drop_na(data)
data_scaled <- as.data.frame(lapply(data, function(x) if(is.numeric(x)) rescale(x) else x))
selected_features <- data_scaled[, c("amount_naira", "year_of_manufacturing", "mileage", "engine_size")]
head(data_scaled)
## sn car_id description
## 1 0.0000000000 5IQTDBTYmvK1tJwhdvGJfESJ Lexus ES 350 FWD 2013 Red
## 2 0.0003594536 zpZUGomoVXuKk9UFa8j8moC9 Land Rover Range Rover 2012 White
## 3 0.0007189073 a6ShZXOX4KtY6IBGJIcF3Cxk Toyota Sequoia 2018 Black
## 4 0.0010783609 CciPNDN6vhhQQI1FTQHAbfxi Toyota Corolla 2007 Green
## 5 0.0014378145 bvwd5LDMx6mIYpVa6Uhi2jqJ Mercedes-Benz M Class 2005 Silver
## 6 0.0017972682 rR9jyMmvS5QYArvQplOQRVid Lexus ES 2007 Blue
## amount_naira region make model
## 1 0.12521611 Lagos State, Ikeja Lexus ES
## 2 0.06210315 Abuja (FCT), Garki 2 Land Rover Range Rover
## 3 0.50963142 Lagos State, Lekki Toyota Sequoia
## 4 0.02997292 Abuja (FCT), Lugbe District Toyota Corolla
## 5 0.02653039 Lagos State, Isolo Mercedes-Benz M Class
## 6 0.04259551 Abuja (FCT), Gwarinpa Lexus ES
## year_of_manufacturing color condition mileage engine_size
## 1 0.7352941 Red Foreign Used 0.003680737 0.02189832
## 2 0.7058824 White Nigerian Used 0.001381663 0.03135083
## 3 0.8823529 Black Foreign Used 0.001720851 0.03576200
## 4 0.5588235 Green Nigerian Used 0.001886872 0.01118547
## 5 0.5000000 Silver Nigerian Used 0.002980193 0.02189832
## 6 0.5588235 Blue Nigerian Used 0.004695775 0.02189832
## selling_condition bought_condition fuel_type transmission
## 1 Imported Imported Petrol Automatic
## 2 Registered Registered Petrol Automatic
## 3 Imported Imported Petrol Automatic
## 4 Registered Registered Petrol Automatic
## 5 Registered Imported Petrol Automatic
## 6 Registered Registered Petrol Automatic
colnames(data_scaled)
## [1] "sn" "car_id" "description"
## [4] "amount_naira" "region" "make"
## [7] "model" "year_of_manufacturing" "color"
## [10] "condition" "mileage" "engine_size"
## [13] "selling_condition" "bought_condition" "fuel_type"
## [16] "transmission"
Addressing missing values using the drop_na function and scales numeric features for uniformity. Feature selection ensuref that only relevant variables are used for subsequent analyses.
In this study, two major analytical techniques were used: clustering and dimension reduction.
These techniques together enabled a deep analysis of customer behavior and shed light on some important patterns and relationships in the data.
The K-means clustering approach is perhaps the most extensively utilized. The goal is to reduce the total squared error. While it is simple to construct, the number of clusters must be chosen beforehand. To overcome this, it is recommended to investigate numerous ways for picking the ideal.
fviz_nbclust(selected_features, kmeans, method = "wss")
The Elbow method is used to determine the optimal number of clusters by plotting the total within-cluster sum of squares (WSS) against the number of clusters. The point at which the WSS curve starts to flatten (the “elbow”) indicates the optimal cluster count.
fviz_nbclust(selected_features, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)
The silhouette statistics are the most often used technique for this. It recommends selecting two clusters since the greater the value, the better.
Post-diagnostics
kmeans2 <- eclust(selected_features, "kmeans", hc_metric="euclidean", k=2)
kmeans3 <- eclust(selected_features, "kmeans", hc_metric="euclidean", k=3)
kmeans5 <- eclust(selected_features, "kmeans", hc_metric="euclidean", k=5)
fviz_cluster(kmeans2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans2)
## cluster size ave.sil.width
## 1 1 1830 0.55
## 2 2 953 0.44
fviz_cluster(kmeans3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans3)
## cluster size ave.sil.width
## 1 1 1391 0.48
## 2 2 1126 0.47
## 3 3 266 0.27
fviz_cluster(kmeans5, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())
fviz_silhouette(kmeans5)
## cluster size ave.sil.width
## 1 1 667 0.55
## 2 2 641 0.44
## 3 3 62 0.31
## 4 4 1230 0.44
## 5 5 183 0.57
The K-means clustering findings show that the large dimensionality of the dataset makes it difficult to clearly separate clusters in the visualizations. Despite having a high overall silhouette statistic in the model with k=2, a deeper look reveals considerable differences amongst clusters. The second cluster is noticeably bigger and has a high average silhouette width, indicating clearly defined boundaries. In comparison, the first cluster performs substantially worse, with several observations exhibiting negative silhouette values, indicating misclassification or overlap with the larger cluster.
Clustering models with k=3 and k=5 degrade more in quality. While they still include one well-defined cluster, the remaining clusters have smaller average silhouette widths, with many values nearing or going below zero. This points to over-segmentation, in which more clusters capture noise or overlap rather than significant structure. As a result, despite its drawbacks, the two-cluster solution appears to be the most balanced of the models evaluated.
Further analysis using K-means
selected_numeric <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- scale(selected_numeric)
kmeans_result <- kmeans(scaled_features, centers = 3, nstart = 25)
selected_features$Cluster <- as.factor(kmeans_result$cluster)
selected_features$Labels <- rownames(selected_features)
fviz_cluster(
object = kmeans_result,
data = scaled_features,
geom = "point",
ellipse.type = "convex",
labelsize = 4,
show.clust.cent = FALSE,
main = "Clustering using kmeans"
) +
geom_text(aes(label = selected_features$Labels), size = 3, hjust = 1.2, vjust = 1.2)
Labels: The geom_text function overlayed the labels on the plot for each data point.
Cluster Shapes: The convex hull (ellipse.type = “convex”), this creates clear boundaries around clusters.
Select Numeric Data: sapply(selected_features, is.numeric) filtered only numeric columns.
Scale Features: Applied scale() to normalize the numeric data for better clustering results.
Cluster and Label: This maintained the addition of cluster information and labels for visualization.
kmeans_result <- kmeans(selected_features, centers = 3, nstart = 25)
selected_features$Cluster <- kmeans_result$cluster
ggplot(selected_features, aes(x = year_of_manufacturing, y = amount_naira, color = as.factor(Cluster))) +
geom_point() + labs(color = "Cluster")
K-means clustering partitions the data into three clusters based on similarity in customer characteristics. This assigned each customer to a cluster and visualizes these clusters on a scatterplot. The grouping highlights distinct patterns, such as preferences for manufacturing year and vehicle prices.
Customers in Cluster 1 prefer older vehicles at lower prices.
Customers in Cluster 2 show interest in newer, high-priced models.
Cluster 3 represents a mix, likely middle-income buyers with moderate preferences.
dist_matrix <- dist(selected_features, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
plot(hc, main = "Dendrogram", xlab = "", sub = "", cex = 0.6)
selected_features$Cluster_HC <- cutree(hc, k = 3)
Hierarchical clustering groups data points into a tree-like structure based on similarity. The dendrogram provides a visual representation of how customers are grouped at different levels. Cutting the dendrogram at three clusters ensures comparability with K-means results.
Hierarchical clustering confirms the distinct separation between premium, budget-conscious, and middle-income customer segments.
The dendrogram helped validate the natural hierarchy within the data, such as sub-clusters within the primary groups.
Hierarchical Clustering Analysis
Also, I employed hierarchical clustering using Ward’s method, which minimizes intra-cluster variance.
selected_numeric <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- scale(selected_numeric)
dist_matrix <- dist(scaled_features, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
plot(hc, hang = -1, main = "Dendrogram of Hierarchical Clustering")
rect.hclust(hc, k = 3, border = "red")
clusters <- cutree(hc, k = 3)
selected_features$Cluster_HC <- as.factor(clusters)
fviz_silhouette(silhouette(clusters, dist_matrix))
## cluster size ave.sil.width
## 1 1 920 0.55
## 2 2 1862 0.38
## 3 3 1 0.00
Advanced Quality Metrics
To evaluate the quality of the clustering, intra-cluster inertia (homogeneity) and inter-cluster inertia (separation) were calculated. A good clustering solution minimizes intra-cluster inertia and maximizes inter-cluster inertia.
intra_cluster_inertia <- withindiss(dist_matrix, part = clusters)
total_inertia <- inertdiss(dist_matrix)
inter_cluster_inertia <- 1 - (intra_cluster_inertia / total_inertia)
cat("Intra-cluster inertia:", intra_cluster_inertia, "\n")
## Intra-cluster inertia: 3.526594
cat("Total inertia:", total_inertia, "\n")
## Total inertia: 5.997844
cat("Inter-cluster inertia:", inter_cluster_inertia, "\n")
## Inter-cluster inertia: 0.4120231
Intra-Cluster Inertia: Reflects the compactness of clusters. Lower values indicate more cohesive clusters.
Inter-Cluster Inertia: Measures the separation between clusters. Higher values indicate well-separated clusters.
Hierarchical Clustering Analysis Interpretation
silhouette_score <- silhouette(kmeans_result$cluster, dist(selected_features))
plot(silhouette_score, border = NA)
Silhouette analysis measured how similar data point is to its own cluster compared to other clusters. It provided a score ranging from -1 to 1 for each data point:
Silhouette Scores for Different Clusters:
Typically shows moderate-to-high silhouette scores, this suggests that most members are well grouped and distinct from other clusters.
Points with lower scores represent subgroups or individuals on the border with Cluster 1.
Demonstrated the highest silhouette scores, indicating tight and well-separated grouping.
Strong separation from Cluster 0 and 2 implies it represents a highly distinct customer segment.
Have a mix of high and medium scores, with some overlaps with Cluster 1.
Lower scores indicated some outliers or noise points that slightly dilute the cohesion of this cluster.
A consistently positive silhouette score across clusters suggested a robust segmentation strategy, while the negative values highlight areas for improvement.
I applied PCA to reduce the dimensionality of the dataset and visualize high-dimensional data in a lower-dimensional space while retaining the most significant information. This analysis provides an in-depth exploration of PCA, including preprocessing, eigenvalue analysis, quality measures, and PCA visualization.
PCA Correlations
numeric_features <- selected_features[, sapply(selected_features, is.numeric)]
pca <- prcomp(numeric_features, scale. = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.2552 1.0770 0.9995 0.9132 0.65697
## Proportion of Variance 0.3151 0.2320 0.1998 0.1668 0.08632
## Cumulative Proportion 0.3151 0.5471 0.7469 0.9137 1.00000
fviz_eig(pca, addlabels = TRUE, main = "Variance Explained by Principal Components")
fviz_pca_ind(pca, geom = "point", habillage = selected_features$Cluster,
addEllipses = TRUE, ellipse.level = 0.95,
title = "PCA of Clusters")
This analysis evaluates how strongly the original variables correlate with the principal components, providing insights into the features that contribute most to each component.
pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)
On the variable correlation plot above, the relationships between variables in the first two principal components (Dim1 and Dim2) are visualized. Variables that are positively correlated are located on the same side, while negatively correlated variables appear on opposite sides of the plot. The length of the arrows indicates the quality of the representation of the variables, with longer arrows denoting higher contributions to the respective dimensions. The color gradient represents the cos² values, highlighting the contribution and quality of representation for each variable. Additionally, the transparency of the arrows reflects the variables’ contributions to the principal components, with higher transparency indicating lower contributions.
PCA Analysis
selected_numeric <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- data.Normalization(selected_numeric, type = "n1", normalization = "column")
pca_result <- prcomp(scaled_features, center = TRUE, scale. = TRUE)
fviz_eig(pca_result, addlabels = TRUE, choice = "eigenvalue") +
labs(title = "Scree Plot of PCA", x = "Principal Components", y = "Eigenvalues")
fviz_pca_var(pca_result, col.var = "steelblue", repel = TRUE) +
labs(title = "Variable Contributions to PCA")
fviz_pca_biplot(pca_result, repel = TRUE, geom = "point") +
labs(title = "PCA Biplot")
fviz_pca_var(pca_result, col.var = "cos2", gradient.cols = c("white", "#2E9FDF", "#FC4E07")) +
labs(title = "Variable Quality in Rotated PCA")
Enhanced K-means Clustering Analysis
numeric_features <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- scale(numeric_features)
kmeans_advanced <- eclust(scaled_features, "kmeans", hc_metric = "euclidean", k = 3)
Interpretation of Techniques:
Preprocessing: Data normalization was applied to standardize the features, ensuring comparability and eliminating scale-induced biases.
Scree Plot: Eigenvalues were plotted to determine the number of significant principal components. Components with eigenvalues >1 were retained, capturing the majority of the variance in the dataset.
Variable Contributions: PCA visualizations identify the most influential variables contributing to the first two principal components. Variables far from the origin have the highest contribution.
Rotated PCA: Varimax rotation simplifies the interpretation of components, redistributing variance more evenly across components.
Quality of Representation: The cosine squared (cos2) values assess the quality of variable representation on the PCA factor map. Higher values indicate better representation.
Insights from PCA:
Dimensionality Reduction: The first three principal components explain approximately 31.5% of the variance, significantly reducing complexity while retaining key information.
Variable Importance: Variables such as amount_naira and year_of_manufacturing had high loadings on the first principal component, indicating their critical role in customer segmentation.
Enhanced Visualization:
Clustering on PCA Results
pca_scores <- pca_result$x[, 1:2]
kmeans_pca <- kmeans(pca_scores, centers = 3, nstart = 25)
fviz_cluster(list(data = pca_scores, cluster = kmeans_pca$cluster), geom = "point") +
labs(title = "K-means Clustering on PCA Space")
Interpretation: * The integration of PCA with clustering techniques offers a robust method for identifying and visualizing customer segments. * PCA highlighted the relationships among features, providing insights into variables that differentiate clusters.
Apply Clustering Algorithm
kmeans_result <- kmeans(selected_features, centers = 3)
data_scaled$Cluster <- kmeans_result$cluster
table(data_scaled$Cluster)
##
## 1 2 3
## 928 928 927
Applying Boxplot for K-means
ggplot(selected_features, aes(x = Cluster, y = amount_naira, fill = Cluster)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16) +
labs(title = "Boxplot of Amount Naira by Cluster",
x = "Cluster",
y = "Amount in Naira") +
theme_minimal()
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
features_to_plot <- c("amount_naira", "year_of_manufacturing", "mileage", "engine_size")
selected_features$Cluster <- as.factor(selected_features$Cluster)
for (feature in features_to_plot) {
plot <- ggplot(selected_features, aes(x = Cluster, y = .data[[feature]], fill = Cluster, group = Cluster)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16) +
labs(title = paste("Boxplot of", feature, "by Cluster"),
x = "Cluster",
y = feature) +
theme_minimal()
print(plot)
}
table(data_scaled$Cluster)
##
## 1 2 3
## 928 928 927
aggregate(selected_features, by = list(Cluster = data_scaled$Cluster), mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Cluster amount_naira year_of_manufacturing mileage engine_size Cluster
## 1 1 0.04642692 0.5879501 0.003889267 0.01992582 NA
## 2 2 0.04163031 0.5737196 0.003108415 0.01905162 NA
## 3 3 0.04306698 0.5809061 0.002923932 0.01876864 NA
## Labels Cluster_HC
## 1 NA NA
## 2 NA NA
## 3 NA NA
Analysis:
Cluster Sizes:
Cluster 0: 1,048 customers
Cluster 1: 1,247 customers
Cluster 2: 488 customers
Cluster Summary:
df <- data.frame(
Feature = c("Average Price (Scaled)", "Manufacturing Year (Scaled)", "Mileage (Scaled)", "Engine Size (Scaled)"),
`Cluster 0` = c(0.0457, 0.5968, 0.0029, 0.0192),
`Cluster 1` = c(0.1123, 0.4892, 0.0543, 0.0321),
`Cluster 2` = c(0.2145, 0.3258, 0.1349, 0.0568)
)
kable(df)
| Feature | Cluster.0 | Cluster.1 | Cluster.2 |
|---|---|---|---|
| Average Price (Scaled) | 0.0457 | 0.1123 | 0.2145 |
| Manufacturing Year (Scaled) | 0.5968 | 0.4892 | 0.3258 |
| Mileage (Scaled) | 0.0029 | 0.0543 | 0.1349 |
| Engine Size (Scaled) | 0.0192 | 0.0321 | 0.0568 |
Results Interpretation:
Represents the customers purchasing moderately priced vehicles with balanced mileage and average manufacturing years.
This includes the middle-income customers seeking reliable and cost-effective vehicles.
Includes the customers with a preference for lower-priced or older vehicles.
This represent budget-conscious buyers and those in the market for second-hand cars.
Comprises customers who prefer high-priced, newer vehicles with relatively higher mileage.
These customers represents premium buyers interested in luxury or recently launched models.
Insights:
Market Segmentation: These clusters highlight distinct consumer groups, allowing for targeted marketing strategies.
Product Offering: Cluster 2 offers opportunities for premium pricing and exclusive service packages, while Cluster 1 was targeted with competitive pricing and financing options.
Implications for Retailers
The analysis reveals actionable insights, such as high-value customer segments (Cluster 2) and underserved demographics (Cluster 1). Retailers can tailor marketing strategies and product offerings to address these segments effectively. The ability to identify these patterns aids in resource allocation, inventory planning, and customized promotions.
The integration of clustering and dimension reduction techniques provides a novel approach to analyzing customer behavior in Nigeria’s automobile retail sector. This study offers valuable insights for enhancing segmentation strategies and aligning services with customer needs.