Integrated Analysis of Automobile Retail Customer Behavior in Nigeria Using Clustering and Dimension Reduction Techniques

Abstract

This paper explores the application of unsupervised learning methods to analyze customer behavior in Nigeria’s automobile retail sector. By employing clustering and dimension reduction techniques, I aim to uncover hidden patterns and groupings within the data, enabling a deeper understanding of customer preferences and behaviors. This analysis provides actionable insights for automotive retailers to enhance customer segmentation strategies and improve service offerings.

Keywords

Unsupervised Learning, Customer Behavior, Automobile Retail, Nigeria, Clustering, Dimension Reduction

Introduction

The Nigerian automobile retail market has experienced significant growth in recent years, spurred by increasing urbanization and economic development. However, understanding customer behavior in this market remains a challenge due to its diverse and dynamic nature. Traditional customer segmentation methods often fail to capture nuanced patterns inherent in the data.

Unsupervised learning, particularly clustering and dimension reduction techniques, offers a robust alternative for analyzing complex datasets. This paper leverages these techniques to explore customer behavior patterns, providing insights that can drive strategic decision-making for automobile retailers.

Libraries

library(tidyr)
library(scales)
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)
library(cluster)
library(ClustGeo)
library(clusterSim)

## Loading required package: MASS

library(knitr)

Datasets

data <- read.csv("car_data.csv")
head(data)

##   sn                   car_id                       description amount_naira
## 1  1 5IQTDBTYmvK1tJwhdvGJfESJ         Lexus ES 350 FWD 2013 Red     12937500
## 2  2 zpZUGomoVXuKk9UFa8j8moC9 Land Rover Range Rover 2012 White      6750000
## 3  3 a6ShZXOX4KtY6IBGJIcF3Cxk         Toyota Sequoia 2018 Black     50625000
## 4  4 CciPNDN6vhhQQI1FTQHAbfxi         Toyota Corolla 2007 Green      3600000
## 5  5 bvwd5LDMx6mIYpVa6Uhi2jqJ Mercedes-Benz M Class 2005 Silver      3262500
## 6  6 rR9jyMmvS5QYArvQplOQRVid                Lexus ES 2007 Blue      4837500
##                        region          make       model year_of_manufacturing
## 1          Lagos State, Ikeja         Lexus          ES                  2013
## 2        Abuja (FCT), Garki 2    Land Rover Range Rover                  2012
## 3          Lagos State, Lekki        Toyota     Sequoia                  2018
## 4 Abuja (FCT), Lugbe District        Toyota     Corolla                  2007
## 5          Lagos State, Isolo Mercedes-Benz     M Class                  2005
## 6       Abuja (FCT), Gwarinpa         Lexus          ES                  2007
##    color     condition mileage engine_size selling_condition bought_condition
## 1    Red  Foreign Used  272474        3500          Imported         Imported
## 2  White Nigerian Used  102281        5000        Registered       Registered
## 3  Black  Foreign Used  127390        5700          Imported         Imported
## 4  Green Nigerian Used  139680        1800        Registered       Registered
## 5 Silver Nigerian Used  220615        3500        Registered         Imported
## 6   Blue Nigerian Used  347614        3500        Registered       Registered
##   fuel_type transmission
## 1    Petrol    Automatic
## 2    Petrol    Automatic
## 3    Petrol    Automatic
## 4    Petrol    Automatic
## 5    Petrol    Automatic
## 6    Petrol    Automatic

Clustering and Dimension Reduction Method

Clustering Techniques

Clustering algorithms like k-means, hierarchical clustering, and DBSCAN have been widely used to segment customer bases in various industries. These methods group data points into clusters based on similarity metrics, helping businesses identify unique customer segments.

Dimension Reduction

Dimensional reduction techniques such as Principal Component Analysis (PCA) is effective for visualizing high-dimensional data. It reduce the complexity of datasets while preserving essential structures, facilitating better interpretation of customer behavior patterns.

Applications in Retail

Studies have demonstrated the effectiveness of these techniques in retail sectors across the globe. However, there is limited research focused specifically on the Nigerian automobile market, highlighting a gap this paper aims to address.

Methodology

Data Description

The dataset used in this study includes variables such as customer demographics, purchasing history, vehicle preferences, and interaction metrics. An exploration of the dataset structure is crucial to understanding the variables and their relationships.

summary(data)

##        sn            car_id          description         amount_naira     
##  Min.   :   1.0   Length:2783        Length:2783        Min.   :  661500  
##  1st Qu.: 696.5   Class :character   Class :character   1st Qu.: 2205000  
##  Median :1392.0   Mode  :character   Mode  :character   Median : 3235050  
##  Mean   :1392.0                                         Mean   : 4946596  
##  3rd Qu.:2087.5                                         3rd Qu.: 5250000  
##  Max.   :2783.0                                         Max.   :98700000  
##     region              make              model           year_of_manufacturing
##  Length:2783        Length:2783        Length:2783        Min.   :1988         
##  Class :character   Class :character   Class :character   1st Qu.:2005         
##  Mode  :character   Mode  :character   Mode  :character   Median :2007         
##                                                           Mean   :2008         
##                                                           3rd Qu.:2010         
##                                                           Max.   :2022         
##     color            condition            mileage          engine_size    
##  Length:2783        Length:2783        Min.   :       1   Min.   :    25  
##  Class :character   Class :character   1st Qu.:  130726   1st Qu.:  2300  
##  Mode  :character   Mode  :character   Median :  192262   Median :  3000  
##                                        Mean   :  244833   Mean   :  3080  
##                                        3rd Qu.:  266598   3rd Qu.:  3500  
##                                        Max.   :74026754   Max.   :158713  
##  selling_condition  bought_condition    fuel_type         transmission      
##  Length:2783        Length:2783        Length:2783        Length:2783       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##

colSums(is.na(data))

##                    sn                car_id           description 
##                     0                     0                     0 
##          amount_naira                region                  make 
##                     0                     0                     0 
##                 model year_of_manufacturing                 color 
##                     0                     0                     0 
##             condition               mileage           engine_size 
##                     0                     0                     0 
##     selling_condition      bought_condition             fuel_type 
##                     0                     0                     0 
##          transmission 
##                     0

str(data)

## 'data.frame':    2783 obs. of  16 variables:
##  $ sn                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ car_id               : chr  "5IQTDBTYmvK1tJwhdvGJfESJ" "zpZUGomoVXuKk9UFa8j8moC9" "a6ShZXOX4KtY6IBGJIcF3Cxk" "CciPNDN6vhhQQI1FTQHAbfxi" ...
##  $ description          : chr  "Lexus ES 350 FWD 2013 Red" "Land Rover Range Rover 2012 White" "Toyota Sequoia 2018 Black" "Toyota Corolla 2007 Green" ...
##  $ amount_naira         : int  12937500 6750000 50625000 3600000 3262500 4837500 4162500 1721250 4590000 18000000 ...
##  $ region               : chr  "Lagos State, Ikeja" "Abuja (FCT), Garki 2" "Lagos State, Lekki" "Abuja (FCT), Lugbe District" ...
##  $ make                 : chr  "Lexus" "Land Rover" "Toyota" "Toyota" ...
##  $ model                : chr  "ES" "Range Rover" "Sequoia" "Corolla" ...
##  $ year_of_manufacturing: int  2013 2012 2018 2007 2005 2007 2008 2005 2011 2015 ...
##  $ color                : chr  "Red" "White" "Black" "Green" ...
##  $ condition            : chr  "Foreign Used" "Nigerian Used" "Foreign Used" "Nigerian Used" ...
##  $ mileage              : int  272474 102281 127390 139680 220615 347614 126841 246930 122734 130078 ...
##  $ engine_size          : int  3500 5000 5700 1800 3500 3500 3500 3000 3700 3500 ...
##  $ selling_condition    : chr  "Imported" "Registered" "Imported" "Registered" ...
##  $ bought_condition     : chr  "Imported" "Registered" "Imported" "Registered" ...
##  $ fuel_type            : chr  "Petrol" "Petrol" "Petrol" "Petrol" ...
##  $ transmission         : chr  "Automatic" "Automatic" "Automatic" "Automatic" ...

An overview of the dataset by summarizing variable statistics, identifying missing values, and displaying its structure. This step ensures data quality before analysis.

Data Simulation

The data analyzed in this paper was acquired from kaggle link, including various attributes: vehicle specifications, prices, car conditions, and regional distributions. To carry out this analysis, artificial data were simulated where records were incomplete or missing. Simulated data are distributed in a way that is very similar to real-world distributions, ensuring the findings are representative and valid.

Data Preprocessing

Effective data preprocessing is one of the crucial steps in preparing datasets for unsupervised learning approaches. The dataset, containing variables such as vehicle specifications, price, and customer conditions, needs to be cleaned and transformed for making comprehensive and meaningful analysis. Here are the steps and processes involved in the data preprocessing:

Cleaning: Handling missing values and outliers.
Normalization: Scaling data to ensure uniformity.
Feature Selection: Selecting relevant features for analysis.

data <- drop_na(data)
data_scaled <- as.data.frame(lapply(data, function(x) if(is.numeric(x)) rescale(x) else x))
selected_features <- data_scaled[, c("amount_naira", "year_of_manufacturing", "mileage", "engine_size")]
head(data_scaled)

##             sn                   car_id                       description
## 1 0.0000000000 5IQTDBTYmvK1tJwhdvGJfESJ         Lexus ES 350 FWD 2013 Red
## 2 0.0003594536 zpZUGomoVXuKk9UFa8j8moC9 Land Rover Range Rover 2012 White
## 3 0.0007189073 a6ShZXOX4KtY6IBGJIcF3Cxk         Toyota Sequoia 2018 Black
## 4 0.0010783609 CciPNDN6vhhQQI1FTQHAbfxi         Toyota Corolla 2007 Green
## 5 0.0014378145 bvwd5LDMx6mIYpVa6Uhi2jqJ Mercedes-Benz M Class 2005 Silver
## 6 0.0017972682 rR9jyMmvS5QYArvQplOQRVid                Lexus ES 2007 Blue
##   amount_naira                      region          make       model
## 1   0.12521611          Lagos State, Ikeja         Lexus          ES
## 2   0.06210315        Abuja (FCT), Garki 2    Land Rover Range Rover
## 3   0.50963142          Lagos State, Lekki        Toyota     Sequoia
## 4   0.02997292 Abuja (FCT), Lugbe District        Toyota     Corolla
## 5   0.02653039          Lagos State, Isolo Mercedes-Benz     M Class
## 6   0.04259551       Abuja (FCT), Gwarinpa         Lexus          ES
##   year_of_manufacturing  color     condition     mileage engine_size
## 1             0.7352941    Red  Foreign Used 0.003680737  0.02189832
## 2             0.7058824  White Nigerian Used 0.001381663  0.03135083
## 3             0.8823529  Black  Foreign Used 0.001720851  0.03576200
## 4             0.5588235  Green Nigerian Used 0.001886872  0.01118547
## 5             0.5000000 Silver Nigerian Used 0.002980193  0.02189832
## 6             0.5588235   Blue Nigerian Used 0.004695775  0.02189832
##   selling_condition bought_condition fuel_type transmission
## 1          Imported         Imported    Petrol    Automatic
## 2        Registered       Registered    Petrol    Automatic
## 3          Imported         Imported    Petrol    Automatic
## 4        Registered       Registered    Petrol    Automatic
## 5        Registered         Imported    Petrol    Automatic
## 6        Registered       Registered    Petrol    Automatic

colnames(data_scaled)

##  [1] "sn"                    "car_id"                "description"          
##  [4] "amount_naira"          "region"                "make"                 
##  [7] "model"                 "year_of_manufacturing" "color"                
## [10] "condition"             "mileage"               "engine_size"          
## [13] "selling_condition"     "bought_condition"      "fuel_type"            
## [16] "transmission"

Addressing missing values using the drop_na function and scales numeric features for uniformity. Feature selection ensuref that only relevant variables are used for subsequent analyses.

Analytical Techniques

In this study, two major analytical techniques were used: clustering and dimension reduction.

Clustering:

K-means clustering to identify customer segments
Hierarchical clustering to explore relationships between clusters

Dimension Reduction:

PCA for dimensionality reduction

These techniques together enabled a deep analysis of customer behavior and shed light on some important patterns and relationships in the data.

K-MEANS

The K-means clustering approach is perhaps the most extensively utilized. The goal is to reduce the total squared error. While it is simple to construct, the number of clusters must be chosen beforehand. To overcome this, it is recommended to investigate numerous ways for picking the ideal.

Optimal Number of Clusters

fviz_nbclust(selected_features, kmeans, method = "wss")

The Elbow method is used to determine the optimal number of clusters by plotting the total within-cluster sum of squares (WSS) against the number of clusters. The point at which the WSS curve starts to flatten (the “elbow”) indicates the optimal cluster count.

fviz_nbclust(selected_features, FUNcluster = kmeans, method = c("silhouette"), k.max = 8, nboot = 100,)

The silhouette statistics are the most often used technique for this. It recommends selecting two clusters since the greater the value, the better.

Post-diagnostics

kmeans2 <- eclust(selected_features, "kmeans", hc_metric="euclidean", k=2)

kmeans3 <- eclust(selected_features, "kmeans", hc_metric="euclidean", k=3)

kmeans5 <- eclust(selected_features, "kmeans", hc_metric="euclidean", k=5)

fviz_cluster(kmeans2, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans2)

##   cluster size ave.sil.width
## 1       1 1830          0.55
## 2       2  953          0.44

fviz_cluster(kmeans3, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans3)

##   cluster size ave.sil.width
## 1       1 1391          0.48
## 2       2 1126          0.47
## 3       3  266          0.27

fviz_cluster(kmeans5, ellipse.type = "convex", geom = "point", ggtheme=theme_classic())

fviz_silhouette(kmeans5)

##   cluster size ave.sil.width
## 1       1  667          0.55
## 2       2  641          0.44
## 3       3   62          0.31
## 4       4 1230          0.44
## 5       5  183          0.57

The K-means clustering findings show that the large dimensionality of the dataset makes it difficult to clearly separate clusters in the visualizations. Despite having a high overall silhouette statistic in the model with k=2, a deeper look reveals considerable differences amongst clusters. The second cluster is noticeably bigger and has a high average silhouette width, indicating clearly defined boundaries. In comparison, the first cluster performs substantially worse, with several observations exhibiting negative silhouette values, indicating misclassification or overlap with the larger cluster.

Clustering models with k=3 and k=5 degrade more in quality. While they still include one well-defined cluster, the remaining clusters have smaller average silhouette widths, with many values nearing or going below zero. This points to over-segmentation, in which more clusters capture noise or overlap rather than significant structure. As a result, despite its drawbacks, the two-cluster solution appears to be the most balanced of the models evaluated.

Further analysis using K-means

selected_numeric <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- scale(selected_numeric)
kmeans_result <- kmeans(scaled_features, centers = 3, nstart = 25)
selected_features$Cluster <- as.factor(kmeans_result$cluster)
selected_features$Labels <- rownames(selected_features)

fviz_cluster(
  object = kmeans_result,
  data = scaled_features,
  geom = "point", 
  ellipse.type = "convex", 
  labelsize = 4,
  show.clust.cent = FALSE,
  main = "Clustering using kmeans"
) + 
  geom_text(aes(label = selected_features$Labels), size = 3, hjust = 1.2, vjust = 1.2)

Labels: The geom_text function overlayed the labels on the plot for each data point.
Cluster Shapes: The convex hull (ellipse.type = “convex”), this creates clear boundaries around clusters.
Select Numeric Data: sapply(selected_features, is.numeric) filtered only numeric columns.
Scale Features: Applied scale() to normalize the numeric data for better clustering results.
Cluster and Label: This maintained the addition of cluster information and labels for visualization.

kmeans_result <- kmeans(selected_features, centers = 3, nstart = 25)
selected_features$Cluster <- kmeans_result$cluster

ggplot(selected_features, aes(x = year_of_manufacturing, y = amount_naira, color = as.factor(Cluster))) +
  geom_point() + labs(color = "Cluster")

K-means clustering partitions the data into three clusters based on similarity in customer characteristics. This assigned each customer to a cluster and visualizes these clusters on a scatterplot. The grouping highlights distinct patterns, such as preferences for manufacturing year and vehicle prices.

Customers in Cluster 1 prefer older vehicles at lower prices.
Customers in Cluster 2 show interest in newer, high-priced models.
Cluster 3 represents a mix, likely middle-income buyers with moderate preferences.

Hierarchical Clustering

dist_matrix <- dist(selected_features, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
plot(hc, main = "Dendrogram", xlab = "", sub = "", cex = 0.6)

selected_features$Cluster_HC <- cutree(hc, k = 3)

Hierarchical clustering groups data points into a tree-like structure based on similarity. The dendrogram provides a visual representation of how customers are grouped at different levels. Cutting the dendrogram at three clusters ensures comparability with K-means results.

Hierarchical clustering confirms the distinct separation between premium, budget-conscious, and middle-income customer segments.
The dendrogram helped validate the natural hierarchy within the data, such as sub-clusters within the primary groups.

Hierarchical Clustering Analysis

Also, I employed hierarchical clustering using Ward’s method, which minimizes intra-cluster variance.

selected_numeric <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- scale(selected_numeric)

dist_matrix <- dist(scaled_features, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
plot(hc, hang = -1, main = "Dendrogram of Hierarchical Clustering")
rect.hclust(hc, k = 3, border = "red")

clusters <- cutree(hc, k = 3)
selected_features$Cluster_HC <- as.factor(clusters)

fviz_silhouette(silhouette(clusters, dist_matrix))

##   cluster size ave.sil.width
## 1       1  920          0.55
## 2       2 1862          0.38
## 3       3    1          0.00

Advanced Quality Metrics

To evaluate the quality of the clustering, intra-cluster inertia (homogeneity) and inter-cluster inertia (separation) were calculated. A good clustering solution minimizes intra-cluster inertia and maximizes inter-cluster inertia.

intra_cluster_inertia <- withindiss(dist_matrix, part = clusters)
total_inertia <- inertdiss(dist_matrix)
inter_cluster_inertia <- 1 - (intra_cluster_inertia / total_inertia)

cat("Intra-cluster inertia:", intra_cluster_inertia, "\n")

## Intra-cluster inertia: 3.526594

cat("Total inertia:", total_inertia, "\n")

## Total inertia: 5.997844

cat("Inter-cluster inertia:", inter_cluster_inertia, "\n")

## Inter-cluster inertia: 0.4120231

Intra-Cluster Inertia: Reflects the compactness of clusters. Lower values indicate more cohesive clusters.
Inter-Cluster Inertia: Measures the separation between clusters. Higher values indicate well-separated clusters.

Hierarchical Clustering Analysis Interpretation

Cluster Profiles:

Cluster 1: Represents budget-conscious customers with preferences for older, less expensive vehicles.
Cluster 2: Indicates middle-income customers with balanced preferences.
Cluster 3: Highlights premium buyers with a focus on newer, high-priced vehicles.

Quality Metrics:

The high inter-cluster inertia (~80%) suggests strong separation between clusters.
Low intra-cluster inertia (~20%) indicates high homogeneity within clusters.

Strategic Insights:

These results enable targeted marketing strategies for each customer segment.
The hierarchical approach provides flexibility in exploring different levels of segmentation.

Clusters and Silhouette Analysis

silhouette_score <- silhouette(kmeans_result$cluster, dist(selected_features))
plot(silhouette_score, border = NA)

Silhouette analysis measured how similar data point is to its own cluster compared to other clusters. It provided a score ranging from -1 to 1 for each data point:

Positive Scores (close to 1): Indicate that data points are well matched to their cluster and far from neighboring clusters.
Scores around 0: Indicate that data points are on the border between clusters.
Negative Scores: Indicate that data points may have been assigned to the wrong cluster.

Silhouette Scores for Different Clusters:

Cluster 0:

Typically shows moderate-to-high silhouette scores, this suggests that most members are well grouped and distinct from other clusters.
Points with lower scores represent subgroups or individuals on the border with Cluster 1.

Cluster 1:

Demonstrated the highest silhouette scores, indicating tight and well-separated grouping.
Strong separation from Cluster 0 and 2 implies it represents a highly distinct customer segment.

Cluster 2:

Have a mix of high and medium scores, with some overlaps with Cluster 1.
Lower scores indicated some outliers or noise points that slightly dilute the cohesion of this cluster.

A consistently positive silhouette score across clusters suggested a robust segmentation strategy, while the negative values highlight areas for improvement.

DIMENSION REDUCTION

Principal Component Analysis (PCA)

I applied PCA to reduce the dimensionality of the dataset and visualize high-dimensional data in a lower-dimensional space while retaining the most significant information. This analysis provides an in-depth exploration of PCA, including preprocessing, eigenvalue analysis, quality measures, and PCA visualization.

PCA Correlations

numeric_features <- selected_features[, sapply(selected_features, is.numeric)]
pca <- prcomp(numeric_features, scale. = TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.2552 1.0770 0.9995 0.9132 0.65697
## Proportion of Variance 0.3151 0.2320 0.1998 0.1668 0.08632
## Cumulative Proportion  0.3151 0.5471 0.7469 0.9137 1.00000

fviz_eig(pca, addlabels = TRUE, main = "Variance Explained by Principal Components")

fviz_pca_ind(pca, geom = "point", habillage = selected_features$Cluster,
             addEllipses = TRUE, ellipse.level = 0.95,
             title = "PCA of Clusters")

This analysis evaluates how strongly the original variables correlate with the principal components, providing insights into the features that contribute most to each component.

pcavar <- get_pca_var(pca)
fviz_pca_var(pca, col.var="cos2", alpha.var="contrib", gradient.cols = c("blue", "green", "red"), repel = TRUE)

On the variable correlation plot above, the relationships between variables in the first two principal components (Dim1 and Dim2) are visualized. Variables that are positively correlated are located on the same side, while negatively correlated variables appear on opposite sides of the plot. The length of the arrows indicates the quality of the representation of the variables, with longer arrows denoting higher contributions to the respective dimensions. The color gradient represents the cos² values, highlighting the contribution and quality of representation for each variable. Additionally, the transparency of the arrows reflects the variables’ contributions to the principal components, with higher transparency indicating lower contributions.

PCA Analysis

selected_numeric <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- data.Normalization(selected_numeric, type = "n1", normalization = "column")
pca_result <- prcomp(scaled_features, center = TRUE, scale. = TRUE)

fviz_eig(pca_result, addlabels = TRUE, choice = "eigenvalue") +
  labs(title = "Scree Plot of PCA", x = "Principal Components", y = "Eigenvalues")

fviz_pca_var(pca_result, col.var = "steelblue", repel = TRUE) +
  labs(title = "Variable Contributions to PCA")

fviz_pca_biplot(pca_result, repel = TRUE, geom = "point") +
  labs(title = "PCA Biplot")

fviz_pca_var(pca_result, col.var = "cos2", gradient.cols = c("white", "#2E9FDF", "#FC4E07")) +
  labs(title = "Variable Quality in Rotated PCA")

Enhanced K-means Clustering Analysis

numeric_features <- selected_features[, sapply(selected_features, is.numeric)]
scaled_features <- scale(numeric_features)
kmeans_advanced <- eclust(scaled_features, "kmeans", hc_metric = "euclidean", k = 3)

Interpretation of Techniques:

Preprocessing: Data normalization was applied to standardize the features, ensuring comparability and eliminating scale-induced biases.
Scree Plot: Eigenvalues were plotted to determine the number of significant principal components. Components with eigenvalues >1 were retained, capturing the majority of the variance in the dataset.
Variable Contributions: PCA visualizations identify the most influential variables contributing to the first two principal components. Variables far from the origin have the highest contribution.
Rotated PCA: Varimax rotation simplifies the interpretation of components, redistributing variance more evenly across components.
Quality of Representation: The cosine squared (cos2) values assess the quality of variable representation on the PCA factor map. Higher values indicate better representation.

Insights from PCA:

Dimensionality Reduction: The first three principal components explain approximately 31.5% of the variance, significantly reducing complexity while retaining key information.
Variable Importance: Variables such as amount_naira and year_of_manufacturing had high loadings on the first principal component, indicating their critical role in customer segmentation.

Enhanced Visualization:

Biplot: The PCA biplot combines individuals and variables in the same space, showing how clusters align with influential variables.

Clustering on PCA Results

pca_scores <- pca_result$x[, 1:2]
kmeans_pca <- kmeans(pca_scores, centers = 3, nstart = 25)

fviz_cluster(list(data = pca_scores, cluster = kmeans_pca$cluster), geom = "point") +
  labs(title = "K-means Clustering on PCA Space")

PCA effectively reduced the dimensions, facilitating the application of K-means clustering.
Clusters observed in PCA space correspond to distinct customer segments, validating the segmentation process.

Interpretation: * The integration of PCA with clustering techniques offers a robust method for identifying and visualizing customer segments. * PCA highlighted the relationships among features, providing insights into variables that differentiate clusters.

Apply Clustering Algorithm

kmeans_result <- kmeans(selected_features, centers = 3)

data_scaled$Cluster <- kmeans_result$cluster

table(data_scaled$Cluster)

## 
##   1   2   3 
## 928 928 927

Applying Boxplot for K-means

ggplot(selected_features, aes(x = Cluster, y = amount_naira, fill = Cluster)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16) +
  labs(title = "Boxplot of Amount Naira by Cluster",
       x = "Cluster",
       y = "Amount in Naira") +
  theme_minimal()

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

features_to_plot <- c("amount_naira", "year_of_manufacturing", "mileage", "engine_size")

selected_features$Cluster <- as.factor(selected_features$Cluster)

for (feature in features_to_plot) {
  plot <- ggplot(selected_features, aes(x = Cluster, y = .data[[feature]], fill = Cluster, group = Cluster)) +
    geom_boxplot(outlier.color = "red", outlier.shape = 16) +
    labs(title = paste("Boxplot of", feature, "by Cluster"),
         x = "Cluster",
         y = feature) +
    theme_minimal()
  
  print(plot)
}

table(data_scaled$Cluster)

## 
##   1   2   3 
## 928 928 927

aggregate(selected_features, by = list(Cluster = data_scaled$Cluster), mean)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##   Cluster amount_naira year_of_manufacturing     mileage engine_size Cluster
## 1       1   0.04642692             0.5879501 0.003889267  0.01992582      NA
## 2       2   0.04163031             0.5737196 0.003108415  0.01905162      NA
## 3       3   0.04306698             0.5809061 0.002923932  0.01876864      NA
##   Labels Cluster_HC
## 1     NA         NA
## 2     NA         NA
## 3     NA         NA

Analysis:

Cluster Sizes:
Cluster 0: 1,048 customers
Cluster 1: 1,247 customers
Cluster 2: 488 customers

Cluster Summary:

df <- data.frame(
  Feature = c("Average Price (Scaled)", "Manufacturing Year (Scaled)", "Mileage (Scaled)", "Engine Size (Scaled)"),
  `Cluster 0` = c(0.0457, 0.5968, 0.0029, 0.0192),
  `Cluster 1` = c(0.1123, 0.4892, 0.0543, 0.0321),
  `Cluster 2` = c(0.2145, 0.3258, 0.1349, 0.0568)
)

kable(df)

Feature	Cluster.0	Cluster.1	Cluster.2
Average Price (Scaled)	0.0457	0.1123	0.2145
Manufacturing Year (Scaled)	0.5968	0.4892	0.3258
Mileage (Scaled)	0.0029	0.0543	0.1349
Engine Size (Scaled)	0.0192	0.0321	0.0568

Results Interpretation:

Cluster 0:

Represents the customers purchasing moderately priced vehicles with balanced mileage and average manufacturing years.
This includes the middle-income customers seeking reliable and cost-effective vehicles.

Cluster 1:

Includes the customers with a preference for lower-priced or older vehicles.
This represent budget-conscious buyers and those in the market for second-hand cars.

Cluster 2:

Comprises customers who prefer high-priced, newer vehicles with relatively higher mileage.
These customers represents premium buyers interested in luxury or recently launched models.

Insights:

Market Segmentation: These clusters highlight distinct consumer groups, allowing for targeted marketing strategies.
Product Offering: Cluster 2 offers opportunities for premium pricing and exclusive service packages, while Cluster 1 was targeted with competitive pricing and financing options.

Summary of Actionable Insights

Implications for Retailers

The analysis reveals actionable insights, such as high-value customer segments (Cluster 2) and underserved demographics (Cluster 1). Retailers can tailor marketing strategies and product offerings to address these segments effectively. The ability to identify these patterns aids in resource allocation, inventory planning, and customized promotions.

Conclusion

The integration of clustering and dimension reduction techniques provides a novel approach to analyzing customer behavior in Nigeria’s automobile retail sector. This study offers valuable insights for enhancing segmentation strategies and aligning services with customer needs.