Clustering Analysis of Vehicles: A Mixed Data Approach

Author

Saurabh C Srivastava

Published

February 7, 2025

Objective of the Analysis

The objective of this analysis is to classify vehicles into meaningful groups based on a combination of numerical (e.g., miles per gallon (MPG), weight) and categorical (e.g., transmission type, engine configuration) attributes. Using mixed data clustering techniques, we aim to identify patterns among vehicles, allowing us to explore how fuel efficiency, transmission type, and weight contribute to vehicle categorization. The results provide insights into vehicle performance and help distinguish groups of cars with similar characteristics.

Summary of the Code

1. Load Required Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(easystats)
# Attaching packages: easystats 0.7.3 (red = needs update)
✖ bayestestR  0.15.0   ✖ correlation 0.8.6 
✖ datawizard  1.0.0    ✔ effectsize  1.0.0 
✖ insight     1.0.1    ✖ modelbased  0.8.9 
✔ performance 0.13.0   ✖ parameters  0.24.1
✖ report      0.6.0    ✖ see         0.9.0 

Restart the R-Session and update packages with `easystats::easystats_update()`.
library(NbClust)
library(M3C)
library(mclust)
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.

Attaching package: 'mclust'

The following object is masked from 'package:purrr':

    map
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(dbscan)

Attaching package: 'dbscan'

The following object is masked from 'package:stats':

    as.dendrogram
library(ggplot2)
library(psych)

Attaching package: 'psych'

The following object is masked from 'package:mclust':

    sim

The following object is masked from 'package:M3C':

    pca

The following object is masked from 'package:effectsize':

    phi

The following object is masked from 'package:datawizard':

    rescale

The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp

2. Data Preprocessing & Transformation

mtcars2 = mtcars %>% mutate(vs = factor(vs, labels = c("V","S")),
                            am = factor(am, labels = c("automatic", "manual")),
                            cyl = ordered(cyl),
                            carb = ordered(carb)) %>% as.data.frame()
 
str(mtcars2)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Ord.factor w/ 3 levels "4"<"6"<"8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : Factor w/ 2 levels "V","S": 1 1 2 2 1 2 1 2 2 2 ...
 $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 4 4 1 1 2 1 4 2 2 4 ...

3. Determining the Optimal Number of Clusters

## Option 1: To identify number of cluster
n_clust <- n_clusters(mtcars2,
                      package = c("easystats", "NbClust", 
                                  "M3C",
                                  "mclust"),
                      include_factors = TRUE,
                      standardize = FALSE  # important
  )
n_clust
# Method Agreement Procedure:

The choice of 2 clusters is supported by 11 (47.83%) methods out of 23 (Elbow, Silhouette, Gap_Maechler2012, kl, Duda, Pseudot2, Ratkowsky, PtBiserial, Frey, Mcclain, SDindex).
## Option 2: To identify number of cluster. Using r package factoextra
myxx = mtcars2 |> as.data.frame() |> mutate_all(as.factor) |> mutate_all(as.numeric)
str(myxx)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  16 16 19 17 13 12 3 20 19 14 ...
 $ cyl : num  2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  13 13 6 16 23 15 23 12 10 14 ...
 $ hp  : num  11 11 6 11 15 9 20 2 7 13 ...
 $ drat: num  16 16 15 5 6 1 7 11 17 17 ...
 $ wt  : num  9 12 7 16 18 19 21 15 13 18 ...
 $ qsec: num  6 10 22 24 10 29 5 27 30 19 ...
 $ vs  : num  1 1 2 2 1 2 1 2 2 2 ...
 $ am  : num  2 2 2 1 1 1 1 1 1 1 ...
 $ gear: num  2 2 2 1 1 1 1 2 2 2 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
psych::headTail(myxx)
               mpg cyl disp  hp drat  wt qsec  vs  am gear carb
Mazda RX4       16   2   13  11   16   9    6   1   2    2    4
Mazda RX4 Wag   16   2   13  11   16  12   10   1   2    2    4
Datsun 710      19   1    6   6   15   7   22   2   2    2    1
Hornet 4 Drive  17   2   16  11    5  16   24   2   1    1    1
...            ... ...  ... ...  ... ...  ... ... ...  ...  ...
Ford Pantera L   8   3   22  21   20  14    1   1   2    3    4
Ferrari Dino    15   2   11  15   10  10    4   1   2    3    5
Maserati Bora    5   3   18  22    9  21    2   1   2    3    6
Volvo 142E      17   1    9  10   19  11   21   2   2    2    2
myx = as.data.frame(myxx)
n_clust <- fviz_nbclust(myxx, kmeans)
n_clust

4. Performing K-Means Clustering

Runs K-Means clustering with 3 clusters (based on the earlier analysis).

# Assigning number of clusters to mtcars2 data frame
set.seed(42)
k = stats::kmeans(myxx, 3)
names(k)
[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
myxx$cluster = k$cluster
head(myxx)
                  mpg cyl disp hp drat wt qsec vs am gear carb cluster
Mazda RX4          16   2   13 11   16  9    6  1  2    2    4       2
Mazda RX4 Wag      16   2   13 11   16 12   10  1  2    2    4       2
Datsun 710         19   1    6  6   15  7   22  2  2    2    1       3
Hornet 4 Drive     17   2   16 11    5 16   24  2  1    1    1       2
Hornet Sportabout  13   3   23 15    6 18   10  1  1    1    2       1
Valiant            12   2   15  9    1 19   29  2  1    1    1       2

5. Visualizing Clustering Results

Boxplot: Cluster Group vs. Fuel Efficiency

# Create a boxplot showing the relationship between Cluster Group and Average MPG.
g1 = ggplot(myxx, aes(x = factor(cluster), y = mpg, label = rownames(myxx))) + 
     geom_boxplot(fill = "skyblue", color = "black") + 
     geom_jitter(width = 0.2, color = "black", size = 1.5, alpha = 0.5) + 
     geom_text(position = position_jitter(width = 0.1), size = 3) + # Add labels with jitter
     labs(
          title = "Relationship between Cluster Group and Average MPG",
          x = "Cluster Group",
          y = "Average MPG"
     ) +
     theme_bw() +
     theme(plot.title = element_text(hjust = 0.5))
  
g1
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Cluster Group, Transmission Type, and Average Weight

# Create a graph showing the relationship between Cluster Group, type of transmission,
# and Average weight; a column chart, dodge style.
  
# Calculate average weight for each combination of ClusterGroup and am (transmission)
avg_weight <- myxx %>%
  group_by(cluster, am) %>% 
  summarize(avg_weight = mean(wt))
`summarise()` has grouped output by 'cluster'. You can override using the
`.groups` argument.
# Create the bar chart
g2 = ggplot(avg_weight, aes(x = factor(cluster), y = avg_weight, fill = factor(am))) + 
     geom_col(position = "dodge") + 
     labs(
          title = "Average Weight by Cluster Group and Transmission Type",
          x = "Cluster Group",
          y = "Average Weight",
          fill = "Transmission",
          caption = "Saurabh's Work"
     ) +
     scale_fill_manual(values = c("skyblue", "orange"), labels = c("Automatic", "Manual")) + 
     theme_bw()  +
     theme(plot.title = element_text(hjust = 0.5))
  
g2

6. Combining the Plots

cowplot::plot_grid(g1, g2)
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Conclusion: Mixed Data Clustering for Vehicle Classification

The clustering analysis successfully categorized vehicles into three distinct groups based on a combination of fuel efficiency (MPG), weight, and transmission type. The results reveal key patterns in vehicle characteristics:

  1. Fuel Efficiency vs. Clusters:

    • Vehicles with higher MPG (fuel-efficient) tend to fall into Cluster 3.

    • Cluster 1 consists of heavier, fuel-inefficient vehicles.

    • Cluster 2 represents cars with moderate fuel efficiency.

  2. Weight and Transmission Influence:

    • Automatic transmission cars are generally heavier, aligning with lower MPG.

    • Manual transmission cars are lighter and often more fuel-efficient.

  3. Clustering Validity:

    • The use of K-Means clustering effectively grouped vehicles with similar attributes.

    • The visualizations confirm meaningful separations between clusters, highlighting the relationship between vehicle performance, weight, and fuel consumption.