Clustering Analysis of Vehicles: A Mixed Data Approach
Author
Saurabh C Srivastava
Published
February 7, 2025
Objective of the Analysis
The objective of this analysis is to classify vehicles into meaningful groups based on a combination of numerical (e.g., miles per gallon (MPG), weight) and categorical (e.g., transmission type, engine configuration) attributes. Using mixed data clustering techniques, we aim to identify patterns among vehicles, allowing us to explore how fuel efficiency, transmission type, and weight contribute to vehicle categorization. The results provide insights into vehicle performance and help distinguish groups of cars with similar characteristics.
Summary of the Code
1. Load Required Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
Attaching package: 'mclust'
The following object is masked from 'package:purrr':
map
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)library(dbscan)
Attaching package: 'dbscan'
The following object is masked from 'package:stats':
as.dendrogram
library(ggplot2)library(psych)
Attaching package: 'psych'
The following object is masked from 'package:mclust':
sim
The following object is masked from 'package:M3C':
pca
The following object is masked from 'package:effectsize':
phi
The following object is masked from 'package:datawizard':
rescale
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(cowplot)
Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':
stamp
## Option 1: To identify number of clustern_clust <-n_clusters(mtcars2,package =c("easystats", "NbClust", "M3C","mclust"),include_factors =TRUE,standardize =FALSE# important )n_clust
# Method Agreement Procedure:
The choice of 2 clusters is supported by 11 (47.83%) methods out of 23 (Elbow, Silhouette, Gap_Maechler2012, kl, Duda, Pseudot2, Ratkowsky, PtBiserial, Frey, Mcclain, SDindex).
## Option 2: To identify number of cluster. Using r package factoextramyxx = mtcars2 |>as.data.frame() |>mutate_all(as.factor) |>mutate_all(as.numeric)str(myxx)
# Create a boxplot showing the relationship between Cluster Group and Average MPG.g1 =ggplot(myxx, aes(x =factor(cluster), y = mpg, label =rownames(myxx))) +geom_boxplot(fill ="skyblue", color ="black") +geom_jitter(width =0.2, color ="black", size =1.5, alpha =0.5) +geom_text(position =position_jitter(width =0.1), size =3) +# Add labels with jitterlabs(title ="Relationship between Cluster Group and Average MPG",x ="Cluster Group",y ="Average MPG" ) +theme_bw() +theme(plot.title =element_text(hjust =0.5))g1
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Cluster Group, Transmission Type, and Average Weight
# Create a graph showing the relationship between Cluster Group, type of transmission,# and Average weight; a column chart, dodge style.# Calculate average weight for each combination of ClusterGroup and am (transmission)avg_weight <- myxx %>%group_by(cluster, am) %>%summarize(avg_weight =mean(wt))
`summarise()` has grouped output by 'cluster'. You can override using the
`.groups` argument.
# Create the bar chartg2 =ggplot(avg_weight, aes(x =factor(cluster), y = avg_weight, fill =factor(am))) +geom_col(position ="dodge") +labs(title ="Average Weight by Cluster Group and Transmission Type",x ="Cluster Group",y ="Average Weight",fill ="Transmission",caption ="Saurabh's Work" ) +scale_fill_manual(values =c("skyblue", "orange"), labels =c("Automatic", "Manual")) +theme_bw() +theme(plot.title =element_text(hjust =0.5))g2
6. Combining the Plots
cowplot::plot_grid(g1, g2)
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Conclusion: Mixed Data Clustering for Vehicle Classification
The clustering analysis successfully categorized vehicles into three distinct groups based on a combination of fuel efficiency (MPG), weight, and transmission type. The results reveal key patterns in vehicle characteristics:
Fuel Efficiency vs. Clusters:
Vehicles with higher MPG (fuel-efficient) tend to fall into Cluster 3.
Cluster 1 consists of heavier, fuel-inefficient vehicles.
Cluster 2 represents cars with moderate fuel efficiency.
Weight and Transmission Influence:
Automatic transmission cars are generally heavier, aligning with lower MPG.
Manual transmission cars are lighter and often more fuel-efficient.
Clustering Validity:
The use of K-Means clustering effectively grouped vehicles with similar attributes.
The visualizations confirm meaningful separations between clusters, highlighting the relationship between vehicle performance, weight, and fuel consumption.