#Importing libraries
# Load necessary libraries
if (!require(cluster)) install.packages("cluster", dependencies = TRUE)
## Loading required package: cluster
if (!require(factoextra)) install.packages("factoextra", dependencies = TRUE)
## Loading required package: factoextra
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
if (!require(corrplot)) install.packages("corrplot", dependencies = TRUE)
## Loading required package: corrplot
## corrplot 0.92 loaded
if (!require(caret)) install.packages("caret", dependencies = TRUE)
## Loading required package: caret
## Loading required package: lattice
if (!require(factoextra)) install.packages("factoextra", dependencies = TRUE)
library(factoextra)
library(caret)
library(corrplot)
library(cluster) # For silhouette and clustering functions
library(factoextra) # For visualizations such as gap statistic plots
#Importing the dataset
cars_df <- read.csv("~/Desktop/Spring 25/MA&R Esin/Theltegos data.csv")
# Set the car model names as row names and remove the 'Name' column
rownames(cars_df) <- cars_df$Name
cars_df$Name <- NULL
head(cars_df)
## Displacement Moment Horsepower Length Width Weight Trunk
## Kia Picanto 1.1. Start 1086 97 65 3535 1595 929 127
## Suzuki Splash 1.0 996 90 65 3715 1680 1050 178
## Renault Clio 1.0 1149 105 75 3986 1719 1155 288
## Dacia Sandero 1.6 1598 128 87 4020 1746 1111 320
## Fiat Grande Punto 1.4 1598 140 88 3986 1719 1215 288
## Peugot 207 1.4 1360 133 88 4030 1748 1214 270
## Speed Acceleration
## Kia Picanto 1.1. Start 154 15.1
## Suzuki Splash 1.0 160 14.7
## Renault Clio 1.0 167 13.4
## Dacia Sandero 1.6 174 11.5
## Fiat Grande Punto 1.4 177 11.9
## Peugot 207 1.4 180 12.7
# Calculate the correlation matrix for your data
corr_matrix <- cor(cars_df)
corrplot(corr_matrix, method = "color", type = "lower",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, col = colorRampPalette(c("lightblue", "white", "pink"))(200))
## Warning in ind1:ind2: numerical expression has 2 elements: only the first used
The correlation matrix shows that many variables (e.g., Displacement, Horsepower, and Speed) are almost perfectly correlated, while Acceleration is strongly inversely correlated with them. This indicates redundancy in the dataset. To address this, I removed the highly correlated variables so that each remaining variable provides unique information for the clustering analysis.
# Identify columns to drop based on the 0.90 cutoff
cols_to_drop <- findCorrelation(corr_matrix, cutoff = 0.90)
# Print the names of variables that will be dropped
cat("Dropping variables:", paste(colnames(cars_df)[cols_to_drop], collapse = ", "), "\n")
## Dropping variables: Speed, Acceleration, Width, Weight, Displacement, Length
# Create a new data frame with the highly correlated variables removed
cars_df_reduced <- cars_df[, -cols_to_drop]
# Optionally, verify the new correlation matrix
round(cor(cars_df_reduced), 2)
## Moment Horsepower Trunk
## Moment 1.00 0.85 0.69
## Horsepower 0.85 1.00 0.41
## Trunk 0.69 0.41 1.00
Yes, standardization is essential. The dataset’s variables—such as Moment, Horsepower, and Trunk—are measured on different scales. Without standardization, variables with larger numerical ranges or higher variances would disproportionately influence the distance calculations used in clustering.
\[z=\frac{x - \mu}{\sigma}\]
where μ the mean and σ is the standard deviation.
# Standardize the data
cars_scaled <- scale(cars_df_reduced)
head(cars_scaled)
## Moment Horsepower Trunk
## Kia Picanto 1.1. Start -1.1722023 -1.0152681 -1.6085559
## Suzuki Splash 1.0 -1.2426956 -1.0152681 -1.2799724
## Renault Clio 1.0 -1.0916386 -0.8910510 -0.5712628
## Dacia Sandero 1.6 -0.8600178 -0.7419904 -0.3650928
## Fiat Grande Punto 1.4 -0.7391722 -0.7295687 -0.5712628
## Peugot 207 1.4 -0.8096655 -0.7295687 -0.6872335
I am using Euclidean distance for our clustering analysis because it provides a straightforward, “straight-line” measure of dissimilarity between observations. Euclidean distance works well with Ward's method, which minimizes the within-cluster variance, leading to more balanced and interpretable clusters.
# Compute the Euclidean distance matrix on standardized data
dist_matrix <- dist(cars_scaled, method = "euclidean")
dist_matrix
## Kia Picanto 1.1. Start Suzuki Splash 1.0 Renault Clio 1.0
## Suzuki Splash 1.0 0.33606015
## Renault Clio 1.0 1.04780594 0.73519883
## Dacia Sandero 1.6 1.31085480 1.02865330 0.34405422
## Fiat Grande Punto 1.4 1.15979139 0.91511153 0.38769710
## Peugot 207 1.4 1.03048146 0.78770463 0.34501381
## Renault Clio 1.6 1.06188968 0.80849725 0.33060623
## Porsche Cayman 4.18034829 4.09084787 3.69952810
## Nissan 350Z 3.96538708 3.96779881 3.77292475
## Mercedes c200 CDI 3.02210494 2.82416472 2.22398439
## VW Passat Variant 2.0 3.83834319 3.63463259 3.01258621
## Skoda Octavia 2.0 3.79859954 3.59734709 2.97977935
## Mercedes E280 3.93844367 3.76348912 3.20114673
## Audi A6 2.4 3.31917988 3.08944604 2.43987434
## BMW 525i 3.52095830 3.32616135 2.74257577
## Dacia Sandero 1.6 Fiat Grande Punto 1.4 Peugot 207 1.4
## Suzuki Splash 1.0
## Renault Clio 1.0
## Dacia Sandero 1.6
## Fiat Grande Punto 1.4 0.23929907
## Peugot 207 1.4 0.32628865 0.13571475
## Renault Clio 1.6 0.30796966 0.19337536 0.12283693
## Porsche Cayman 3.40143872 3.35945857 3.43083833
## Nissan 350Z 3.53557406 3.42316479 3.45827881
## Mercedes c200 CDI 1.88294860 1.91842704 2.04395271
## VW Passat Variant 2.0 2.67457773 2.72744097 2.85678546
## Skoda Octavia 2.0 2.64159581 2.69116027 2.82004924
## Mercedes E280 2.86511102 2.89611941 3.00161148
## Audi A6 2.4 2.10359058 2.19242328 2.31051160
## BMW 525i 2.41200527 2.46353246 2.56715386
## Renault Clio 1.6 Porsche Cayman Nissan 350Z
## Suzuki Splash 1.0
## Renault Clio 1.0
## Dacia Sandero 1.6
## Fiat Grande Punto 1.4
## Peugot 207 1.4
## Renault Clio 1.6
## Porsche Cayman 3.40827705
## Nissan 350Z 3.44763229 1.13751175
## Mercedes c200 CDI 2.05479844 2.15203525 2.73748038
## VW Passat Variant 2.0 2.86974087 2.25006602 3.04672372
## Skoda Octavia 2.0 2.83375863 2.22423777 3.00844385
## Mercedes E280 2.97905291 1.22302413 2.21413454
## Audi A6 2.4 2.28125492 2.03551918 2.81454659
## BMW 525i 2.53191955 1.49618305 2.34744706
## Mercedes c200 CDI VW Passat Variant 2.0 Skoda Octavia 2.0
## Suzuki Splash 1.0
## Renault Clio 1.0
## Dacia Sandero 1.6
## Fiat Grande Punto 1.4
## Peugot 207 1.4
## Renault Clio 1.6
## Porsche Cayman
## Nissan 350Z
## Mercedes c200 CDI
## VW Passat Variant 2.0 0.83449538
## Skoda Octavia 2.0 0.79412274 0.05154251
## Mercedes E280 1.26861679 1.18909813 1.17674612
## Audi A6 2.4 0.75901096 1.05162469 1.03955591
## BMW 525i 1.06250727 1.27578472 1.25901571
## Mercedes E280 Audi A6 2.4
## Suzuki Splash 1.0
## Renault Clio 1.0
## Dacia Sandero 1.6
## Fiat Grande Punto 1.4
## Peugot 207 1.4
## Renault Clio 1.6
## Porsche Cayman
## Nissan 350Z
## Mercedes c200 CDI
## VW Passat Variant 2.0
## Skoda Octavia 2.0
## Mercedes E280
## Audi A6 2.4 0.97383790
## BMW 525i 0.54425748 0.57271544
I used Ward’s linkage because it makes the clusters as compact and uniform as feasible by minimizing the within-cluster variation in each combination step. This is especially important because I normalized the data, so every variable would contribute equally to the distance calculations. Ward’s method was the most appropriate for my work because it constantly produces clusters that are balanced and easy to grasp.
I will be applying agglomerative hierarchical clustering. Each observation begins as a cluster in this way, and clusters gradually combine based on similarity until all observations are in a single group. I selected this approach because it allows me to visually evaluate and determine the ideal number of clusters and produces a reproducible dendrogram that illustrates the nestedness in the data.
# Perform hierarchical clustering using Ward's method (ward.D2)
hc <- hclust(dist_matrix, method = "ward.D2")
# Plot the dendrogram
plot(hc, main = "Dendrogram of Car Data (Ward's Method)",
xlab = "Car Models", sub = "", cex = 0.8)
The dendrogram reflects very nicely how the car models naturally cluster by their attributes. A tree branch is a cluster, with the vertical axis marking the point at which clusters combine. Browsing through the dendrogram, there can be identified three major clusters: one grouping the lower-end, price-conscious cars (such as Kia Picanto, Suzuki Splash, and Renault Clio) with lower HP and smaller size; one grouping the sport cars with stronger HP and faster acceleration (such as Porsche Cayman and Nissan 350Z); and one group of larger, luxury sedans (such as Mercedes, Audi, BMW, and VW vehicles) with attributes such as larger weight and more trunk space. While finer cut would support four clusters by splitting the sedan group, overall shape supports a three-cluster solution that maps well onto real-life, practical car classes.
So in short;
The small city cars cluster together because they share similar attributes: smaller engines, lower horsepower, lighter weight, etc.
The sports cars stand out on a separate branch due to their high horsepower, high speed, and low acceleration times.
The larger sedans form a distinct group reflecting more powerful engines (though not at sports-car levels), higher weight, and more spacious dimensions.
# Cut the dendrogram to form 3 clusters (as an example)
clusters_3 <- cutree(hc, k = 3)
table(clusters_3)
## clusters_3
## 1 2 3
## 7 2 6
Referred from : https://uc-r.github.io/hc_clustering
fviz_nbclust(cars_scaled, FUN = hcut, method = "wss")
The dendrogram reveals a distinct grouping at a moderate height, indicating three natural clusters that align well with domain knowledge (city cars, sports cars, executive sedans) and the elbow method show a clear bent at k=3.
The following techniques aid in determining the number of clusters in addition to dendrogram inspection The ideal number of clusters can be ascertained using a number of quantitative methods in addition to eyeballing the dendrogram. One technique is silhouette analysis, which compares the average distance of each observation to points in its own cluster and to points in the closest cluster to determine the silhouette width for each observation. Clusters that are well-separated are indicated by a larger average silhouette width. The gap statistic is an alternative method that compares the overall within-cluster variation for varying numbers of clusters to a null reference distribution; the gap is maximized for the ideal number. A reasonable choice is also suggested by the elbow method, which plots the within-cluster sum of squares (WCSS) against the number of clusters. The “elbow” is the point at which adding another cluster does not significantly affect WCSS.