Environment Setup

#Importing libraries
if (!require(cluster)) install.packages("cluster", dependencies = TRUE)
## Loading required package: cluster
if (!require(factoextra)) install.packages("factoextra", dependencies = TRUE)
## Loading required package: factoextra
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
if (!require(corrplot)) install.packages("corrplot", dependencies = TRUE)
## Loading required package: corrplot
## corrplot 0.92 loaded
if (!require(caret)) install.packages("caret", dependencies = TRUE)
## Loading required package: caret
## Loading required package: lattice
if (!require(factoextra)) install.packages("factoextra", dependencies = TRUE)
library(factoextra)
library(caret)
library(corrplot)
library(cluster)     
library(factoextra)

Importing Dataset

#Importing the dataset
cars_df <- read.csv("~/Desktop/Spring 25/MA&R Esin/Theltegos data.csv")

# Set the car model names as row names and remove the 'Name' column
rownames(cars_df) <- cars_df$Name
cars_df$Name <- NULL

head(cars_df)

##                        Displacement Moment Horsepower Length Width Weight Trunk
## Kia Picanto 1.1. Start         1086     97         65   3535  1595    929   127
## Suzuki Splash 1.0               996     90         65   3715  1680   1050   178
## Renault Clio 1.0               1149    105         75   3986  1719   1155   288
## Dacia Sandero 1.6              1598    128         87   4020  1746   1111   320
## Fiat Grande Punto 1.4          1598    140         88   3986  1719   1215   288
## Peugot 207 1.4                 1360    133         88   4030  1748   1214   270
##                        Speed Acceleration
## Kia Picanto 1.1. Start   154         15.1
## Suzuki Splash 1.0        160         14.7
## Renault Clio 1.0         167         13.4
## Dacia Sandero 1.6        174         11.5
## Fiat Grande Punto 1.4    177         11.9
## Peugot 207 1.4           180         12.7

Collinearity Assessment

# Calculate the correlation matrix for the data
corr_matrix <- cor(cars_df)
corrplot(corr_matrix, method = "color", type = "lower", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         number.cex = 0.7, col = colorRampPalette(c("lightblue", "white", "pink"))(200))

## Warning in ind1:ind2: numerical expression has 2 elements: only the first used

The correlation matrix shows that many variables (e.g., Displacement, Horsepower, and Speed) are almost perfectly correlated, while Acceleration is strongly inversely correlated with them. This indicates redundancy in the dataset. To address this, I removed the highly correlated variables so that each remaining variable provides unique information for the clustering analysis.

# Columns being identified to drop based on the 0.90 cutoff
cols_to_drop <- findCorrelation(corr_matrix, cutoff = 0.90)

# Print the names of variables that will be dropped
cat("Dropping variables:", paste(colnames(cars_df)[cols_to_drop], collapse = ", "), "\n")

## Dropping variables: Speed, Acceleration, Width, Weight, Displacement, Length

# Create a new data frame with the highly correlated variables removed
cars_df_reduced <- cars_df[, -cols_to_drop]

round(cor(cars_df_reduced), 2)

##            Moment Horsepower Trunk
## Moment       1.00       0.85  0.69
## Horsepower   0.85       1.00  0.41
## Trunk        0.69       0.41  1.00

Hierarchical Clustering

Standardization

Yes, standardization is essential. The dataset’s variables—such as Moment, Horsepower, and Trunk—are measured on different scales. Without standardization, variables with larger numerical ranges or higher variances would disproportionately influence the distance calculations used in clustering.

\[z=\frac{x - \mu}{\sigma}\]

where μ the mean and σ is the standard deviation.

# Standardize the data
cars_scaled <- scale(cars_df_reduced)
head(cars_scaled)

##                            Moment Horsepower      Trunk
## Kia Picanto 1.1. Start -1.1722023 -1.0152681 -1.6085559
## Suzuki Splash 1.0      -1.2426956 -1.0152681 -1.2799724
## Renault Clio 1.0       -1.0916386 -0.8910510 -0.5712628
## Dacia Sandero 1.6      -0.8600178 -0.7419904 -0.3650928
## Fiat Grande Punto 1.4  -0.7391722 -0.7295687 -0.5712628
## Peugot 207 1.4         -0.8096655 -0.7295687 -0.6872335

Distance Measure

I am using Euclidean distance for our clustering analysis because it provides a straightforward, “straight-line” measure of dissimilarity between observations. Euclidean distance works well with Ward's method, which minimizes the within-cluster variance, leading to more balanced and interpretable clusters.

# Compute the Euclidean distance matrix on standardized data
dist_matrix <- dist(cars_scaled, method = "euclidean")
dist_matrix

##                       Kia Picanto 1.1. Start Suzuki Splash 1.0 Renault Clio 1.0
## Suzuki Splash 1.0                 0.33606015                                   
## Renault Clio 1.0                  1.04780594        0.73519883                 
## Dacia Sandero 1.6                 1.31085480        1.02865330       0.34405422
## Fiat Grande Punto 1.4             1.15979139        0.91511153       0.38769710
## Peugot 207 1.4                    1.03048146        0.78770463       0.34501381
## Renault Clio 1.6                  1.06188968        0.80849725       0.33060623
## Porsche Cayman                    4.18034829        4.09084787       3.69952810
## Nissan 350Z                       3.96538708        3.96779881       3.77292475
## Mercedes c200 CDI                 3.02210494        2.82416472       2.22398439
## VW Passat Variant 2.0             3.83834319        3.63463259       3.01258621
## Skoda Octavia 2.0                 3.79859954        3.59734709       2.97977935
## Mercedes E280                     3.93844367        3.76348912       3.20114673
## Audi A6 2.4                       3.31917988        3.08944604       2.43987434
## BMW 525i                          3.52095830        3.32616135       2.74257577
##                       Dacia Sandero 1.6 Fiat Grande Punto 1.4 Peugot 207 1.4
## Suzuki Splash 1.0                                                           
## Renault Clio 1.0                                                            
## Dacia Sandero 1.6                                                           
## Fiat Grande Punto 1.4        0.23929907                                     
## Peugot 207 1.4               0.32628865            0.13571475               
## Renault Clio 1.6             0.30796966            0.19337536     0.12283693
## Porsche Cayman               3.40143872            3.35945857     3.43083833
## Nissan 350Z                  3.53557406            3.42316479     3.45827881
## Mercedes c200 CDI            1.88294860            1.91842704     2.04395271
## VW Passat Variant 2.0        2.67457773            2.72744097     2.85678546
## Skoda Octavia 2.0            2.64159581            2.69116027     2.82004924
## Mercedes E280                2.86511102            2.89611941     3.00161148
## Audi A6 2.4                  2.10359058            2.19242328     2.31051160
## BMW 525i                     2.41200527            2.46353246     2.56715386
##                       Renault Clio 1.6 Porsche Cayman Nissan 350Z
## Suzuki Splash 1.0                                                
## Renault Clio 1.0                                                 
## Dacia Sandero 1.6                                                
## Fiat Grande Punto 1.4                                            
## Peugot 207 1.4                                                   
## Renault Clio 1.6                                                 
## Porsche Cayman              3.40827705                           
## Nissan 350Z                 3.44763229     1.13751175            
## Mercedes c200 CDI           2.05479844     2.15203525  2.73748038
## VW Passat Variant 2.0       2.86974087     2.25006602  3.04672372
## Skoda Octavia 2.0           2.83375863     2.22423777  3.00844385
## Mercedes E280               2.97905291     1.22302413  2.21413454
## Audi A6 2.4                 2.28125492     2.03551918  2.81454659
## BMW 525i                    2.53191955     1.49618305  2.34744706
##                       Mercedes c200 CDI VW Passat Variant 2.0 Skoda Octavia 2.0
## Suzuki Splash 1.0                                                              
## Renault Clio 1.0                                                               
## Dacia Sandero 1.6                                                              
## Fiat Grande Punto 1.4                                                          
## Peugot 207 1.4                                                                 
## Renault Clio 1.6                                                               
## Porsche Cayman                                                                 
## Nissan 350Z                                                                    
## Mercedes c200 CDI                                                              
## VW Passat Variant 2.0        0.83449538                                        
## Skoda Octavia 2.0            0.79412274            0.05154251                  
## Mercedes E280                1.26861679            1.18909813        1.17674612
## Audi A6 2.4                  0.75901096            1.05162469        1.03955591
## BMW 525i                     1.06250727            1.27578472        1.25901571
##                       Mercedes E280 Audi A6 2.4
## Suzuki Splash 1.0                              
## Renault Clio 1.0                               
## Dacia Sandero 1.6                              
## Fiat Grande Punto 1.4                          
## Peugot 207 1.4                                 
## Renault Clio 1.6                               
## Porsche Cayman                                 
## Nissan 350Z                                    
## Mercedes c200 CDI                              
## VW Passat Variant 2.0                          
## Skoda Octavia 2.0                              
## Mercedes E280                                  
## Audi A6 2.4              0.97383790            
## BMW 525i                 0.54425748  0.57271544

Linkage Method

I used Ward’s linkage because it makes the clusters as compact and uniform as feasible by minimizing the within-cluster variation in each combination step. This is especially important because I normalized the data, so every variable would contribute equally to the distance calculations. Ward’s method was the most appropriate for my work because it constantly produces clusters that are balanced and easy to grasp.

I will be applying agglomerative hierarchical clustering. Each observation begins as a cluster in this way, and clusters gradually combine based on similarity until all observations are in a single group. I selected this approach because it allows me to visually evaluate and determine the ideal number of clusters and produces a reproducible dendrogram that illustrates the nestedness in the data.

# Perform hierarchical clustering using Ward's method (ward.D2)
hc <- hclust(dist_matrix, method = "ward.D2")

# Plot the dendrogram
plot(hc, main = "Dendrogram of Car Data (Ward's Method)", 
     xlab = "Car Models", sub = "", cex = 0.8)

Interpretation

The dendrogram reflects very nicely how the car models naturally cluster by their attributes. A tree branch is a cluster, with the vertical axis marking the point at which clusters combine. Browsing through the dendrogram, there can be identified three major clusters: one grouping the lower-end, price-conscious cars (such as Kia Picanto, Suzuki Splash, and Renault Clio) with lower HP and smaller size; one grouping the sport cars with stronger HP and faster acceleration (such as Porsche Cayman and Nissan 350Z); and one group of larger, luxury sedans (such as Mercedes, Audi, BMW, and VW vehicles) with attributes such as larger weight and more trunk space. While finer cut would support four clusters by splitting the sedan group, overall shape supports a three-cluster solution that maps well onto real-life, practical car classes.

So in short;

The small city cars cluster together because they share similar attributes: smaller engines, lower horsepower, lighter weight, etc.
The sports cars stand out on a separate branch due to their high horsepower, high speed, and low acceleration times.
The larger sedans form a distinct group reflecting more powerful engines (though not at sports-car levels), higher weight, and more spacious dimensions.

Determining the Optimal Number of Clusters

Dendrogram Inspection

# Cut the dendrogram to form 3 clusters 
clusters_3 <- cutree(hc, k = 3)
table(clusters_3)

## clusters_3
## 1 2 3 
## 7 2 6

Elbow Method

Referred from : https://uc-r.github.io/hc_clustering

#Perform elbow method
fviz_nbclust(cars_scaled, FUN = hcut, method = "wss")

The dendrogram reveals a distinct grouping at a moderate height, indicating three natural clusters that align well with domain knowledge (city cars, sports cars, executive sedans) and the elbow method show a clear bent at k=3.

Alternative Approaches

The following techniques aid in determining the number of clusters in addition to dendrogram inspection The ideal number of clusters can be ascertained using a number of quantitative methods in addition to eyeballing the dendrogram. One technique is silhouette analysis, which compares the average distance of each observation to points in its own cluster and to points in the closest cluster to determine the silhouette width for each observation. Clusters that are well-separated are indicated by a larger average silhouette width. The gap statistic is an alternative method that compares the overall within-cluster variation for varying numbers of clusters to a null reference distribution; the gap is maximized for the ideal number. A reasonable choice is also suggested by the elbow method, which plots the within-cluster sum of squares (WCSS) against the number of clusters. The “elbow” is the point at which adding another cluster does not significantly affect WCSS.

Homework Esin

Aritra

2025-03-11