1 + 1[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1[1] 2
You can add options to executable code like this
[1] 4
The echo: false option disables the printing of code (only output is displayed).
# Load the relevant package
library(readxl)
library(rmarkdown)
library(corrplot)corrplot 0.95 loaded
library(caret)Loading required package: ggplot2
Loading required package: lattice
# Load an Excel file
data <- read_excel("~/Desktop/Theltegos.xlsx")
# View the first few rows
head(data)# A tibble: 6 × 10
Name Displacement Moment Horsepower Length Width Weight Trunk Speed
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Kia Picanto 1.… 1086 97 65 3535 1595 929 127 154
2 Suzuki Splash … 996 90 65 3715 1680 1050 178 160
3 Renault Clio 1… 1149 105 75 3986 1719 1155 288 167
4 Dacia Sandero … 1598 128 87 4020 1746 1111 320 174
5 Fiat Grande Pu… 1598 140 88 3986 1719 1215 288 177
6 Peugot 207 1.4 1360 133 88 4030 1748 1214 270 180
# ℹ 1 more variable: Acceleration <dbl>
# View the structure of the dataset
str(data)tibble [15 × 10] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:15] "Kia Picanto 1.1. Start" "Suzuki Splash 1.0" "Renault Clio 1.0" "Dacia Sandero 1.6" ...
$ Displacement: num [1:15] 1086 996 1149 1598 1598 ...
$ Moment : num [1:15] 97 90 105 128 140 133 125 340 353 270 ...
$ Horsepower : num [1:15] 65 65 75 87 88 88 95 295 301 136 ...
$ Length : num [1:15] 3535 3715 3986 4020 3986 ...
$ Width : num [1:15] 1595 1680 1719 1746 1719 ...
$ Weight : num [1:15] 929 1050 1155 1111 1215 ...
$ Trunk : num [1:15] 127 178 288 320 288 270 275 410 235 485 ...
$ Speed : num [1:15] 154 160 167 174 177 180 178 275 250 208 ...
$ Acceleration: num [1:15] 15.1 14.7 13.4 11.5 11.9 12.7 11.4 5.4 5.8 10.8 ...
# Exclude the 'Name' column (non-numeric) and calculate the correlation matrix
cor_matrix <- cor(data[, -1], use = "complete.obs") # Exclude missing values
# Print the correlation matrix
print(cor_matrix) Displacement Moment Horsepower Length Width
Displacement 1.0000000 0.8752790 0.9833009 0.6567364 0.7643470
Moment 0.8752790 1.0000000 0.8466490 0.7665564 0.7657827
Horsepower 0.9833009 0.8466490 1.0000000 0.6082133 0.7319137
Length 0.6567364 0.7665564 0.6082133 1.0000000 0.9116335
Width 0.7643470 0.7657827 0.7319137 0.9116335 1.0000000
Weight 0.7678926 0.8618937 0.7143511 0.9213978 0.8837887
Trunk 0.4698137 0.6908975 0.4079151 0.9342425 0.7833397
Speed 0.9669150 0.8592533 0.9683902 0.7411032 0.8192538
Acceleration -0.9685791 -0.8609295 -0.9614512 -0.7142153 -0.8182177
Weight Trunk Speed Acceleration
Displacement 0.7678926 0.4698137 0.9669150 -0.9685791
Moment 0.8618937 0.6908975 0.8592533 -0.8609295
Horsepower 0.7143511 0.4079151 0.9683902 -0.9614512
Length 0.9213978 0.9342425 0.7411032 -0.7142153
Width 0.8837887 0.7833397 0.8192538 -0.8182177
Weight 1.0000000 0.7854973 0.7783739 -0.7627800
Trunk 0.7854973 1.0000000 0.5789343 -0.5521259
Speed 0.7783739 0.5789343 1.0000000 -0.9709323
Acceleration -0.7627800 -0.5521259 -0.9709323 1.0000000
# Visualize the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)High Positive or Negative Correlations: values close to 1 or -1 indicate strong collinearity. Threshold for Collinearity: a correlation coefficient above 0.7 or below -0.7 suggests substantial collinearity.
High Collinearity Observed: Displacement and Horsepower (0.98); Displacement and Speed (0.97); Length and Weight (0.92); Speed and Horsepower (0.97); Acceleration is highly negatively correlated with Displacement (-0.97) and Speed (-0.97).
Remove the highly correlated variables: between Displacement and Horsepower; between Length and Weight.
# Remove highly correlated variables
data_cleaned <- data[, !(colnames(data) %in% c("Displacement", "Horsepower", "Length"))]
# View the cleaned dataset
head(data_cleaned)# A tibble: 6 × 7
Name Moment Width Weight Trunk Speed Acceleration
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Kia Picanto 1.1. Start 97 1595 929 127 154 15.1
2 Suzuki Splash 1.0 90 1680 1050 178 160 14.7
3 Renault Clio 1.0 105 1719 1155 288 167 13.4
4 Dacia Sandero 1.6 128 1746 1111 320 174 11.5
5 Fiat Grande Punto 1.4 140 1719 1215 288 177 11.9
6 Peugot 207 1.4 133 1748 1214 270 180 12.7
Standardization is necessary because the variables in the dataset are measured on different scales. Hierarchical clustering is sensitive to the scale of the data, so we need to standardize the variables to have a mean of 0 and a standard deviation of 1.
# Remove the 'Name' column (non-numeric) for clustering
data_numeric <- data_cleaned[, -1]
# Standardize the data
data_scaled <- scale(data_numeric)
# View the scaled data
head(data_scaled) Moment Width Weight Trunk Speed Acceleration
[1,] -1.1722023 -2.3029338 -1.7144479 -1.6085559 -1.3045260 1.5508981
[2,] -1.2426956 -1.1127268 -1.2113399 -1.2799724 -1.1473542 1.4177738
[3,] -1.0916386 -0.5666319 -0.7747586 -0.5712628 -0.9639871 0.9851198
[4,] -0.8600178 -0.1885661 -0.9577070 -0.3650928 -0.7806200 0.3527794
[5,] -0.7391722 -0.5666319 -0.5252836 -0.5712628 -0.7020341 0.4859037
[6,] -0.8096655 -0.1605613 -0.5294415 -0.6872335 -0.6234482 0.7521523
I use Euclidean distance as the distance measure. Euclidean distance is the most common choice for continuous numeric data because it calculates the straight-line distance between two points in multidimensional space. It is suitable for this dataset because all variables are numeric and continuous.
# Calculate the distance matrix
dist_matrix <- dist(data_scaled, method = "euclidean")I use Ward’s linkage method. Ward’s method minimizes the total within-cluster variance, which tends to produce compact and well-separated clusters. It is a good choice for this dataset because it often produces more interpretable and balanced clusters compared to other linkage methods such as single linkage.
# Perform hierarchical clustering using Ward's method
hc <- hclust(dist_matrix, method = "ward.D2")I perform agglomerative hierarchical clustering, which is a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as we move up the hierarchy. This method is chosen because it provides a complete dendrogram, allowing us to visualize the relationships between all observations and decide on the optimal number of clusters.
# Plot the dendrogram
plot(hc, main = "Dendrogram of Hierarchical Clustering", xlab = "", sub = "", cex = 0.9)Objects such as 1 and 2, Kia Picanto and Suzuki Splash, are more similar than objects such as 1 and 7, Kia Picanto and Renault Clio. The dendrogram shows the hierarchical relationships between the observations. The height of the branches indicates the distance at which clusters are merged. The dendrogram suggests potential clusters, but we need to decide on the optimal number of clusters.
To decide the optimal number of clusters, we examine the dendrogram. There are 2 longest vertical lines without horizontal cross-bars which indicate 2 natural breaks in the data. The height at which two clusters merge indicates the distance (dissimilarity) between them. Longer vertical lines without horizontal cross-bars suggest natural breaks in the data. Also, look for the point where the vertical lines are the longest. This indicates a significant jump in dissimilarity, suggesting a natural separation into clusters. The dendrogram likely shows 2 distinct branches at a higher level of dissimilarity. These branches represent natural groupings in the data. Cutting the dendrogram at a height that results in 2 clusters balances the trade-off between having too few clusters (which may oversimplify the data) and too many clusters (which may overfit the data).
# Elbow method to determine the optimal number of clusters
wcss <- sapply(1:10, function(k) { sum(kmeans(data_scaled, k)$withinss) })
plot(1:10, wcss, type = "b", pch = 19, frame = FALSE, xlab = "Number of Clusters", ylab = "WCSS", main = "Elbow Method")We can also calculate Within-Cluster Sum of Squares (WCSS) and use the elbow method. For each possible number of clusters, compute the total WCSS. WCSS measures the compactness of the clusters, that is, lower values indicate tighter clusters. As the number of clusters increases, WCSS decreases because each cluster becomes smaller and more homogeneous. Plot the within-cluster sum of squares (WCSS) against the number of clusters and look for an “elbow” point. The elbow plot likely shows a sharp decline in WCSS from 1 to 2 clusters, followed by a more gradual decline after 2 clusters. This suggests that 2 clusters capture the majority of the variance in the data, and adding more clusters provides diminishing returns. Therefore, 3 clusters is the optimal choice providing a balance between simplicity and explanatory power.