Homework

Author

Betty Jiahui Wang

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1
[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

# Load the relevant package
library(readxl)
library(rmarkdown)
library(corrplot)
corrplot 0.95 loaded
library(caret)
Loading required package: ggplot2
Loading required package: lattice
# Load an Excel file
data <- read_excel("~/Desktop/Theltegos.xlsx")

# View the first few rows
head(data)
# A tibble: 6 × 10
  Name            Displacement Moment Horsepower Length Width Weight Trunk Speed
  <chr>                  <dbl>  <dbl>      <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
1 Kia Picanto 1.…         1086     97         65   3535  1595    929   127   154
2 Suzuki Splash …          996     90         65   3715  1680   1050   178   160
3 Renault Clio 1…         1149    105         75   3986  1719   1155   288   167
4 Dacia Sandero …         1598    128         87   4020  1746   1111   320   174
5 Fiat Grande Pu…         1598    140         88   3986  1719   1215   288   177
6 Peugot 207 1.4          1360    133         88   4030  1748   1214   270   180
# ℹ 1 more variable: Acceleration <dbl>
# View the structure of the dataset
str(data)
tibble [15 × 10] (S3: tbl_df/tbl/data.frame)
 $ Name        : chr [1:15] "Kia Picanto 1.1. Start" "Suzuki Splash 1.0" "Renault Clio 1.0" "Dacia Sandero 1.6" ...
 $ Displacement: num [1:15] 1086 996 1149 1598 1598 ...
 $ Moment      : num [1:15] 97 90 105 128 140 133 125 340 353 270 ...
 $ Horsepower  : num [1:15] 65 65 75 87 88 88 95 295 301 136 ...
 $ Length      : num [1:15] 3535 3715 3986 4020 3986 ...
 $ Width       : num [1:15] 1595 1680 1719 1746 1719 ...
 $ Weight      : num [1:15] 929 1050 1155 1111 1215 ...
 $ Trunk       : num [1:15] 127 178 288 320 288 270 275 410 235 485 ...
 $ Speed       : num [1:15] 154 160 167 174 177 180 178 275 250 208 ...
 $ Acceleration: num [1:15] 15.1 14.7 13.4 11.5 11.9 12.7 11.4 5.4 5.8 10.8 ...
# Exclude the 'Name' column (non-numeric) and calculate the correlation matrix
cor_matrix <- cor(data[, -1], use = "complete.obs")  # Exclude missing values

# Print the correlation matrix
print(cor_matrix)
             Displacement     Moment Horsepower     Length      Width
Displacement    1.0000000  0.8752790  0.9833009  0.6567364  0.7643470
Moment          0.8752790  1.0000000  0.8466490  0.7665564  0.7657827
Horsepower      0.9833009  0.8466490  1.0000000  0.6082133  0.7319137
Length          0.6567364  0.7665564  0.6082133  1.0000000  0.9116335
Width           0.7643470  0.7657827  0.7319137  0.9116335  1.0000000
Weight          0.7678926  0.8618937  0.7143511  0.9213978  0.8837887
Trunk           0.4698137  0.6908975  0.4079151  0.9342425  0.7833397
Speed           0.9669150  0.8592533  0.9683902  0.7411032  0.8192538
Acceleration   -0.9685791 -0.8609295 -0.9614512 -0.7142153 -0.8182177
                 Weight      Trunk      Speed Acceleration
Displacement  0.7678926  0.4698137  0.9669150   -0.9685791
Moment        0.8618937  0.6908975  0.8592533   -0.8609295
Horsepower    0.7143511  0.4079151  0.9683902   -0.9614512
Length        0.9213978  0.9342425  0.7411032   -0.7142153
Width         0.8837887  0.7833397  0.8192538   -0.8182177
Weight        1.0000000  0.7854973  0.7783739   -0.7627800
Trunk         0.7854973  1.0000000  0.5789343   -0.5521259
Speed         0.7783739  0.5789343  1.0000000   -0.9709323
Acceleration -0.7627800 -0.5521259 -0.9709323    1.0000000
# Visualize the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)

High Positive or Negative Correlations: values close to 1 or -1 indicate strong collinearity. Threshold for Collinearity: a correlation coefficient above 0.7 or below -0.7 suggests substantial collinearity.

High Collinearity Observed: Displacement and Horsepower (0.98); Displacement and Speed (0.97); Length and Weight (0.92); Speed and Horsepower (0.97); Acceleration is highly negatively correlated with Displacement (-0.97) and Speed (-0.97).

Remove the highly correlated variables: between Displacement and Horsepower; between Length and Weight.

# Remove highly correlated variables
data_cleaned <- data[, !(colnames(data) %in% c("Displacement", "Horsepower", "Length"))]

# View the cleaned dataset
head(data_cleaned)
# A tibble: 6 × 7
  Name                   Moment Width Weight Trunk Speed Acceleration
  <chr>                   <dbl> <dbl>  <dbl> <dbl> <dbl>        <dbl>
1 Kia Picanto 1.1. Start     97  1595    929   127   154         15.1
2 Suzuki Splash 1.0          90  1680   1050   178   160         14.7
3 Renault Clio 1.0          105  1719   1155   288   167         13.4
4 Dacia Sandero 1.6         128  1746   1111   320   174         11.5
5 Fiat Grande Punto 1.4     140  1719   1215   288   177         11.9
6 Peugot 207 1.4            133  1748   1214   270   180         12.7

Standardization is necessary because the variables in the dataset are measured on different scales. Hierarchical clustering is sensitive to the scale of the data, so we need to standardize the variables to have a mean of 0 and a standard deviation of 1.

# Remove the 'Name' column (non-numeric) for clustering
data_numeric <- data_cleaned[, -1]

# Standardize the data
data_scaled <- scale(data_numeric)

# View the scaled data
head(data_scaled)
         Moment      Width     Weight      Trunk      Speed Acceleration
[1,] -1.1722023 -2.3029338 -1.7144479 -1.6085559 -1.3045260    1.5508981
[2,] -1.2426956 -1.1127268 -1.2113399 -1.2799724 -1.1473542    1.4177738
[3,] -1.0916386 -0.5666319 -0.7747586 -0.5712628 -0.9639871    0.9851198
[4,] -0.8600178 -0.1885661 -0.9577070 -0.3650928 -0.7806200    0.3527794
[5,] -0.7391722 -0.5666319 -0.5252836 -0.5712628 -0.7020341    0.4859037
[6,] -0.8096655 -0.1605613 -0.5294415 -0.6872335 -0.6234482    0.7521523

I use Euclidean distance as the distance measure. Euclidean distance is the most common choice for continuous numeric data because it calculates the straight-line distance between two points in multidimensional space. It is suitable for this dataset because all variables are numeric and continuous.

# Calculate the distance matrix
dist_matrix <- dist(data_scaled, method = "euclidean")

I use Ward’s linkage method. Ward’s method minimizes the total within-cluster variance, which tends to produce compact and well-separated clusters. It is a good choice for this dataset because it often produces more interpretable and balanced clusters compared to other linkage methods such as single linkage.

# Perform hierarchical clustering using Ward's method
hc <- hclust(dist_matrix, method = "ward.D2")

I perform agglomerative hierarchical clustering, which is a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as we move up the hierarchy. This method is chosen because it provides a complete dendrogram, allowing us to visualize the relationships between all observations and decide on the optimal number of clusters.

# Plot the dendrogram
plot(hc, main = "Dendrogram of Hierarchical Clustering", xlab = "", sub = "", cex = 0.9)

Objects such as 1 and 2, Kia Picanto and Suzuki Splash, are more similar than objects such as 1 and 7, Kia Picanto and Renault Clio. The dendrogram shows the hierarchical relationships between the observations. The height of the branches indicates the distance at which clusters are merged. The dendrogram suggests potential clusters, but we need to decide on the optimal number of clusters.

To decide the optimal number of clusters, we examine the dendrogram. There are 2 longest vertical lines without horizontal cross-bars which indicate 2 natural breaks in the data. The height at which two clusters merge indicates the distance (dissimilarity) between them. Longer vertical lines without horizontal cross-bars suggest natural breaks in the data. Also, look for the point where the vertical lines are the longest. This indicates a significant jump in dissimilarity, suggesting a natural separation into clusters. The dendrogram likely shows 2 distinct branches at a higher level of dissimilarity. These branches represent natural groupings in the data. Cutting the dendrogram at a height that results in 2 clusters balances the trade-off between having too few clusters (which may oversimplify the data) and too many clusters (which may overfit the data).

# Elbow method to determine the optimal number of clusters
wcss <- sapply(1:10, function(k) { sum(kmeans(data_scaled, k)$withinss) })
plot(1:10, wcss, type = "b", pch = 19, frame = FALSE, xlab = "Number of Clusters", ylab = "WCSS", main = "Elbow Method")

We can also calculate Within-Cluster Sum of Squares (WCSS) and use the elbow method. For each possible number of clusters, compute the total WCSS. WCSS measures the compactness of the clusters, that is, lower values indicate tighter clusters. As the number of clusters increases, WCSS decreases because each cluster becomes smaller and more homogeneous. Plot the within-cluster sum of squares (WCSS) against the number of clusters and look for an “elbow” point. The elbow plot likely shows a sharp decline in WCSS from 1 to 2 clusters, followed by a more gradual decline after 2 clusters. This suggests that 2 clusters capture the majority of the variance in the data, and adding more clusters provides diminishing returns. Therefore, 3 clusters is the optimal choice providing a balance between simplicity and explanatory power.