# load data and make it a data frame
df <- data.frame(read_excel("Theltegos_data.xlsx"))
#view the structure of our data set
str(df)
## 'data.frame': 15 obs. of 10 variables:
## $ Name : chr "Kia Picanto 1.1. Start" "Suzuki Splash 1.0" "Renault Clio 1.0" "Dacia Sandero 1.6" ...
## $ Displacement: num 1086 996 1149 1598 1598 ...
## $ Moment : num 97 90 105 128 140 133 125 340 353 270 ...
## $ Horsepower : num 65 65 75 87 88 88 95 295 301 136 ...
## $ Length : num 3535 3715 3986 4020 3986 ...
## $ Width : num 1595 1680 1719 1746 1719 ...
## $ Weight : num 929 1050 1155 1111 1215 ...
## $ Trunk : num 127 178 288 320 288 270 275 410 235 485 ...
## $ Speed : num 154 160 167 174 177 180 178 275 250 208 ...
## $ Acceleration: num 15.1 14.7 13.4 11.5 11.9 12.7 11.4 5.4 5.8 10.8 ...
# summary statistics
stargazer(df[2:10], type = "text")
##
## ===============================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------
## Displacement 15 2,000.600 814.260 996 3,498
## Moment 15 213.400 99.300 90 353
## Horsepower 15 146.733 80.504 65 301
## Length 15 4,300.533 440.233 3,535 4,916
## Width 15 1,759.467 71.416 1,595 1,855
## Weight 15 1,341.333 240.505 929 1,660
## Trunk 15 376.667 155.212 127 588
## Speed 15 203.800 38.175 154 275
## Acceleration 15 10.440 3.005 5.400 15.100
## -----------------------------------------------
# scatter plot matrix
pairs(df[2:10])
# Select the numeric columns for correlation matrix
selected_columns <- c("Displacement", "Moment", "Horsepower", "Length", "Width", "Weight", "Trunk", "Speed", "Acceleration" )
# Create correlation matrix
cor_matrix <- as.data.frame(cor(df[, selected_columns], use = "complete.obs", method = 'pearson'))
# Print correlation matrix
print(cor_matrix)
## Displacement Moment Horsepower Length Width
## Displacement 1.0000000 0.8752790 0.9833009 0.6567364 0.7643470
## Moment 0.8752790 1.0000000 0.8466490 0.7665564 0.7657827
## Horsepower 0.9833009 0.8466490 1.0000000 0.6082133 0.7319137
## Length 0.6567364 0.7665564 0.6082133 1.0000000 0.9116335
## Width 0.7643470 0.7657827 0.7319137 0.9116335 1.0000000
## Weight 0.7678926 0.8618937 0.7143511 0.9213978 0.8837887
## Trunk 0.4698137 0.6908975 0.4079151 0.9342425 0.7833397
## Speed 0.9669150 0.8592533 0.9683902 0.7411032 0.8192538
## Acceleration -0.9685791 -0.8609295 -0.9614512 -0.7142153 -0.8182177
## Weight Trunk Speed Acceleration
## Displacement 0.7678926 0.4698137 0.9669150 -0.9685791
## Moment 0.8618937 0.6908975 0.8592533 -0.8609295
## Horsepower 0.7143511 0.4079151 0.9683902 -0.9614512
## Length 0.9213978 0.9342425 0.7411032 -0.7142153
## Width 0.8837887 0.7833397 0.8192538 -0.8182177
## Weight 1.0000000 0.7854973 0.7783739 -0.7627800
## Trunk 0.7854973 1.0000000 0.5789343 -0.5521259
## Speed 0.7783739 0.5789343 1.0000000 -0.9709323
## Acceleration -0.7627800 -0.5521259 -0.9709323 1.0000000
This data shows a high degree of correlation between variables, this will cause estimation issues in most algorithms. Another thing to think about highly correlated variables is that many of the variables are telling us the same or very similar information information.
There are 2 potential solution to this problem:
It is also important to note that we will have to scale our data because some of our variables have wildly different standard deviations. We do not want data that is measured on a larger scale to be valued stronger by the algorithm and therefore bias the results.
Now let us think about the meaning of each variable and see witch ones we can drop without losing a lot of information.
I fed the list of variables to Chat GPT with the following prompt (“I am working with a data set that has to do with cars, I am going to give you the variable names, can you give me a short explanation of each variable name?”). Then did a quick Google search to corroborate the results on any definitions I was not sure on.
Displacement: This usually refers to the total volume of all the cylinders in an engine. It’s often measured in liters (L) or cubic inches (CID) and is an important indicator of the power and fuel efficiency of the engine.
Moment: This could refer to “torque,” which is a measure of the rotational force of the engine. It’s a crucial aspect of engine performance, especially for tasks like towing, acceleration, and climbing hills.
Horsepower: This is a unit of power measurement in the context of car engines. It’s a standard way to quantify how much work an engine can do, or how fast it can do a certain amount of work.
Length: The length of the car. It’s a fundamental dimension of the vehicle and affects things like interior space and parking.
Width: The width of the car, measured typically from the widest points on the exterior. It’s important for determining how much space the car takes up on the road and in parking spots.
Weight: The total mass of the vehicle. Weight affects a car’s handling, fuel efficiency, and overall performance.
Trunk: The capacity of the trunk or cargo space in the vehicle. It tells you how much luggage or cargo the car can carry in its rear storage area.
Speed: This could refer to the top speed of the car, often measured in miles per hour (mph) or kilometers per hour (km/h). It indicates the maximum speed the car can reach under optimal conditions.
Acceleration: How quickly the car can increase its speed from a standstill or from one speed to another. It’s often measured in seconds from 0 to 60 miles per hour (0-60 mph) or 0 to 100 kilometers per hour (0-100 km/h) and is a key indicator of performance.
The exploration of the data has revealed that some variables may mutually exclusive for example speed and acceleration. This assumption can be inferred from three facts.
Using the three facts stated above we can deiced which variables should be excluded from the analysis.The List of Variables being dropped is Speed, Moment, Displacement and Trunk.
# drop highly correlated variables
df_subset <- select(df, -Speed,-Moment, -Displacement, -Trunk)
We Must normalize the data because of the differences in scales, this is done so that the algorithm will preform better.
# ignoring name column in our subseted data for standardization
df_num <- select(df_subset, -Name)
# Z-score normalization function for a vector
z_score_normalize <- function(x) {
(x - mean(x)) / sd(x)
}
# Apply z-score normalization to a data frame
df_z_norm <- as.data.frame(lapply(df_num, z_score_normalize))
# bring back name column
df_z_norm$Name <- df$Name
# View the normalized data
head(df_z_norm)
## Horsepower Length Width Weight Acceleration
## 1 -1.0152681 -1.7389282 -2.3029338 -1.7144479 1.5508981
## 2 -1.0152681 -1.3300536 -1.1127268 -1.2113399 1.4177738
## 3 -0.8910510 -0.7144703 -0.5666319 -0.7747586 0.9851198
## 4 -0.7419904 -0.6372385 -0.1885661 -0.9577070 0.3527794
## 5 -0.7295687 -0.7144703 -0.5666319 -0.5252836 0.4859037
## 6 -0.7295687 -0.6145232 -0.1605613 -0.5294415 0.7521523
## Name
## 1 Kia Picanto 1.1. Start
## 2 Suzuki Splash 1.0
## 3 Renault Clio 1.0
## 4 Dacia Sandero 1.6
## 5 Fiat Grande Punto 1.4
## 6 Peugot 207 1.4
# summary statistics to verify normalization
stargazer(df_z_norm[2:6], type = "text", title = "After Z-Score normalization")
##
## After Z-Score normalization
## ============================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------
## Length 15 -0.000 1.000 -1.739 1.398
## Width 15 -0.000 1.000 -2.303 1.338
## Weight 15 0.000 1.000 -1.714 1.325
## Acceleration 15 0.000 1.000 -1.677 1.551
## --------------------------------------------
df_z_dist <- get_dist(df_z_norm , method = "euclidean")
## Warning in stats::dist(x, method = method, ...): NAs introduced by coercion
df_z_dist
## 1 2 3 4 5 6 7
## 2 1.4918220
## 3 2.5178107 1.1333411
## 4 3.0529521 1.7688265 0.8516634
## 5 2.8342633 1.5866349 0.6364254 0.6515504
## 6 3.0941052 1.6999757 0.6152206 0.6428355 0.5430893
## 7 2.5249160 1.5459277 0.9310962 0.9191106 0.6542106 1.1129506
## 8 6.3096895 5.3764931 4.5321815 3.9235408 4.0293999 4.0750566 4.0889803
## 9 6.8252732 5.8192636 4.9262425 4.4297556 4.4070064 4.4205184 4.5895771
## 10 5.1994221 4.0057463 2.9490545 2.7848038 2.5790067 2.4836548 2.9222096
## 11 5.8736650 4.5963402 3.4899386 3.2158240 3.1650513 2.9695775 3.5328224
## 12 4.8788340 3.7004090 2.6131839 2.2377111 2.1847923 2.1314840 2.4299724
## 13 6.8657169 5.7314076 4.6778436 4.2522961 4.2245349 4.1546271 4.4552004
## 14 6.5203519 5.2546422 4.1430791 3.7040947 3.7899367 3.5981640 4.0905149
## 15 6.7376555 5.5448227 4.4689152 3.9785268 4.0437429 3.9239613 4.2914465
## 8 9 10 11 12 13 14
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 1.2611945
## 10 3.2617904 3.0538891
## 11 3.2420687 3.0083297 0.8962149
## 12 2.7617508 2.9048227 0.9162256 1.2475238
## 13 2.2545920 1.7479280 2.0993318 1.7372228 2.1426457
## 14 2.7657981 2.6243365 1.8060373 1.0542286 1.7374878 1.2409130
## 15 2.1501752 1.9166628 2.1187037 1.5979644 1.9841882 0.6510805 0.7991366
# use normalized distances for clustering
# Single linkage is used
hclust_z <- hclust(df_z_dist, method = "single")
#plot deneograms
plot(hclust_z ,hang = -3, main= " Zscore normalized Cluster Dendrogram", labels=df$Name, cex= 0.5)
# Cut the dendogram so that we have 3 clusters
cut_z <- cutree(hclust_z, k = 3)
df_z_norm$hclust <- cut_z
# this is my best atempt
plot(hclust_z, labels = df$Name, hang=-3, cex=.5)
rect.hclust(hclust_z , k = 3, border = 2:6)
abline(h = 3, col = 'red')
# Set the number of clusters (k)
k <- 3
# Run k-means clustering
kmeans_result <- kmeans(df_z_dist, centers = k)
# View the clustering results
print(kmeans_result)
## K-means clustering with 3 clusters of sizes 5, 7, 3
##
## Cluster means:
## 1 2 3 4 5 6 7 8
## 1 6.651737 5.545326 4.5496524 4.057643 4.0989242 4.034465 4.303144 1.686352
## 2 2.216553 1.318075 0.9550796 1.126705 0.9865963 1.101168 1.098316 4.619335
## 3 5.317307 4.100832 3.0173923 2.746113 2.6429501 2.528239 2.961668 3.088537
## 9 10 11 12 13 14 15
## 1 1.510024 2.4679505 2.1279628 2.3061790 1.178903 1.486037 1.103411
## 2 5.059662 3.2748425 3.8347456 2.8823410 4.908804 4.442969 4.712724
## 3 2.989014 0.6041468 0.7145796 0.7212498 1.993067 1.532585 1.900285
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2 2 2 2 2 2 2 1 1 3 3 3 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 28.798496 83.127892 6.227211
## (between_SS / total_SS = 80.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Create a vector of total within-cluster sum of squares
wss_values <- numeric(10)
for(k in 1:10) {
kmeans_temp <- kmeans(df_z_dist, centers = k)
wss_values[k] <- kmeans_temp$tot.withinss
}
# Plot the scree plot
plot(1:10, wss_values, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Total Within Sum of Squares",
main = "Scree Plot for K-Means Clustering")
# Visualize the clustering results with factoextra
fviz_cluster(kmeans_result, data = df_z_norm[2:5], geom = "point", stand = FALSE, main = "Kmeans Cluster Plot")
# Add cluster assignments as a new column in the data frame
df_z_norm$kclust <- kmeans_result$cluster
# Perform MANOVA
aov_kclust <- aov(cbind(Horsepower,Width,Length,Acceleration,Weight) ~ kclust, data = df_z_norm)
# Print MANOVA summary
summary(aov_kclust)
## Response Horsepower :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 5.2413 5.2413 7.7794 0.01534 *
## Residuals 13 8.7587 0.6737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 1.728 1.728 1.8305 0.1991
## Residuals 13 12.272 0.944
##
## Response Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 0.3487 0.34871 0.3321 0.5743
## Residuals 13 13.6513 1.05010
##
## Response Acceleration :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 4.1296 4.1296 5.4389 0.03642 *
## Residuals 13 9.8704 0.7593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Weight :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 0.3166 0.31661 0.3008 0.5927
## Residuals 13 13.6834 1.05257
# Perform MANOVA
aov_hclust <- aov(cbind(Horsepower,Width,Length,Acceleration,Weight) ~ kclust, data = df_z_norm)
# Print MANOVA summary
summary(aov_hclust)
## Response Horsepower :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 5.2413 5.2413 7.7794 0.01534 *
## Residuals 13 8.7587 0.6737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 1.728 1.728 1.8305 0.1991
## Residuals 13 12.272 0.944
##
## Response Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 0.3487 0.34871 0.3321 0.5743
## Residuals 13 13.6513 1.05010
##
## Response Acceleration :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 4.1296 4.1296 5.4389 0.03642 *
## Residuals 13 9.8704 0.7593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Weight :
## Df Sum Sq Mean Sq F value Pr(>F)
## kclust 1 0.3166 0.31661 0.3008 0.5927
## Residuals 13 13.6834 1.05257
clusters <- cbind(df_z_norm$Name, df_z_norm$kclust)
print(clusters)
## [,1] [,2]
## [1,] "Kia Picanto 1.1. Start" "2"
## [2,] "Suzuki Splash 1.0" "2"
## [3,] "Renault Clio 1.0" "2"
## [4,] "Dacia Sandero 1.6" "2"
## [5,] "Fiat Grande Punto 1.4" "2"
## [6,] "Peugot 207 1.4" "2"
## [7,] "Renault Clio 1.6" "2"
## [8,] "Porsche Cayman" "1"
## [9,] "Nissan 350Z" "1"
## [10,] "Mercedes c200 CDI" "3"
## [11,] "VW Passat Variant 2.0" "3"
## [12,] "Skoda Octavia 2.0" "3"
## [13,] "Mercedes E280" "1"
## [14,] "Audi A6 2.4" "1"
## [15,] "BMW 525i" "1"
df$kclust <- df_z_norm$kclust
group_1 <- df[df$kclust == 1,]
group_2 <- df[df$kclust == 2,]
group_3 <- df[df$kclust == 3,]
stargazer(group_1, type="text", title = "Sports Cars (Group 1)")
##
## Sports Cars (Group 1)
## =============================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------
## Displacement 5 2,954.000 501.970 2,393 3,498
## Moment 5 294.600 53.998 230 353
## Horsepower 5 244.400 52.875 177 301
## Length 5 4,653.000 298.204 4,315 4,916
## Width 5 1,827.800 22.287 1,801 1,855
## Weight 5 1,537.000 122.045 1,340 1,660
## Trunk 5 450.200 132.326 235 546
## Speed 5 250.200 15.897 231 275
## Acceleration 5 6.980 1.410 5.400 8.900
## kclust 5 1.000 0.000 1 1
## ---------------------------------------------
stargazer(group_2, type="text", title = "Compact Cars (Group 2)")
##
## Compact Cars (Group 2)
## ===============================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------
## Displacement 7 1,307.857 240.282 996 1,598
## Moment 7 116.857 19.334 90 140
## Horsepower 7 80.429 12.081 65 95
## Length 7 3,900.286 195.977 3,535 4,030
## Width 7 1,699.143 52.806 1,595 1,748
## Weight 7 1,115.571 100.528 929 1,215
## Trunk 7 249.429 69.670 127 320
## Speed 7 170.000 9.950 154 180
## Acceleration 7 12.957 1.503 11.400 15.100
## kclust 7 2.000 0.000 2 2
## -----------------------------------------------
stargazer(group_3, type="text", title = " Luxury Cars (Group 3)")
##
## Luxury Cars (Group 3)
## ==============================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------
## Displacement 3 2,028.000 103.923 1,968 2,148
## Moment 3 303.333 28.868 270 320
## Horsepower 3 138.667 2.309 136 140
## Length 3 4,647.000 110.585 4,572 4,774
## Width 3 1,786.333 29.160 1,769 1,820
## Weight 3 1,542.000 101.425 1,425 1,605
## Trunk 3 551.000 57.297 485 588
## Speed 3 205.333 3.786 201 208
## Acceleration 3 10.333 0.569 9.700 10.800
## kclust 3 3.000 0.000 3 3
## ----------------------------------------------
Above is the Final list of clusters (Column [,2]), and the k-means clustering is used. I think the k-means clusters are better because they make more intuitive sense so we will focus on those in the analysis. The k-means clustering algorithm has effectively grouped the cars into three categories based on the included variables. The ANOVA tables have shown that horsepower and Acceleration have statistically significant cluster means. This result makes intuitive sense based on the definitions of the clusters of sports, compact, and luxury cars. These car groups are primarily differentiated based on the attributes of their engines. For example some sports cars could be coups (two-door) and would have similar size to compact cars.
The first cluster, identified as Sports Cars, includes models such as the Porsche Cayman, Audi A6 2.4, and BMW 525i. These cars are characterized by their powerful performance, with a mean horsepower of 244.4 and a mean acceleration of 6.98. They also have larger displacements, averaging 2,954.0 cc, indicating large engines. Known for their dynamic handling and powerful engines, sports cars are ideal for enthusiasts who prioritize performance.
The second cluster, labeled Compact Cars, consists of models like the Kia Picanto, Suzuki Splash, Renault Clio, and Fiat Grande Punto. These cars are characterized by their compact size, efficient engines, and practicality for urban commuting. With smaller displacements averaging 1,307.86 cc, they offer fuel efficiency. Compact cars provide good value for money, and decent comfort, making them popular for daily use.
The third cluster is denoted as Luxury Cars and includes models such as the Mercedes C200 CDI and VW Passat Variant. Luxury cars offer upscale interiors and advanced safety features representing the pinnacle of comfort and technology. They provide generous interior space with larger dimensions averaging 4,647 mm in length and 1,786.33 mm in width. Luxury cars excel in catering to buyers seeking comfort, prestige, and the latest automotive innovations.
Understanding these clusters can assist buyers in making informed decisions based on their lifestyle, preferences, and priorities when choosing a vehicle. Each cluster represents a different segment of the automotive market. The k-means clusters provide valuable insights for selecting the ideal car to suit individual needs and preferences.