# load data and make it a data frame
df <- data.frame(read_excel("Theltegos_data.xlsx"))
#view the structure of our data set 
str(df)

## 'data.frame':    15 obs. of  10 variables:
##  $ Name        : chr  "Kia Picanto 1.1. Start" "Suzuki Splash 1.0" "Renault Clio 1.0" "Dacia Sandero 1.6" ...
##  $ Displacement: num  1086 996 1149 1598 1598 ...
##  $ Moment      : num  97 90 105 128 140 133 125 340 353 270 ...
##  $ Horsepower  : num  65 65 75 87 88 88 95 295 301 136 ...
##  $ Length      : num  3535 3715 3986 4020 3986 ...
##  $ Width       : num  1595 1680 1719 1746 1719 ...
##  $ Weight      : num  929 1050 1155 1111 1215 ...
##  $ Trunk       : num  127 178 288 320 288 270 275 410 235 485 ...
##  $ Speed       : num  154 160 167 174 177 180 178 275 250 208 ...
##  $ Acceleration: num  15.1 14.7 13.4 11.5 11.9 12.7 11.4 5.4 5.8 10.8 ...

1.0 Exploratory Analysis.

1.1 Summary statistics.

# summary statistics 
stargazer(df[2:10], type = "text")

## 
## ===============================================
## Statistic    N    Mean    St. Dev.  Min   Max  
## -----------------------------------------------
## Displacement 15 2,000.600 814.260   996  3,498 
## Moment       15  213.400   99.300   90    353  
## Horsepower   15  146.733   80.504   65    301  
## Length       15 4,300.533 440.233  3,535 4,916 
## Width        15 1,759.467  71.416  1,595 1,855 
## Weight       15 1,341.333 240.505   929  1,660 
## Trunk        15  376.667  155.212   127   588  
## Speed        15  203.800   38.175   154   275  
## Acceleration 15  10.440    3.005   5.400 15.100
## -----------------------------------------------

1.2 Scatter plot matrix.

# scatter plot matrix
pairs(df[2:10])

1.2 Correlation matrix.

# Select the numeric columns for correlation matrix
selected_columns <- c("Displacement", "Moment", "Horsepower", "Length", "Width", "Weight", "Trunk", "Speed", "Acceleration"  )

# Create correlation matrix
cor_matrix <- as.data.frame(cor(df[, selected_columns], use = "complete.obs", method = 'pearson'))

# Print correlation matrix
print(cor_matrix)

##              Displacement     Moment Horsepower     Length      Width
## Displacement    1.0000000  0.8752790  0.9833009  0.6567364  0.7643470
## Moment          0.8752790  1.0000000  0.8466490  0.7665564  0.7657827
## Horsepower      0.9833009  0.8466490  1.0000000  0.6082133  0.7319137
## Length          0.6567364  0.7665564  0.6082133  1.0000000  0.9116335
## Width           0.7643470  0.7657827  0.7319137  0.9116335  1.0000000
## Weight          0.7678926  0.8618937  0.7143511  0.9213978  0.8837887
## Trunk           0.4698137  0.6908975  0.4079151  0.9342425  0.7833397
## Speed           0.9669150  0.8592533  0.9683902  0.7411032  0.8192538
## Acceleration   -0.9685791 -0.8609295 -0.9614512 -0.7142153 -0.8182177
##                  Weight      Trunk      Speed Acceleration
## Displacement  0.7678926  0.4698137  0.9669150   -0.9685791
## Moment        0.8618937  0.6908975  0.8592533   -0.8609295
## Horsepower    0.7143511  0.4079151  0.9683902   -0.9614512
## Length        0.9213978  0.9342425  0.7411032   -0.7142153
## Width         0.8837887  0.7833397  0.8192538   -0.8182177
## Weight        1.0000000  0.7854973  0.7783739   -0.7627800
## Trunk         0.7854973  1.0000000  0.5789343   -0.5521259
## Speed         0.7783739  0.5789343  1.0000000   -0.9709323
## Acceleration -0.7627800 -0.5521259 -0.9709323    1.0000000

1.3 Analysis.

This data shows a high degree of correlation between variables, this will cause estimation issues in most algorithms. Another thing to think about highly correlated variables is that many of the variables are telling us the same or very similar information information.

There are 2 potential solution to this problem:

Subset our data.
Dimension reduction of our data with something like PCA.

It is also important to note that we will have to scale our data because some of our variables have wildly different standard deviations. We do not want data that is measured on a larger scale to be valued stronger by the algorithm and therefore bias the results.

Now let us think about the meaning of each variable and see witch ones we can drop without losing a lot of information.

I fed the list of variables to Chat GPT with the following prompt (“I am working with a data set that has to do with cars, I am going to give you the variable names, can you give me a short explanation of each variable name?”). Then did a quick Google search to corroborate the results on any definitions I was not sure on.

Displacement: This usually refers to the total volume of all the cylinders in an engine. It’s often measured in liters (L) or cubic inches (CID) and is an important indicator of the power and fuel efficiency of the engine.

Moment: This could refer to “torque,” which is a measure of the rotational force of the engine. It’s a crucial aspect of engine performance, especially for tasks like towing, acceleration, and climbing hills.

Horsepower: This is a unit of power measurement in the context of car engines. It’s a standard way to quantify how much work an engine can do, or how fast it can do a certain amount of work.

Length: The length of the car. It’s a fundamental dimension of the vehicle and affects things like interior space and parking.

Width: The width of the car, measured typically from the widest points on the exterior. It’s important for determining how much space the car takes up on the road and in parking spots.

Weight: The total mass of the vehicle. Weight affects a car’s handling, fuel efficiency, and overall performance.

Trunk: The capacity of the trunk or cargo space in the vehicle. It tells you how much luggage or cargo the car can carry in its rear storage area.

Speed: This could refer to the top speed of the car, often measured in miles per hour (mph) or kilometers per hour (km/h). It indicates the maximum speed the car can reach under optimal conditions.

Acceleration: How quickly the car can increase its speed from a standstill or from one speed to another. It’s often measured in seconds from 0 to 60 miles per hour (0-60 mph) or 0 to 100 kilometers per hour (0-100 km/h) and is a key indicator of performance.

The exploration of the data has revealed that some variables may mutually exclusive for example speed and acceleration. This assumption can be inferred from three facts.

They have a very high degree of correlation with each other
They have very similar correlation coefficients with other variables
There definitions are intertwined with other variables.

Using the three facts stated above we can deiced which variables should be excluded from the analysis.The List of Variables being dropped is Speed, Moment, Displacement and Trunk.

2.0 Data Cleaning.

2.1 Subset the data.

# drop highly correlated variables 
df_subset <- select(df, -Speed,-Moment, -Displacement, -Trunk)

2.2 Data Normalization.

We Must normalize the data because of the differences in scales, this is done so that the algorithm will preform better.

# ignoring name column in our subseted data for standardization 
df_num <- select(df_subset, -Name)

# Z-score normalization function for a vector
z_score_normalize <- function(x) {
  (x - mean(x)) / sd(x)
}

# Apply z-score normalization to a data frame
df_z_norm <- as.data.frame(lapply(df_num, z_score_normalize))

# bring back name column
df_z_norm$Name <- df$Name


# View the normalized data
head(df_z_norm)

##   Horsepower     Length      Width     Weight Acceleration
## 1 -1.0152681 -1.7389282 -2.3029338 -1.7144479    1.5508981
## 2 -1.0152681 -1.3300536 -1.1127268 -1.2113399    1.4177738
## 3 -0.8910510 -0.7144703 -0.5666319 -0.7747586    0.9851198
## 4 -0.7419904 -0.6372385 -0.1885661 -0.9577070    0.3527794
## 5 -0.7295687 -0.7144703 -0.5666319 -0.5252836    0.4859037
## 6 -0.7295687 -0.6145232 -0.1605613 -0.5294415    0.7521523
##                     Name
## 1 Kia Picanto 1.1. Start
## 2      Suzuki Splash 1.0
## 3       Renault Clio 1.0
## 4      Dacia Sandero 1.6
## 5  Fiat Grande Punto 1.4
## 6         Peugot 207 1.4

# summary statistics to verify normalization
stargazer(df_z_norm[2:6], type = "text", title = "After Z-Score normalization")

## 
## After Z-Score normalization
## ============================================
## Statistic    N   Mean  St. Dev.  Min    Max 
## --------------------------------------------
## Length       15 -0.000  1.000   -1.739 1.398
## Width        15 -0.000  1.000   -2.303 1.338
## Weight       15 0.000   1.000   -1.714 1.325
## Acceleration 15 0.000   1.000   -1.677 1.551
## --------------------------------------------

df_z_dist <- get_dist(df_z_norm , method = "euclidean")

## Warning in stats::dist(x, method = method, ...): NAs introduced by coercion

df_z_dist

##            1         2         3         4         5         6         7
## 2  1.4918220                                                            
## 3  2.5178107 1.1333411                                                  
## 4  3.0529521 1.7688265 0.8516634                                        
## 5  2.8342633 1.5866349 0.6364254 0.6515504                              
## 6  3.0941052 1.6999757 0.6152206 0.6428355 0.5430893                    
## 7  2.5249160 1.5459277 0.9310962 0.9191106 0.6542106 1.1129506          
## 8  6.3096895 5.3764931 4.5321815 3.9235408 4.0293999 4.0750566 4.0889803
## 9  6.8252732 5.8192636 4.9262425 4.4297556 4.4070064 4.4205184 4.5895771
## 10 5.1994221 4.0057463 2.9490545 2.7848038 2.5790067 2.4836548 2.9222096
## 11 5.8736650 4.5963402 3.4899386 3.2158240 3.1650513 2.9695775 3.5328224
## 12 4.8788340 3.7004090 2.6131839 2.2377111 2.1847923 2.1314840 2.4299724
## 13 6.8657169 5.7314076 4.6778436 4.2522961 4.2245349 4.1546271 4.4552004
## 14 6.5203519 5.2546422 4.1430791 3.7040947 3.7899367 3.5981640 4.0905149
## 15 6.7376555 5.5448227 4.4689152 3.9785268 4.0437429 3.9239613 4.2914465
##            8         9        10        11        12        13        14
## 2                                                                       
## 3                                                                       
## 4                                                                       
## 5                                                                       
## 6                                                                       
## 7                                                                       
## 8                                                                       
## 9  1.2611945                                                            
## 10 3.2617904 3.0538891                                                  
## 11 3.2420687 3.0083297 0.8962149                                        
## 12 2.7617508 2.9048227 0.9162256 1.2475238                              
## 13 2.2545920 1.7479280 2.0993318 1.7372228 2.1426457                    
## 14 2.7657981 2.6243365 1.8060373 1.0542286 1.7374878 1.2409130          
## 15 2.1501752 1.9166628 2.1187037 1.5979644 1.9841882 0.6510805 0.7991366

3.0 Hierarchical Clustering.

# use normalized distances for clustering 
# Single linkage is used 
hclust_z <- hclust(df_z_dist, method = "single")

3.1 Dendogrm.

#plot deneograms 
plot(hclust_z ,hang = -3, main= " Zscore normalized Cluster Dendrogram", labels=df$Name, cex= 0.5)

# Cut the dendogram so that we have 3 clusters 
cut_z <- cutree(hclust_z, k = 3)
df_z_norm$hclust <- cut_z

3.2 Icicle Diagram.

# this is my best atempt 
plot(hclust_z, labels = df$Name, hang=-3, cex=.5)
rect.hclust(hclust_z , k = 3, border = 2:6)
abline(h = 3, col = 'red')

4.0 K-means Clustering.

# Set the number of clusters (k)
k <- 3

# Run k-means clustering
kmeans_result <- kmeans(df_z_dist, centers = k)

# View the clustering results
print(kmeans_result)

## K-means clustering with 3 clusters of sizes 5, 7, 3
## 
## Cluster means:
##          1        2         3        4         5        6        7        8
## 1 6.651737 5.545326 4.5496524 4.057643 4.0989242 4.034465 4.303144 1.686352
## 2 2.216553 1.318075 0.9550796 1.126705 0.9865963 1.101168 1.098316 4.619335
## 3 5.317307 4.100832 3.0173923 2.746113 2.6429501 2.528239 2.961668 3.088537
##          9        10        11        12       13       14       15
## 1 1.510024 2.4679505 2.1279628 2.3061790 1.178903 1.486037 1.103411
## 2 5.059662 3.2748425 3.8347456 2.8823410 4.908804 4.442969 4.712724
## 3 2.989014 0.6041468 0.7145796 0.7212498 1.993067 1.532585 1.900285
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
##  2  2  2  2  2  2  2  1  1  3  3  3  1  1  1 
## 
## Within cluster sum of squares by cluster:
## [1] 28.798496 83.127892  6.227211
##  (between_SS / total_SS =  80.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

4.1 Scree Plot

# Create a vector of total within-cluster sum of squares
wss_values <- numeric(10)  
for(k in 1:10) {
  kmeans_temp <- kmeans(df_z_dist, centers = k)
  wss_values[k] <- kmeans_temp$tot.withinss
}

# Plot the scree plot
plot(1:10, wss_values, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters (k)", ylab = "Total Within Sum of Squares",
     main = "Scree Plot for K-Means Clustering")

4.2 K-means Cluster plot

# Visualize the clustering results with factoextra
fviz_cluster(kmeans_result, data = df_z_norm[2:5], geom = "point", stand = FALSE, main = "Kmeans Cluster Plot")

# Add cluster assignments as a new column in the data frame
df_z_norm$kclust <-  kmeans_result$cluster

5.0 ANOVA of Clusters.

# Perform MANOVA
aov_kclust <- aov(cbind(Horsepower,Width,Length,Acceleration,Weight) ~ kclust, data = df_z_norm)

# Print MANOVA summary
summary(aov_kclust)

##  Response Horsepower :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## kclust       1 5.2413  5.2413  7.7794 0.01534 *
## Residuals   13 8.7587  0.6737                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Width :
##             Df Sum Sq Mean Sq F value Pr(>F)
## kclust       1  1.728   1.728  1.8305 0.1991
## Residuals   13 12.272   0.944               
## 
##  Response Length :
##             Df  Sum Sq Mean Sq F value Pr(>F)
## kclust       1  0.3487 0.34871  0.3321 0.5743
## Residuals   13 13.6513 1.05010               
## 
##  Response Acceleration :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## kclust       1 4.1296  4.1296  5.4389 0.03642 *
## Residuals   13 9.8704  0.7593                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Weight :
##             Df  Sum Sq Mean Sq F value Pr(>F)
## kclust       1  0.3166 0.31661  0.3008 0.5927
## Residuals   13 13.6834 1.05257

# Perform MANOVA
aov_hclust <- aov(cbind(Horsepower,Width,Length,Acceleration,Weight) ~ kclust, data = df_z_norm)

# Print MANOVA summary
summary(aov_hclust)

##  Response Horsepower :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## kclust       1 5.2413  5.2413  7.7794 0.01534 *
## Residuals   13 8.7587  0.6737                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Width :
##             Df Sum Sq Mean Sq F value Pr(>F)
## kclust       1  1.728   1.728  1.8305 0.1991
## Residuals   13 12.272   0.944               
## 
##  Response Length :
##             Df  Sum Sq Mean Sq F value Pr(>F)
## kclust       1  0.3487 0.34871  0.3321 0.5743
## Residuals   13 13.6513 1.05010               
## 
##  Response Acceleration :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## kclust       1 4.1296  4.1296  5.4389 0.03642 *
## Residuals   13 9.8704  0.7593                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Weight :
##             Df  Sum Sq Mean Sq F value Pr(>F)
## kclust       1  0.3166 0.31661  0.3008 0.5927
## Residuals   13 13.6834 1.05257

6.0 Analysis of Clusters.

clusters <- cbind(df_z_norm$Name, df_z_norm$kclust) 
print(clusters)

##       [,1]                     [,2]
##  [1,] "Kia Picanto 1.1. Start" "2" 
##  [2,] "Suzuki Splash 1.0"      "2" 
##  [3,] "Renault Clio 1.0"       "2" 
##  [4,] "Dacia Sandero 1.6"      "2" 
##  [5,] "Fiat Grande Punto 1.4"  "2" 
##  [6,] "Peugot 207 1.4"         "2" 
##  [7,] "Renault Clio 1.6"       "2" 
##  [8,] "Porsche Cayman"         "1" 
##  [9,] "Nissan 350Z"            "1" 
## [10,] "Mercedes c200 CDI"      "3" 
## [11,] "VW Passat Variant 2.0"  "3" 
## [12,] "Skoda Octavia 2.0"      "3" 
## [13,] "Mercedes E280"          "1" 
## [14,] "Audi A6 2.4"            "1" 
## [15,] "BMW 525i"               "1"

df$kclust <- df_z_norm$kclust

group_1 <- df[df$kclust == 1,]
group_2 <- df[df$kclust == 2,]
group_3 <- df[df$kclust == 3,]

stargazer(group_1, type="text", title = "Sports Cars (Group 1)")

## 
## Sports Cars (Group 1)
## =============================================
## Statistic    N   Mean    St. Dev.  Min   Max 
## ---------------------------------------------
## Displacement 5 2,954.000 501.970  2,393 3,498
## Moment       5  294.600   53.998   230   353 
## Horsepower   5  244.400   52.875   177   301 
## Length       5 4,653.000 298.204  4,315 4,916
## Width        5 1,827.800  22.287  1,801 1,855
## Weight       5 1,537.000 122.045  1,340 1,660
## Trunk        5  450.200  132.326   235   546 
## Speed        5  250.200   15.897   231   275 
## Acceleration 5   6.980    1.410   5.400 8.900
## kclust       5   1.000    0.000     1     1  
## ---------------------------------------------

stargazer(group_2, type="text", title =  "Compact Cars (Group 2)")

## 
## Compact Cars (Group 2)
## ===============================================
## Statistic    N   Mean    St. Dev.  Min    Max  
## -----------------------------------------------
## Displacement 7 1,307.857 240.282   996   1,598 
## Moment       7  116.857   19.334    90    140  
## Horsepower   7  80.429    12.081    65     95  
## Length       7 3,900.286 195.977  3,535  4,030 
## Width        7 1,699.143  52.806  1,595  1,748 
## Weight       7 1,115.571 100.528   929   1,215 
## Trunk        7  249.429   69.670   127    320  
## Speed        7  170.000   9.950    154    180  
## Acceleration 7  12.957    1.503   11.400 15.100
## kclust       7   2.000    0.000     2      2   
## -----------------------------------------------

stargazer(group_3, type="text", title =  " Luxury Cars (Group 3)")

## 
## Luxury Cars (Group 3)
## ==============================================
## Statistic    N   Mean    St. Dev.  Min   Max  
## ----------------------------------------------
## Displacement 3 2,028.000 103.923  1,968 2,148 
## Moment       3  303.333   28.868   270   320  
## Horsepower   3  138.667   2.309    136   140  
## Length       3 4,647.000 110.585  4,572 4,774 
## Width        3 1,786.333  29.160  1,769 1,820 
## Weight       3 1,542.000 101.425  1,425 1,605 
## Trunk        3  551.000   57.297   485   588  
## Speed        3  205.333   3.786    201   208  
## Acceleration 3  10.333    0.569   9.700 10.800
## kclust       3   3.000    0.000     3     3   
## ----------------------------------------------

Above is the Final list of clusters (Column [,2]), and the k-means clustering is used. I think the k-means clusters are better because they make more intuitive sense so we will focus on those in the analysis. The k-means clustering algorithm has effectively grouped the cars into three categories based on the included variables. The ANOVA tables have shown that horsepower and Acceleration have statistically significant cluster means. This result makes intuitive sense based on the definitions of the clusters of sports, compact, and luxury cars. These car groups are primarily differentiated based on the attributes of their engines. For example some sports cars could be coups (two-door) and would have similar size to compact cars.

The first cluster, identified as Sports Cars, includes models such as the Porsche Cayman, Audi A6 2.4, and BMW 525i. These cars are characterized by their powerful performance, with a mean horsepower of 244.4 and a mean acceleration of 6.98. They also have larger displacements, averaging 2,954.0 cc, indicating large engines. Known for their dynamic handling and powerful engines, sports cars are ideal for enthusiasts who prioritize performance.

The second cluster, labeled Compact Cars, consists of models like the Kia Picanto, Suzuki Splash, Renault Clio, and Fiat Grande Punto. These cars are characterized by their compact size, efficient engines, and practicality for urban commuting. With smaller displacements averaging 1,307.86 cc, they offer fuel efficiency. Compact cars provide good value for money, and decent comfort, making them popular for daily use.

The third cluster is denoted as Luxury Cars and includes models such as the Mercedes C200 CDI and VW Passat Variant. Luxury cars offer upscale interiors and advanced safety features representing the pinnacle of comfort and technology. They provide generous interior space with larger dimensions averaging 4,647 mm in length and 1,786.33 mm in width. Luxury cars excel in catering to buyers seeking comfort, prestige, and the latest automotive innovations.

Understanding these clusters can assist buyers in making informed decisions based on their lifestyle, preferences, and priorities when choosing a vehicle. Each cluster represents a different segment of the automotive market. The k-means clusters provide valuable insights for selecting the ideal car to suit individual needs and preferences.

MarketResearch-ClusteringPractice

Avery Davis

2024-03-14