assignment 2 final

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: bitops

Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

1.)

Call:
rpart(formula = formula, data = pl_training, method = "class")
  n= 600 

          CP nsplit rel error    xerror       xstd
1 0.18791946      0 1.0000000 1.0738255 0.04100739
2 0.02684564      1 0.8120805 0.9295302 0.04097786
3 0.02013423      3 0.7583893 0.8959732 0.04084942
4 0.01174497      4 0.7382550 0.8825503 0.04078504
5 0.01006711      6 0.7147651 0.9194631 0.04094418
6 0.01000000      7 0.7046980 0.9127517 0.04091942

Variable importance
  c_diff ftg_diff   s_diff  st_diff htg_diff   wdl_ht   f_diff   y_diff 
      24       21       15       11       10       10        6        3 

Node number 1: 600 observations,    complexity param=0.1879195
  predicted class=Away  expected loss=0.4966667  P(node) =1
    class counts:   302   298
   probabilities: 0.503 0.497 
  left son=2 (370 obs) right son=3 (230 obs)
  Primary splits:
      ftg_diff < 0.5   to the left,  improve=11.668920, (0 missing)
      wdl_ft   splits as  LLR,       improve=11.668920, (0 missing)
      s_diff   < 0.5   to the left,  improve=11.591030, (0 missing)
      st_diff  < -0.5  to the left,  improve=10.170630, (0 missing)
      c_diff   < 0.5   to the left,  improve= 6.746946, (0 missing)
  Surrogate splits:
      htg_diff < 0.5   to the left,  agree=0.797, adj=0.470, (0 split)
      wdl_ht   splits as  LLR,       agree=0.797, adj=0.470, (0 split)
      st_diff  < 2.5   to the left,  agree=0.758, adj=0.370, (0 split)
      s_diff   < 2.5   to the left,  agree=0.688, adj=0.187, (0 split)
      f_diff   < -6.5  to the right, agree=0.625, adj=0.022, (0 split)

Node number 2: 370 observations,    complexity param=0.02684564
  predicted class=Away  expected loss=0.4189189  P(node) =0.6166667
    class counts:   215   155
   probabilities: 0.581 0.419 
  left son=4 (95 obs) right son=5 (275 obs)
  Primary splits:
      c_diff   < -3.5  to the left,  improve=8.971882, (0 missing)
      s_diff   < -12.5 to the left,  improve=7.382894, (0 missing)
      st_diff  < 0.5   to the left,  improve=3.779366, (0 missing)
      f_diff   < -2.5  to the right, improve=2.493164, (0 missing)
      ftg_diff < -0.5  to the left,  improve=1.194605, (0 missing)
  Surrogate splits:
      s_diff  < -9.5  to the left,  agree=0.819, adj=0.295, (0 split)
      st_diff < -10.5 to the left,  agree=0.754, adj=0.042, (0 split)

Node number 3: 230 observations
  predicted class=Home  expected loss=0.3782609  P(node) =0.3833333
    class counts:    87   143
   probabilities: 0.378 0.622 

Node number 4: 95 observations
  predicted class=Away  expected loss=0.2315789  P(node) =0.1583333
    class counts:    73    22
   probabilities: 0.768 0.232 

Node number 5: 275 observations,    complexity param=0.02684564
  predicted class=Away  expected loss=0.4836364  P(node) =0.4583333
    class counts:   142   133
   probabilities: 0.516 0.484 
  left son=10 (213 obs) right son=11 (62 obs)
  Primary splits:
      s_diff   < 5.5   to the left,  improve=3.384380, (0 missing)
      f_diff   < -8.5  to the left,  improve=2.087576, (0 missing)
      st_diff  < 3.5   to the left,  improve=1.193904, (0 missing)
      ftg_diff < -1.5  to the left,  improve=1.070037, (0 missing)
      htg_diff < -1.5  to the left,  improve=1.045349, (0 missing)
  Surrogate splits:
      st_diff < 1.5   to the left,  agree=0.840, adj=0.290, (0 split)
      c_diff  < 5.5   to the left,  agree=0.833, adj=0.258, (0 split)

Node number 10: 213 observations,    complexity param=0.02013423
  predicted class=Away  expected loss=0.4413146  P(node) =0.355
    class counts:   119    94
   probabilities: 0.559 0.441 
  left son=20 (141 obs) right son=21 (72 obs)
  Primary splits:
      f_diff   < -2.5  to the right, improve=2.190665, (0 missing)
      c_diff   < 5.5   to the right, improve=2.049204, (0 missing)
      htg_diff < -2.5  to the right, improve=1.584083, (0 missing)
      st_diff  < -5.5  to the right, improve=1.415182, (0 missing)
      ftg_diff < -3.5  to the right, improve=1.217358, (0 missing)
  Surrogate splits:
      y_diff < -2.5  to the right, agree=0.667, adj=0.014, (0 split)

Node number 11: 62 observations
  predicted class=Home  expected loss=0.3709677  P(node) =0.1033333
    class counts:    23    39
   probabilities: 0.371 0.629 

Node number 20: 141 observations
  predicted class=Away  expected loss=0.3900709  P(node) =0.235
    class counts:    86    55
   probabilities: 0.610 0.390 

Node number 21: 72 observations,    complexity param=0.01174497
  predicted class=Home  expected loss=0.4583333  P(node) =0.12
    class counts:    33    39
   probabilities: 0.458 0.542 
  left son=42 (59 obs) right son=43 (13 obs)
  Primary splits:
      c_diff   < -1.5  to the right, improve=1.643090, (0 missing)
      f_diff   < -8    to the left,  improve=1.531250, (0 missing)
      st_diff  < -2.5  to the right, improve=1.251203, (0 missing)
      ftg_diff < -0.5  to the right, improve=1.172078, (0 missing)
      wdl_ft   splits as  LR-,       improve=1.172078, (0 missing)
  Surrogate splits:
      ftg_diff < -3.5  to the right, agree=0.833, adj=0.077, (0 split)
      s_diff   < -14.5 to the right, agree=0.833, adj=0.077, (0 split)

Node number 42: 59 observations,    complexity param=0.01174497
  predicted class=Away  expected loss=0.4915254  P(node) =0.09833333
    class counts:    30    29
   probabilities: 0.508 0.492 
  left son=84 (31 obs) right son=85 (28 obs)
  Primary splits:
      y_diff   < -0.5  to the right, improve=1.4247050, (0 missing)
      st_diff  < -3.5  to the right, improve=0.7822231, (0 missing)
      htg_diff < -1.5  to the left,  improve=0.6728441, (0 missing)
      s_diff   < -8.5  to the left,  improve=0.6728441, (0 missing)
      f_diff   < -6.5  to the left,  improve=0.6629540, (0 missing)
  Surrogate splits:
      s_diff   < -3.5  to the left,  agree=0.610, adj=0.179, (0 split)
      r_diff   < 0.5   to the left,  agree=0.610, adj=0.179, (0 split)
      st_diff  < -4.5  to the left,  agree=0.593, adj=0.143, (0 split)
      htg_diff < -2.5  to the right, agree=0.559, adj=0.071, (0 split)
      f_diff   < -8    to the right, agree=0.559, adj=0.071, (0 split)

Node number 43: 13 observations
  predicted class=Home  expected loss=0.2307692  P(node) =0.02166667
    class counts:     3    10
   probabilities: 0.231 0.769 

Node number 84: 31 observations
  predicted class=Away  expected loss=0.3870968  P(node) =0.05166667
    class counts:    19    12
   probabilities: 0.613 0.387 

Node number 85: 28 observations,    complexity param=0.01006711
  predicted class=Home  expected loss=0.3928571  P(node) =0.04666667
    class counts:    11    17
   probabilities: 0.393 0.607 
  left son=170 (7 obs) right son=171 (21 obs)
  Primary splits:
      c_diff  < 3.5   to the right, improve=1.9285710, (0 missing)
      f_diff  < -4.5  to the left,  improve=1.7857140, (0 missing)
      st_diff < -2.5  to the right, improve=1.6138270, (0 missing)
      s_diff  < -4.5  to the right, improve=0.2142857, (0 missing)
      y_diff  < -1.5  to the left,  improve=0.1488095, (0 missing)
  Surrogate splits:
      f_diff < -9.5  to the left,  agree=0.821, adj=0.286, (0 split)

Node number 170: 7 observations
  predicted class=Away  expected loss=0.2857143  P(node) =0.01166667
    class counts:     5     2
   probabilities: 0.714 0.286 

Node number 171: 21 observations
  predicted class=Home  expected loss=0.2857143  P(node) =0.035
    class counts:     6    15
   probabilities: 0.286 0.714

fancyRpartPlot(tree_model)

1b.) The classification tree above tells us the following: i. Rule for predicting if a team is the home team: Based on the analysis of the decision tree model, one rule for predicting if a team is the home team is as follows: “If the half-time goal difference (htg_diff) is less than 0.5, predict that the team is the home team.” It is important to note that the purity of the node is not very high. Node 3, where this rule applies, contains 87 observations predicted as “Home” and 143 observations predicted as “Away”, indicating impurity.

Rule for predicting if a team is the away team: From the decision tree analysis, a rule for predicting if a team is the away team can be used as follows:

“If the corner difference (c_diff) is less than -3.5, predict that the team is the away team.” This rule is developed from (Node 4), and while it provides a basis for classification, it’s worth noting that the purity of this node is relatively high compared to others. Node 4 contains 73 observations predicted as “Away” and only 22 observations predicted as “Home”, indicating a higher degree of purity.

variable importance for predicting if a team is home or away. c_diff (Corner Difference): This variable has the highest importance value of 24, suggesting that the difference in corner kicks awarded to the team compared to the opponent is the most important predictor for determining if a team is home or away.

ftg_diff (Full-time Goal Difference): With an importance value of 21, the full-time goal difference, which represents the goals scored by the team minus the goals scored by the opponent, is the second most important predictor.

s_diff (Shot Difference): The shot difference, denoted by s_diff, follows with an importance value of 15. It indicates the difference between the shots made by the team and those made by the opponent.

st_diff (Shot on Target Difference): This variable represents the difference in shots on target made by the team and those made by the opponent. It has an importance value of 11.

htg_diff (Half-time Goal Difference): Half-time goal difference, htg_diff, comes next with an importance value of 10, indicating its significance in predicting whether a team is home or away.

wdl_ht (Half-time Winning, Drawing, Losing): While not a numerical variable like the others, the halftime winning, drawing, or losing status holds importance (importance value of 10) in predicting the team’s status (home or away) at half-time.

f_diff (Foul Difference): Foul difference, representing the difference between fouls made by the team and those made by the opponent, has an importance value of 6.

y_diff (Yellow Card Difference): Yellow card difference, showing the difference in yellow cards shown to the team and those shown to the opponent, follows with an importance value of 3.

r_diff (Red Card Difference): Lastly, the difference in red cards shown to the team and those shown to the opponent, represented by r_diff, has the lowest importance value of 0.25.

These importance values suggest the relative influence of each predictor variable in determining whether a team is playing at home or away.

c.) Accuracy on the training dataset: 65% Accuracy on the testing dataset: 63.125% Given that the accuracy on the testing dataset is slightly lower than on the training dataset, it indicates that the classification tree model may be overfitting the training dataset. In this case, the small difference in accuracy between the training and testing datasets suggests some degree of overfitting, although it’s not extreme.

3.)


Call:
glm(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + 
    f_diff + c_diff + y_diff + r_diff, family = "binomial", data = pl_training)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.012840   0.084802  -0.151  0.87965   
ftg_diff     0.059662   0.078488   0.760  0.44717   
htg_diff     0.120396   0.107023   1.125  0.26061   
s_diff       0.047562   0.018066   2.633  0.00847 **
st_diff     -0.023611   0.040479  -0.583  0.55969   
f_diff      -0.023460   0.018947  -1.238  0.21566   
c_diff       0.019839   0.025258   0.785  0.43219   
y_diff      -0.004931   0.056198  -0.088  0.93007   
r_diff       0.241312   0.259056   0.932  0.35159   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 831.75  on 599  degrees of freedom
Residual deviance: 787.27  on 591  degrees of freedom
AIC: 805.27

Number of Fisher Scoring iterations: 4

s_diff (p = 0.00847): The variable representing the difference in shots made by the team compared to the opponent appears to be statistically significant at α=0.05 level, suggesting it has a notable influence on classifying a team as home or away.

the only important predictor variable in classifying a team as home or away appears to be the difference in shots made by the team compared to the opponent (s_diff).

Odds Ratio=e 0.047562 ≈1.048

Interpretation: With each additional shot made by the team compared to the opponent, the odds of the team being classified as the home team increase by approximately 4.8%.

3.a)

[1] “Away” “Home” [1] “Away” “Home”

Call: glm(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, family = binomial(link = “logit”), data = pl_training)

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.145392 0.188527 -0.771 0.4406
ftg_diff -0.147175 0.118998 -1.237 0.2162
htg_diff 0.096657 0.184058 0.525 0.5995
s_diff 0.046414 0.018210 2.549 0.0108 st_diff -0.020422 0.040914 -0.499 0.6177
f_diff -0.019706 0.019094 -1.032 0.3021
c_diff 0.018899 0.025457 0.742 0.4578
y_diff -0.004911 0.056662 -0.087 0.9309
r_diff 0.226011 0.260462 0.868 0.3855
wdl_ftLose -0.425153 0.301097 -1.412 0.1579
wdl_ftWin 0.563351 0.300776 1.873 0.0611 . wdl_htLose 0.091217 0.320569 0.285 0.7760
wdl_htWin 0.178392 0.325737 0.548 0.5839
— Signif. codes: 0 ’’ 0.001 ’’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 831.75  on 599  degrees of freedom

Residual deviance: 780.45 on 587 degrees of freedom AIC: 806.45

Number of Fisher Scoring iterations: 4

3b.)

2,) Clustering Baseball players

1.)

'data.frame':   82 obs. of  6 variables:
 $ playerID    : chr  "aaronha01" "alomaro01" "aparilu01" "bagweje01" ...
 $ hits        : int  3771 2724 2677 2314 2583 2048 2150 3060 3010 1779 ...
 $ runs        : int  2174 1508 1335 1517 1305 1091 1175 1844 1513 861 ...
 $ home_runs   : int  755 210 83 449 512 389 358 291 118 68 ...
 $ rbi         : int  2297 1134 791 1529 1636 1376 1430 1175 1014 789 ...
 $ stolen_bases: int  240 474 506 202 50 68 30 414 24 51 ...

2.) For hierarchical clustering, the need for scaling depends on the scale of the variables. Since hierarchical clustering relies on distance metrics, variables with larger scales can dominate the clustering process. However, if the variables in the dataset are on similar scales, scaling may not be necessary.

On the other hand, for K-means clustering, scaling is generally recommended. K-means clustering partitions the data based on Euclidean distance, making it sensitive to the scale of the variables. Scaling ensures that all variables contribute equally to the clustering process, preventing biases due to differences in variable scales.

In summary, while hierarchical clustering may not require scaling if variables are on similar scales, for this dataset it may be benficial to scale the data before performing K-means clustering to ensure optimal clustering results.

3.a)

baseball_data <- read.csv("baseball_hof.csv")

sum(is.na(baseball_data))

[1] 0

attributes <- c("hits", "runs", "home_runs", "rbi", "stolen_bases")
data_for_clustering <- baseball_hof[attributes]

scaled_data <- scale(data_for_clustering)

num_clusters <- 3

set.seed(123)  
kmeans_result <- kmeans(scaled_data, centers = num_clusters)

library(ggplot2)
cluster_plot <- ggplot(data_for_clustering, aes(x = hits, y = runs, color = factor(kmeans_result$cluster))) +
  geom_point(size = 3) +
  labs(title = "Clustering of Baseball Players",
       x = "hits",
       y = "runs",
       color = "Cluster") +
  theme_minimal()
print(cluster_plot)

attributes <- c("hits", "runs", "home_runs", "rbi", "stolen_bases")
data_for_clustering <- baseball_data[attributes]

scaled_data <- scale(data_for_clustering)

baseball_data <- read.csv("baseball_hof.csv")

attributes <- c("hits", "runs", "home_runs", "rbi", "stolen_bases")
data_for_clustering <- baseball_data[attributes]

scaled_data <- scale(data_for_clustering)

dist_matrix <- dist(scaled_data)  # Compute the distance matrix
hclust_result <- hclust(dist_matrix, method = 'ward.D')  # Perform hierarchical clustering

plot(hclust_result, hang = -1, cex = 0.6, main = "Hierarchical Clustering Dendrogram")

3b+c)

Warning in plot.window(...): "cellwidth" is not a graphical parameter

Warning in plot.window(...): "cellheight" is not a graphical parameter

Warning in plot.xy(xy, type, ...): "cellwidth" is not a graphical parameter

Warning in plot.xy(xy, type, ...): "cellheight" is not a graphical parameter

Warning in title(...): "cellwidth" is not a graphical parameter

Warning in title(...): "cellheight" is not a graphical parameter

the heatmap shows evidence of clustering as the top block shows light blocks of clustering as you go down the diagonal line there is light colored blocks showing byt the further down the less evident they become.

Warning in plot.window(...): "hang" is not a graphical parameter

Warning in plot.xy(xy, type, ...): "hang" is not a graphical parameter

Warning in axis(side = side, at = at, labels = labels, ...): "hang" is not a
graphical parameter

Warning in axis(side = side, at = at, labels = labels, ...): "hang" is not a
graphical parameter

Warning in box(...): "hang" is not a graphical parameter

Warning in title(...): "hang" is not a graphical parameter

Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = b1) :
 Cluster sizes and average silhouette widths:
       17        25        30        10 
0.3202221 0.2078566 0.2958375 0.4325219 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1771  0.2100  0.3257  0.2907  0.4224  0.5725

4,)

      hits      runs home_runs      rbi stolen_bases
1 2462.586 1387.5517  417.7931 1534.241     95.96552
2 2810.833 1536.0000  146.4444 1034.556    530.83333
3 1833.385  973.0769  246.5385 1076.077     61.23077
4 3232.818 1780.5455  387.3636 1753.727    257.45455

cluster_assignments
 1  2  3  4 
29 18 13 22

Average Silhouette Width: Cluster 1: 0.320 Cluster 2: 0.208 Cluster 3: 0.296 Cluster 4: 0.433 Interpretation: Clusters with a silhouette width closer to 1 indicate well-separated clusters. Cluster 4 has the highest average silhouette width, suggesting it has the most distinct and well-separated clusters. Cluster 2 has the lowest average silhouette width, indicating that its clusters might be overlapping or poorly separated.

      cluster neighbor    sil_width
 [1,]       4        1  0.384506916
 [2,]       2        1  0.312982837
 [3,]       2        1  0.418236473
 [4,]       1        3  0.562121628
 [5,]       1        4  0.566132188
 [6,]       3        1  0.095805288
 [7,]       1        3  0.147925307
 [8,]       2        4  0.131414280
 [9,]       2        1  0.188914215
[10,]       3        1  0.545892803
[11,]       4        1  0.313885438
[12,]       2        4  0.427358174
[13,]       3        1  0.468706901
[14,]       2        1  0.398204049
[15,]       3        1  0.298848165
[16,]       4        2  0.250382372
[17,]       3        1  0.553761033
[18,]       2        4  0.041698566
[19,]       1        3  0.295765391
[20,]       1        4  0.327251906
[21,]       3        1  0.470203252
[22,]       1        3  0.465023241
[23,]       1        3  0.411872464
[24,]       1        4  0.147562851
[25,]       2        1  0.292846775
[26,]       3        1  0.459970799
[27,]       4        1 -0.103113559
[28,]       1        2  0.541979678
[29,]       2        4  0.283954071
[30,]       3        1  0.512015506
[31,]       1        2  0.438077081
[32,]       2        4  0.131430652
[33,]       4        1 -0.013639305
[34,]       1        4  0.427821152
[35,]       1        4  0.283669262
[36,]       4        1  0.151655201
[37,]       2        1  0.461654552
[38,]       1        3  0.321096503
[39,]       3        1  0.527598880
[40,]       4        2  0.217706370
[41,]       2        1  0.091862035
[42,]       1        2  0.507613798
[43,]       2        1  0.266623036
[44,]       1        3  0.545351705
[45,]       4        1  0.428181944
[46,]       1        3  0.407883494
[47,]       1        3  0.406757388
[48,]       4        2  0.040201876
[49,]       2        1  0.260067139
[50,]       4        1  0.394498616
[51,]       4        1  0.461390102
[52,]       4        1  0.160488989
[53,]       1        4  0.446538852
[54,]       3        1  0.061403663
[55,]       3        1  0.103752437
[56,]       2        1  0.403265905
[57,]       1        3  0.536215579
[58,]       4        1  0.343375029
[59,]       1        2  0.269724218
[60,]       4        1  0.236744965
[61,]       3        1  0.532695232
[62,]       1        2  0.209376924
[63,]       4        1  0.203585823
[64,]       2        1 -0.056968553
[65,]       1        3  0.526979514
[66,]       4        1  0.005527059
[67,]       2        1  0.228016213
[68,]       2        3  0.310036150
[69,]       1        3  0.107587752
[70,]       4        2  0.295883103
[71,]       1        3  0.391317643
[72,]       3        1  0.235144270
[73,]       1        4  0.560058764
[74,]       1        4  0.540374315
[75,]       1        3  0.193043689
[76,]       4        2  0.256366507
[77,]       4        2  0.018085437
[78,]       1        4  0.479425546
[79,]       1        4  0.155803449
[80,]       4        1  0.364188344
[81,]       4        1  0.488213864
[82,]       4        2  0.147591607
attr(,"Ordered")
[1] FALSE
attr(,"call")
silhouette.default(x = cluster_assignments, dist = dist(data_for_clustering))
attr(,"class")
[1] "silhouette"

[1] 0.3137006

The silhouette analysis provides insights into the quality of the clustering solution. The average silhouette score of approximately 0.314 indicates that the clustering solution is reasonable, with some degree of separation between clusters.

  Cluster     hits      runs home_runs      rbi stolen_bases
1       1 2462.586 1387.5517  417.7931 1534.241     95.96552
2       2 2810.833 1536.0000  146.4444 1034.556    530.83333
3       3 1833.385  973.0769  246.5385 1076.077     61.23077
4       4 3232.818 1780.5455  387.3636 1753.727    257.45455

 [1] 4 2 2 1 1 3 1 2 2 3 1 2 3 2 3 2 3 2 3 1 3 1 1 4 2 3 4 1 2 3 1 2 1 1 1 1 2 1
[39] 3 2 3 1 3 1 4 1 3 2 2 4 4 4 1 3 3 2 1 4 1 4 3 1 4 3 1 1 2 2 1 2 1 3 1 1 3 2
[77] 2 1 4 4 4 2

        hits       runs  home_runs        rbi stolen_bases
1 -0.2044605 -0.1624618  0.5660315  0.3798416   -0.4849247
2  0.7521677  0.5537557 -0.9804896 -0.5809428    1.2098245
3 -1.1508405 -1.2010894 -0.5701789 -0.8728960   -0.4647733
4  0.8494748  1.1682310  1.2734836  1.4407847   -0.3236583

[1] 28 22 19 13

[1] 3

sil_kmeans1 <- silhouette(kmeans1$cluster, b1) summary(sil_kmeans1)

Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = b1) :
 Cluster sizes and average silhouette widths:
       28        22        19        13 
0.3322757 0.2178155 0.2687416 0.3421893 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06303  0.18425  0.28984  0.28842  0.41869  0.54977

5.a)

Based on the comparison of overall average silhouette widths, Hierarchical Clustering produced clusters with a slightly higher quality compared to K-means, as its overall average silhouette width is higher. However, the difference is not substantial.

5b.) Hierarchical Clustering: Cluster 1: Size 17, Average silhouette width 0.3202221 Cluster 2: Size 25, Average silhouette width 0.2078566 Cluster 3: Size 30, Average silhouette width 0.2958375 Cluster 4: Size 10, Average silhouette width 0.4325219 K-means: Cluster 1: Size 28, Average silhouette width 0.3322757 Cluster 2: Size 22, Average silhouette width 0.2178155 Cluster 3: Size 19, Average silhouette width 0.2687416 Cluster 4: Size 13, Average silhouette width 0.3421893 While both algorithms produced four clusters, there are some noticeable differences in the cluster profiles:

Cluster sizes: The sizes of the clusters differ between the two algorithms. For example, Cluster 1 has a larger size in K-means (28) compared to Hierarchical Clustering (17), and Cluster 4 has a larger size in Hierarchical Clustering (10) compared to K-means (13). Average silhouette width: Although the overall average silhouette width of K-means is slightly lower than that of Hierarchical Clustering, the differences in average silhouette widths between corresponding clusters are not significant. However, Cluster 4 in Hierarchical Clustering stands out with the highest average silhouette width (0.4325219), indicating that it may have more cohesive and well-separated data points compared to the corresponding cluster in K-means. In conclusion, while both algorithms produced clusters with some similarities, such as similar average silhouette widths, there are noticeable differences in cluster sizes and potentially in the cohesion of certain clusters. These differences may arise due to variations in the clustering algorithms’ methodologies and assumptions.