── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
assignment 2 final
1.)
Call:
rpart(formula = formula, data = pl_training, method = "class")
n= 600
CP nsplit rel error xerror xstd
1 0.18791946 0 1.0000000 1.0738255 0.04100739
2 0.02684564 1 0.8120805 0.9295302 0.04097786
3 0.02013423 3 0.7583893 0.8959732 0.04084942
4 0.01174497 4 0.7382550 0.8825503 0.04078504
5 0.01006711 6 0.7147651 0.9194631 0.04094418
6 0.01000000 7 0.7046980 0.9127517 0.04091942
Variable importance
c_diff ftg_diff s_diff st_diff htg_diff wdl_ht f_diff y_diff
24 21 15 11 10 10 6 3
Node number 1: 600 observations, complexity param=0.1879195
predicted class=Away expected loss=0.4966667 P(node) =1
class counts: 302 298
probabilities: 0.503 0.497
left son=2 (370 obs) right son=3 (230 obs)
Primary splits:
ftg_diff < 0.5 to the left, improve=11.668920, (0 missing)
wdl_ft splits as LLR, improve=11.668920, (0 missing)
s_diff < 0.5 to the left, improve=11.591030, (0 missing)
st_diff < -0.5 to the left, improve=10.170630, (0 missing)
c_diff < 0.5 to the left, improve= 6.746946, (0 missing)
Surrogate splits:
htg_diff < 0.5 to the left, agree=0.797, adj=0.470, (0 split)
wdl_ht splits as LLR, agree=0.797, adj=0.470, (0 split)
st_diff < 2.5 to the left, agree=0.758, adj=0.370, (0 split)
s_diff < 2.5 to the left, agree=0.688, adj=0.187, (0 split)
f_diff < -6.5 to the right, agree=0.625, adj=0.022, (0 split)
Node number 2: 370 observations, complexity param=0.02684564
predicted class=Away expected loss=0.4189189 P(node) =0.6166667
class counts: 215 155
probabilities: 0.581 0.419
left son=4 (95 obs) right son=5 (275 obs)
Primary splits:
c_diff < -3.5 to the left, improve=8.971882, (0 missing)
s_diff < -12.5 to the left, improve=7.382894, (0 missing)
st_diff < 0.5 to the left, improve=3.779366, (0 missing)
f_diff < -2.5 to the right, improve=2.493164, (0 missing)
ftg_diff < -0.5 to the left, improve=1.194605, (0 missing)
Surrogate splits:
s_diff < -9.5 to the left, agree=0.819, adj=0.295, (0 split)
st_diff < -10.5 to the left, agree=0.754, adj=0.042, (0 split)
Node number 3: 230 observations
predicted class=Home expected loss=0.3782609 P(node) =0.3833333
class counts: 87 143
probabilities: 0.378 0.622
Node number 4: 95 observations
predicted class=Away expected loss=0.2315789 P(node) =0.1583333
class counts: 73 22
probabilities: 0.768 0.232
Node number 5: 275 observations, complexity param=0.02684564
predicted class=Away expected loss=0.4836364 P(node) =0.4583333
class counts: 142 133
probabilities: 0.516 0.484
left son=10 (213 obs) right son=11 (62 obs)
Primary splits:
s_diff < 5.5 to the left, improve=3.384380, (0 missing)
f_diff < -8.5 to the left, improve=2.087576, (0 missing)
st_diff < 3.5 to the left, improve=1.193904, (0 missing)
ftg_diff < -1.5 to the left, improve=1.070037, (0 missing)
htg_diff < -1.5 to the left, improve=1.045349, (0 missing)
Surrogate splits:
st_diff < 1.5 to the left, agree=0.840, adj=0.290, (0 split)
c_diff < 5.5 to the left, agree=0.833, adj=0.258, (0 split)
Node number 10: 213 observations, complexity param=0.02013423
predicted class=Away expected loss=0.4413146 P(node) =0.355
class counts: 119 94
probabilities: 0.559 0.441
left son=20 (141 obs) right son=21 (72 obs)
Primary splits:
f_diff < -2.5 to the right, improve=2.190665, (0 missing)
c_diff < 5.5 to the right, improve=2.049204, (0 missing)
htg_diff < -2.5 to the right, improve=1.584083, (0 missing)
st_diff < -5.5 to the right, improve=1.415182, (0 missing)
ftg_diff < -3.5 to the right, improve=1.217358, (0 missing)
Surrogate splits:
y_diff < -2.5 to the right, agree=0.667, adj=0.014, (0 split)
Node number 11: 62 observations
predicted class=Home expected loss=0.3709677 P(node) =0.1033333
class counts: 23 39
probabilities: 0.371 0.629
Node number 20: 141 observations
predicted class=Away expected loss=0.3900709 P(node) =0.235
class counts: 86 55
probabilities: 0.610 0.390
Node number 21: 72 observations, complexity param=0.01174497
predicted class=Home expected loss=0.4583333 P(node) =0.12
class counts: 33 39
probabilities: 0.458 0.542
left son=42 (59 obs) right son=43 (13 obs)
Primary splits:
c_diff < -1.5 to the right, improve=1.643090, (0 missing)
f_diff < -8 to the left, improve=1.531250, (0 missing)
st_diff < -2.5 to the right, improve=1.251203, (0 missing)
ftg_diff < -0.5 to the right, improve=1.172078, (0 missing)
wdl_ft splits as LR-, improve=1.172078, (0 missing)
Surrogate splits:
ftg_diff < -3.5 to the right, agree=0.833, adj=0.077, (0 split)
s_diff < -14.5 to the right, agree=0.833, adj=0.077, (0 split)
Node number 42: 59 observations, complexity param=0.01174497
predicted class=Away expected loss=0.4915254 P(node) =0.09833333
class counts: 30 29
probabilities: 0.508 0.492
left son=84 (31 obs) right son=85 (28 obs)
Primary splits:
y_diff < -0.5 to the right, improve=1.4247050, (0 missing)
st_diff < -3.5 to the right, improve=0.7822231, (0 missing)
htg_diff < -1.5 to the left, improve=0.6728441, (0 missing)
s_diff < -8.5 to the left, improve=0.6728441, (0 missing)
f_diff < -6.5 to the left, improve=0.6629540, (0 missing)
Surrogate splits:
s_diff < -3.5 to the left, agree=0.610, adj=0.179, (0 split)
r_diff < 0.5 to the left, agree=0.610, adj=0.179, (0 split)
st_diff < -4.5 to the left, agree=0.593, adj=0.143, (0 split)
htg_diff < -2.5 to the right, agree=0.559, adj=0.071, (0 split)
f_diff < -8 to the right, agree=0.559, adj=0.071, (0 split)
Node number 43: 13 observations
predicted class=Home expected loss=0.2307692 P(node) =0.02166667
class counts: 3 10
probabilities: 0.231 0.769
Node number 84: 31 observations
predicted class=Away expected loss=0.3870968 P(node) =0.05166667
class counts: 19 12
probabilities: 0.613 0.387
Node number 85: 28 observations, complexity param=0.01006711
predicted class=Home expected loss=0.3928571 P(node) =0.04666667
class counts: 11 17
probabilities: 0.393 0.607
left son=170 (7 obs) right son=171 (21 obs)
Primary splits:
c_diff < 3.5 to the right, improve=1.9285710, (0 missing)
f_diff < -4.5 to the left, improve=1.7857140, (0 missing)
st_diff < -2.5 to the right, improve=1.6138270, (0 missing)
s_diff < -4.5 to the right, improve=0.2142857, (0 missing)
y_diff < -1.5 to the left, improve=0.1488095, (0 missing)
Surrogate splits:
f_diff < -9.5 to the left, agree=0.821, adj=0.286, (0 split)
Node number 170: 7 observations
predicted class=Away expected loss=0.2857143 P(node) =0.01166667
class counts: 5 2
probabilities: 0.714 0.286
Node number 171: 21 observations
predicted class=Home expected loss=0.2857143 P(node) =0.035
class counts: 6 15
probabilities: 0.286 0.714
fancyRpartPlot(tree_model)1b.) The classification tree above tells us the following: i. Rule for predicting if a team is the home team: Based on the analysis of the decision tree model, one rule for predicting if a team is the home team is as follows: “If the half-time goal difference (htg_diff) is less than 0.5, predict that the team is the home team.” It is important to note that the purity of the node is not very high. Node 3, where this rule applies, contains 87 observations predicted as “Home” and 143 observations predicted as “Away”, indicating impurity.
- Rule for predicting if a team is the away team: From the decision tree analysis, a rule for predicting if a team is the away team can be used as follows:
“If the corner difference (c_diff) is less than -3.5, predict that the team is the away team.” This rule is developed from (Node 4), and while it provides a basis for classification, it’s worth noting that the purity of this node is relatively high compared to others. Node 4 contains 73 observations predicted as “Away” and only 22 observations predicted as “Home”, indicating a higher degree of purity.
variable importance for predicting if a team is home or away. c_diff (Corner Difference): This variable has the highest importance value of 24, suggesting that the difference in corner kicks awarded to the team compared to the opponent is the most important predictor for determining if a team is home or away.
ftg_diff (Full-time Goal Difference): With an importance value of 21, the full-time goal difference, which represents the goals scored by the team minus the goals scored by the opponent, is the second most important predictor.
s_diff (Shot Difference): The shot difference, denoted by s_diff, follows with an importance value of 15. It indicates the difference between the shots made by the team and those made by the opponent.
st_diff (Shot on Target Difference): This variable represents the difference in shots on target made by the team and those made by the opponent. It has an importance value of 11.
htg_diff (Half-time Goal Difference): Half-time goal difference, htg_diff, comes next with an importance value of 10, indicating its significance in predicting whether a team is home or away.
wdl_ht (Half-time Winning, Drawing, Losing): While not a numerical variable like the others, the halftime winning, drawing, or losing status holds importance (importance value of 10) in predicting the team’s status (home or away) at half-time.
f_diff (Foul Difference): Foul difference, representing the difference between fouls made by the team and those made by the opponent, has an importance value of 6.
y_diff (Yellow Card Difference): Yellow card difference, showing the difference in yellow cards shown to the team and those shown to the opponent, follows with an importance value of 3.
r_diff (Red Card Difference): Lastly, the difference in red cards shown to the team and those shown to the opponent, represented by r_diff, has the lowest importance value of 0.25.
These importance values suggest the relative influence of each predictor variable in determining whether a team is playing at home or away.
c.) Accuracy on the training dataset: 65% Accuracy on the testing dataset: 63.125% Given that the accuracy on the testing dataset is slightly lower than on the training dataset, it indicates that the classification tree model may be overfitting the training dataset. In this case, the small difference in accuracy between the training and testing datasets suggests some degree of overfitting, although it’s not extreme.
3.)
Call:
glm(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff +
f_diff + c_diff + y_diff + r_diff, family = "binomial", data = pl_training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.012840 0.084802 -0.151 0.87965
ftg_diff 0.059662 0.078488 0.760 0.44717
htg_diff 0.120396 0.107023 1.125 0.26061
s_diff 0.047562 0.018066 2.633 0.00847 **
st_diff -0.023611 0.040479 -0.583 0.55969
f_diff -0.023460 0.018947 -1.238 0.21566
c_diff 0.019839 0.025258 0.785 0.43219
y_diff -0.004931 0.056198 -0.088 0.93007
r_diff 0.241312 0.259056 0.932 0.35159
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 831.75 on 599 degrees of freedom
Residual deviance: 787.27 on 591 degrees of freedom
AIC: 805.27
Number of Fisher Scoring iterations: 4
s_diff (p = 0.00847): The variable representing the difference in shots made by the team compared to the opponent appears to be statistically significant at α=0.05 level, suggesting it has a notable influence on classifying a team as home or away.
the only important predictor variable in classifying a team as home or away appears to be the difference in shots made by the team compared to the opponent (s_diff).
Odds Ratio=e 0.047562 ≈1.048
Interpretation: With each additional shot made by the team compared to the opponent, the odds of the team being classified as the home team increase by approximately 4.8%.
3.a)
[1] “Away” “Home” [1] “Away” “Home”
Call: glm(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, family = binomial(link = “logit”), data = pl_training)
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.145392 0.188527 -0.771 0.4406
ftg_diff -0.147175 0.118998 -1.237 0.2162
htg_diff 0.096657 0.184058 0.525 0.5995
s_diff 0.046414 0.018210 2.549 0.0108 st_diff -0.020422 0.040914 -0.499 0.6177
f_diff -0.019706 0.019094 -1.032 0.3021
c_diff 0.018899 0.025457 0.742 0.4578
y_diff -0.004911 0.056662 -0.087 0.9309
r_diff 0.226011 0.260462 0.868 0.3855
wdl_ftLose -0.425153 0.301097 -1.412 0.1579
wdl_ftWin 0.563351 0.300776 1.873 0.0611 . wdl_htLose 0.091217 0.320569 0.285 0.7760
wdl_htWin 0.178392 0.325737 0.548 0.5839
— Signif. codes: 0 ’’ 0.001 ’’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 831.75 on 599 degrees of freedom
Residual deviance: 780.45 on 587 degrees of freedom AIC: 806.45
Number of Fisher Scoring iterations: 4
3b.)
2,) Clustering Baseball players
1.)
'data.frame': 82 obs. of 6 variables:
$ playerID : chr "aaronha01" "alomaro01" "aparilu01" "bagweje01" ...
$ hits : int 3771 2724 2677 2314 2583 2048 2150 3060 3010 1779 ...
$ runs : int 2174 1508 1335 1517 1305 1091 1175 1844 1513 861 ...
$ home_runs : int 755 210 83 449 512 389 358 291 118 68 ...
$ rbi : int 2297 1134 791 1529 1636 1376 1430 1175 1014 789 ...
$ stolen_bases: int 240 474 506 202 50 68 30 414 24 51 ...
2.) For hierarchical clustering, the need for scaling depends on the scale of the variables. Since hierarchical clustering relies on distance metrics, variables with larger scales can dominate the clustering process. However, if the variables in the dataset are on similar scales, scaling may not be necessary.
On the other hand, for K-means clustering, scaling is generally recommended. K-means clustering partitions the data based on Euclidean distance, making it sensitive to the scale of the variables. Scaling ensures that all variables contribute equally to the clustering process, preventing biases due to differences in variable scales.
In summary, while hierarchical clustering may not require scaling if variables are on similar scales, for this dataset it may be benficial to scale the data before performing K-means clustering to ensure optimal clustering results.
3.a)
baseball_data <- read.csv("baseball_hof.csv")
sum(is.na(baseball_data))[1] 0
attributes <- c("hits", "runs", "home_runs", "rbi", "stolen_bases")
data_for_clustering <- baseball_hof[attributes]
scaled_data <- scale(data_for_clustering)
num_clusters <- 3
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = num_clusters)
library(ggplot2)
cluster_plot <- ggplot(data_for_clustering, aes(x = hits, y = runs, color = factor(kmeans_result$cluster))) +
geom_point(size = 3) +
labs(title = "Clustering of Baseball Players",
x = "hits",
y = "runs",
color = "Cluster") +
theme_minimal()
print(cluster_plot)attributes <- c("hits", "runs", "home_runs", "rbi", "stolen_bases")
data_for_clustering <- baseball_data[attributes]
scaled_data <- scale(data_for_clustering)baseball_data <- read.csv("baseball_hof.csv")
attributes <- c("hits", "runs", "home_runs", "rbi", "stolen_bases")
data_for_clustering <- baseball_data[attributes]
scaled_data <- scale(data_for_clustering)
dist_matrix <- dist(scaled_data) # Compute the distance matrix
hclust_result <- hclust(dist_matrix, method = 'ward.D') # Perform hierarchical clustering
plot(hclust_result, hang = -1, cex = 0.6, main = "Hierarchical Clustering Dendrogram")3b+c)
Warning in plot.window(...): "cellwidth" is not a graphical parameter
Warning in plot.window(...): "cellheight" is not a graphical parameter
Warning in plot.xy(xy, type, ...): "cellwidth" is not a graphical parameter
Warning in plot.xy(xy, type, ...): "cellheight" is not a graphical parameter
Warning in title(...): "cellwidth" is not a graphical parameter
Warning in title(...): "cellheight" is not a graphical parameter
the heatmap shows evidence of clustering as the top block shows light blocks of clustering as you go down the diagonal line there is light colored blocks showing byt the further down the less evident they become.
Warning in plot.window(...): "hang" is not a graphical parameter
Warning in plot.xy(xy, type, ...): "hang" is not a graphical parameter
Warning in axis(side = side, at = at, labels = labels, ...): "hang" is not a
graphical parameter
Warning in axis(side = side, at = at, labels = labels, ...): "hang" is not a
graphical parameter
Warning in box(...): "hang" is not a graphical parameter
Warning in title(...): "hang" is not a graphical parameter
Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = b1) :
Cluster sizes and average silhouette widths:
17 25 30 10
0.3202221 0.2078566 0.2958375 0.4325219
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1771 0.2100 0.3257 0.2907 0.4224 0.5725
4,)
hits runs home_runs rbi stolen_bases
1 2462.586 1387.5517 417.7931 1534.241 95.96552
2 2810.833 1536.0000 146.4444 1034.556 530.83333
3 1833.385 973.0769 246.5385 1076.077 61.23077
4 3232.818 1780.5455 387.3636 1753.727 257.45455
cluster_assignments
1 2 3 4
29 18 13 22
Average Silhouette Width: Cluster 1: 0.320 Cluster 2: 0.208 Cluster 3: 0.296 Cluster 4: 0.433 Interpretation: Clusters with a silhouette width closer to 1 indicate well-separated clusters. Cluster 4 has the highest average silhouette width, suggesting it has the most distinct and well-separated clusters. Cluster 2 has the lowest average silhouette width, indicating that its clusters might be overlapping or poorly separated.
cluster neighbor sil_width
[1,] 4 1 0.384506916
[2,] 2 1 0.312982837
[3,] 2 1 0.418236473
[4,] 1 3 0.562121628
[5,] 1 4 0.566132188
[6,] 3 1 0.095805288
[7,] 1 3 0.147925307
[8,] 2 4 0.131414280
[9,] 2 1 0.188914215
[10,] 3 1 0.545892803
[11,] 4 1 0.313885438
[12,] 2 4 0.427358174
[13,] 3 1 0.468706901
[14,] 2 1 0.398204049
[15,] 3 1 0.298848165
[16,] 4 2 0.250382372
[17,] 3 1 0.553761033
[18,] 2 4 0.041698566
[19,] 1 3 0.295765391
[20,] 1 4 0.327251906
[21,] 3 1 0.470203252
[22,] 1 3 0.465023241
[23,] 1 3 0.411872464
[24,] 1 4 0.147562851
[25,] 2 1 0.292846775
[26,] 3 1 0.459970799
[27,] 4 1 -0.103113559
[28,] 1 2 0.541979678
[29,] 2 4 0.283954071
[30,] 3 1 0.512015506
[31,] 1 2 0.438077081
[32,] 2 4 0.131430652
[33,] 4 1 -0.013639305
[34,] 1 4 0.427821152
[35,] 1 4 0.283669262
[36,] 4 1 0.151655201
[37,] 2 1 0.461654552
[38,] 1 3 0.321096503
[39,] 3 1 0.527598880
[40,] 4 2 0.217706370
[41,] 2 1 0.091862035
[42,] 1 2 0.507613798
[43,] 2 1 0.266623036
[44,] 1 3 0.545351705
[45,] 4 1 0.428181944
[46,] 1 3 0.407883494
[47,] 1 3 0.406757388
[48,] 4 2 0.040201876
[49,] 2 1 0.260067139
[50,] 4 1 0.394498616
[51,] 4 1 0.461390102
[52,] 4 1 0.160488989
[53,] 1 4 0.446538852
[54,] 3 1 0.061403663
[55,] 3 1 0.103752437
[56,] 2 1 0.403265905
[57,] 1 3 0.536215579
[58,] 4 1 0.343375029
[59,] 1 2 0.269724218
[60,] 4 1 0.236744965
[61,] 3 1 0.532695232
[62,] 1 2 0.209376924
[63,] 4 1 0.203585823
[64,] 2 1 -0.056968553
[65,] 1 3 0.526979514
[66,] 4 1 0.005527059
[67,] 2 1 0.228016213
[68,] 2 3 0.310036150
[69,] 1 3 0.107587752
[70,] 4 2 0.295883103
[71,] 1 3 0.391317643
[72,] 3 1 0.235144270
[73,] 1 4 0.560058764
[74,] 1 4 0.540374315
[75,] 1 3 0.193043689
[76,] 4 2 0.256366507
[77,] 4 2 0.018085437
[78,] 1 4 0.479425546
[79,] 1 4 0.155803449
[80,] 4 1 0.364188344
[81,] 4 1 0.488213864
[82,] 4 2 0.147591607
attr(,"Ordered")
[1] FALSE
attr(,"call")
silhouette.default(x = cluster_assignments, dist = dist(data_for_clustering))
attr(,"class")
[1] "silhouette"
[1] 0.3137006
The silhouette analysis provides insights into the quality of the clustering solution. The average silhouette score of approximately 0.314 indicates that the clustering solution is reasonable, with some degree of separation between clusters.
Cluster hits runs home_runs rbi stolen_bases
1 1 2462.586 1387.5517 417.7931 1534.241 95.96552
2 2 2810.833 1536.0000 146.4444 1034.556 530.83333
3 3 1833.385 973.0769 246.5385 1076.077 61.23077
4 4 3232.818 1780.5455 387.3636 1753.727 257.45455
[1] 4 2 2 1 1 3 1 2 2 3 1 2 3 2 3 2 3 2 3 1 3 1 1 4 2 3 4 1 2 3 1 2 1 1 1 1 2 1
[39] 3 2 3 1 3 1 4 1 3 2 2 4 4 4 1 3 3 2 1 4 1 4 3 1 4 3 1 1 2 2 1 2 1 3 1 1 3 2
[77] 2 1 4 4 4 2
hits runs home_runs rbi stolen_bases
1 -0.2044605 -0.1624618 0.5660315 0.3798416 -0.4849247
2 0.7521677 0.5537557 -0.9804896 -0.5809428 1.2098245
3 -1.1508405 -1.2010894 -0.5701789 -0.8728960 -0.4647733
4 0.8494748 1.1682310 1.2734836 1.4407847 -0.3236583
[1] 28 22 19 13
[1] 3
sil_kmeans1 <- silhouette(kmeans1$cluster, b1) summary(sil_kmeans1)
Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = b1) :
Cluster sizes and average silhouette widths:
28 22 19 13
0.3322757 0.2178155 0.2687416 0.3421893
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.06303 0.18425 0.28984 0.28842 0.41869 0.54977
5.a)
Based on the comparison of overall average silhouette widths, Hierarchical Clustering produced clusters with a slightly higher quality compared to K-means, as its overall average silhouette width is higher. However, the difference is not substantial.
5b.) Hierarchical Clustering: Cluster 1: Size 17, Average silhouette width 0.3202221 Cluster 2: Size 25, Average silhouette width 0.2078566 Cluster 3: Size 30, Average silhouette width 0.2958375 Cluster 4: Size 10, Average silhouette width 0.4325219 K-means: Cluster 1: Size 28, Average silhouette width 0.3322757 Cluster 2: Size 22, Average silhouette width 0.2178155 Cluster 3: Size 19, Average silhouette width 0.2687416 Cluster 4: Size 13, Average silhouette width 0.3421893 While both algorithms produced four clusters, there are some noticeable differences in the cluster profiles:
Cluster sizes: The sizes of the clusters differ between the two algorithms. For example, Cluster 1 has a larger size in K-means (28) compared to Hierarchical Clustering (17), and Cluster 4 has a larger size in Hierarchical Clustering (10) compared to K-means (13). Average silhouette width: Although the overall average silhouette width of K-means is slightly lower than that of Hierarchical Clustering, the differences in average silhouette widths between corresponding clusters are not significant. However, Cluster 4 in Hierarchical Clustering stands out with the highest average silhouette width (0.4325219), indicating that it may have more cohesive and well-separated data points compared to the corresponding cluster in K-means. In conclusion, while both algorithms produced clusters with some similarities, such as similar average silhouette widths, there are noticeable differences in cluster sizes and potentially in the cohesion of certain clusters. These differences may arise due to variations in the clustering algorithms’ methodologies and assumptions.