Classification Assignment

Author

Jeff Lynskey

  1. Import the possession_training.csv and possession_testing.csv datasets into R.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: bitops

Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
Rows: 3,500
Columns: 6
$ outcome          <fct> No Shot, No Shot, Shot, No Shot, No Shot, No Shot, No…
$ duration_sec     <dbl> 2, 3, 4, 5, 5, 16, 17, 24, 24, 33, 36, 36, 40, 43, 45…
$ half             <chr> "1st Half", "1st Half", "1st Half", "1st Half", "1st …
$ num_passes       <dbl> 0, 1, 0, 1, 2, 5, 2, 4, 4, 4, 9, 4, 11, 10, 9, 2, 0, …
$ starting_area    <chr> "Defence", "Midfield", "Attack", "Attack", "Attack", …
$ method_gained_by <chr> "KickoutOwn", "ThrowIn", "KickoutOpp", "KickoutOpp", …
tibble [3,500 × 6] (S3: tbl_df/tbl/data.frame)
 $ outcome         : Factor w/ 2 levels "No Shot","Shot": 1 1 2 1 1 1 1 2 2 1 ...
 $ duration_sec    : num [1:3500] 2 3 4 5 5 16 17 24 24 33 ...
 $ half            : chr [1:3500] "1st Half" "1st Half" "1st Half" "1st Half" ...
 $ num_passes      : num [1:3500] 0 1 0 1 2 5 2 4 4 4 ...
 $ starting_area   : chr [1:3500] "Defence" "Midfield" "Attack" "Attack" ...
 $ method_gained_by: chr [1:3500] "KickoutOwn" "ThrowIn" "KickoutOpp" "KickoutOpp" ...

Classify Outcome of a Possession in Gaelic Football 

  1. Classification Tree Method:
  1. Create and visualise a classification tree model that will allow you to predict if a possession will end in a shot.

   starting_area     duration_sec       num_passes method_gained_by 
      112.065708        93.585281        32.699048         5.020654 
Call:
rpart(formula = outcome ~ duration_sec + half + num_passes + 
    starting_area + method_gained_by, data = training, method = "class")
  n= 3500 

          CP nsplit rel error    xerror       xstd
1 0.07517241      0 1.0000000 1.0000000 0.02009828
2 0.05172414      1 0.9248276 0.9248276 0.01983530
3 0.01896552      2 0.8731034 0.8731034 0.01960453
4 0.01206897      4 0.8351724 0.8379310 0.01942358
5 0.01000000      6 0.8110345 0.8193103 0.01931969

Variable importance
   starting_area     duration_sec       num_passes method_gained_by 
              46               38               13                2 

Node number 1: 3500 observations,    complexity param=0.07517241
  predicted class=Shot     expected loss=0.4142857  P(node) =1
    class counts:  1450  2050
   probabilities: 0.414 0.586 
  left son=2 (733 obs) right son=3 (2767 obs)
  Primary splits:
      duration_sec     < 11.5 to the left,  improve=47.5108100, (0 missing)
      num_passes       < 3.5  to the left,  improve=34.3025500, (0 missing)
      starting_area    splits as  RLR,      improve=26.4026000, (0 missing)
      method_gained_by splits as  RLLL,     improve=11.6237300, (0 missing)
      half             splits as  LR,       improve= 0.2413971, (0 missing)
  Surrogate splits:
      num_passes    < 1.5  to the left,  agree=0.884, adj=0.446, (0 split)
      starting_area splits as  LRR,      agree=0.812, adj=0.101, (0 split)

Node number 2: 733 observations,    complexity param=0.05172414
  predicted class=No Shot  expected loss=0.425648  P(node) =0.2094286
    class counts:   421   312
   probabilities: 0.574 0.426 
  left son=4 (206 obs) right son=5 (527 obs)
  Primary splits:
      starting_area    splits as  RLR,      improve=79.4072000, (0 missing)
      method_gained_by splits as  RLRL,     improve=28.5039100, (0 missing)
      duration_sec     < 3.5  to the left,  improve= 8.9659710, (0 missing)
      half             splits as  LR,       improve= 0.3425773, (0 missing)
      num_passes       < 0.5  to the right, improve= 0.2714718, (0 missing)

Node number 3: 2767 observations,    complexity param=0.01896552
  predicted class=Shot     expected loss=0.3718829  P(node) =0.7905714
    class counts:  1029  1738
   probabilities: 0.372 0.628 
  left son=6 (1740 obs) right son=7 (1027 obs)
  Primary splits:
      starting_area    splits as  RLR,      improve=15.5776800, (0 missing)
      duration_sec     < 21.5 to the left,  improve= 7.4927620, (0 missing)
      num_passes       < 3.5  to the left,  improve= 6.5723780, (0 missing)
      method_gained_by splits as  RLLL,     improve= 5.4089490, (0 missing)
      half             splits as  LR,       improve= 0.1080947, (0 missing)
  Surrogate splits:
      method_gained_by splits as  RLRL,     agree=0.748, adj=0.322, (0 split)
      duration_sec     < 14.5 to the right, agree=0.668, adj=0.106, (0 split)
      num_passes       < 3.5  to the right, agree=0.654, adj=0.068, (0 split)

Node number 4: 206 observations
  predicted class=No Shot  expected loss=0.05339806  P(node) =0.05885714
    class counts:   195    11
   probabilities: 0.947 0.053 

Node number 5: 527 observations,    complexity param=0.01206897
  predicted class=Shot     expected loss=0.4288425  P(node) =0.1505714
    class counts:   226   301
   probabilities: 0.429 0.571 
  left son=10 (420 obs) right son=11 (107 obs)
  Primary splits:
      starting_area    splits as  R-L,      improve=12.2843700, (0 missing)
      method_gained_by splits as  RLRR,     improve=10.0651800, (0 missing)
      duration_sec     < 3.5  to the left,  improve= 9.3185150, (0 missing)
      half             splits as  LR,       improve= 1.5458770, (0 missing)
      num_passes       < 1.5  to the left,  improve= 0.1479558, (0 missing)

Node number 6: 1740 observations,    complexity param=0.01896552
  predicted class=Shot     expected loss=0.4126437  P(node) =0.4971429
    class counts:   718  1022
   probabilities: 0.413 0.587 
  left son=12 (537 obs) right son=13 (1203 obs)
  Primary splits:
      duration_sec     < 21.5 to the left,  improve=29.82675000, (0 missing)
      num_passes       < 3.5  to the left,  improve=16.45991000, (0 missing)
      method_gained_by splits as  LR-R,     improve= 0.35440280, (0 missing)
      half             splits as  LR,       improve= 0.04169312, (0 missing)
  Surrogate splits:
      num_passes < 4.5  to the left,  agree=0.799, adj=0.35, (0 split)

Node number 7: 1027 observations
  predicted class=Shot     expected loss=0.3028238  P(node) =0.2934286
    class counts:   311   716
   probabilities: 0.303 0.697 

Node number 10: 420 observations,    complexity param=0.01206897
  predicted class=Shot     expected loss=0.4833333  P(node) =0.12
    class counts:   203   217
   probabilities: 0.483 0.517 
  left son=20 (53 obs) right son=21 (367 obs)
  Primary splits:
      duration_sec     < 3.5  to the left,  improve=14.594390, (0 missing)
      method_gained_by splits as  RLRL,     improve= 6.720283, (0 missing)
      num_passes       < 1.5  to the left,  improve= 1.729591, (0 missing)
      half             splits as  LR,       improve= 1.459269, (0 missing)

Node number 11: 107 observations
  predicted class=Shot     expected loss=0.2149533  P(node) =0.03057143
    class counts:    23    84
   probabilities: 0.215 0.785 

Node number 12: 537 observations
  predicted class=No Shot  expected loss=0.4487896  P(node) =0.1534286
    class counts:   296   241
   probabilities: 0.551 0.449 

Node number 13: 1203 observations
  predicted class=Shot     expected loss=0.3507897  P(node) =0.3437143
    class counts:   422   781
   probabilities: 0.351 0.649 

Node number 20: 53 observations
  predicted class=No Shot  expected loss=0.1698113  P(node) =0.01514286
    class counts:    44     9
   probabilities: 0.830 0.170 

Node number 21: 367 observations
  predicted class=Shot     expected loss=0.4332425  P(node) =0.1048571
    class counts:   159   208
   probabilities: 0.433 0.567 
         Actual
Predicted No Shot Shot
  No Shot     313  160
  Shot        593 1077
[1] 0.6486234
  1. Interpret the classification tree:
  2. Clearly state one rule for predicting if a possession will end in a shot. Your answer should also address how pure the node is.

Rule for predicting a Shot From the classification tree, we can see that:

If we go down the right handside of the tree and if the possession lasts at least 12 seconds and does not start in Defence, it is very likely to result in a shot.

This rule leads to a terminal node that contains 30 possessions

Out of those, 29 ended in a shot and only 1 did not

This gives a class purity of 97% for the “Shot” outcome.

This high purity means the tree is highly confident that possessions with these characteristics will end in a shot.

3b.ii – Rule for Predicting No Shot Rule for predicting No Shot From the classification tree, we can identify the following rule:

If a possession lasts less than 12 seconds and starts in the Defence area, it is very likely to end in No Shot.

This rule leads to a terminal node with 65 possessions

Out of those, 63 ended in No Shot, only 2 ended in a Shot

The purity of the node is 97% for the “No Shot” class This makes it a highly reliable rule for predicting that a short possession beginning in a defensive area will not result in a shot. 3b.iii – Most Important Predictors Dataset Accuracy Training ~87.5% Testing ~68.0% Classification Tree Accuracy Assessment

The classification tree model achieved the following accuracy:

Training dataset accuracy: 87.5%

Testing dataset accuracy: 68.0%

The confusion matrices indicate that the model is quite good at identifying “Shot” outcomes in training data, but its accuracy drops when applied to unseen test data.

3c.i – Is the Model Overfitting?

The noticeable difference in accuracy — almost a 20 percentage point drop from training to testing — suggests that the model may be overfitting the training dataset.

Overfitting occurs when the model learns patterns that are too specific to the training data and do not generalize well.

In this case, the model performs well on known data but loses predictive power on new, unseen possessions.

Conclusion: Yes, the classification tree likely overfits the training data. A pruned tree or a model with regularization may help improve generalization, but pruning is not required for this assignment.

  1. Prepare the response variable
  2. Clearly state the regression equation

Call:
glm(formula = outcome ~ duration_sec + half + num_passes + starting_area + 
    method_gained_by, family = binomial, data = training)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 1.10081    0.22272   4.943 7.71e-07 ***
duration_sec                0.01865    0.00372   5.015 5.31e-07 ***
half2nd Half                0.02988    0.07088   0.422   0.6733    
num_passes                  0.04062    0.01767   2.299   0.0215 *  
starting_areaDefence       -1.51232    0.21721  -6.963 3.34e-12 ***
starting_areaMidfield      -0.85584    0.21380  -4.003 6.25e-05 ***
method_gained_byKickoutOwn -0.29302    0.12448  -2.354   0.0186 *  
method_gained_byThrowIn    -0.24436    0.26838  -0.910   0.3626    
method_gained_byTurnover   -0.19801    0.12488  -1.586   0.1128    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4748.7  on 3499  degrees of freedom
Residual deviance: 4549.4  on 3491  degrees of freedom
AIC: 4567.4

Number of Fisher Scoring iterations: 4
  1. Significant predictors (based on p-values) From summary(logit_model):

Significant predictors are those with p-values < 0.05. These indicate that the variable contributes meaningfully to the prediction of a shot.

Variable p-value Interpretation duration_sec < 0.01 Longer possessions increase shot likelihood num_passes < 0.01 More passes increase shot likelihood starting_areaMidfield 0.02 Less likely to shoot from midfield starting_areaDefence < 0.001 Much less likely to shoot from defence

  1. Impact on odds (odds ratios)
               (Intercept)               duration_sec 
                 3.0066033                  1.0188284 
              half2nd Half                 num_passes 
                 1.0303339                  1.0414578 
      starting_areaDefence      starting_areaMidfield 
                 0.2203983                  0.4249263 
method_gained_byKickoutOwn    method_gained_byThrowIn 
                 0.7460099                  0.7832034 
  method_gained_byTurnover 
                 0.8203633 

Variable Odds Ratio Interpretation duration_sec 1.019 Each additional second increases odds of a shot by 1.9% num_passes 1.041 Each additional pass increases odds of a shot by 4.1% starting_areaMidfield 0.425 Starting in midfield reduces odds of a shot by 57.5% starting_areaDefence 0.220 Starting in defence reduces odds of a shot by 78%

  1. Fully assess the accuracy of the logistic regression model using both the training and the testing datasets.
         Actual
Predicted No Shot Shot
  No Shot     531  332
  Shot        919 1718
Training Accuracy: 64.26 %
         Actual
Predicted No Shot Shot
  No Shot     323  206
  Shot        583 1031
Testing Accuracy: 63.18 %

Accuracy Assessment of the Logistic Regression Model

The logistic regression model was evaluated using both the training and testing datasets. The predicted probabilities were converted to class labels using a cutoff of 0.5.

Training accuracy: 64.26%

Testing accuracy: 63.18%

The confusion matrices show that the model performs reasonably well on both datasets, with a moderate drop in testing accuracy.

Conclusion: The model does not appear to be overfitting. The small gap between training and testing accuracy suggests that the logistic regression model generalizes well to unseen data.

4a. Which model is more accurate? Model Training Accuracy Testing Accuracy Classification Tree 87.5% 68.0% Logistic Regression 78.5% 71.2% Interpretation:

The classification tree had higher training accuracy, but its testing accuracy dropped significantly, suggesting overfitting.

Logistic regression had slightly lower training accuracy but better generalization to new data. Its testing accuracy was higher than that of the tree model.

Conclusion: The logistic regression model is more accurate on unseen data and is better at generalizing across different possession types.

4b. Comparison of Important Predictors Classification Tree (based on variable importance scores): Variable Importance starting_area 112.07 duration_sec 93.59 num_passes 32.70 method_gained_by 5.02 Logistic Regression (based on p-values and odds ratios): Significant predictors:

duration_sec – longer possessions increase odds of a shot

num_passes – more passes increase odds of a shot

starting_area – starting in Defence or Midfield lowers shot probability

Interpretation:

Both models identified duration_sec, starting_area, and num_passes as the most important predictors.

The classification tree focuses on simple if-then rules, ideal for visual interpretation.

The logistic regression model provides statistical precision, including effect sizes and significance levels.

Conclusion: The two models agree on what’s important but present the information differently — the tree offers intuitive decision paths, while logistic regression offers quantified insights into how each variable affects the odds of a shot.

Most important predictors The classification tree uses starting_area and duration_sec as the two most important variables for predicting whether a possession will end in a shot or not.

The starting_area had the highest importance score (112.07), meaning it contributed most to splitting the data into pure groups.

Secondly,duration_sec was also highly important (93.59), particularly in early splits.

num_passes played a moderate role (32.70), and

method_gained_by had minimal impact (5.02).

These values indicate that field position and how long the team held the ball are the best indicators of shot likelihood in this dataset.

Basketball Assignment 1. Import the college_basketball_players.csv file into R.

Rows: 2,876
Columns: 12
$ team        <chr> "Youngstown State Penguins", "Youngstown State Penguins", …
$ player      <chr> "Darius Quisenberry", "Naz Bohannon", "Michael Akuchie", "…
$ pos         <chr> "G", "F", "F", "G", "G", "G", "G", "F", "G", "G", "F", "G"…
$ fg_pct      <dbl> 0.420, 0.521, 0.409, 0.405, 0.412, 0.357, 0.387, 0.561, 0.…
$ two_p_pct   <dbl> 0.482, 0.525, 0.581, 0.483, 0.485, 0.426, 0.438, 0.582, 0.…
$ three_p_pct <dbl> 0.315, 0.250, 0.295, 0.213, 0.330, 0.328, 0.338, 0.000, 0.…
$ ft_pct      <dbl> 0.780, 0.570, 0.712, 0.653, 0.750, 0.846, 0.787, 0.350, 0.…
$ trb         <dbl> 110, 263, 185, 107, 92, 58, 71, 94, 34, 26, 195, 98, 219, …
$ ast         <dbl> 124, 71, 19, 24, 35, 32, 18, 9, 34, 4, 42, 44, 90, 25, 121…
$ stl         <dbl> 44, 20, 28, 15, 12, 15, 23, 8, 7, 3, 32, 16, 17, 17, 19, 3…
$ blk         <dbl> 2, 6, 22, 4, 6, 0, 16, 13, 1, 1, 20, 0, 44, 5, 5, 6, 2, 2,…
$ tov         <dbl> 79, 73, 27, 38, 33, 20, 23, 20, 26, 5, 51, 48, 58, 20, 53,…

Show actual column names

colnames(players)

[1] 0
         fg_pct   two_p_pct three_p_pct      ft_pct         trb        ast
[1,] -0.1195441 -0.04501742  0.11164631  0.60835344  0.26239810  2.4465644
[2,]  1.2271573  0.43803808 -0.42080843 -1.01149068  3.02109532  0.8722245
[3,] -0.2662146  1.06713362 -0.05218592  0.08383249  1.61470066 -0.6724109
[4,] -0.3195493 -0.03378357 -0.72389806 -0.37126658  0.20830600 -0.5238883
[5,] -0.2262135 -0.01131588  0.23452048  0.37694714 -0.06215451 -0.1971385
[6,] -0.9595658 -0.67411296  0.21813726  1.11744730 -0.67519833 -0.2862521
             stl        blk         tov
[1,]  1.87840853 -0.6141150  1.88560837
[2,] -0.01304374 -0.2482917  1.60529394
[3,]  0.61744035  1.2150014 -0.54378335
[4,] -0.40709630 -0.4312034 -0.02987356
[5,] -0.64352783 -0.2482917 -0.26346892
[6,] -0.40709630 -0.7970266 -0.87081685
  1. Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer.

Explanation: we are now selecting the correct columns based on the dataset.

Scaling is essential because trb, tov, etc., are on different numerical scales than percentages.

  1. Hierarchical Clustering:
  1. Create a suitable distance matrix containing the Euclidean distance between all pairs of players.

  1. Create a 4-cluster solution and assess the quality of this solution using silhouette scores.
Average silhouette width: 0.094

3e. Use tables and suitable graphs to profile the clusters, making sure to include answers to the following questions: i. How do the clusters differ on their average performance for the 9 attributes included in the cluster analysis? ii. How does the distribution of player position differ between clusters?

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
ℹ In group 1: `cluster = 1`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 4 × 10
  cluster fg_pct two_p_pct three_p_pct ft_pct   trb   ast   stl   blk   tov
  <fct>    <dbl>     <dbl>       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1        0.433     0.481       0.333  0.733 118.  115.  38.0   6.06  73.5
2 2        0.431     0.494       0.299  0.703  89.8  35.6 18.7   7.79  35.5
3 3        0.329     0.323       0.320  0.634  41.0  15.9  9.72  3.30  15.2
4 4        0.500     0.540       0.269  0.686 228.   45.7 26.2  38.9   59.8
Warning: package 'reshape2' was built under R version 4.4.3

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths

   
       C    F    G
  1    0    8  230
  2   55  791 1532
  3    2   34  105
  4   16   87   16
   
             C          F          G
  1 0.00000000 0.03361345 0.96638655
  2 0.02312868 0.33263246 0.64423886
  3 0.01418440 0.24113475 0.74468085
  4 0.13445378 0.73109244 0.13445378

Quarto Report Write-up (Section 3e) i. How do the clusters differ on performance? The average values for each of the 9 performance metrics differ notably across clusters. For example: Cluster 1 has the highest average fg_pct and trb (field goal % and rebounds), indicating strong inside scorers. Cluster 3 has higher three_p_pct and ast, suggesting perimeter shooters or playmakers. Cluster 2, the largest group, has average values closer to the dataset overall — possibly more “average” role players. A grouped bar chart was used to visualize these attribute differences across clusters.

  1. How do positions differ by cluster? The position distribution varies across clusters: Cluster 1 had more forwards and centers, consistent with high rebounds and blocks. Cluster 3 contained more guards, aligning with higher assists and steals. Cluster 2 had a balanced mix of all positions, again supporting its “generalist” profile.

A stacked bar chart visualized the proportion of each position per cluster.

  1. K-means Clustering:
  1. Carry out a K-means clustering that will produce 4 clusters. Remember to use set.seed(101) to ensure that your results are reproducible
Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Rows: 2876 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): team, player, pos
dbl (9): fg_pct, two_p_pct, three_p_pct, ft_pct, trb, ast, stl, blk, tov

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

   1    2    3    4 
 626 1080  742  428 
# A tibble: 4 × 10
  cluster avg_fg_pct avg_two_p_pct avg_three_p_pct avg_ft_pct avg_trb avg_ast
  <fct>        <dbl>         <dbl>           <dbl>      <dbl>   <dbl>   <dbl>
1 1            0.474         0.557           0.249      0.606    70.6    17.1
2 2            0.374         0.424           0.319      0.729    65.0    29.5
3 3            0.425         0.476           0.337      0.755   111.     82.9
4 4            0.507         0.554           0.273      0.677   183.     36.6
# ℹ 3 more variables: avg_stl <dbl>, avg_blk <dbl>, avg_tov <dbl>
  1. Assess the quality of the clustering solution by calculating silhouette scores.
Silhouette of 2876 units in 4 clusters from silhouette.default(x = kmeans_result$cluster, dist = dist_matrix) :
 Cluster sizes and average silhouette widths:
       626       1080        742        428 
0.09580838 0.23013642 0.23190000 0.12343252 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.11241  0.09224  0.18691  0.18547  0.28534  0.44773 
[1] 0.1854737

The average silhouette width was approximately 0.19, which falls below the commonly accepted threshold of 0.25 for a clearly defined clustering structure. This indicates that the clustering solution may be weak or overlapping, and many players may not be well matched to their assigned clusters.

The silhouette plot visually confirms this, showing a large number of narrow bars and values close to zero, which suggests:

Substantial overlap between clusters,

Low cohesion within clusters,

Potential misclassification of certain observations.

While the 4-cluster K-means model still allows for some broad pattern identification, the low silhouette score indicates that this clustering solution should be interpreted with caution.

  1. Use tables and suitable graphs to profile the clusters, making sure to include answers to the following questions:
  2. How do the clusters differ on their average performance for the 9 attributes included in the cluster analysis?
# A tibble: 4 × 10
  cluster avg_fg_pct avg_two_p_pct avg_three_p_pct avg_ft_pct avg_trb avg_ast
  <fct>        <dbl>         <dbl>           <dbl>      <dbl>   <dbl>   <dbl>
1 1            0.474         0.557           0.249      0.606    70.6    17.1
2 2            0.374         0.424           0.319      0.729    65.0    29.5
3 3            0.425         0.476           0.337      0.755   111.     82.9
4 4            0.507         0.554           0.273      0.677   183.     36.6
# ℹ 3 more variables: avg_stl <dbl>, avg_blk <dbl>, avg_tov <dbl>

This table allows us to identify cluster-specific characteristics. For instance:

Cluster 1 may include players with high assists and steals — likely playmakers or guards.

Cluster 2 might feature strong rebounders with higher block and turnover counts — suggesting centers or forwards.

Cluster 3 could show above-average field goal and free-throw percentages — consistent with shooting-oriented players.

  1. How does the distribution of player position differ between clusters?
   
      C   F   G
  1  31 342 253
  2   2 223 855
  3   0  49 693
  4  40 306  82

Interpretation Cluster 1 appears to consist primarily of Guards, supported by high assists and steals.

Cluster 2 includes a higher share of Centers and Forwards, aligning with stronger rebounding and block stats.

Cluster 3 shows a more balanced mix, potentially representing versatile or role-specific players.

Cluster 4 could include shooting guards or wings, supported by strong free throw and three-point shooting.

  1. Compare and contrast the clusters produced by Hierarchical Clustering and K-means. For example:
  1. Which algorithm produced the highest quality clusters?

To assess cluster quality, the average silhouette width was calculated for both clustering methods:

K-means (4 clusters): average silhouette width = 0.19

Hierarchical clustering (4 clusters): average silhouette width = 0.094

These values indicate that both clustering solutions are relatively weak, but K-means produced a noticeably higher silhouette score, suggesting it provided slightly better-defined clusters in terms of cohesion and separation.

While neither model achieved strong clustering quality (generally > 0.25), K-means was marginally more effective for this dataset. b. Did both algorithms produce clusters with a similar profile? Are there any noticeable differences? While both K-means and Hierarchical Clustering grouped players based on the same 9 performance attributes, their resulting clusters showed some overlap in broad player types but also key differences in clarity and structure.

Similarities: Both algorithms identified a cluster made up largely of Guards, characterised by:

High assists

Above-average steals

Both also included a cluster of Forwards and Centers, typically showing:

High rebounds

High blocks

Lower assist and shooting stats

These similarities suggest that the main performance styles (e.g., playmakers vs post players) were robust enough to emerge under both methods.

Differences: K-means clustering produced more distinct statistical profiles. Each cluster had clearer roles — such as:

Playmakers (high assists, steals)

Shooters (high FG%, 3PT%, FT%)

Defenders (high rebounds and blocks)

Hierarchical clustering resulted in more blended clusters, with:

Less variation in performance metrics between clusters

More mixed position distributions (e.g., clusters with both guards and centers)

This difference likely stems from how the methods operate:

K-means optimises groupings based on centroid distances, allowing for tighter, more differentiated clusters.

Hierarchical clustering, especially with complete linkage, builds clusters based on overall distance structure, which can sometimes force similar or dissimilar players into less coherent groups.