Classifying NBA Players and Clustering Footballers

Author

Michael Harding

Data Sourced From the 2017/18 NBA Season

PART A - CLASSIFICATION

Introduction

In September 2018 ESPN.com published a series of articles on the top players in the NBA in the upcoming 2018-19 season. The article is available at https://www.espn.com/nba/story/_/id/24668720/nbarank-2018-19-1-10-best-players-season. It is assumed that data from the 2017/18 season played a large role in ESPN’s rankings. From that 2017 / 18 dataset I will use classifiaction analysis to build an algorithm to determine if a player is a top 50 player or not. I will use 11 attributes from the dataset - (fg) Average field goals made per game, (fgp) Field Goal Percentage which means % of field goal attempts that were successful, (thr) Average three-points made per game, (thrp) Three-pointers Percentage, (efg) Effective Field Goal Percentage, which adjusts field goal % to account for the fact that three-point field goals count for three points while field goals only count for two points, (rb) Total rebound percentage, percentage of available rebounds a player grabbed while playing, (ast) Average assists per game, (stl) Average steals per game, (blk) Average blocks per game, (tov) Average turnovers per game and (pf) Average personal fouls per game.

CLASSIFICATION TREE

This classification tree was built using the rpart package in R to predict whether an NBA player would be classified as a Top 50 player (Y) or outside the Top 50 (N). The tree was trained on 130 players, of whom 99 (76%) were not in the Top 50 and 31 (24%) were in the Top 50.

What Rules Can We See?

The clearest and most direct rule produced by the classification tree is as follows

If a player makes 7.1 or more field goals per game (fg ≥ 7.1), then they are predicted to be a Top 50 NBA player.

This rule operates at the very first split of the tree (Node 1). Any player scoring 7.1 or more field goals per game immediately arrives at Node 3, and is classified as Top 50. This is the simplest and most powerful prediction the tree makes — requiring only a single decision based on a single variable.

Leaf node 3 is 81% pure — 17 out of 21 players meeting this condition are genuinely Top 50 players. The remaining 4 players represent false alarms. Despite these 4 misclassifications, Node 3 remains the tree’s most important positive prediction — reached with a single decision, covering 16% of all players in the dataset, and producing an 81% success rate.

The most reliable rule for predicting that a player falls outside the Top 50 requires three conditions to be met simultaneously:

If a player makes fewer than 7.1 field goals per game (fg < 7.1), and attempts fewer than 2.25 three-pointers per game (thr < 2.25), and grabs fewer than 6.35 rebounds per game (trb < 6.35), then they are predicted to be outside the Top 50.

This path leads to Node 8, the largest and purest leaf node in the entire tree, covering 56% of all players in the training dataset (73 players). A player following this path scores modestly, rarely attempts three-pointers, and does not contribute significantly on the boards.

Node 8 is 99% pure — 72 out of 73 players are correctly classified as not Top 50. This makes it by far the most confident and most reliable prediction the tree produces. Only a single player out of 73 was misclassified at this node. The logic is straightforward: a player who scores little, rarely shoots threes, and does not rebound simply does not possess the statistical profile of an Top 50 NBA player. This node’s near-perfect purity and large coverage (56% of all players) make it the single most dependable rule in the classification tree.

Most Important Variables for Predicting Top 50 Status

nba_model_tree$variable.importance
        fg        thr        tov         pf        blk        trb        ast 
16.3355244  6.1996468  5.4451748  3.5572075  2.8196389  2.7911430  1.8660357 
      thrp        stl        efg        fgp 
 1.3760600  1.0881536  0.8043276  0.8043276 

The variable importance scores are retrieved directly from the model using the code above “nba_model_tree$variable.importance”, which asks R to display how much each variable contributed to cleaning up the mix of Top 50 and non-Top 50 players at each split. These scores are derived from the Gini Impurity Reduction at each split — measuring how much more pure the resulting nodes were compared to the node before the split. The higher the score, the more that variable reduced the mix of classes across the tree.

Field goals (fg) made per game (fg) is overwhelmingly the most important variable with a score of 16.34 — nearly three times higher than the next variable. This score came almost entirely from a single split at the root node, where splitting all 130 players on fg < 7.1 reduced impurity by 16.34 units in one step. Before the split the dataset was moderately mixed (76% N, 24% Y). After the split the tree produced Node 2 (87% N) and Node 3 (81% Y) — a dramatic improvement in purity that no other variable came close to replicating.

Three-point made per game (thr) is the second most important variable, with a score of 6.20. It forms the second primary split of the tree at Node 2, separating lower-scoring players into those who shoot a high volume of threes and those who do not. This reflects the modern NBA’s shift toward perimeter play — players who attempt more threes tend to be higher-usage offensive contributors, which correlates with Top 50 status. Its importance score is notably lower than fg (6.20 vs 16.34), confirming that while relevant, it is a secondary rather than primary driver.

Turnovers (5.45) and personal fouls (3.56) are notable because they are 3rd and 4th highest on the importance scale but they do not appear visually on the tree. My understanding of this is - fg did such a powerful job at the root node that R never needed them as primary splitters. Each node, R tests every available variable and picks the one that reduces impurity the most. I believe this tells us that NBA players who score a lot also tend to turn the ball over more and pick up more fouls, simply because they have the ball in their hands more often. Tov and pf are therefore indirect signals of the same thing fg measures directly — they just measure it less cleanly, which is exactly why they appear as surrogates rather than primary splits.

Blocks (2.82) and rebounds (2.79) operate in the deeper branches of the tree, specifically Nodes 9 and 4, where they work together to identify players who do not fit the high-scoring profile of Node 3. A player who rebounds well (trb ≥ 6.35) and also blocks shots (blk ≥ 0.95) is classified as likely Top 50 at Node 19, capturing players who dominate the board defensively. Their lower importance scores reflect the fact that they apply to only a minority of players in the dataset.

Confusion Matrix on Training and Testing Data

Using the predict funtion in R we can see below the results of our training data and testing data. Just below I have the full output from the code.

          player pos  fg   fgp thr  thrp   efg trb ast stl blk tov  pf  pts
1   Aaron Gordon  PF 6.5 0.434 2.0 0.336 0.500 7.9 2.3 1.0 0.8 1.8 1.9 17.6
2     Al Horford   C 5.1 0.489 1.3 0.429 0.553 7.4 4.7 0.6 1.1 1.8 1.9 12.9
3 Andre Iguodala  SF 2.3 0.463 0.5 0.282 0.514 3.8 3.3 0.8 0.6 1.0 1.5  6.0
4 Andrew Wiggins  SF 6.9 0.438 1.4 0.331 0.481 4.4 2.0 1.1 0.6 1.7 2.0 17.7
5  Austin Rivers  SG 5.6 0.424 2.2 0.378 0.508 2.4 4.0 1.2 0.3 1.8 2.5 15.1
6  Blake Griffin  PF 7.5 0.438 1.9 0.345 0.493 7.4 5.8 0.7 0.3 2.8 2.4 21.4
  top_50         N          Y nba_tree_prediction
1      Y 0.8666667 0.13333333                   N
2      Y 0.4444444 0.55555556                   Y
3      N 0.9863014 0.01369863                   N
4      N 0.9863014 0.01369863                   N
5      N 0.9863014 0.01369863                   N
6      Y 0.1904762 0.80952381                   Y
      Predicted
Actual  N  Y
     N 91  8
     Y  9 22
[1] 0.8692308
            player pos   fg   fgp thr  thrp   efg  trb ast stl blk tov  pf  pts
1  Al-Farouq Aminu  PF  3.3 0.395 1.8 0.369 0.503  7.6 1.2 1.1 0.6 1.1 2.0  9.3
2     Allen Crabbe  SG  4.5 0.407 2.7 0.378 0.529  4.3 1.6 0.6 0.5 1.0 2.2 13.2
3   Andre Drummond   C  6.0 0.529 0.0 0.000 0.529 16.0 3.0 1.5 1.6 2.6 3.2 15.0
4    Anthony Davis  PF 10.4 0.534 0.7 0.340 0.552 11.1 2.3 1.5 2.6 2.2 2.1 28.1
5 Anthony Tolliver  PF  2.8 0.464 2.0 0.436 0.632  3.1 1.1 0.4 0.3 0.7 1.8  8.9
6      Ben Simmons  PG  6.7 0.545 0.0 0.000 0.545  8.1 8.2 1.7 0.9 3.4 2.6 15.8
  top_50         N          Y nba_testing_tree_prediction
1      N 0.8666667 0.13333333                           N
2      N 0.5000000 0.50000000                           N
3      Y 0.4444444 0.55555556                           Y
4      Y 0.1904762 0.80952381                           Y
5      N 0.9863014 0.01369863                           N
6      Y 0.8666667 0.13333333                           N
      Predicted
Actual  N  Y
     N 48  4
     Y  3 11
[1] 0.8939394

Accuracy of the Classification Tree

Below are the results from the accuracy testing on the training data.

Training Confusion Matrix (Accuracy: 86.9 %)
Actual
Predicted
N Y
N 91 8
Y 9 22

Training Dataset Performance (130 players)

On the training dataset, the classification tree achieved an overall accuracy of 87% ((91 + 22) / 130), correctly classifying 113 out of 130 players. The model is notably stronger at identifying non-Top 50 players (specificity 91.9%) than it is at identifying genuine Top 50 players (sensitivity 71.0%). The false alarm rate of 8.1% indicates the model is appropriately conservative — it rarely labels an average player as elite. However, the missed Top 50 rate of 29% means nearly 1 in 3 genuine Top 50 players were incorrectly classified as ordinary, which represents the model’s primary weakness.

Metric Calculation Result
Overall Accuracy (91 + 22) / 130 87.0%
Error Rate (8 + 9) / 130 13.0%
Sensitivity (catching Top 50) 22 / 31 71.0%
Specificity (catching non-Top 50) 91 / 99 91.9%
False Alarm Rate 8 / 99 8.1%
Missed Top 50 Rate 9 / 31 29.0%

Testing Dataset Performance (66 players)

Testing Confusion Matrix (Accuracy: 89.4 %)
Actual
Predicted
N Y
N 48 4
Y 3 11

On the testing dataset, the model achieved an overall accuracy of 89.4% ((48 + 11) / 66), slightly higher than the training accuracy of 87.0%. All key metrics either improved or held steady when applied to unseen data, which is an encouraging and somewhat unusual because my research suggests many classification models deteriorate when applied to new data.

Metric Calculation Result
Overall Accuracy (48 + 11) / 66 89.4%
Error Rate (4 + 3) / 66 10.6%
Sensitivity (catching Top 50) 11 / 14 78.6%
Specificity (catching non-Top 50) 48 / 52 92.3%
False Alarm Rate 4 / 52 7.7%
Missed Top 50 Rate 3 / 14 21.4%

Training vs Testing Comparison

Metric Training Testing Change Direction
Overall Accuracy 87.0% 89.4% +2.4% Improved
Error Rate 13.0% 10.6% -2.4% Improved
Sensitivity 71.0% 78.6% +7.6% Improved
Specificity 91.9% 92.3% +0.4% Stable
False Alarm Rate 8.1% 7.7% -0.4% Stable
Missed Top 50 Rate 29.0% 21.4% -7.6% Improved

Overfitting Assessment

Overfitting occurs when a classification model learns the specific patterns of the training data so precisely that it fails to generalise to new, unseen data. In the context of classification trees, overfitting typically manifests as a tree that grows too deep, creating highly specific branching rules that apply only to the training cases rather than capturing broader, transferable patterns. The classic symptom is a significant drop in accuracy from training data to testing data.

Evidence Against Overfitting in This Model

Based on the comparison of training and testing performance, the classification tree shows no evidence of overfitting. The following four points all support this conclusion:

1 — Overall Accuracy Improved on Testing Data

Accuracy increased from 87.0% on the training data to 89.4% on the testing data — a gain of 2.4 percentage points. In a textbook overfitting scenario, accuracy would be expected to drop by 10 to 20 percentage points when moving from training to testing data, as the model would have memorised the training cases rather than learned general patterns. An increase in accuracy on unseen data is a strong indicator of successful generalisation.

2 — Sensitivity Improved Substantially

Sensitivity — the model’s ability to correctly identify genuine Top 50 players — improved from 71.0% on training data to 78.6% on testing data, an improvement of 7.6 percentage points. Sensitivity is typically the hardest metric to maintain on new data because Top 50 players are already a minority class (only 31 out of 130 in training). The improvement on testing data suggests the model learned genuine, transferable patterns about what distinguishes elite players rather than memorising specific training cases.

3 — Specificity Held Steady

Specificity held almost perfectly steady at 91.9% (training) versus 92.3% (testing). The model’s ability to identify non-Top 50 players was virtually identical across both datasets, confirming that the tree’s conservative decision rules are stable and robust rather than data-specific.

4 — Tree Structure is Appropriately Shallow

The tree contains only 4 levels and 5 leaf nodes across 130 training players. Overfitted trees in classification studies typically grow very deep, with 8 to 10 or more levels, writing essentially a unique rule for each training observation. These trees would need to be “pruned”, I did not feel the need to apply the prune (maxdepth) function to these tree.

An Important Caveat

While the model shows no signs of overfitting, it is important to note two limitations that affect the strength of this conclusion. First, the testing dataset contains only 66 players, approximately half the size of the training set and includes only 14 genuine Top 50 players. With such a small number of positive cases, a shift of just two or three classifications could substantially alter the sensitivity score, and the observed improvement should therefore be interpreted with cautious optimism.

Conclusion

The classification tree shows no signs of overfitting. Accuracy improved by 2.4%, sensitivity improved by 7.6%, and specificity held steady when the model was applied to completely new player data. The tree’s shallow structure produced a model that learned genuine patterns about what makes an NBA player a Top 50 performer rather than memorising the specific characteristics of players in the training set. In the context of classification studies, this represents a well-calibrated model.

BINARY LOGISTIC REGRESSION

Below I use binary logistic regression model that will allow me to classify a player as being in the Top 50 or outside the Top 50. I use the same 11 attributes as I used in the Classification Tree.


Call:
glm(formula = top_50_binary ~ fg + fgp + thr + thrp + efg + trb + 
    ast + stl + blk + tov + pf, family = binomial(link = "logit"), 
    data = nba)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -22.1497     6.9686  -3.178  0.00148 **
fg            1.1146     0.4702   2.370  0.01778 * 
fgp           8.0569    37.1402   0.217  0.82826   
thr           1.8565     1.3790   1.346  0.17823   
thrp         -3.9736     5.4205  -0.733  0.46351   
efg          13.3517    37.2101   0.359  0.71973   
trb           0.1160     0.1746   0.665  0.50625   
ast           0.3559     0.3236   1.100  0.27137   
stl           1.2273     0.8778   1.398  0.16208   
blk           1.4373     0.9467   1.518  0.12897   
tov          -1.2971     0.9667  -1.342  0.17967   
pf            0.2191     0.8587   0.255  0.79862   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.82  on 129  degrees of freedom
Residual deviance:  66.52  on 118  degrees of freedom
AIC: 90.52

Number of Fisher Scoring iterations: 7

The Regression Equation

From the Estimate column, the full regression equation is:

ln(π/1-π) = -22.1497 + 1.1146(fg) + 8.0569(fgp) + 1.8565(thr) - 3.9736(thrp) + 13.3517(efg) + 0.1160(trb) + 0.3559(ast) + 1.2273(stl) + 1.4373(blk) - 1.2971(tov) + 0.2191(pf)

Important Predictor Variables

Looking at the Pr(>|z|) column, only one variable is statistically significant:

Variable Coefficient P-value Significant?
(Intercept) -22.1497 0.00148 ** (p < 0.01)
fg 1.1146 0.01778 * (p < 0.05)
fgp 8.0569 0.82826 Not significant
thr 1.8565 0.17823 Not significant
thrp -3.9736 0.46351 Not significant
efg 13.3517 0.71973 Not significant
trb 0.1160 0.50625 Not significant
ast 0.3559 0.27137 Not significant
stl 1.2273 0.16208 Not significant
blk 1.4373 0.12897 Not significant
tov -1.2971 0.17967 Not significant
pf 0.2191 0.79862 Not significant

fg is the only statistically significant predictor variable (p = 0.01778), significant at the 0.05 level as indicated by the single asterisk (*).

This means:

• H₀ is rejected for fg — field goals made per game is an important predictor of Top 50 status

• H₀ is accepted for all other variables — none of the remaining 10 variables are statistically significant predictors of Top 50 status at the 0.05 level

Impact of Fiel Goals (fg) on the Odds of Being Top 50

Since fg is the only significant predictor, this is the only variable requiring odds interpretation.

Calculating e^(coefficient): e^(1.1146) = 3.048

Interpretation

For every 1 additional field goal made per game, the odds of a player being in the Top 50 are multiplied by 3.048. This represents a (3.048 - 1) × 100% = 204.8% increase in the odds of being Top 50 for each extra field goal made per game.

204.8% looks to be a very big increase for only one more fg scored, I will try to explain it better below

What the Equation Actually Says The logistic regression equation is:

ln(π/1-π) = -22.1497 + 1.1146(fg) + 8.0569(fgp) + 1.8565(thr) - 3.9736(thrp) + 13.3517(efg) + 0.1160(trb) + 0.3559(ast) + 1.2273(stl) + 1.4373(blk) - 1.2971(tov) + 0.2191(pf)

The coefficient of 1.1146 for fg operates on the log odds scale — not on probability directly. To get back to something interpretable we need to “undo” the log by raising e to the power of the coefficient:

e^1.1146 = 3.048

This means for every 1 extra field goal per game, the odds of being Top 50 are multiplied by 3.048.

This is the key point that makes 204.8% feel misleading. The odds went from 1.0 to 3.0 — a 200% increase in odds. But the probability only went from 50% to 75% — a much more modest real-world change. Now consider two players, 1 - a player sitting right at the boundary of the Top 50, and 2 - a player who is already unlikely to be Top 50 :

Scenario Context Probability Odds
Before (+0 fg) Borderline player 50% 50/50 = 1.0
After (+1 fg) Borderline player ~75% 75/25 = 3.0
Before (+0 fg) Unlikely Top 50 5% 5/95 = 0.053
After (+1 fg) Unlikely Top 50 ~13.7% 13.7/86.3 = 0.159

The range of fg in the data is relatively narrow. Most NBA players in the dataset score between 4 and 11 field goals per game. A single field goal difference within that range genuinely does separate players meaningfully. The difference between a role player at 5 fg/game and a star at 6 fg/game is substantial in the NBA context.

Accuracy of the Logistic Regression

Below is the full output of the predict function for testing the accuracy of the regression model on the training data and then on the test data.

# Training

nba_lr_pi <- predict(nba_model_lr, newdata = nba, type = 'response')
nba_lr_pi
           1            2            3            4            5            6 
0.4599416579 0.1820078525 0.0012366819 0.1741950503 0.2128502134 0.3976645664 
           7            8            9           10           11           12 
0.0245794905 0.1247773384 0.1856238606 0.9267390359 0.8145195575 0.0010800405 
          13           14           15           16           17           18 
0.0283716985 0.9740791005 0.0089945363 0.0006888773 0.0878055802 0.0059471071 
          19           20           21           22           23           24 
0.1465141230 0.0005480866 0.0008129682 0.3655091871 0.0004113700 0.9475663487 
          25           26           27           28           29           30 
0.0438808583 0.1105088588 0.0163852790 0.1173217851 0.6487898179 0.0265809967 
          31           32           33           34           35           36 
0.0032394913 0.1088670802 0.6679476683 0.0004922754 0.0223952423 0.0020281020 
          37           38           39           40           41           42 
0.0159027521 0.1989870848 0.4851525982 0.5006803608 0.0178080156 0.0005443639 
          43           44           45           46           47           48 
0.0035395651 0.7877015129 0.9786979197 0.1056035797 0.1151072226 0.4283527782 
          49           50           51           52           53           54 
0.0032383997 0.0156324643 0.0278852521 0.0047489513 0.0001868174 0.0741625528 
          55           56           57           58           59           60 
0.0026601294 0.0057055879 0.0209667214 0.0007908298 0.7168701124 0.2803992483 
          61           62           63           64           65           66 
0.4819925826 0.0480076792 0.0158953915 0.0033117671 0.1560712547 0.0208520471 
          67           68           69           70           71           72 
0.0008015509 0.1029854906 0.0417532415 0.0105088306 0.8654943729 0.0248265793 
          73           74           75           76           77           78 
0.1143963686 0.3764162775 0.6411242328 0.9692846330 0.9368680176 0.1921332845 
          79           80           81           82           83           84 
0.7205551733 0.9876184333 0.8035268665 0.0010684280 0.1856950795 0.9971415476 
          85           86           87           88           89           90 
0.0338161935 0.0037664408 0.1097815959 0.0210622171 0.0018434742 0.0010659519 
          91           92           93           94           95           96 
0.0179753510 0.0079890972 0.0047241373 0.0692563328 0.0120294491 0.7238012217 
          97           98           99          100          101          102 
0.5726153203 0.7046034084 0.0013718327 0.0053339326 0.0058677082 0.0069313340 
         103          104          105          106          107          108 
0.3954606376 0.0063973581 0.0802543914 0.0073879452 0.6862022739 0.7805753777 
         109          110          111          112          113          114 
0.0180131787 0.1525789688 0.0310047379 0.9992340352 0.6642354155 0.0009409720 
         115          116          117          118          119          120 
0.3311013642 0.1040394740 0.0519944908 0.0215420523 0.7473026374 0.2062874189 
         121          122          123          124          125          126 
0.0060753674 0.0358669692 0.0001449726 0.6012787622 0.9766457945 0.0923942397 
         127          128          129          130 
0.3114077964 0.0279277196 0.0100016022 0.0193133119 
nba_lr_final <- nba %>%
  mutate(pi = nba_lr_pi) %>%
  mutate(nba_lr_prediction = case_when(pi > 0.5 ~ 'Y', 
                                               pi <= 0.5 ~ 'N'))
nba_lr_final
# A tibble: 130 × 19
   player      pos      fg   fgp   thr  thrp   efg   trb   ast   stl   blk   tov
   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron Gord… PF      6.5 0.434   2   0.336 0.5     7.9   2.3   1     0.8   1.8
 2 Al Horford  C       5.1 0.489   1.3 0.429 0.553   7.4   4.7   0.6   1.1   1.8
 3 Andre Iguo… SF      2.3 0.463   0.5 0.282 0.514   3.8   3.3   0.8   0.6   1  
 4 Andrew Wig… SF      6.9 0.438   1.4 0.331 0.481   4.4   2     1.1   0.6   1.7
 5 Austin Riv… SG      5.6 0.424   2.2 0.378 0.508   2.4   4     1.2   0.3   1.8
 6 Blake Grif… PF      7.5 0.438   1.9 0.345 0.493   7.4   5.8   0.7   0.3   2.8
 7 Bogdan Bog… SG      4.4 0.446   1.7 0.392 0.529   2.9   3.3   0.9   0.2   1.6
 8 Brook Lopez C       5   0.465   1.5 0.345 0.536   4     1.7   0.4   1.3   1.3
 9 Carmelo An… PF      6.1 0.404   2.2 0.357 0.476   5.8   1.3   0.6   0.6   1.3
10 Chris Paul  PG      6.3 0.46    2.5 0.38  0.55    5.4   7.9   1.7   0.2   2.2
# ℹ 120 more rows
# ℹ 7 more variables: pf <dbl>, pts <dbl>, top_50 <chr>, outcome <fct>,
#   top_50_binary <dbl>, pi <dbl>, nba_lr_prediction <chr>
nba_lr_table <- table(nba_lr_final$outcome, nba_lr_final$nba_lr_prediction, dnn = c('Actual', 'Predicted'))
nba_lr_table
      Predicted
Actual  N  Y
     N 94  5
     Y  8 23
nba_lr_acc <- sum(diag(nba_lr_table))/sum(nba_lr_table)
nba_lr_acc
[1] 0.9
# TESTING

nba_testing_lr_pi <- predict(nba_model_lr, newdata = nba_testing, type = 'response')
nba_testing_lr_pi
           1            2            3            4            5            6 
1.529092e-02 1.717828e-01 5.637821e-01 9.982634e-01 2.384550e-02 5.947992e-01 
           7            8            9           10           11           12 
2.325429e-02 7.062467e-02 8.844303e-01 2.153027e-02 3.741159e-04 1.256936e-01 
          13           14           15           16           17           18 
8.651253e-03 9.251300e-01 3.653700e-03 3.835995e-01 2.081906e-02 2.673102e-03 
          19           20           21           22           23           24 
1.147546e-03 8.143558e-02 1.602339e-01 1.522657e-01 4.236112e-01 3.358061e-05 
          25           26           27           28           29           30 
5.117240e-03 2.670005e-03 8.092632e-03 6.512846e-03 1.326002e-02 4.751111e-03 
          31           32           33           34           35           36 
1.522784e-03 1.635866e-01 9.354800e-04 9.972787e-01 4.574917e-04 5.465900e-02 
          37           38           39           40           41           42 
4.526071e-02 4.491164e-02 3.017851e-02 4.301401e-02 6.735361e-03 8.606577e-01 
          43           44           45           46           47           48 
2.937204e-02 5.369557e-04 9.729418e-01 1.050883e-02 9.972073e-01 3.181084e-02 
          49           50           51           52           53           54 
1.054865e-02 3.715806e-02 3.970610e-01 5.981091e-03 6.752153e-03 1.429672e-02 
          55           56           57           58           59           60 
3.322969e-03 9.377083e-01 7.713782e-03 1.067118e-01 9.697771e-04 7.818339e-04 
          61           62           63           64           65           66 
7.033658e-02 3.440215e-01 2.830105e-03 3.393204e-03 1.180089e-01 7.282448e-03 
nba_testing_lr_final <- nba_testing %>%
  mutate(pi = nba_testing_lr_pi) %>%
  mutate(nba_testing_lr_prediction = case_when(pi > 0.5 ~ 'Y', 
                                              pi <= 0.5 ~ 'N'))
nba_testing_lr_final
# A tibble: 66 × 19
   player      pos      fg   fgp   thr  thrp   efg   trb   ast   stl   blk   tov
   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Al-Farouq … PF      3.3 0.395   1.8 0.369 0.503   7.6   1.2   1.1   0.6   1.1
 2 Allen Crab… SG      4.5 0.407   2.7 0.378 0.529   4.3   1.6   0.6   0.5   1  
 3 Andre Drum… C       6   0.529   0   0     0.529  16     3     1.5   1.6   2.6
 4 Anthony Da… PF     10.4 0.534   0.7 0.34  0.552  11.1   2.3   1.5   2.6   2.2
 5 Anthony To… PF      2.8 0.464   2   0.436 0.632   3.1   1.1   0.4   0.3   0.7
 6 Ben Simmons PG      6.7 0.545   0   0     0.545   8.1   8.2   1.7   0.9   3.4
 7 Bobby Port… PF      5.2 0.471   1.1 0.359 0.52    6.8   1.7   0.7   0.3   1.4
 8 Bojan Bogd… SF      5.1 0.474   1.9 0.402 0.565   3.4   1.5   0.7   0.1   1.3
 9 Bradley Be… SG      8.3 0.46    2.4 0.375 0.527   4.4   4.5   1.2   0.4   2.6
10 Brandon In… SF      6.1 0.47    0.7 0.39  0.497   5.3   3.9   0.8   0.7   2.5
# ℹ 56 more rows
# ℹ 7 more variables: pf <dbl>, pts <dbl>, top_50 <chr>, outcome <fct>,
#   top_50_binary <dbl>, pi <dbl>, nba_testing_lr_prediction <chr>
nba_testing_lr_table <- table(nba_testing_lr_final$outcome, nba_testing_lr_final$nba_testing_lr_prediction, dnn = c("Actual", "Predicted"))
nba_testing_lr_table
      Predicted
Actual  N  Y
     N 52  0
     Y  4 10
nba_testing_lr_acc <- sum(diag(nba_testing_lr_table))/sum(nba_testing_lr_table)
nba_testing_lr_acc
[1] 0.9393939

Training Confusion Matrix

Training Confusion Matrix (Accuracy: 90 %)
Actual
Predicted
N Y
N 94 5
Y 8 23

The model performed strongly on the training data, correctly classifying 117 out of 130 players at an overall accuracy of 90.0%. Specificity was particularly strong at 94.9%, meaning the model correctly identified the vast majority of non-Top 50 players. Sensitivity of 74.2% indicates the model correctly identified roughly three quarters of genuine Top 50 players, missing 8 out of 31. The false alarm rate of 5.1% confirms the model is appropriately conservative — it rarely labels an average player as elite.

Testing Confusion Matrix

The model was then applied to the unseen testing dataset of 66 players, producing the following confusion matrix.

Testing Confusion Matrix (Accuracy: 93.9 %)
Actual
Predicted
N Y
N 52 0
Y 4 10

The standout finding on the testing dataset is a perfect specificity of 100% — the model correctly identified every single non-Top 50 player in the testing set, producing zero false alarms. Overall accuracy improved further to 93.9%, and the error rate fell to just 6.1%. The only area of slight concern is sensitivity, which dropped marginally from 74.2% to 71.4%, meaning the model missed 4 out of 14 genuine Top 50 players on unseen data.

Training vs Testing Comparison

Metric Training Testing Change Direction
Overall Accuracy 90.0% 93.9% +3.9% Improved
Error Rate 10.0% 6.1% -3.9% Improved
Sensitivity (catching Top 50) 74.2% 71.4% -2.8% Slight drop
Specificity (catching non-Top 50) 94.9% 100% +5.1% Improved
False Alarm Rate 5.1% 0.0% -5.1% Improved
Missed Top 50 Rate 25.8% 28.6% +2.8% Slight rise

Conclusion

The logistic regression model shows no signs of overfitting. Overall accuracy increased by 3.9 percentage points on unseen data, which is the opposite of the accuracy drop that characterises an overfitted model. Four out of six metrics improved on the testing dataset, and the remaining two showed only marginal changes of less than 3 percentage points. The model has clearly learned genuine and transferable patterns about what distinguishes Top 50 NBA players rather than memorising specific players in the training set.

The logistic regression model demonstrates strong and stable performance across both datasets. With a testing accuracy of 93.9%, a perfect false alarm rate of 0%, and no evidence of overfitting, it represents a reliable tool for classifying NBA players as Top 50 or otherwise. Its primary limitation is a missed Top 50 rate of 28.6% on testing data, which reflects the inherent difficulty of capturing elite player quality through a stats alone approach rather than any fundamental flaw in the model itself.

Comparing and Contrasting the Classification Tree and the Binary Logistic Regression Models

Which Model is More Accurate?

Metric Classification Tree Logistic Regression Better Model
Training Accuracy 87.0% 90.0% LR
Testing Accuracy 89.4% 93.9% LR
Training Error Rate 13.0% 10.0% LR
Testing Error Rate 10.6% 6.1% LR
Training Sensitivity 71.0% 74.2% LR
Testing Sensitivity 78.6% 71.4% Tree
Training Specificity 91.9% 94.9% LR
Testing Specificity 92.3% 100.0% LR
Training False Alarms 8.1% 5.1% LR
Testing False Alarms 7.7% 0.0% LR
Overfitting None None Draw

Based on the most important measure (testing accuracy) logistic regression is the superior model:

Logistic Regression: 93.9% vs Classification Tree: 89.4%

This 4.5 percentage point difference on completely unseen data is the primary basis for recommending logistic regression as the better model. As the classification notes state clearly, the best model should be selected based on performance on the testing dataset, since this represents how the model would perform in real world conditions on new players. Importantly, both models improved from training to testing data — an unusual and positive result confirming that neither model overfitted the training data. Both learned genuine, transferable patterns about what distinguishes Top 50 NBA players.

Variable Importance Comparison

Both models unanimously agree on one finding — fg is the dominant predictor of Top 50 status. The classification tree assigned fg an importance score of 16.34, nearly three times higher than any other variable, and used it as the very first and most powerful split at the root node. Logistic regression confirmed this independently, making fg the only statistically significant predictor with a p-value of 0.018. This agreement across two completely different analytical methods is compelling evidence that field goals made per game is a genuinely robust predictor of elite NBA status rather than an result of one particular modelling approach. Both models also agree on the direction of the relationship — more field goals made increases the probability of Top 50 status. The logistic regression quantifies this precisely: each additional field goal per game multiplies the odds of being Top 50 by 3.048, a 204.8% increase in odds.

The classification tree identified thr (Average three-points made per game), trb (Total rebound percentage) and blk (Average blocks per game) as meaningful secondary predictors, using them to build deeper branches. In logistic regression however, none of these variables reached statistical significance — thr (p = 0.178), trb (p = 0.506) and blk (p = 0.129) all fell short of the 0.05 threshold.

A note on which may be easier to use or apply to the real-world. The classification tree produces simple IF/THEN rules that any coach or scout could understand without statistical knowledge. For example — “if a player scores 7.1 or more field goals per game, predict Top 50” is immediately actionable. The logistic regression equation with 11 variables and a log odds transformation requires a considerably higher level of statistical understanding to interpret and communicate to non-technical stakeholders.

However for more stats based organisations, logistic regression produces an exact probability for every player — for example a 73% chance of being Top 50. This is richer and more actionable than the classification tree which simply outputs Y or N. A scout could use probability scores from logistic regression to rank borderline players against each other, prioritising those with higher probabilities even if all fall below the 0.5 threshold

Final Recommendation

Logistic regression is the recommended model for predicting Top 50 NBA status based on superior testing accuracy (93.9%), a perfect false alarm rate (0.0%), and stronger overall generalisation to new data. However the classification tree remains a genuinely valuable complementary tool — particularly for communicating findings to non-technical stakeholders such as coaches and scouts, and for its superior ability to identify genuine Top 50 players on new data (78.6% sensitivity). In an ideal analytical workflow both models would be used together — logistic regression for precise probability-based player rankings and the classification tree for clear, visual decision rules that can be directly communicated and acted upon.

PART B - CLUSTERING

Data Sourced From 1000 FIFA players from 2018/19

Introduction

This report applies hierarchical clustering to a dataset of 1,000 FIFA football players from the 2018 /19 season. The goal is to identify natural groupings of players based on six performance attributes: acceleration, ball control, dribbling, shot power, short passing, and sprint speed.

Rows: 993
Columns: 10
$ name          <chr> "Neymar", "L. Messi", "L. Suarez", "Cristiano Ronaldo", …
$ age           <dbl> 25, 30, 30, 32, 28, 26, 26, 27, 23, 29, 26, 26, 27, 28, …
$ value         <dbl> 1.23e+08, 1.05e+08, 9.70e+07, 9.55e+07, 9.20e+07, 9.05e+…
$ wage          <dbl> 280000, 565000, 510000, 565000, 355000, 295000, 285000, …
$ acceleration  <dbl> 94, 92, 88, 89, 79, 93, 76, 60, 88, 78, 87, 77, 93, 88, …
$ ball_control  <dbl> 95, 95, 91, 93, 89, 92, 87, 89, 93, 85, 86, 92, 87, 87, …
$ dribbling     <dbl> 96, 97, 86, 91, 85, 93, 85, 79, 92, 84, 87, 90, 89, 90, …
$ shot_power    <dbl> 80, 85, 87, 94, 88, 79, 85, 87, 82, 88, 81, 75, 91, 84, …
$ short_passing <dbl> 81, 88, 83, 83, 83, 86, 90, 90, 83, 75, 79, 91, 86, 81, …
$ sprint_speed  <dbl> 90, 87, 77, 91, 83, 87, 75, 52, 84, 80, 86, 68, 95, 84, …

The dataset contains 993 players after removing 7 players who had no club, transfer value or wage recorded in the source data. After selecting the relevant columns, each player is described by six numeric performance attributes plus personal attributes (name, age, value, wage).

Do We Need to Scale the Data?

Before computing a distance matrix, it is important to consider whether the variables are on comparable scales.

In this case, scaling is not necessary. All six performance attributes (acceleration, ball control, dribbling, shot power, short passing, sprint speed) are measured on the same 1–100 scale.

Hierarchical Clustering

Distance Matrix

A Euclidean distance matrix is computed across the six performance attributes for all 993 players.

d1 <- dist(noname, method = "euclidean")

Fitting the Hierarchical Clustering Model

Ward’s linkage method is used, which merges clusters in a way that minimises the total within-cluster variance at each step. This tends to produce compact, well-separated clusters.

h1 <- hclust(d1, method = "ward.D")

Dendrogram

Dendrogram of hierarchical clustering of 1,000 FIFA players using Ward’s method.

With 993 players, the dendrogram is naturally dense and it is not possible to read individual player labels. However, the overall clustering structure is still visible. Rather than focusing on the fine detail at the leaves, attention should be directed to the upper portion of the tree; the large jumps in height toward the top indicate the most meaningful splits. The biggest height increase before the final merges suggests that a 4-cluster solution is a natural and sensible choice, which is highlighted by the coloured rectangles above.

Heatmap

Heatmap of the Euclidean distance matrix, ordered by the hierarchical clustering dendrogram.

With 993 players the heatmap is more compressed than in smaller datasets and the block structure may not appear as sharply defined, there is still meaningful evidence of clustering. The darker regions along the diagonal — where distances between players are small — indicate groups of similar players being placed together by the clustering algorithm.

Creating the Four Clusters and Cluster Sizes

Number of players assigned to each cluster.

Assessing Cluster Quality with Silhouette Analysis

The silhouette coefficient measures how well each observation fits within its assigned cluster relative to neighbouring clusters. Values range from −1 to +1, values from 0.51 to 0.7 suggest a reasonable cluster structure with and values from 0.71 to 1 suggests a strong structure has been found.

sil1 <- silhouette(clusters1, d1)
summary(sil1)
Silhouette of 993 units in 4 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
      349       330       107       207 
0.2872864 0.2273908 0.6947205 0.1536821 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.2801  0.1331  0.2991  0.2834  0.4099  0.7959 

Silhouette analysis was used to assess the quality of the four-cluster solution across all 993 players. The overall mean silhouette width of 0.283 falls below the commonly accepted threshold of 0.5, suggesting the solution is weak overall. However, this masks considerable variation across clusters. C3 (n = 107) is by far the best-defined cluster, with a mean silhouette width of 0.695, indicating that these players are genuinely distinct and well-separated from the rest. In contrast, C4 (n = 207) has a poor mean silhouette width of 0.154, meaning these players have relatively weak affinity to their assigned cluster and sit closer to neighbouring clusters than is ideal. This is further evidenced by the minimum individual silhouette width of -0.280, confirming that some players have been misclassified and would fit better in a different cluster. C1 (n = 349) and C2 (n = 330) both return weak scores of 0.287 and 0.227 respectively, reflecting an overlap in their skill profiles.

Mean performance scores (out of 100) for each cluster across the six attributes.
Cluster Acceleration Sprint Speed Ball Control Dribbling Short Passing Shot Power
C1 83.1 82.7 80.3 80.0 76.3 76.1
C2 69.4 68.8 80.7 77.1 80.5 78.0
C3 48.5 49.2 23.7 16.1 33.0 25.1
C4 57.0 60.8 66.4 55.8 70.3 63.3

Interpretation

The table above reveals four player types:

C1 – Score highest on physical attributes, with the strongest acceleration (83.1) and sprint speed (82.7) of any cluster, while remaining competitive across technical attributes. These are athletic, high-intensity players who combine pace with solid technical ability — profiles consistent with attacking midfielders, wide attackers, or pacey forwards.

C2 – The most technically refined cluster. They outperform C1 on short passing (80.5 vs 76.3) and shot power (78.0 vs 76.1), and closely match C1 on ball control (80.7 vs 80.3) and dribbling (77.1 vs 80.0), but are notably slower on acceleration (69.4 vs 83.1) and sprint speed (68.8 vs 82.7). These players are best characterised as technically gifted but physically limited — profiles consistent with playmakers, deep-lying midfielders, or technical forwards such as number 10’s.

C3 – Display the most distinctive profile. Very low ball control (23.7), dribbling (16.1), and shot power (25.1), with relatively higher sprint speed (49.2) and acceleration (48.5). This profile is the hallmark of a goalkeeper, who retains reasonable athletic attributes but lacks outfield technical skill.

C4 – While scoring at a moderate level across all attributes, the profile is not without shape. Their strongest attribute is short passing (70.3), complemented by reasonable ball control (66.4), yet they are notably slow — ranking lowest on acceleration (57.0) and sprint speed (60.8) among the clusters. This combination of limited pace, moderate technical ability, and a passing emphasis is consistent with a centre-back or deep-lying defensive midfielder profile — players who are not expected to contribute athletically or in advanced areas, but who are relied upon to distribute, hold position, and provide defensive structure. The lower overall attribute ceiling compared to C2 suggests these are competent but less refined versions of a similar positional type.

The line chart below provides a strong visual summary:

C1 and C2 both sit at a high level on technical attributes but diverge sharply on pace, with C1 elevated and C2 dipping considerably; C3 forms a distinctive spike-and-collapse shape reflecting its unique goalkeeper profile; and C4 mirrors the general shape of C2 but sits at a consistently lower level, reinforcing its identity as a positionally-oriented defensive type footballer.

Mean performance profile for each cluster across the six skill attributes.

Age, Club Value and Wage

The table below presents means for age, club transfer value, and wage for each cluster. These demographic and financial variables were not used in the clustering but serve as useful external validation of the cluster labels.

Average age, transfer value and wage by cluster.
Cluster Total Players Avg Age Avg Value (€M) Avg Wage (€K)
C1 349 26.1 20.52 75.1
C2 330 27.6 19.55 77.2
C3 107 29.1 14.35 51.8
C4 207 28.0 13.45 58.5

Cluster ages follow a broadly consistent pattern, ranging from 26.1 to 29.1 years. C3 records the highest average age (29.1), consistent with the real-world pattern that goalkeeping careers tend to extend further into a player’s late twenties and thirties. C1 is notably the youngest cluster on average (26.1), which may reflect the premium placed on young, athletic talent in the transfer market.

Contrary to what might be expected, the transfer values across clusters are relatively similar, ranging from €13.5M (C4) to €20.5M (C1). C1 commands the highest average valuation (€20.5M), followed closely by C2 (€19.5M), reflecting the market value placed on both athletic and technically refined outfield players. C3 and C4 carry lower average valuations (€14.4M and €13.5M respectively), which is consistent with goalkeepers and defensively-oriented players typically attracting a lower market premium relative to outfield counterparts.

The wage distribution closely mirrors the value pattern, with C1 and C2 earning notably higher average wages (€275.1K and €277.2K respectively) compared to C3 and C4 (€351.8K and €358.5K) — though the differences are modest, suggesting broadly comparable earning levels across clusters.

In combination, these external variables provide reasonable face validity for the cluster solution, with C1 and C2 attracting marginally higher market valuations and wages, broadly consistent with their elite outfield profiles.

Summary

The four-cluster hierarchical solution reveals interpretable and practically meaningful player types. The clearest finding is the complete separation of goalkeepers from outfield players. Among outfield players, the distinction between elite athletes and technical playmakers is intuitive but statistically weak, suggesting that a 3 cluster solution on outfield-only data would be a productive next step for refining the analysis.

K-MEANS CLUSTERING

K-means clustering was applied to the same 993 players and six attributes used in the hierarchical analysis, again with no scaling required as all attributes share a common 1–100 scale. set.seed(101) ensures reproducibility.

Assessing Cluster Quality with Silhouette Analysis

sil_kmeans1 <- silhouette(kmeans1$cluster, d1)
summary(sil_kmeans1)
Silhouette of 993 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = d1) :
 Cluster sizes and average silhouette widths:
      173       429       285       106 
0.1746332 0.3826574 0.2179209 0.6883496 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1098  0.1825  0.3321  0.3318  0.4813  0.7898 

The overall mean silhouette width for the K-means solution is 0.332, again falling below the commonly accepted threshold of 0.5. As with the hierarchical solution, scores vary considerably across clusters. The ‘goalkeeper’ cluster (identifiable by its very low technical attribute scores) again produces the highest silhouette width, confirming that goalkeepers are distinctly separated from outfield players regardless of the algorithm used. Clusters representing outfield player types return weaker scores, reflecting the inherent overlap between positions on a 1–100 attribute scale.

Profiling the K-Means Clusters

Mean performance scores (out of 100) for each K-means cluster across the six attributes.
Cluster Acceleration Sprint Speed Ball Control Dribbling Short Passing Shot Power
C1 58.2 62.1 64.5 53.3 68.7 60.3
C2 82.1 81.3 81.3 81.1 77.9 77.1
C3 65.0 65.0 78.5 73.5 79.0 76.9
C4 48.3 49.2 23.4 15.9 32.8 25.1

The table above reveals four player types:

C1 - scores at a moderate level across all attributes, with no single standout quality. Their strongest attribute is short passing (68.7), complemented by reasonable ball control (64.5), yet they are notably slow — ranking second lowest on acceleration (58.2) and sprint speed (62.1) among the clusters. This combination of limited pace, moderate technical ability, and a passing emphasis is consistent with a centre-back or deep-lying defensive midfielder profile — players relied upon to distribute and hold position rather than contribute athletically in advanced areas.

C2 - scores highest on physical attributes, with the strongest acceleration (82.1) and sprint speed (81.3) of any cluster, while remaining highly competitive across technical attributes. Ball control (81.3) and dribbling (81.1) are both elite, and short passing (77.9) and shot power (77.1) are strong. These are high-intensity players who combine pace with excellent technical ability — profiles consistent with wide attackers, pacey forwards, or attacking midfielders.

C3 - are the most technically polished outfield cluster. They lead on ball control (78.5), short passing (79.0), and dribbling (73.5), and post strong shot power (76.9), yet are notably slower than C2 — acceleration (65.0) and sprint speed (65.0) sit well below C2’s 82.1 and 81.3. These players are best characterised as technically gifted but physically limited — profiles consistent with deep-lying midfielders, number 10s, or technical forwards who create rather than run in behind.

C4 - display the most distinctive profile of any cluster. Ball control (23.4), dribbling (15.9), and shot power (25.1) are dramatically lower than all other clusters, while acceleration (48.3) and sprint speed (49.2) are modest but not negligible. This spike-and-collapse shape is the hallmark of a goalkeeper, who retains reasonable athletic attributes but lacks the outfield technical skills that define every other cluster.

The line chart below provides a strong visual summary: C2 and C3 both sit at a high level on technical attributes but diverge sharply on pace, with C2 elevated and C3 dipping considerably; C4 forms a distinctive spike-and-collapse shape reflecting its unique goalkeeper profile; and C1 mirrors the general shape of C3 but sits at a consistently lower level across all attributes, reinforcing its identity as a positionally-oriented defensive type.

Age, Club Value and Wage

The table below presents means for age, club transfer value, and wage for each cluster. These demographic and financial variables were not used in the clustering but serve as useful external validation of the cluster labels.

Average age, transfer value and wage by K-means cluster.
Cluster Total Players Avg Age Avg Value (€M) Avg Wage (€K)
C1 173 27.7 13.39 58.9
C2 429 26.2 22.07 81.1
C3 285 28.1 16.20 66.1
C4 106 29.0 14.47 52.0

The demographic and financial profile of the K-means clusters closely echoes the hierarchical findings. The goalkeeper cluster again records the highest average age, consistent with the known longevity of goalkeeping careers. The clusters capturing elite athletic and technically refined outfield players command higher average transfer values and wages than the defensive and goalkeeper clusters, reflecting the premium the transfer market places on outfield creativity and athleticism. I believe that these patterns provide external validation that the K-means algorithm has recovered meaningful real-world groupings.

Comparing Hierarchical Clustering and K-Means

K-means produced a higher overall mean silhouette width than hierarchical clustering (0.332 v 0.283), suggesting it found a marginally better-fitting 4-cluster partition of this data. However, the difference between the two solutions is small, and both share the same core weakness — weak silhouette scores for outfield player clusters — which reflects genuine overlap in the attribute profiles of outfield positions rather than a failure of either algorithm. In both cases the goalkeeper cluster is far and away the best-separated group, with silhouette widths above 0.6.

Both algorithms recovered the same four player types — goalkeepers, elite athletic outfield players, technical playmakers, and defensively-oriented players — confirming that these groupings reflect genuine structure in the data rather than an result of the method used.

The cluster numbering differs between methods, as K-means cluster numbers are assigned randomly, but the underlying profiles are closely matched; hierarchical C1 (athletic outfield, n=349) corresponds to K-means C2 (n=429); hierarchical C2 (technical playmakers, n=330) matches to K-means C3 (n=285); hierarchical C3 (goalkeepers, n=107) matches to K-means C4 (n=106); and hierarchical C4 (defensive/passing, n=207) matches to K-means C1 (n=173).

The most meaningful difference between the two results lies in cluster sizes for the outfield player types. The goalkeeper clusters are strikingly consistent — 107 players in hierarchical versus 106 in K-means — providing strong mutual validation that this group is robustly identified by both algorithms. The outfield clusters tell a different story. The athletic outfield cluster grew from 349 (hierarchical) to 429 (K-means), while the defensive/passing cluster shrank from 207 to 173, and the technical playmaker cluster contracted from 330 to 285. This suggests K-means redistributed a portion of the defensively-oriented and technical players into the athletic cluster, reflecting the sensitivity of K-means to its random starting point.

Conclusion

Both algorithms agree on the fundamental structure of this player dataset; four meaningful player types exist, with goalkeepers forming the clearest and most distinct group. The weak overall silhouette scores for outfield clusters are consistent across both methods, pointing to the underlying challenge of separating playing positions based solely on six shared attributes. For a more granular segmentation of outfield players, applying either algorithm to outfield players only would be a productive next step.