Sports Analytics & Insight: Classification and Clustering Analysis of NBA and FIFA Data

Introduction

This report evaluates the application of statistical modelling and machine learning techniques on elite sports datasets to enhance performance analysis. Part A determines if NBA players are qualified for the ESPN “Top 50” rating based on their 2017–18 season performance using supervised learning techniques, namely binary logistic regression and classification trees. Using hierarchical and K-means clustering, 1,000 professional football players from a FIFA dataset are analysed for intrinsic correlations and skill specialisations in Part B, which applies unsupervised learning techniques. By combining several methods to find key performance indicators and distinct player archetypes, the report offers unbiased, data-driven insights for professional sports management.

Part A - NBA Players

Classification Tree Method

Creating Classification Tree

One rule for predicting if a player is classified as being in the Top 50 (“Y”) is that players average 7.1 or more field goals per game (fg). This rule shows an 81% purity (probability 0.81) by taking the “no” branch from the root node to the leaf node Box 3. With 81% of players in the training set properly identified as Top 50, this high purity suggests that high-volume field goal production is a strong single predictor of elite status.

In contrast, if a player averages less than 7.1 field goals per game, along with less than 2.3 three-pointers per game (thr) and less than 6.4 total rebound percentage (trb), they are expected to be outside the Top 50 (“N”). Leaf node Box 8, which maintains an extraordinarily high purity of 99%, is reached by this path. This shows that 99% of players in the training sample who met this cutoff were accurately classified as non-elite, acting as a very effective classification filter.

Predicting Important Variables

Call:
rpart(formula = top_50 ~ pos + fg + fgp + thr + thrp + efg + 
    trb + ast + stl + blk + tov + pf, data = nba_training, method = "class")
  n= 130 

          CP nsplit rel error    xerror      xstd
1 0.41935484      0 1.0000000 1.0000000 0.1567347
2 0.01075269      1 0.5806452 0.8387097 0.1471194
3 0.01000000      4 0.5483871 0.9032258 0.1511979

Variable importance
  fg  thr  tov   pf  blk  trb  pos  ast thrp  stl  fgp 
  37   14   12    8    6    6    4    4    3    2    2 

Node number 1: 130 observations,    complexity param=0.4193548
  predicted class=N  expected loss=0.2384615  P(node) =1
    class counts:    99    31
   probabilities: 0.762 0.238 
  left son=2 (109 obs) right son=3 (21 obs)
  Primary splits:
      fg  < 7.1    to the left,  improve=16.335520, (0 missing)
      thr < 2.25   to the left,  improve=10.069930, (0 missing)
      tov < 1.45   to the left,  improve= 8.330769, (0 missing)
      stl < 1.35   to the left,  improve= 6.179021, (0 missing)
      trb < 7.3    to the left,  improve= 5.553524, (0 missing)
  Surrogate splits:
      tov < 2.75   to the left,  agree=0.892, adj=0.333, (0 split)
      pf  < 3.05   to the left,  agree=0.869, adj=0.190, (0 split)
      thr < 2.6    to the left,  agree=0.854, adj=0.095, (0 split)
      ast < 8.5    to the left,  agree=0.854, adj=0.095, (0 split)
      stl < 1.75   to the left,  agree=0.846, adj=0.048, (0 split)

Node number 2: 109 observations,    complexity param=0.01075269
  predicted class=N  expected loss=0.1284404  P(node) =0.8384615
    class counts:    95    14
   probabilities: 0.872 0.128 
  left son=4 (97 obs) right son=5 (12 obs)
  Primary splits:
      thr < 2.25   to the left,  improve=3.723257, (0 missing)
      fg  < 4.65   to the left,  improve=3.478123, (0 missing)
      trb < 7.3    to the left,  improve=2.893895, (0 missing)
      efg < 0.5335 to the left,  improve=2.570815, (0 missing)
      tov < 1.45   to the left,  improve=2.270472, (0 missing)
  Surrogate splits:
      ast < 6.75   to the left,  agree=0.899, adj=0.083, (0 split)
      stl < 1.65   to the left,  agree=0.899, adj=0.083, (0 split)

Node number 3: 21 observations
  predicted class=Y  expected loss=0.1904762  P(node) =0.1615385
    class counts:     4    17
   probabilities: 0.190 0.810 

Node number 4: 97 observations,    complexity param=0.01075269
  predicted class=N  expected loss=0.08247423  P(node) =0.7461538
    class counts:    89     8
   probabilities: 0.918 0.082 
  left son=8 (73 obs) right son=9 (24 obs)
  Primary splits:
      trb < 6.35   to the left,  improve=2.791143, (0 missing)
      blk < 0.95   to the left,  improve=2.233258, (0 missing)
      fg  < 4.65   to the left,  improve=1.802364, (0 missing)
      pos splits as  RLLLL,      improve=1.542761, (0 missing)
      fgp < 0.464  to the left,  improve=1.435421, (0 missing)
  Surrogate splits:
      pos  splits as  RLLLL,      agree=0.845, adj=0.375, (0 split)
      thrp < 0.2565 to the right, agree=0.835, adj=0.333, (0 split)
      blk  < 0.75   to the left,  agree=0.825, adj=0.292, (0 split)
      thr  < 0.25   to the right, agree=0.814, adj=0.250, (0 split)
      fgp  < 0.549  to the left,  agree=0.804, adj=0.208, (0 split)

Node number 5: 12 observations
  predicted class=N  expected loss=0.5  P(node) =0.09230769
    class counts:     6     6
   probabilities: 0.500 0.500 

Node number 8: 73 observations
  predicted class=N  expected loss=0.01369863  P(node) =0.5615385
    class counts:    72     1
   probabilities: 0.986 0.014 

Node number 9: 24 observations,    complexity param=0.01075269
  predicted class=N  expected loss=0.2916667  P(node) =0.1846154
    class counts:    17     7
   probabilities: 0.708 0.292 
  left son=18 (15 obs) right son=19 (9 obs)
  Primary splits:
      blk < 0.95   to the left,  improve=2.0055560, (0 missing)
      stl < 0.9    to the left,  improve=1.0416670, (0 missing)
      fgp < 0.474  to the left,  improve=0.9388889, (0 missing)
      efg < 0.51   to the left,  improve=0.9388889, (0 missing)
      thr < 0.85   to the left,  improve=0.7500000, (0 missing)
  Surrogate splits:
      pos  splits as  RLLL-,      agree=0.792, adj=0.444, (0 split)
      thrp < 0.1715 to the right, agree=0.708, adj=0.222, (0 split)
      pf   < 2.65   to the left,  agree=0.708, adj=0.222, (0 split)
      fgp  < 0.474  to the left,  agree=0.667, adj=0.111, (0 split)
      thr  < 0.05   to the right, agree=0.667, adj=0.111, (0 split)

Node number 18: 15 observations
  predicted class=N  expected loss=0.1333333  P(node) =0.1153846
    class counts:    13     2
   probabilities: 0.867 0.133 

Node number 19: 9 observations
  predicted class=Y  expected loss=0.4444444  P(node) =0.06923077
    class counts:     4     5
   probabilities: 0.444 0.556 

The most important metrics for determining an NBA player’s Top 50 ranking are average field goals per game (fg) at 37%, three-pointers made (thr) at 14%, and turnovers (tov) at 12%.

Field goals (fg) emerged as the dominant predictor, accounting for 37% of overall gains, with a value more than twice as high as the next most important variable. This statistical priority is further illustrated by the classification tree structure, where fg was selected as the root node, confirming it as the optimal single measure for splitting the player groups. Even though field goal production is the model’s primary driver, three-pointers and turnovers remain significant secondary factors in the final categorisation process.

Model Accuracy

      Predicted
Actual  N  Y
     N 91  8
     Y  9 22
      Predicted
Actual  N  Y
     N 48  4
     Y  3 11

The classification tree had high prediction power with an accuracy of 86.9% in the training dataset and correctly identified 113 out of 130 participants. When applied to the unseen test sample, performance remained outstanding, with 59 correct classifications out of 66 players and a slightly higher test accuracy of 89.4%.

Based on these result, the model is not overfitting the training set. The accuracy in this analysis remains steady and somewhat improves with validation, indicating that the model has successfully identified general patterns applicable to the larger population of NBA players rather than merely memorising the specifics of the training set. Although it is rare for a model to be more accurate on a testing set than a training set, this result is a clear indication that the model does not suffer from overfitting.

Binary Logistic Regression Model

Creating Regression Model

[1] "N" "Y"
[1] "N" "Y"

Call:
glm(formula = top_50 ~ pos + fg + fgp + thr + thrp + efg + trb + 
    ast + stl + blk + tov + pf, family = binomial(link = "logit"), 
    data = nba_training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -18.6844     7.3431  -2.544   0.0109 *
posPF        -2.1358     1.3850  -1.542   0.1230  
posPG        -1.8758     2.0854  -0.900   0.3684  
posSF        -0.6826     1.6757  -0.407   0.6837  
posSG        -0.3048     1.7937  -0.170   0.8651  
fg            1.1205     0.4981   2.250   0.0245 *
fgp          16.4680    41.3104   0.399   0.6902  
thr           2.3870     1.5595   1.531   0.1259  
thrp         -4.5043     5.9837  -0.753   0.4516  
efg          -0.5173    41.2340  -0.013   0.9900  
trb           0.1575     0.2460   0.640   0.5219  
ast           0.5552     0.4049   1.371   0.1703  
stl           1.3078     1.1259   1.162   0.2454  
blk           1.3646     0.9495   1.437   0.1507  
tov          -1.7518     1.1043  -1.586   0.1127  
pf            0.2102     0.9898   0.212   0.8318  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  61.859  on 114  degrees of freedom
AIC: 93.859

Number of Fisher Scoring iterations: 7

Regression Equation

ln (n/(1-n)) = −18.6844 − 2.1358.posPF −1.8758.posPG −0.6826.posSF −0.3048.posSG +1.1205.fg +16.4680.fgp +2.3870.thr −4.5043.thrp −0.5173.efg +0.1575.trb +0.5552.ast +1.3078.stl +1.3646.blk −1.7518.tov +0.2102.pf

Important Predictor Variables

The statistical significance of each predictor variable was assessed using the logistic regression model’s p-values. The only statistically significant predictor of performance was found to be average field goals scored per game (fg), with a p-value of 0.0245. As this result is below the standard 0.05 level of significance, which is also indicated by the single asterisk (*) next to the coefficient, the null hypothesis that the coefficient for field goals is zero is rejected. A high number of field goals made is therefore seen by this model as an important variable in identifying a Top 50 player.

However, none of the other positional variables or performance measures was statistically significant. Turnovers (tov) had a p-value of 0.1127, and the Power Forward position (posPF) had a p-value of 0.1230. Because these values are higher than 0.05, the null hypothesis, which states that these characteristics do not significantly contribute to the prediction of a player’s classification within this specific model, is therefore accepted.

Impact of Variable on odds of Player Being in Top 50

Average field goals made (fg) was the only significant predictor variable identified with a p-value of 0.0245. This variable had a coefficient of 1.1205, which can be used to calculate the impact of this variable as e^(1.1205) = 3.066. This showcases that for every 1-unit increase in an average field goal made during a match, the odds of a player ranking inside the Top 50 increase by 3.066, or by 206.6%.

Check Accuracy of Model

      Predicted
Actual  N  Y
     N 94  5
     Y  6 25
      Predicted
Actual  N  Y
     N 50  2
     Y  4 10

The binary logistic regression model showed strong predictive ability in the training dataset, identifying 119 out of 130 participants with an accuracy of 91.5%. When applied to the unseen testing sample, performance remained outstanding with 60 correct classifications out of 66 players and a similar high testing accuracy of 90.9%.

These results demonstrate that the training set is not overfit by the model. The sustained accuracy between the training and test datasets illustrates that the model has successfully identified general performance patterns applicable to the larger population of NBA players rather than merely memorising the specifics of the training set.

Comparison Between Models

Which is More Accurate?

The binary logistic regression model is the most accurate in predicting NBA Top 50 status. It demonstrated remarkable predictive ability by correctly identifying 119 out of 130 participants in the training dataset (91.5% accuracy) and 60 out of 66 players in the unseen testing sample (90.9% accuracy). In comparison, the classification tree’s accuracy ratings were lower, with 86.9% during training and 89.4% during testing.

The logistic regression model’s constant accuracy between the training and test datasets shows that it was successful in finding broad performance patterns relevant to the larger population of NBA players instead of merely memorising training-specific patterns. Although both models avoided overfitting, the logistic regression model yielded a more accurate and statistically sound classification for this dataset.

Comparison of Important Variables

According to both methods, average field goals per game (fg) is the primary factor used to calculate the NBA Top 50 ranking. At 7.1, fg was the first split point in the classification tree and contributed 37% of the model’s improvements. As a result, fg was the only performance variable in the logistic regression model that met the significance threshold, while other metrics, such as turnovers, did not.

Part B - FIFA Soccer Players

Does the Data Need to be Scaled?

No, the data does not need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm. All cluster analysis variables are required to be on a comparable scale to avoid any one attribute from dominating the Euclidean distance calculation, which is already the case, as the six performance attributes are currently measured on the same 1 to 100 scale.

Hierarchical Clustering

Visualising the results using a dendrogram

Visualising the results using a heatmap

The heatmap provides strong evidence of clustering in the FIFA dataset. The visualisation illustrates a large block of high similarity and several smaller, distinct square blocks further down the diagonal. This visual evidence confirms that the 1,000 players are not just a random spread of statistics but naturally group into specific player type profiles, including fast attackers, technical midfielders, or goalkeeper specialists. Players are placed into these clusters based on their similarities across the six performance attributes.

Creating a 4-cluster solution

fifa_clusters
  1   2   3   4 
492 193 107 208 
Silhouette of 1000 units in 4 clusters from silhouette.default(x = fifa_clusters, dist = fifa_dist) :
 Cluster sizes and average silhouette widths:
       492        193        107        208 
0.30592133 0.28830631 0.69464229 0.08230691 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3653  0.1461  0.3160  0.2976  0.4576  0.7958 

The four-cluster solution is assessed using silhouette widths. These widths measure how well each player aligns with their assigned cluster compared to others. The silhouette widths had an overall average silhouette width of 0.2976. As this result falls within the 0.25 to 0.5 range, this would indicate that the result of these clusters is weak and could be artificial.

However, a further investigation into these individual groups would show a significant difference in these cluster purities. Cluster 3 produced the strongest score, with a silhouette width of 0.6946. This indicates these players are well-matched to this group and also share a similar player profile.

In contrast, Cluster 4 produced the lowest silhouette score of 0.0823, which is less than the 0.25 threshold and indicates that no significant structure has been found for this specific group. Players, therefore, cannot be grouped effectively using the six selected performance indicators from these methods.

Although there is a discrepancy between these clusters, the mean silhouette width of 0.2976 would indicate that this 4-cluster solution is overall to be weak.

Profiling the Clusters using Table and Graphs

Table Showcasing Clusters for the Six Various Attributes

# A tibble: 4 × 7
  Cluster mean_acceleration mean_ball_control mean_dribbling mean_shot_power
  <fct>               <dbl>             <dbl>          <dbl>           <dbl>
1 1                    80.8              81.1           80.4            77.0
2 2                    65.2              79.0           73.9            77.2
3 3                    48.5              23.7           16.1            25.1
4 4                    57.0              66.3           55.8            63.3
# ℹ 2 more variables: mean_short_passing <dbl>, mean_sprint_speed <dbl>

The table above displays the various performance profiles for the four hierarchical clusters, displaying various player archetypes that can be identified by both their physical and technical abilities.

  • Cluster 1 illustrates the elite players who excel in speed and dribbling ability. These players exceed the sample in both pace and individual skill by obtaining an average score for acceleration (81), ball control (81), dribbling (80), and sprint speed (80), which far exceeds the sample average, illustrating these players to be the elite.

  • Cluster 2 groups together the elite players who thrive on passing ability and shot power, and can be considered the technical powerhouses. Although this cluster obtained an average just below cluster 1 for the attributes listed above, this group maintained the highest averages for short passing (79) and shot power (77). This would highlight that this cluster may represent the elite midfielders or central playmakers.

  • Cluster 3 displays the most distinct group, and can showcase the goalkeepers within this sample of players. This group produced significantly lower scores in all technical outfield attributes, including dribbling (16) and ball control (24). However, this cluster produced a silhouette width of 0.69, which would indicate this group to be reasonably good and a pure group. This can be due to the fundamental difference in the required skillset for an outfield player vs a goalkeeper in soccer.

  • Cluster 4 showcases the remaining balanced outfield players who do not necessarily excel in any attribute. They are the middle ground group, who obtain higher attribute averages than the goalkeepers, but yet have lower averages than the elite groups showcased in cluster 1 and cluster 2. However, this cluster does possess strong distribution skills, evidenced by a short passing average of 70.

Table Showcasing Clusters for Age, Club Value and Wage

# A tibble: 4 × 4
  Cluster mean_age mean_value mean_wage
  <fct>      <dbl>      <dbl>     <dbl>
1 1           26.3  21355081.    79226.
2 2           28.1  16098446.    65772.
3 3           29.1  14350935.    51766.
4 4           28.0  13387500     58260.

The table above showcases the financial profiles of the four clusters, which revealed a strong correlation between performance profiles and market worth.

  • Cluster 1 (Youth Elite Stars) currently obtains the highest financial figures within the sample, while obtaining an average club market value of €21,355,081 and demanding an average wage of €79,226, which is higher than any other group. This group also obtains the youngest cluster in relation to age, with the average player being 26 years old. This cluster represents the most sought-after players and most valuable players from the dataset.

  • Cluster 2 (Established Technicals) showcases the second-highest financial standing players, with an average club value of €16,098,446 and wages of €65,772. This group consists of slightly older players, with players being an average of 28 years old. This group showcases the established veterans within the dataset who possess high technical ability to maintain their market worth.

  • Cluster 3 (Goalkeepers) obtains the oldest players on average, at 29 years old. These players obtain the lowest average wage at €51,766 and a market value of €14,350,935.

  • Cluster 4 (Balanced Group) has the lowest average club value at €13,387,500. Although this group matches the established technical group in cluster 2 with an average age of 28 years old, this group earn a more moderate wage at €58,260. This cluster also lacks the high market premium value associated with the dataset’s elite performers.

Graph Illustrating Performance Profiles by Clusters

The line graph above allows a visual representation of the specific player profiles found within the FIFA dataset across the four clusters.

The top elite outfield players can be seen through cluster 1 and cluster 2. These clusters dominate the average score abilities across the six attributes, with cluster 1 leading in physical speed and acceleration, while cluster 2 leads in technical proficiency through short passing and shot power.

Cluster 3, containing the goalkeepers from the FIFA dataset, can be visually seen to be an outlier. This line sits significantly below the other clusters, illustrating the drop-off in these technical skills required for outfield players.

Cluster 4, which represents the middle-ground players, is illustrated between the elite outfield players and the goalkeepers. This illustrates that these players exceed the goalkeepers for these attributes, but are far inferior to the elite players, indicating this cluster to be a group of overall balanced or defensive players who lack those elite capabilities but remain competent in distribution statistics, such as short passing.

K-means Clustering

Clustering Solutions by Calculating Silhouette Scores

Silhouette of 1000 units in 4 clusters from silhouette.default(x = fifa_kmeans$cluster, dist = fifa_dist) :
 Cluster sizes and average silhouette widths:
      174       431       289       106 
0.1777123 0.3802751 0.2177741 0.6883372 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1060  0.1807  0.3313  0.3307  0.4788  0.7898 

The quality of the clustering solutions was assessed using silhouette scores. This score measures how well each participant resembles their assigned cluster with respect to others. The K-means algorithm produced a silhouette score of 0.3307. This result falls between the 0.25 and 0.5 range, which suggests that the structure discovered is weak and could be artificial.

Despite this overall low mean score, the method did produce one highly distinct group. Cluster 4, or the goalkeepers, obtained a high purity score of 0.688. Although this was the only group to achieve a high score and struggled to produce a substantial natural structure for the remainder of the outfield players based on the six performance attributes selected, this suggests a significant overlap in the statistical profiles of the sample’s outfield players.

Profiling the Clusters using Table and Graphs

Table Showcasing Clusters for the Six Various Attributes

# A tibble: 4 × 7
  Cluster mean_acceleration mean_ball_control mean_dribbling mean_shot_power
  <fct>               <dbl>             <dbl>          <dbl>           <dbl>
1 1                    58.2              64.5           53.3            60.3
2 2                    82.1              81.3           81.1            77.1
3 3                    65.0              78.6           73.6            76.9
4 4                    48.3              23.4           15.9            25.1
# ℹ 2 more variables: mean_short_passing <dbl>, mean_sprint_speed <dbl>

The K-means clustering algorithm extracted four distinct player profiles from the sample of FIFA soccer players. These players are clustered based on their physical and technical attributes. The table below illustrates how cluster 2 and cluster 3 contain the most elite players from this sample.

  • Cluster 1 (Balanced Group) contains the players who sit in the middle ground. These players obtain higher attributes than the goalkeepers, but are less technically or physically gifted than the elite players. This group’s highest attribute is short passing (68.7), while all other attributes remain moderate. This showcases these players to be the average players from the dataset.

  • Cluster 2 (Elite Dribbling and Sprint Speed) contains the top-level players who hold the highest average scores for acceleration (82), ball control (81), dribbling (81) and sprint speed (81). This group could consist of the elite wingers or attacking players who excel due to their explosiveness and individual technical ability.

  • Cluster 3 (Technical Playmakers) showcases the remaining elite players who excel due to their technical distributions. These players obtain the highest average scores for short passing (79) and have high shot power (77). This group could consist of top-level midfielders or central playmakers to benefit their teams due to their ball movement and technical precision.

  • Cluster 4 (Goalkeepers) contains the players with the lowest average scores across all six attributes. This is due to these technical skills being required for outfield players. These low numbers can be seen through the average dribbling score of 16 and the ball control score of 23.

Table Showcasing Clusters for Age, Club Value and Wage

# A tibble: 4 × 4
  Cluster mean_age mean_value mean_wage
  <fct>      <dbl>      <dbl>     <dbl>
1 1           27.7  13317241.    58540.
2 2           26.2  21964037.    80708.
3 3           28.1  15971626.    65225.
4 4           29.0  14475000     51972.

The financial and age profiles of the four K-means clusters demonstrate a clear relationship between technical specialisation and market worth.

  • Cluster 2 (High-Value Elite) consists of the players with the highest average club value of €21,964,037 and the highest average wage of €80,708. This cluster also contains the youngest group of players, with an average age of 26 years old. This cluster represents the most demanded and expensive players.

  • Cluster 3 (Established Technicals) also contains elite players, but contains the second most valuable tier of players. These players have an average club value of €15,971,626 and wages of €65,225. These players are slightly older, with an average age of 28 years old, showcasing these elite players as the veterans with high technical ability, which demands their high market value and wages.

  • Cluster 4 (Goalkeepers) consists of the oldest group of players (29 years old), while maintaining a reasonable market value of €14,475,000. However, this cluster earns the lowest average wage (€51,972).

  • Cluster 1 (Budget Outfield Players) contains players who possess the lowest average club value (€13,317,241). Despite this, these players earn more in wages than the goalkeepers (€58,540). However, their low market value indicates that these players are balanced, mid-tier players who lack the elite proficiency of the other groups of outfield players.

Graph Illustrating Performance Profiles by Clusters

The line graph above showcases a visual of the four different player profiles found within the FIFA dataset.

The top-tier outfield players can be seen through cluster 2 and cluster 3. These clusters dominate the upper echelon of the graph, with cluster 2 dominating in sprint speed and acceleration, while cluster 3 leads in average short passing.

Cluster 4 can visually be seen as the outlier due to the lack of technical proficiency. This performance line is below all other groups, which showcases this group’s inferior abilities in outfield technical skills that the other group exceed in, illustrating this to be the goalkeeper group.

Cluster 1 illustrates the average players, who have a performance trajectory line that is between cluster 4 (Goalkeepers) and clusters 2 and 3 (elite players). This group represents the average players who do not excel in any particular attribute, but are consistent across all.

Comparison Between Models

Highest Quality Clusters

K-means Clustering produced the highest quality clusters for this FIFA dataset, achieving an overall mean silhouette width of 0.3307 compared to the 0.2976 produced by the Hierarchical approach. While K-means is statistically superior, both results fall within the 0.25 – 0.5 range, which the sources categorise as a clustering structure that is weak and could be artificial.

Did Both Algorithms Produce Clusters With a Similar Profile?

Overall, both algorithms independently provided strong cross-validation by finding similar player profiles. Both methods successfully isolated a highly distinct goalkeeper group (Silhouette > 0.68) and clearly differentiated elite outfield specialisations and both algorithms independently achieved high cross-validation by finding nearly identical player profiles.

Visual evidence from the line graphs for both methods confirmed this specialisation, showing one elite cluster leading in physical speed (acceleration and sprint speed) while another crosses over to lead in technical precision (short passing and shot power). Ultimately, the high degree of similarity between the two outputs confirms that the identified player types are consistent regardless of whether a hierarchical approach or a K-means technique is used.

There were, however, minimal differences in the group sizes, as the Goalkeeper group size shifted from 110 players in the Hierarchical model to 106 in the K-means output. Additionally, there were slight improvements in the centroid means, where K-means’ iterative process of reassigning observations to the closest cluster produced higher peak scores. However, the high overall degree of similarity between the two outputs verifies that the identified player categories are consistent across both methods.