Classifying NBA Players and Clustering Soccer Players - Sports Analytics Module

Author

Abby Tarrant

Part A – Classify NBA Players

1. Import Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)
library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.6.2 Copyright (c) 2006-2023 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(cluster)

nba_train <- read_csv("nba_training.csv") %>% na.omit()
Rows: 130 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nba_test  <- read_csv("nba_testing.csv") %>% na.omit() 
Rows: 66 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Classification Tree Method

2a. Create and Visualise Model

nba_train_clean <- nba_train %>% select(-player, -pts)
nba_test_clean  <- nba_test %>% select(-player, -pts)

nba_train_clean$top_50 <- factor(nba_train_clean$top_50, levels = c("N", "Y"))
nba_test_clean$top_50  <- factor(nba_test_clean$top_50, levels = c("N", "Y"))

nba_tree <- rpart(top_50 ~ ., data = nba_train_clean, method = 'class')

fancyRpartPlot(nba_tree)

2b. Interpretation of the Classification Tree

2b(i) Rule for Predicting Top 50

A rule for predicting that a player is in the Top 50 is as follows:

If a player has field goals (fg) greater than 7.1, then the player is classified as being in the Top 50.

The probability values at this leaf node are approximately 0.19 (Not Top 50) and 0.81 (Top 50), meaning that 81% of players in this node are Top 50 players. This indicates that the node is highly pure.


2b(ii) Rule for Predicting NOT Top 50

A rule for predicting that a player is NOT in the Top 50 is:

If a player has total rebound percentage (trb) less than 6.4, then the player is classified as NOT being in the Top 50.

The probability values at this leaf node are approximately 0.99 (Not Top 50) and 0.01 (Top 50), indicating that 99% of players in this node are not Top 50 players. This shows a very high level of node purity.


2b(iii) Variable Importance

nba_tree$variable.importance
        fg        thr        tov         pf        blk        trb        pos 
16.3355244  6.1996468  5.4451748  3.5572075  2.8196389  2.7911430  1.9380366 
       ast       thrp        stl        fgp 
 1.8660357  1.3760600  1.0881536  0.8043276 

The most important variables for predicting whether a player is in the Top 50 are:

  • Field goals (fg)
  • Three-point shots (thr)
  • Total rebound percentage (trb)
  • Blocks (blk)

These variables contribute the most to improving the predictive accuracy of the classification tree.

2c. Accuracy Assessment

Training Accuracy

train_tree_pred <- predict(nba_tree, newdata = nba_train_clean, type = 'class')
train_tree_tab <- table(nba_train_clean$top_50, train_tree_pred)
train_tree_tab
   train_tree_pred
     N  Y
  N 91  8
  Y  9 22
sum(diag(train_tree_tab)) / sum(train_tree_tab)
[1] 0.8692308

Testing Accuracy

test_tree_pred <- predict(nba_tree, newdata = nba_test_clean, type = 'class')
test_tree_tab <- table(nba_test_clean$top_50, test_tree_pred)
test_tree_tab
   test_tree_pred
     N  Y
  N 48  4
  Y  3 11
sum(diag(test_tree_tab)) / sum(test_tree_tab)
[1] 0.8939394

2c(i) Overfitting

The classification tree model achieved an accuracy of 86.9% on the training dataset and 89.4% on the testing dataset.

Since the testing accuracy is slightly higher than the training accuracy, this suggests that the model is not overfitting.

This indicates that the model generalises well to new, unseen data.

3. Binary Logistic Regression Method

3a Create Model

nba_lr <- glm(top_50 ~ ., data = nba_train_clean, family = binomial(link = 'logit'))

summary(nba_lr)

Call:
glm(formula = top_50 ~ ., family = binomial(link = "logit"), 
    data = nba_train_clean)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -18.6844     7.3431  -2.544   0.0109 *
posPF        -2.1358     1.3850  -1.542   0.1230  
posPG        -1.8758     2.0854  -0.900   0.3684  
posSF        -0.6826     1.6757  -0.407   0.6837  
posSG        -0.3048     1.7937  -0.170   0.8651  
fg            1.1205     0.4981   2.250   0.0245 *
fgp          16.4680    41.3104   0.399   0.6902  
thr           2.3870     1.5595   1.531   0.1259  
thrp         -4.5043     5.9837  -0.753   0.4516  
efg          -0.5173    41.2340  -0.013   0.9900  
trb           0.1575     0.2460   0.640   0.5219  
ast           0.5552     0.4049   1.371   0.1703  
stl           1.3078     1.1259   1.162   0.2454  
blk           1.3646     0.9495   1.437   0.1507  
tov          -1.7518     1.1043  -1.586   0.1127  
pf            0.2102     0.9898   0.212   0.8318  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  61.859  on 114  degrees of freedom
AIC: 93.859

Number of Fisher Scoring iterations: 7

3a(i) Target Variable

The response variable was converted into a factor with levels ordered as:

  • N = Not Top 50
  • Y = Top 50

3a(ii) Regression Equation

The logistic regression model is:

log(p / (1 − p)) = β0 + β1X1 + β2X2 + … + βnXn

where p represents the probability of a player being in the Top 50.


3a(iii) Significant Variables

Variables with p-values less than 0.05 are considered statistically significant.

From the model output, the only significant variable is:

  • Field goals per game (fg) (p = 0.0245)

This indicates that field goals per game is a key predictor of whether a player is in the Top 50.


3a(iv) Impact on Odds

exp(coef(nba_lr))
 (Intercept)        posPF        posPG        posSF        posSG           fg 
7.681944e-09 1.181444e-01 1.532270e-01 5.052774e-01 7.372898e-01 3.066453e+00 
         fgp          thr         thrp          efg          trb          ast 
1.419001e+07 1.088058e+01 1.106101e-02 5.961075e-01 1.170595e+00 1.742356e+00 
         stl          blk          tov           pf 
3.697919e+00 3.914198e+00 1.734596e-01 1.233917e+00 

The impact of each variable is calculated using exp(β).

For the significant variable:

Field goals (fg): exp(β) = 3.07

This means that a 1-unit increase in field goals per game multiplies the odds of a player being in the Top 50 by approximately 3.07, representing an increase of approximately 207% in the odds.

3b. Accuracy Assessment

Training Accuracy

train_lr_pi <- predict(nba_lr, newdata = nba_train_clean, type = 'response')
train_lr_pred <- ifelse(train_lr_pi >= 0.5, "Y", "N")

train_lr_tab <- table(nba_train_clean$top_50, train_lr_pred)
train_lr_tab
   train_lr_pred
     N  Y
  N 94  5
  Y  6 25
sum(diag(train_lr_tab)) / sum(train_lr_tab)
[1] 0.9153846

Testing Accuracy

test_lr_pi <- predict(nba_lr, newdata = nba_test_clean, type = 'response')
test_lr_pred <- ifelse(test_lr_pi >= 0.5, "Y", "N")

test_lr_tab <- table(nba_test_clean$top_50, test_lr_pred)
test_lr_tab
   test_lr_pred
     N  Y
  N 50  2
  Y  4 10
sum(diag(test_lr_tab)) / sum(test_lr_tab)
[1] 0.9090909

The logistic regression model achieved an accuracy of approximately 91% on the testing dataset.

4. Model Comparison

4a. Accuracy Comparison

The logistic regression model achieved higher accuracy on the testing dataset compared to the classification tree. Therefore, the logistic regression model is more accurate. This suggests that the logistic regression model provides better predictive performance on unseen data.


4b. Variable Comparison

The classification tree identifies important variables based on how the data is split, highlighting variables such as field goals (fg), three-point shots (thr), and rebounds (trb). In contrast, logistic regression identifies important variables using statistical significance, with field goals (fg) being the only significant predictor. Both models highlight the importance of scoring ability, but logistic regression provides clearer statistical evidence of the relationship between field goals and the probability of being in the Top 50.

Part B - Clustering Soccer Players

1. Import Data

fifa <- read_csv("fifa_dataset.csv") %>% na.omit()
Rows: 1000 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): name, nationality, club
dbl (39): age, overall, potential, value, wage, acceleration, aggression, ag...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2 Scaling Justification

Clustering is based on distance measures, and therefore scaling is required before computing the distance matrix.

If the data is not scaled, variables with larger values or variation may dominate the Euclidean distance calculation.

By standardising the variables, each attribute contributes equally to the clustering process, ensuring a fair and accurate clustering outcome.

3 Hierarchical Clustering

3a. Data Preparation and Distance Matrix

fifa_sub <- fifa %>% 
  select(acceleration, ball_control, dribbling, 
         shot_power, short_passing, sprint_speed)

fifa_scaled <- scale(fifa_sub)

d_fifa <- dist(fifa_scaled)

3b. Hierarchical Clustering Model

h_fifa <- hclust(d_fifa, method = 'ward.D')

3c. Visualisation

plot(h_fifa, 
     hang = -1, 
     labels = FALSE, 
     main = "FIFA Player Dendrogram (4-Cluster Solution)",
     xlab = "1,000 Players",
     sub = "")

rect.hclust(h_fifa, k = 4, border = "red")

heatmap(as.matrix(d_fifa), 
        Rowv = as.dendrogram(h_fifa), 
        Colv = 'Rowv', 
        labRow = FALSE, 
        labCol = FALSE,
        main = "FIFA Distance Heatmap")

3c(i) Heatmap Interpretation

The heatmap shows variation in distances between players, with lighter colours representing smaller distances and darker colours representing larger distances.

There are visible blocks of similar colour along the diagonal, which suggests that groups of similar players exist within the dataset.

This provides evidence of clustering structure, although the clusters are not perfectly distinct.

3d. 4-Cluster Solution and Quality Assessment

h_clusters <- cutree(h_fifa, k = 4)

sil_h <- silhouette(h_clusters, d_fifa)
summary(sil_h)
Silhouette of 993 units in 4 clusters from silhouette.default(x = h_clusters, dist = d_fifa) :
 Cluster sizes and average silhouette widths:
       191        388        307        107 
0.34729982 0.23452189 0.07238905 0.69104038 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.48091  0.09872  0.28615  0.25528  0.41572  0.79122 

The mean silhouette score for hierarchical clustering is approximately 0.255.

A silhouette score between 0.25 and 0.5 indicates moderate clustering structure.

This suggests that the hierarchical clustering provides relatively weak to moderate cluster separation.

3e. Profiling the Clusters (Hierarchial)

fifa_profile_h <- fifa %>%
  mutate(Cluster_H = factor(h_clusters))

fifa_profile_h %>%
  group_by(Cluster_H) %>%
  summarise(across(c(acceleration, ball_control, dribbling, shot_power, 
                     short_passing, sprint_speed, age, value, wage), mean))
# A tibble: 4 × 10
  Cluster_H acceleration ball_control dribbling shot_power short_passing
  <fct>            <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1 1                 87.4         82.5      83.8       76.4          77.7
2 2                 74.8         80.2      77.8       76.9          78.7
3 3                 58.6         70.1      60.9       68.3          72.9
4 4                 48.5         23.7      16.1       25.1          33.0
# ℹ 4 more variables: sprint_speed <dbl>, age <dbl>, value <dbl>, wage <dbl>

4. K-Means Clustering

4a. Model

set.seed(101)
k_fifa <- kmeans(fifa_scaled, centers = 4)

4b. Quality Assessment

sil_k <- silhouette(k_fifa$cluster, d_fifa)
summary(sil_k)
Silhouette of 993 units in 4 clusters from silhouette.default(x = k_fifa$cluster, dist = d_fifa) :
 Cluster sizes and average silhouette widths:
      123       391       371       108 
0.1702185 0.4066718 0.2184586 0.6530206 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1588  0.1932  0.3289  0.3339  0.4943  0.7656 

The mean silhouette score for K-means clustering is approximately 0.334.

This indicates a moderate clustering structure, with better separation between clusters compared to hierarchical clustering.

4c. Profiling the Clusters (K-Means)

fifa_profile <- fifa %>%
  mutate(Cluster = factor(k_fifa$cluster))

fifa_profile %>%
  group_by(Cluster) %>%
  summarise(across(c(acceleration, ball_control, dribbling, shot_power, 
                     short_passing, sprint_speed, age, value, wage), mean))
# A tibble: 4 × 10
  Cluster acceleration ball_control dribbling shot_power short_passing
  <fct>          <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1 1               49.3         67.0      55.1       63.1          71.1
2 2               82.8         81.3      81.3       77.3          77.7
3 3               68.0         76.2      70.9       73.8          77.0
4 4               48.7         23.9      16.3       25.3          33.1
# ℹ 4 more variables: sprint_speed <dbl>, age <dbl>, value <dbl>, wage <dbl>

Cluster 1 consists of players with relatively low acceleration and moderate technical ability, suggesting average players or those with lower physical performance.

Cluster 2 contains players with very high acceleration, ball control, and dribbling, indicating elite attacking players or highly skilled forwards.

Cluster 3 represents players with balanced performance across all attributes, suggesting well-rounded players who are competent in multiple areas.

Cluster 4 consists of players with very low values across all performance attributes, indicating low-performing players with limited technical ability.

In terms of personal attributes such as age, value, and wage, higher-performing clusters (such as Cluster 2) are likely associated with higher market value and wages, while lower-performing clusters (such as Cluster 4) are associated with lower values.

5. Comparison of Clustering Methods

5a. Best Method

The K-means clustering method produced a higher mean silhouette score (0.334) compared to hierarchical clustering (0.255).

This indicates that K-means provides better cluster quality and clearer separation between groups.

5b. Comparison

Both hierarchical clustering and K-means produce broadly similar groupings of players.

However, K-means produces more distinct and interpretable clusters based on player performance attributes.

Overall, K-means clustering is preferred as it provides stronger clustering structure and better separation of player types.