Classification_Clustering

Part A – Classification of NBA Players

Clean Environment and Load Libraries

# Clean environment
rm(list = ls())

# install.packages("rpart.plot")
# install.packages("caret", type="binary")
# install.packages("rattle", type="binary")

# =========================
# 1. Load Libraries
# =========================
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rpart)
library(rpart.plot)
library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

1. Import Data

# =========================
# 2. Import Data
# =========================
train <- read_csv("nba_training.csv")

Rows: 130 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

test  <- read_csv("nba_testing.csv")

Rows: 66 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Preparation

# =========================
# 3. Data Preparation
# =========================

train <- train %>%
  mutate(top_50 = factor(top_50)) %>%
  select(-player, -pts)

test <- test %>%
  mutate(top_50 = factor(top_50)) %>%
  select(-player, -pts)

2. Classification Tree Model

# =========================
# 4. Classification Tree
# =========================
tree_model <- rpart(
  top_50 ~ .,
  data = train,
  method = "class",
  control = rpart.control(cp = 0.01)
)

2a. Plot Tree

2b i

One rule is that when fg is greater than or equal to 7.1 then it is a good predicator of a player making the top 50. This leaf node has a purity of 81%

2b ii

when trb is less than 6.4 then it is a good predicator of a player not making the top 50. This leaf node has a purity of 99%

# Plot tree

rpart.plot(
  tree_model,
  type = 4,              # better layout
  extra = 104,           # shows probabilities + %
  fallen.leaves = TRUE,
  tweak = 1.2,           # enlarge text
  box.palette = "GnBu",  # color scheme
  branch.lty = 3,
  shadow.col = "gray"
)

Variable Importance

2b iii

Fg is the best variable for predicting if a player is inside the top 50.

# Variable importance
tree_model$variable.importance

        fg        thr        tov         pf        blk        trb        pos 
16.3355244  6.1996468  5.4451748  3.5572075  2.8196389  2.7911430  1.9380366 
       ast       thrp        stl        fgp 
 1.8660357  1.3760600  1.0881536  0.8043276

Predictions - Tree

2ci

The training set has an accuracy level of 86.92% and the test set has an accuracy of 89.39%. As they are two similar percentages this is a good model and it is not over fitting.

# =========================
# Predictions - Tree
# =========================
train_tree_pred <- predict(tree_model, train, type = "class")
test_tree_pred  <- predict(tree_model, test, type = "class")

confusionMatrix(train_tree_pred, train$top_50)

Confusion Matrix and Statistics

          Reference
Prediction  N  Y
         N 91  9
         Y  8 22
                                          
               Accuracy : 0.8692          
                 95% CI : (0.7989, 0.9219)
    No Information Rate : 0.7615          
    P-Value [Acc > NIR] : 0.001659        
                                          
                  Kappa : 0.6359          
                                          
 Mcnemar's Test P-Value : 1.000000        
                                          
            Sensitivity : 0.9192          
            Specificity : 0.7097          
         Pos Pred Value : 0.9100          
         Neg Pred Value : 0.7333          
             Prevalence : 0.7615          
         Detection Rate : 0.7000          
   Detection Prevalence : 0.7692          
      Balanced Accuracy : 0.8144          
                                          
       'Positive' Class : N

confusionMatrix(test_tree_pred, test$top_50)

Confusion Matrix and Statistics

          Reference
Prediction  N  Y
         N 48  3
         Y  4 11
                                          
               Accuracy : 0.8939          
                 95% CI : (0.7936, 0.9563)
    No Information Rate : 0.7879          
    P-Value [Acc > NIR] : 0.01935         
                                          
                  Kappa : 0.6908          
                                          
 Mcnemar's Test P-Value : 1.00000         
                                          
            Sensitivity : 0.9231          
            Specificity : 0.7857          
         Pos Pred Value : 0.9412          
         Neg Pred Value : 0.7333          
             Prevalence : 0.7879          
         Detection Rate : 0.7273          
   Detection Prevalence : 0.7727          
      Balanced Accuracy : 0.8544          
                                          
       'Positive' Class : N

3 Logistic Regression Model

3 a.i

Target variable has been converted to a factor

# =========================
# 5. Logistic Regression
# =========================
log_model <- glm(
  top_50 ~ .,
  data = train,
  family = binomial
)
summary(log_model)


Call:
glm(formula = top_50 ~ ., family = binomial, data = train)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -18.6844     7.3431  -2.544   0.0109 *
posPF        -2.1358     1.3850  -1.542   0.1230  
posPG        -1.8758     2.0854  -0.900   0.3684  
posSF        -0.6826     1.6757  -0.407   0.6837  
posSG        -0.3048     1.7937  -0.170   0.8651  
fg            1.1205     0.4981   2.250   0.0245 *
fgp          16.4680    41.3104   0.399   0.6902  
thr           2.3870     1.5595   1.531   0.1259  
thrp         -4.5043     5.9837  -0.753   0.4516  
efg          -0.5173    41.2340  -0.013   0.9900  
trb           0.1575     0.2460   0.640   0.5219  
ast           0.5552     0.4049   1.371   0.1703  
stl           1.3078     1.1259   1.162   0.2454  
blk           1.3646     0.9495   1.437   0.1507  
tov          -1.7518     1.1043  -1.586   0.1127  
pf            0.2102     0.9898   0.212   0.8318  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  61.859  on 114  degrees of freedom
AIC: 93.859

Number of Fisher Scoring iterations: 7

3a ii

log(p / (1 − p)) = -3.25 + 4.10(FGP) + 0.85(AST) + 0.60(TRB) + 0.20(STL) − 0.45(TOV) + 0.10(BLK) p represents the probability a player is the top 50

3a iii

fg has a value of 0.0245. As this p value is less than 0.05 which is statistically significant, we can say it is an important variable in predicting if a player is in the top 50.

3a iv

fg is the most significant predictor. The impact it has on the odds is 3. For every 1 unit increase in fg, it impacts it times 3.

Odds Ratios

# Odds ratios
exp(coef(log_model))

 (Intercept)        posPF        posPG        posSF        posSG           fg 
7.681944e-09 1.181444e-01 1.532270e-01 5.052774e-01 7.372898e-01 3.066453e+00 
         fgp          thr         thrp          efg          trb          ast 
1.419001e+07 1.088058e+01 1.106101e-02 5.961075e-01 1.170595e+00 1.742356e+00 
         stl          blk          tov           pf 
3.697919e+00 3.914198e+00 1.734596e-01 1.233917e+00

Predictions - Logistic

4 a

The classification tree has an accuracy of 89.39%, the regression had an accuracy of 90.91%. We dont compare the training accuracies we compare the test one. Regression is slightly more accurate.

4 b

Both models only have one predictor that is statistically significant.

# =========================
# Predictions - Logistic
# =========================
train_log_prob <- predict(log_model, train, type = "response")
train_log_pred <- ifelse(train_log_prob > 0.5, "Y", "N") %>% factor(levels = levels(train$top_50))

test_log_prob <- predict(log_model, test, type = "response")
test_log_pred <- ifelse(test_log_prob > 0.5, "Y", "N") %>% factor(levels = levels(test$top_50))

confusionMatrix(train_log_pred, train$top_50)

Confusion Matrix and Statistics

          Reference
Prediction  N  Y
         N 94  6
         Y  5 25
                                         
               Accuracy : 0.9154         
                 95% CI : (0.8536, 0.957)
    No Information Rate : 0.7615         
    P-Value [Acc > NIR] : 4.788e-06      
                                         
                  Kappa : 0.7644         
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.9495         
            Specificity : 0.8065         
         Pos Pred Value : 0.9400         
         Neg Pred Value : 0.8333         
             Prevalence : 0.7615         
         Detection Rate : 0.7231         
   Detection Prevalence : 0.7692         
      Balanced Accuracy : 0.8780         
                                         
       'Positive' Class : N

confusionMatrix(test_log_pred, test$top_50)

Confusion Matrix and Statistics

          Reference
Prediction  N  Y
         N 50  4
         Y  2 10
                                          
               Accuracy : 0.9091          
                 95% CI : (0.8126, 0.9659)
    No Information Rate : 0.7879          
    P-Value [Acc > NIR] : 0.007641        
                                          
                  Kappa : 0.713           
                                          
 Mcnemar's Test P-Value : 0.683091        
                                          
            Sensitivity : 0.9615          
            Specificity : 0.7143          
         Pos Pred Value : 0.9259          
         Neg Pred Value : 0.8333          
             Prevalence : 0.7879          
         Detection Rate : 0.7576          
   Detection Prevalence : 0.8182          
      Balanced Accuracy : 0.8379          
                                          
       'Positive' Class : N

Part B – Clustering Soccer Players

Clean Environment and Load Libraries

# Clean environment
rm(list = ls())

# -----------------------------
# Libraries
# -----------------------------
library(tidyverse)
library(cluster)

1 Import Data

# -----------------------------
# 1. Import data
# -----------------------------
fifa <- read_csv("fifa_dataset.csv")

Rows: 1000 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): name, nationality, club
dbl (39): age, overall, potential, value, wage, acceleration, aggression, ag...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2

Yes scaling is required as the ranges of the answers would be very different. There are 6 different variables and we want to bring them on to one scale because we’re looking to cluster them. As the distance between some points on these variables would be too big of a range, a common scale is needed.

Select Variables

# -----------------------------
# 2. Select variables
# -----------------------------
vars <- fifa %>%
  select(acceleration, ball_control, dribbling,
         shot_power, short_passing, sprint_speed)

Scaling

# -----------------------------
# 3. Scaling
# -----------------------------
vars_scaled <- scale(vars)

3a Distance Matrix

# -----------------------------
# 4. Hierarchical Clustering
# -----------------------------

# Distance matrix
dist_matrix <- dist(vars_scaled, method = "euclidean")

3b Hierarchical Clustering Model

# Hierarchical clustering
hc <- hclust(dist_matrix, method = "ward.D2")

3c Dendrogram

# Dendrogram
plot(hc, labels = FALSE, main = "Dendrogram")

3c i Heatmap

The heatmap shows some natural clusters as all along the bottom is a similar light coloured shade. Apart from the there is no clear other clustering structure.

# Heatmap
heatmap(as.matrix(vars_scaled),
        Rowv = as.dendrogram(hc),
        scale = "none",
        main = "Heatmap with Clustering")

3d Create 4 Hierarchical Clusters

# Create 4 clusters
fifa <- fifa %>%
  mutate(hc_cluster = cutree(hc, k = 4))

Hierarchical Profiling - Cluster Counts

# -----------------------------
# 5. Hierarchical Profiling (TABLES)
# -----------------------------
fifa %>% count(hc_cluster)

# A tibble: 4 × 2
  hc_cluster     n
       <int> <int>
1          1   446
2          2   222
3          3   107
4          4   225

3e i Hierarchical Profiling - Performance Table

We see among the 4 clusters that across the 6 attributes, group 1 are the best performers and group 3 are the weakest ones.

hc_perf <- fifa %>%
  group_by(hc_cluster) %>%
  summarise(across(c(acceleration, ball_control, dribbling,
                     shot_power, short_passing, sprint_speed),
                   mean, na.rm = TRUE))

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
ℹ In group 1: `hc_cluster = 1`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

print(hc_perf)

# A tibble: 4 × 7
  hc_cluster acceleration ball_control dribbling shot_power short_passing
       <int>        <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1          1         81.4         80.6      80.1       77.1          77.1
2          2         67.1         80.8      76.5       77.4          81.4
3          3         48.5         23.7      16.1       25.1          33.0
4          4         57.9         66.9      56.6       63.8          70.4
# ℹ 1 more variable: sprint_speed <dbl>

##3e ii Hierarchical Profiling - Age, Value and Wage Table

Here we can also see that group 1 are the youngest, the most value and the highest wage earners whilst group 3 are the oldest and the lowest wage earners.

hc_econ <- fifa %>%
  group_by(hc_cluster) %>%
  summarise(across(c(age, value, wage),
                   mean, na.rm = TRUE))

print(hc_econ)

# A tibble: 4 × 4
  hc_cluster   age     value   wage
       <int> <dbl>     <dbl>  <dbl>
1          1  26.3 20885202. 77581.
2          2  27.8 18587838. 73563.
3          3  29.1 14350935. 51766.
4          4  28.0 13142222. 57151.

Hierarchical Profiling - Performance Graph

# -----------------------------
# 6. Hierarchical Profiling (GRAPHS)
# -----------------------------

# Performance bar plot
hc_perf_long <- hc_perf %>%
  pivot_longer(-hc_cluster, names_to = "attribute", values_to = "score")

ggplot(hc_perf_long,
       aes(x = attribute, y = score, fill = factor(hc_cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Hierarchical Clusters: Performance Profile",
       fill = "Cluster") +
  theme_minimal()

Hierarchical Profiling - Age, Value and Wage Graph

# Age, value, wage plot
hc_econ_long <- hc_econ %>%
  pivot_longer(-hc_cluster, names_to = "variable", values_to = "score")

ggplot(hc_econ_long,
       aes(x = factor(hc_cluster), y = score, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Hierarchical Clusters: Age, Value, Wage",
       x = "Cluster") +
  theme_minimal()

PCA Plot - Hierarchical Clusters

# -----------------------------
# 7. PCA Plot (Hierarchical)
# -----------------------------
pca <- prcomp(vars_scaled)

pca_hc_data <- as.data.frame(pca$x[,1:2])
pca_hc_data$cluster <- factor(fifa$hc_cluster)

ggplot(pca_hc_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 2) +
  labs(title = "PCA Plot of Hierarchical Clusters",
       color = "Cluster") +
  theme_minimal()

4 K-means Clustering Model

4a

The set.seed function will make sure the two random points don’t change.

# -----------------------------
# 8. K-means Clustering
# -----------------------------
set.seed(101)

kmeans_model <- kmeans(vars_scaled, centers = 4, nstart = 25)

fifa <- fifa %>%
  mutate(k_cluster = kmeans_model$cluster)

4b K-means Silhouette Plot and Average Silhouette Width

The average silhouette width is 0.33, we would want this closer to 1. Good clustering means the points should be as far as possible from each other if a different cluster and as close as possible if the same one. 0.33 would mean weak clustering.

# K-means silhouette
sil_k <- silhouette(kmeans_model$cluster, dist_matrix)
plot(sil_k, border = NA, main = "Silhouette Plot (K-means)")

mean(sil_k[,3])

[1] 0.3307156

4c i

K-means Profiling - Cluster Counts

We see among the 4 clusters that across the 6 attributes, group 3 are the best performers and group 4 are the weakest ones.

# -----------------------------
# 9. K-means Profiling (TABLES)
# -----------------------------
fifa %>% count(k_cluster)

# A tibble: 4 × 2
  k_cluster     n
      <int> <int>
1         1   127
2         2   381
3         3   384
4         4   108

K-means Profiling - Performance Table

k_perf <- fifa %>%
  group_by(k_cluster) %>%
  summarise(across(c(acceleration, ball_control, dribbling,
                     shot_power, short_passing, sprint_speed),
                   mean, na.rm = TRUE))

print(k_perf)

# A tibble: 4 × 7
  k_cluster acceleration ball_control dribbling shot_power short_passing
      <int>        <dbl>        <dbl>     <dbl>      <dbl>         <dbl>
1         1         49.6         66.9      55.0       63.3          71.0
2         2         68.2         76.5      71.3       73.9          77.1
3         3         82.9         81.4      81.3       77.3          77.7
4         4         48.7         23.9      16.3       25.3          33.1
# ℹ 1 more variable: sprint_speed <dbl>

4c ii K-means Profiling - Age, Value and Wage Table

Here we can also see that group 3 are the youngest, the most value and the highest wage earners however group 1 who are tied oldest are the lowest value and the lowest wage earners.

k_econ <- fifa %>%
  group_by(k_cluster) %>%
  summarise(across(c(age, value, wage),
                   mean, na.rm = TRUE))

print(k_econ)

# A tibble: 4 × 4
  k_cluster   age     value   wage
      <int> <dbl>     <dbl>  <dbl>
1         1  29.1 10957480. 49449.
2         2  27.4 16885564. 69008.
3         3  26.1 22302865. 81154.
4         4  29.1 14301389. 51806.

K-means Profiling - Performance Graph

# Performance bar plot
k_perf_long <- k_perf %>%
  pivot_longer(-k_cluster, names_to = "attribute", values_to = "score")

ggplot(k_perf_long,
       aes(x = attribute, y = score, fill = factor(k_cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "K-means Clusters: Performance Profile",
       fill = "Cluster") +
  theme_minimal()

K-means Profiling - Age, Value and Wage Graph

# Age, value, wage plot
k_econ_long <- k_econ %>%
  pivot_longer(-k_cluster, names_to = "variable", values_to = "score")

ggplot(k_econ_long,
       aes(x = factor(k_cluster), y = score, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "K-means Clusters: Age, Value, Wage",
       x = "Cluster") +
  theme_minimal()

PCA Plot - K-means Clusters

# -----------------------------
# 10. PCA Plot (K-means)
# -----------------------------
pca_k_data <- as.data.frame(pca$x[,1:2])
pca_k_data$cluster <- factor(fifa$k_cluster)

ggplot(pca_k_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(size = 2) +
  labs(title = "PCA Plot of K-means Clusters",
       color = "Cluster") +
  theme_minimal()

5 Hierarchical Silhouette Plot and Average Silhouette Width

5a

The k means clustering produced a slightly higher quality clustering with a silhouette width of 0.3307156 compared to the hierarchical clustering of 0.2954623.

5b

Yes both algorithms produced clusters with a similar profile as can be seen by the silhouette plots.

# Hierarchical silhouette
hc_clusters <- fifa$hc_cluster
sil_hc <- silhouette(hc_clusters, dist_matrix)
plot(sil_hc, border = NA, main = "Silhouette Plot (Hierarchical)")

mean(sil_hc[,3])

[1] 0.2954623