Classification & Clustering Assignment

Part A: Classify NBA Players

Question 1: Importing the Data

# Load all relevant packages for NBA Dataset 
library(tidyverse)
library(rpart)
library(rattle)

nba_training <- read_csv("nba_training.csv")
nba_testing <- read_csv("nba_testing.csv")

# Remove player names, and average points scored per game (pts)
nba_training <- nba_training %>% select(-player, -pts)
nba_testing <- nba_testing %>% select(-player, -pts)

Question 2: Classification Tree

2a

nba_model_tree <- rpart(top_50 ~ ., data = nba_training, method = 'class')

fancyRpartPlot(nba_model_tree)

2b

(i)

If a player averages 7.1 or more field goals per game, they are predicted to be in the Top 50. This leaf (node 3) has a purity of 81%, meaning 81% of players meeting this rule actually were in the Top 50.

(ii)

If a player averages less than 7.1 field goals per game AND less than 2.3 three-pointers AND less than 6.4 total rebounds, they are predicted to be outside the Top 50. This leaf (node 8) has a purity of 99%.

(iii)

nba_model_tree$variable.importance
summary(nba_model_tree)

The most important variable for predicting whether a player is in the Top 50 is fg (average field goals per game), accounting for 37% of all improvements made to the model. The next most important variables are thr (three-pointers per game) at 14% and tov (turnovers per game) at 12%, which together with fg account for over 60% of the model’s predictive power.

2c

# Training Accuracy  

nba_training_tree_prob <- predict(nba_model_tree, newdata = nba_training, type = "prob")
nba_training_tree_prediction <- predict(nba_model_tree, newdata = nba_training, type = "class")
nba_training_tree_final <- cbind(nba_training, nba_training_tree_prob, nba_training_tree_prediction)
nba_training_tree_tab <- table(nba_training_tree_final$top_50, 
                               nba_training_tree_final$nba_training_tree_prediction,
                               dnn = c("Actual", "Predicted"))

nba_training_tree_tab

nba_training_tree_acc <- sum(diag(nba_training_tree_tab)) / sum(nba_training_tree_tab)

nba_training_tree_acc

# Testing Accuracy 

nba_testing_tree_prob <- predict(nba_model_tree, newdata = nba_testing, type = "prob")
nba_testing_tree_prediction <- predict(nba_model_tree, newdata = nba_testing, type = "class")
nba_testing_tree_final <- cbind(nba_testing, nba_testing_tree_prob, nba_testing_tree_prediction)
nba_testing_tree_tab <- table(nba_testing_tree_final$top_50, 
                              nba_testing_tree_final$nba_testing_tree_prediction,
                              dnn = c("Actual", "Predicted"))

nba_testing_tree_tab

nba_testing_tree_acc <- sum(diag(nba_testing_tree_tab)) / sum(nba_testing_tree_tab)

nba_testing_tree_acc

(i)

The classification tree achieved an accuracy of 86.92% on the training dataset and 89.39% on the testing dataset.This suggests that the model is not overfitting, because its performance on the testing data is slightly better than its performance on the training data.Since the two accuracy values are close, the model appears to generalise well to new data.

Question 3 Binary Logistic Regression Method

3a

# Firstly we need to set predictors (i):

nba_training$top_50 <- factor(nba_training$top_50, levels = c("N", "Y"))
nba_testing$top_50  <- factor(nba_testing$top_50, levels = c("N", "Y"))

nba_model_lr <- glm(top_50 ~ ., data = nba_training, family = binomial(link = "logit"))

summary(nba_model_lr)


Call:
glm(formula = top_50 ~ ., family = binomial(link = "logit"), 
    data = nba_training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -18.6844     7.3431  -2.544   0.0109 *
posPF        -2.1358     1.3850  -1.542   0.1230  
posPG        -1.8758     2.0854  -0.900   0.3684  
posSF        -0.6826     1.6757  -0.407   0.6837  
posSG        -0.3048     1.7937  -0.170   0.8651  
fg            1.1205     0.4981   2.250   0.0245 *
fgp          16.4680    41.3104   0.399   0.6902  
thr           2.3870     1.5595   1.531   0.1259  
thrp         -4.5043     5.9837  -0.753   0.4516  
efg          -0.5173    41.2340  -0.013   0.9900  
trb           0.1575     0.2460   0.640   0.5219  
ast           0.5552     0.4049   1.371   0.1703  
stl           1.3078     1.1259   1.162   0.2454  
blk           1.3646     0.9495   1.437   0.1507  
tov          -1.7518     1.1043  -1.586   0.1127  
pf            0.2102     0.9898   0.212   0.8318  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  61.859  on 114  degrees of freedom
AIC: 93.859

Number of Fisher Scoring iterations: 7

(ii)

regression equation = log(p / (1 − p)) = -18.6844 − 2.1358(PF) − 1.8758(PG) − 0.6826(SF) − 0.3048(SG) + 1.1205(fg) + 16.4680(fgp) + 2.3870(thr) − 4.5043(thrp) − 0.5173(efg) + 0.1575(trb) + 0.5552(ast) + 1.3078(stl) + 1.3646(blk) − 1.7518(tov) + 0.2102(pf)

(iii)

The only statistically significant predictor variable is average field goals made per game (fg), with a p-value of 0.0245, which is less than the 5% significance level. This indicates that fg is an important variable in predicting whether a player is in the Top 50.All other variables have p-values greater than 0.05, indicating that they are not statistically significant predictors in this model.

(iv)

To do this we use e1.1205 or exp in R, shown below. The answer is 3.066387. A one-unit increase in field goals per game (fg) increases the odds of a player being in the Top 50 by a factor of approximately 3.07,holding all other variables constant.

exp(1.1205)

[1] 3.066387

3b

On the training dataset, the model correctly classified 119 out of 130 players, resulting in an accuracy of 91.54%. On the testing dataset, the model correctly classified 60 out of 66 players, resulting in an accuracy of 90.91%.

# Training Data:

nba_training_lr_pi <- predict(nba_model_lr, newdata = nba_training, type = "response")

nba_training_lr_final <- nba_training %>%
  mutate(pi = nba_training_lr_pi) %>%
  mutate(nba_training_lr_prediction = case_when(
    pi > 0.5  ~ "Y",
    pi <= 0.5 ~ "N"
  ))

nba_training_lr_final$nba_training_lr_prediction <- factor(
  nba_training_lr_final$nba_training_lr_prediction,
  levels = c("N", "Y")
)

nba_training_lr_tab <- table(
  nba_training_lr_final$top_50,
  nba_training_lr_final$nba_training_lr_prediction,
  dnn = c("Actual", "Predicted")
)

nba_training_lr_tab

nba_training_lr_acc <- sum(diag(nba_training_lr_tab)) / sum(nba_training_lr_tab)
nba_training_lr_acc

# Testing Data:

nba_testing_lr_pi <- predict(nba_model_lr, newdata = nba_testing, type = "response")

nba_testing_lr_final <- nba_testing %>%
  mutate(pi = nba_testing_lr_pi) %>%
  mutate(nba_testing_lr_prediction = case_when(
    pi > 0.5  ~ "Y",
    pi <= 0.5 ~ "N"
  ))

nba_testing_lr_final$nba_testing_lr_prediction <- factor(
  nba_testing_lr_final$nba_testing_lr_prediction,
  levels = c("N", "Y")
)

nba_testing_lr_tab <- table(
  nba_testing_lr_final$top_50,
  nba_testing_lr_final$nba_testing_lr_prediction,
  dnn = c("Actual", "Predicted")
)

nba_testing_lr_tab

nba_testing_lr_acc <- sum(diag(nba_testing_lr_tab)) / sum(nba_testing_lr_tab)
nba_testing_lr_acc

Question 4 Compare and Contrast both Classification and Regression Methods

4a

Classification Tree

Training accuracy: 86.92%

Testing accuracy: 89.39%

Logistic Regression

Training accuracy: 91.54%

Testing accuracy: 90.91%

The binary logistic regression model is slightly more accurate than the classification tree model.The logistic regression model achieved a testing accuracy of 90.91%, compared to 89.39% for the classification tree.

4b

The classification tree identified several variables as important, with field goals (fg) being the most important predictor, followed by free throws (thr) and turnovers (tov). The tree model uses these variables to create decision rules that classify players into Top 50 or not.

In contrast, the logistic regression model found that only field goals (fg) was a statistically significant predictor at the 5% significance level, with all other variables having p-values greater than 0.05.

Part B: Clustering Soccer Players

Question 1: Importing the Dataset

library(cluster)
library(kableExtra)
library(ggplot2)

fifa_data <- read_csv("fifa_dataset.csv")

Question 2: Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer.

fifa_data_clean <- fifa_data %>%
  select(name, acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed)

Scaling the data for either method is not needed as it is presented on a comparable scale already (i.e, 0-100).

Question 3: Hierarchical Clustering

3a

fifa_data_final <- fifa_data_clean %>%
  select(-c(name))

d1 <- dist(fifa_data_final)

3b + c

h1 <- hclust(d1, method = 'ward.D')

plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

3c (i)

The dataset likely contains clusters but they are not well-defined or highly distinct from viewing this heatmap. There are some light grouped color patterns (e.g., top left) suggesting that certain observations are more similar within groups but overall there are not many lightly cluster blocks.

3d

clusters <- cutree(h1, k=4)
sil1 <- silhouette(clusters, d1)
summary(sil1)

Overall the score is weak (0.297) suggesting the structure of these clusters are weak, and could be possibly be artificial but cluster 3 returns a score of 0.694, meaning a strong cluster could be found here.

3e

fifa_clus <- cbind(fifa_data_clean, clusters)

fifa_clus <- fifa_clus %>%
  mutate(Cluster = case_when(
    clusters == 1 ~ "C1",
    clusters == 2 ~ "C2",
    clusters == 3 ~ "C3",
    clusters == 4 ~ "C4"
  ))

fifa_clus_means <- fifa_clus %>%
  group_by(Cluster) %>%
  summarise(acceleration = mean(acceleration),
            ball_control = mean(ball_control),
            dribbling = mean(dribbling),
            shot_power= mean(shot_power),
            short_passing = mean(short_passing),
            sprint_speed = mean(sprint_speed))

kable(fifa_clus_means)

Cluster	acceleration	ball_control	dribbling	shot_power	short_passing	sprint_speed
C1	80.83537	81.07520	80.42276	76.96748	77.89431	80.10366
C2	65.15026	79.01554	73.89637	77.23834	79.44041	65.20725
C3	48.49533	23.71028	16.11215	25.05607	33.04673	49.23364
C4	57.01923	66.33173	55.82212	63.28846	70.32692	60.78365

C1 is clearly the highest-performing cluster across almost all attributes, with the highest scores for acceleration (~81), ball control (~81), dribbling (~80), and sprint speed (~80). C2 stands out for having high ball control (~79) and short passing (~79) but notably lower acceleration and sprint speed (~65), suggesting these are technically skilled but slower players. C4 sits in the middle across all attributes, representing average performers. C3 is the lowest-performing cluster by a large margin, particularly in ball control (~24) and dribbling (~16).

fifa_clus_ii <- cbind(fifa_data, clusters)
fifa_clus_ii <- fifa_clus_ii %>%
  mutate(Cluster = case_when(
    clusters == 1 ~ "C1",
    clusters == 2 ~ "C2",
    clusters == 3 ~ "C3",
    clusters == 4 ~ "C4"
  ))

fifa_clus_means_ii <- fifa_clus_ii %>%
  group_by(Cluster) %>%
  summarise(age = mean(age),
            value = mean(value),
            wage = mean(wage)) 

kable(fifa_clus_means_ii)

Cluster	age	value	wage
C1	26.31301	21355081	79225.61
C2	28.06736	16098446	65772.02
C3	29.10280	14350935	51766.36
C4	28.02885	13387500	58259.62

C1 is the youngest cluster on average (~26 years) and commands by far the highest club value (~€21.4M) and wage (~€79,226), consistent with their elite performance profile. C2 and C4 are similar in age (~28 years) but differ in value and wage, with C2 earning more. C3 is the oldest cluster (~29 years) and has the lowest wage (~€51,766), but interestingly is not the lowest in club value — C4 holds that position (~€13.4M). This suggests C3 may contain experienced but declining players, while C4 contains younger but lower-value performers.

fifa_clus_long <- fifa_clus_means %>%
  pivot_longer(cols = -Cluster, names_to = "Attribute", values_to = "Mean_Score")

ggplot(fifa_clus_long, aes(x = Attribute, y = Mean_Score, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Hierarchical Cluster Profiles – Performance Attributes",
       x = "Attribute", y = "Mean Score") +
  theme_minimal()

# Graph ii – age, value, wage
fifa_clus_ii_long <- fifa_clus_means_ii %>%
  pivot_longer(cols = -Cluster, names_to = "Variable", values_to = "Mean_Value")

ggplot(fifa_clus_ii_long, aes(x = Cluster, y = Mean_Value, fill = Cluster)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ Variable, scales = "free_y") +
  labs(title = "Hierarchical Cluster Profiles – Age, Value & Wage") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

Question 4: K-means

4a

set.seed(101)
km1 <- kmeans(fifa_data_final, centers = 4)

km1$cluster

   [1] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 4 3 4 2 2 4 2 3 2 4 2 2 2 2 3 2 2 3 2 2
  [38] 3 3 2 1 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 2 2 3 2 3 1 2 2 3 2 2 2 1 4 2 1 2
  [75] 2 2 2 2 3 1 4 1 1 3 1 2 3 4 2 2 2 3 2 2 1 2 2 2 2 2 2 2 1 1 2 4 2 2 2 2 3
 [112] 2 2 1 1 2 2 2 1 2 3 2 2 2 3 3 2 2 4 3 1 3 3 2 1 4 3 3 2 2 2 2 2 2 3 3 3 2
 [149] 3 3 3 2 3 1 2 2 3 3 4 3 4 2 2 1 2 2 3 3 4 2 3 3 3 3 3 2 4 4 4 2 2 2 2 3 2
 [186] 4 4 2 2 3 2 2 1 2 2 2 2 2 2 1 3 2 2 2 2 2 3 2 2 3 3 1 2 2 3 2 2 3 3 2 2 2
 [223] 1 2 4 2 3 2 3 2 2 4 2 3 2 2 3 3 2 2 3 2 2 4 2 2 2 2 3 3 1 1 1 4 3 3 2 2 3
 [260] 2 2 2 3 1 2 2 3 3 3 2 2 2 3 2 2 2 2 3 1 1 3 3 1 3 2 2 2 3 3 2 2 3 2 2 3 3
 [297] 4 1 2 2 2 1 2 2 3 4 2 2 2 3 2 2 2 2 2 2 2 4 4 1 3 2 4 2 3 3 2 2 1 3 3 2 2
 [334] 2 2 2 3 4 3 3 2 2 3 2 2 1 1 3 2 2 3 2 2 2 1 2 2 2 3 3 2 2 1 3 3 2 2 1 3 1
 [371] 3 2 3 2 2 1 2 2 3 2 2 2 3 1 3 2 2 1 2 3 2 3 3 3 2 2 2 2 3 1 3 2 2 2 2 2 2
 [408] 2 2 3 2 2 1 2 1 3 2 3 4 2 3 2 2 2 3 2 2 2 2 2 2 4 1 1 4 3 3 2 3 3 3 2 2 2
 [445] 3 1 1 3 2 2 2 2 4 2 2 2 2 3 1 2 3 2 3 1 1 1 2 2 3 2 2 2 2 2 3 3 2 2 4 4 4
 [482] 2 2 3 4 4 2 3 2 2 3 2 3 2 2 2 3 3 2 2 3 3 2 3 3 3 2 2 2 3 3 3 3 3 2 1 4 4
 [519] 4 3 3 3 2 2 3 1 3 1 2 2 2 1 1 2 3 1 1 1 1 2 2 2 2 3 2 2 2 2 3 3 3 2 2 1 3
 [556] 2 2 4 2 2 2 2 2 3 2 2 2 2 1 2 1 2 2 3 3 2 2 3 2 2 3 3 2 2 2 3 4 4 3 1 3 3
 [593] 4 2 2 2 3 2 1 3 3 1 1 2 3 2 2 2 2 2 3 3 2 3 3 3 3 3 2 3 3 1 1 2 2 4 1 2 2
 [630] 3 3 1 2 3 2 2 1 3 3 2 2 2 2 2 1 1 3 1 2 2 2 1 4 1 1 1 2 3 3 2 3 3 1 3 3 3
 [667] 1 1 1 1 2 2 2 1 1 3 4 2 1 2 3 4 4 4 3 3 3 4 4 2 2 2 2 3 1 1 2 3 1 1 2 2 2
 [704] 2 2 4 4 3 3 3 2 2 2 3 1 2 2 2 2 3 3 3 2 2 2 2 3 3 3 2 3 3 3 3 2 3 2 2 2 1
 [741] 1 3 1 3 1 1 3 1 1 4 3 1 1 1 4 4 4 2 3 1 2 1 1 1 3 1 1 4 4 2 2 2 2 2 2 2 2
 [778] 2 2 3 2 3 1 1 1 3 3 3 3 3 1 1 2 1 3 1 3 1 1 4 4 3 3 3 3 3 3 1 2 3 1 4 4 2
 [815] 2 3 3 3 2 1 1 2 2 1 2 4 1 4 3 4 3 2 2 4 4 2 2 2 3 3 1 1 2 3 1 1 1 3 1 1 1
 [852] 2 4 1 4 3 4 4 4 4 4 2 2 3 3 3 1 3 1 2 2 4 4 2 2 3 4 4 2 2 2 2 3 3 3 1 1 4
 [889] 4 4 4 4 3 3 1 2 3 2 1 3 1 3 1 1 4 3 3 1 3 1 1 1 3 2 3 4 4 3 4 4 4 4 3 3 1
 [926] 1 4 4 4 4 3 3 1 1 3 1 3 2 2 1 3 1 1 1 3 4 3 1 1 1 1 3 3 2 3 4 3 1 3 4 2 3
 [963] 3 3 3 2 2 3 1 3 4 3 4 1 3 3 1 1 1 3 4 1 3 4 1 1 1 1 1 4 4 4 1 2 3 3 3 3 2
[1000] 1

km1$centers

  acceleration ball_control dribbling shot_power short_passing sprint_speed
1     58.16667     64.53448  53.31034   60.32759      68.72989     62.16092
2     82.05568     81.30858  81.06497   77.12297      77.85151     81.35731
3     65.00346     78.59862  73.58478   76.90657      78.97924     65.09689
4     48.31132     23.44340  15.88679   25.06604      32.84906     49.17925

km1$size

[1] 174 431 289 106

km1$iter

[1] 4

4b

sil_km <- silhouette(km1$cluster, d1)
summary(sil_km)

Silhouette of 1000 units in 4 clusters from silhouette.default(x = km1$cluster, dist = d1) :
 Cluster sizes and average silhouette widths:
      174       431       289       106 
0.1777123 0.3802751 0.2177741 0.6883372 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1060  0.1807  0.3313  0.3307  0.4788  0.7898

The overall average silhouette for the K-means solution is 0.3307, which indicates a weak to moderate clustering structure. This is similar to the hierarchical result (0.297), though slightly better overall.

Looking at the individual clusters, the quality varies considerably. C4 is by far the strongest cluster with a silhouette width of 0.688, indicating that its 106 players are well-matched to their own cluster and clearly separated from others. C2 is moderate at 0.380, representing 431 players with reasonable but not strong separation. C3 is weak at 0.218 and C1 is the weakest at 0.178, with its 174 players being poorly separated from neighbouring clusters. The minimum silhouette width of -0.106 also indicates that at least some players may have been placed in the wrong cluster.

Overall, the solution is acceptable but not highly distinct, with one strong cluster (C4) driving much of the quality.

4c

(i)

fifa_km_means_tidy <- as_tibble(km1$centers) %>%
  mutate(Cluster = c("C1", "C2", "C3", "C4")) %>%
  pivot_longer(cols = c(acceleration, ball_control, dribbling, 
                        shot_power, short_passing, sprint_speed),
               names_to = "Attribute", values_to = "Mean_Score")

ggplot(fifa_km_means_tidy, aes(x = Attribute, y = Mean_Score, 
                                group = Cluster, colour = Cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Score") +
  ggtitle("K-means Cluster Profiles – Performance Attributes")

This line graph reveals a clear hierarchy across the four clusters. C2 is the highest-performing cluster across almost all attributes, scoring above 80 in acceleration, ball control, dribbling and sprint speed, suggesting these are elite all-round players. C3 is close behind, performing strongly across all attributes and particularly high in shot power and short passing (~79), indicating technically well-rounded players. C1 sits in the middle, with moderate scores across all attributes ranging from roughly 53 to 70, representing average performers. C4 is clearly the lowest-performing cluster, with notably weak dribbling (~17) and ball control (~23)

(ii)

fifa_km_clus_ii <- cbind(fifa_data, km_cluster = km1$cluster) %>%
  mutate(Cluster = case_when(
    km_cluster == 1 ~ "C1",
    km_cluster == 2 ~ "C2",
    km_cluster == 3 ~ "C3",
    km_cluster == 4 ~ "C4"
  ))

fifa_km_means_ii <- fifa_km_clus_ii %>%
  group_by(Cluster) %>%
  summarise(age   = mean(age),
            value = mean(value),
            wage  = mean(wage))

kable(fifa_km_means_ii)

Cluster	age	value	wage
C1	27.72989	13317241	58540.23
C2	26.16705	21964037	80707.66
C3	28.11765	15971626	65224.91
C4	29.03774	14475000	51971.70

C2 players are the most valuable, with the highest average club value (~€21.96M) and wage (~€80,708), and are also the youngest on average (~26.2 years), consistent with their elite performance profile. C3 players earn the second highest wage (~€65,225) and are relatively young (~28.1 years). C1 sits in the middle for both value and wage. C4 is the oldest cluster (~29 years) and commands the lowest wage (~€51,972) and second lowest club value (~€14.5M), consistent with their weaker performance scores.

fifa_km_ii_long <- fifa_km_means_ii %>%
  pivot_longer(cols = -Cluster, names_to = "Variable", values_to = "Mean_Value")

ggplot(fifa_km_ii_long, aes(x = Cluster, y = Mean_Value, fill = Cluster)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ Variable, scales = "free_y") +
  labs(title = "K-means Cluster Profiles – Age, Value & Wage") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

Question 5: Comparing the two methods

5a

The K-means algorithm produced slightly higher quality clusters overall. The average silhouette width for K-means was 0.3307 compared to 0.297 for hierarchical clustering, indicating that K-means achieved marginally better defined and more separated clusters. Both solutions are considered weak to moderate in quality, however both algorithms agreed that one cluster stood out as particularly strong, cluster 3 in hierarchical (0.694) and cluster C4 in K-means (0.688).

5b

Both algorithms produced broadly similar cluster profiles. Both identified the same four distinct player types in terms of various attributes.