# Load all relevant packages for NBA Dataset
library(tidyverse)
library(rpart)
library(rattle)
nba_training <- read_csv("nba_training.csv")
nba_testing <- read_csv("nba_testing.csv")
# Remove player names, and average points scored per game (pts)
nba_training <- nba_training %>% select(-player, -pts)
nba_testing <- nba_testing %>% select(-player, -pts)Classification & Clustering Assignment
Part A: Classify NBA Players
Question 1: Importing the Data
Question 2: Classification Tree
2a
nba_model_tree <- rpart(top_50 ~ ., data = nba_training, method = 'class')
fancyRpartPlot(nba_model_tree)2b
(i)
If a player averages 7.1 or more field goals per game, they are predicted to be in the Top 50. This leaf (node 3) has a purity of 81%, meaning 81% of players meeting this rule actually were in the Top 50.
(ii)
If a player averages less than 7.1 field goals per game AND less than 2.3 three-pointers AND less than 6.4 total rebounds, they are predicted to be outside the Top 50. This leaf (node 8) has a purity of 99%.
(iii)
nba_model_tree$variable.importance
summary(nba_model_tree)The most important variable for predicting whether a player is in the Top 50 is fg (average field goals per game), accounting for 37% of all improvements made to the model. The next most important variables are thr (three-pointers per game) at 14% and tov (turnovers per game) at 12%, which together with fg account for over 60% of the model’s predictive power.
2c
# Training Accuracy
nba_training_tree_prob <- predict(nba_model_tree, newdata = nba_training, type = "prob")
nba_training_tree_prediction <- predict(nba_model_tree, newdata = nba_training, type = "class")
nba_training_tree_final <- cbind(nba_training, nba_training_tree_prob, nba_training_tree_prediction)
nba_training_tree_tab <- table(nba_training_tree_final$top_50,
nba_training_tree_final$nba_training_tree_prediction,
dnn = c("Actual", "Predicted"))
nba_training_tree_tab
nba_training_tree_acc <- sum(diag(nba_training_tree_tab)) / sum(nba_training_tree_tab)
nba_training_tree_acc
# Testing Accuracy
nba_testing_tree_prob <- predict(nba_model_tree, newdata = nba_testing, type = "prob")
nba_testing_tree_prediction <- predict(nba_model_tree, newdata = nba_testing, type = "class")
nba_testing_tree_final <- cbind(nba_testing, nba_testing_tree_prob, nba_testing_tree_prediction)
nba_testing_tree_tab <- table(nba_testing_tree_final$top_50,
nba_testing_tree_final$nba_testing_tree_prediction,
dnn = c("Actual", "Predicted"))
nba_testing_tree_tab
nba_testing_tree_acc <- sum(diag(nba_testing_tree_tab)) / sum(nba_testing_tree_tab)
nba_testing_tree_acc(i)
The classification tree achieved an accuracy of 86.92% on the training dataset and 89.39% on the testing dataset.This suggests that the model is not overfitting, because its performance on the testing data is slightly better than its performance on the training data.Since the two accuracy values are close, the model appears to generalise well to new data.
Question 3 Binary Logistic Regression Method
3a
# Firstly we need to set predictors (i):
nba_training$top_50 <- factor(nba_training$top_50, levels = c("N", "Y"))
nba_testing$top_50 <- factor(nba_testing$top_50, levels = c("N", "Y"))
nba_model_lr <- glm(top_50 ~ ., data = nba_training, family = binomial(link = "logit"))
summary(nba_model_lr)
Call:
glm(formula = top_50 ~ ., family = binomial(link = "logit"),
data = nba_training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.6844 7.3431 -2.544 0.0109 *
posPF -2.1358 1.3850 -1.542 0.1230
posPG -1.8758 2.0854 -0.900 0.3684
posSF -0.6826 1.6757 -0.407 0.6837
posSG -0.3048 1.7937 -0.170 0.8651
fg 1.1205 0.4981 2.250 0.0245 *
fgp 16.4680 41.3104 0.399 0.6902
thr 2.3870 1.5595 1.531 0.1259
thrp -4.5043 5.9837 -0.753 0.4516
efg -0.5173 41.2340 -0.013 0.9900
trb 0.1575 0.2460 0.640 0.5219
ast 0.5552 0.4049 1.371 0.1703
stl 1.3078 1.1259 1.162 0.2454
blk 1.3646 0.9495 1.437 0.1507
tov -1.7518 1.1043 -1.586 0.1127
pf 0.2102 0.9898 0.212 0.8318
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 142.818 on 129 degrees of freedom
Residual deviance: 61.859 on 114 degrees of freedom
AIC: 93.859
Number of Fisher Scoring iterations: 7
(ii)
regression equation = log(p / (1 − p)) = -18.6844 − 2.1358(PF) − 1.8758(PG) − 0.6826(SF) − 0.3048(SG) + 1.1205(fg) + 16.4680(fgp) + 2.3870(thr) − 4.5043(thrp) − 0.5173(efg) + 0.1575(trb) + 0.5552(ast) + 1.3078(stl) + 1.3646(blk) − 1.7518(tov) + 0.2102(pf)
(iii)
The only statistically significant predictor variable is average field goals made per game (fg), with a p-value of 0.0245, which is less than the 5% significance level. This indicates that fg is an important variable in predicting whether a player is in the Top 50.All other variables have p-values greater than 0.05, indicating that they are not statistically significant predictors in this model.
(iv)
To do this we use e1.1205 or exp in R, shown below. The answer is 3.066387. A one-unit increase in field goals per game (fg) increases the odds of a player being in the Top 50 by a factor of approximately 3.07,holding all other variables constant.
exp(1.1205)[1] 3.066387
3b
On the training dataset, the model correctly classified 119 out of 130 players, resulting in an accuracy of 91.54%. On the testing dataset, the model correctly classified 60 out of 66 players, resulting in an accuracy of 90.91%.
# Training Data:
nba_training_lr_pi <- predict(nba_model_lr, newdata = nba_training, type = "response")
nba_training_lr_final <- nba_training %>%
mutate(pi = nba_training_lr_pi) %>%
mutate(nba_training_lr_prediction = case_when(
pi > 0.5 ~ "Y",
pi <= 0.5 ~ "N"
))
nba_training_lr_final$nba_training_lr_prediction <- factor(
nba_training_lr_final$nba_training_lr_prediction,
levels = c("N", "Y")
)
nba_training_lr_tab <- table(
nba_training_lr_final$top_50,
nba_training_lr_final$nba_training_lr_prediction,
dnn = c("Actual", "Predicted")
)
nba_training_lr_tab
nba_training_lr_acc <- sum(diag(nba_training_lr_tab)) / sum(nba_training_lr_tab)
nba_training_lr_acc
# Testing Data:
nba_testing_lr_pi <- predict(nba_model_lr, newdata = nba_testing, type = "response")
nba_testing_lr_final <- nba_testing %>%
mutate(pi = nba_testing_lr_pi) %>%
mutate(nba_testing_lr_prediction = case_when(
pi > 0.5 ~ "Y",
pi <= 0.5 ~ "N"
))
nba_testing_lr_final$nba_testing_lr_prediction <- factor(
nba_testing_lr_final$nba_testing_lr_prediction,
levels = c("N", "Y")
)
nba_testing_lr_tab <- table(
nba_testing_lr_final$top_50,
nba_testing_lr_final$nba_testing_lr_prediction,
dnn = c("Actual", "Predicted")
)
nba_testing_lr_tab
nba_testing_lr_acc <- sum(diag(nba_testing_lr_tab)) / sum(nba_testing_lr_tab)
nba_testing_lr_accQuestion 4 Compare and Contrast both Classification and Regression Methods
4a
Classification Tree
Training accuracy: 86.92%
Testing accuracy: 89.39%
Logistic Regression
Training accuracy: 91.54%
Testing accuracy: 90.91%
The binary logistic regression model is slightly more accurate than the classification tree model.The logistic regression model achieved a testing accuracy of 90.91%, compared to 89.39% for the classification tree.
4b
The classification tree identified several variables as important, with field goals (fg) being the most important predictor, followed by free throws (thr) and turnovers (tov). The tree model uses these variables to create decision rules that classify players into Top 50 or not.
In contrast, the logistic regression model found that only field goals (fg) was a statistically significant predictor at the 5% significance level, with all other variables having p-values greater than 0.05.
Part B: Clustering Soccer Players
Question 1: Importing the Dataset
library(cluster)
library(kableExtra)
library(ggplot2)
fifa_data <- read_csv("fifa_dataset.csv")Question 2: Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer.
fifa_data_clean <- fifa_data %>%
select(name, acceleration, ball_control, dribbling, shot_power, short_passing, sprint_speed)Scaling the data for either method is not needed as it is presented on a comparable scale already (i.e, 0-100).
Question 3: Hierarchical Clustering
3a
fifa_data_final <- fifa_data_clean %>%
select(-c(name))
d1 <- dist(fifa_data_final)3b + c
h1 <- hclust(d1, method = 'ward.D')
plot(h1, hang = -1)heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')3c (i)
The dataset likely contains clusters but they are not well-defined or highly distinct from viewing this heatmap. There are some light grouped color patterns (e.g., top left) suggesting that certain observations are more similar within groups but overall there are not many lightly cluster blocks.
3d
clusters <- cutree(h1, k=4)
sil1 <- silhouette(clusters, d1)
summary(sil1)Overall the score is weak (0.297) suggesting the structure of these clusters are weak, and could be possibly be artificial but cluster 3 returns a score of 0.694, meaning a strong cluster could be found here.
3e
fifa_clus <- cbind(fifa_data_clean, clusters)
fifa_clus <- fifa_clus %>%
mutate(Cluster = case_when(
clusters == 1 ~ "C1",
clusters == 2 ~ "C2",
clusters == 3 ~ "C3",
clusters == 4 ~ "C4"
))
fifa_clus_means <- fifa_clus %>%
group_by(Cluster) %>%
summarise(acceleration = mean(acceleration),
ball_control = mean(ball_control),
dribbling = mean(dribbling),
shot_power= mean(shot_power),
short_passing = mean(short_passing),
sprint_speed = mean(sprint_speed))
kable(fifa_clus_means)| Cluster | acceleration | ball_control | dribbling | shot_power | short_passing | sprint_speed |
|---|---|---|---|---|---|---|
| C1 | 80.83537 | 81.07520 | 80.42276 | 76.96748 | 77.89431 | 80.10366 |
| C2 | 65.15026 | 79.01554 | 73.89637 | 77.23834 | 79.44041 | 65.20725 |
| C3 | 48.49533 | 23.71028 | 16.11215 | 25.05607 | 33.04673 | 49.23364 |
| C4 | 57.01923 | 66.33173 | 55.82212 | 63.28846 | 70.32692 | 60.78365 |
C1 is clearly the highest-performing cluster across almost all attributes, with the highest scores for acceleration (~81), ball control (~81), dribbling (~80), and sprint speed (~80). C2 stands out for having high ball control (~79) and short passing (~79) but notably lower acceleration and sprint speed (~65), suggesting these are technically skilled but slower players. C4 sits in the middle across all attributes, representing average performers. C3 is the lowest-performing cluster by a large margin, particularly in ball control (~24) and dribbling (~16).
fifa_clus_ii <- cbind(fifa_data, clusters)
fifa_clus_ii <- fifa_clus_ii %>%
mutate(Cluster = case_when(
clusters == 1 ~ "C1",
clusters == 2 ~ "C2",
clusters == 3 ~ "C3",
clusters == 4 ~ "C4"
))
fifa_clus_means_ii <- fifa_clus_ii %>%
group_by(Cluster) %>%
summarise(age = mean(age),
value = mean(value),
wage = mean(wage))
kable(fifa_clus_means_ii)| Cluster | age | value | wage |
|---|---|---|---|
| C1 | 26.31301 | 21355081 | 79225.61 |
| C2 | 28.06736 | 16098446 | 65772.02 |
| C3 | 29.10280 | 14350935 | 51766.36 |
| C4 | 28.02885 | 13387500 | 58259.62 |
C1 is the youngest cluster on average (~26 years) and commands by far the highest club value (~€21.4M) and wage (~€79,226), consistent with their elite performance profile. C2 and C4 are similar in age (~28 years) but differ in value and wage, with C2 earning more. C3 is the oldest cluster (~29 years) and has the lowest wage (~€51,766), but interestingly is not the lowest in club value — C4 holds that position (~€13.4M). This suggests C3 may contain experienced but declining players, while C4 contains younger but lower-value performers.
fifa_clus_long <- fifa_clus_means %>%
pivot_longer(cols = -Cluster, names_to = "Attribute", values_to = "Mean_Score")
ggplot(fifa_clus_long, aes(x = Attribute, y = Mean_Score, fill = Cluster)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Hierarchical Cluster Profiles – Performance Attributes",
x = "Attribute", y = "Mean Score") +
theme_minimal()# Graph ii – age, value, wage
fifa_clus_ii_long <- fifa_clus_means_ii %>%
pivot_longer(cols = -Cluster, names_to = "Variable", values_to = "Mean_Value")
ggplot(fifa_clus_ii_long, aes(x = Cluster, y = Mean_Value, fill = Cluster)) +
geom_bar(stat = "identity") +
facet_wrap(~ Variable, scales = "free_y") +
labs(title = "Hierarchical Cluster Profiles – Age, Value & Wage") +
scale_y_continuous(labels = scales::comma) +
theme_minimal()Question 4: K-means
4a
set.seed(101)
km1 <- kmeans(fifa_data_final, centers = 4)
km1$cluster [1] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 4 3 4 2 2 4 2 3 2 4 2 2 2 2 3 2 2 3 2 2
[38] 3 3 2 1 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 3 2 2 3 2 3 1 2 2 3 2 2 2 1 4 2 1 2
[75] 2 2 2 2 3 1 4 1 1 3 1 2 3 4 2 2 2 3 2 2 1 2 2 2 2 2 2 2 1 1 2 4 2 2 2 2 3
[112] 2 2 1 1 2 2 2 1 2 3 2 2 2 3 3 2 2 4 3 1 3 3 2 1 4 3 3 2 2 2 2 2 2 3 3 3 2
[149] 3 3 3 2 3 1 2 2 3 3 4 3 4 2 2 1 2 2 3 3 4 2 3 3 3 3 3 2 4 4 4 2 2 2 2 3 2
[186] 4 4 2 2 3 2 2 1 2 2 2 2 2 2 1 3 2 2 2 2 2 3 2 2 3 3 1 2 2 3 2 2 3 3 2 2 2
[223] 1 2 4 2 3 2 3 2 2 4 2 3 2 2 3 3 2 2 3 2 2 4 2 2 2 2 3 3 1 1 1 4 3 3 2 2 3
[260] 2 2 2 3 1 2 2 3 3 3 2 2 2 3 2 2 2 2 3 1 1 3 3 1 3 2 2 2 3 3 2 2 3 2 2 3 3
[297] 4 1 2 2 2 1 2 2 3 4 2 2 2 3 2 2 2 2 2 2 2 4 4 1 3 2 4 2 3 3 2 2 1 3 3 2 2
[334] 2 2 2 3 4 3 3 2 2 3 2 2 1 1 3 2 2 3 2 2 2 1 2 2 2 3 3 2 2 1 3 3 2 2 1 3 1
[371] 3 2 3 2 2 1 2 2 3 2 2 2 3 1 3 2 2 1 2 3 2 3 3 3 2 2 2 2 3 1 3 2 2 2 2 2 2
[408] 2 2 3 2 2 1 2 1 3 2 3 4 2 3 2 2 2 3 2 2 2 2 2 2 4 1 1 4 3 3 2 3 3 3 2 2 2
[445] 3 1 1 3 2 2 2 2 4 2 2 2 2 3 1 2 3 2 3 1 1 1 2 2 3 2 2 2 2 2 3 3 2 2 4 4 4
[482] 2 2 3 4 4 2 3 2 2 3 2 3 2 2 2 3 3 2 2 3 3 2 3 3 3 2 2 2 3 3 3 3 3 2 1 4 4
[519] 4 3 3 3 2 2 3 1 3 1 2 2 2 1 1 2 3 1 1 1 1 2 2 2 2 3 2 2 2 2 3 3 3 2 2 1 3
[556] 2 2 4 2 2 2 2 2 3 2 2 2 2 1 2 1 2 2 3 3 2 2 3 2 2 3 3 2 2 2 3 4 4 3 1 3 3
[593] 4 2 2 2 3 2 1 3 3 1 1 2 3 2 2 2 2 2 3 3 2 3 3 3 3 3 2 3 3 1 1 2 2 4 1 2 2
[630] 3 3 1 2 3 2 2 1 3 3 2 2 2 2 2 1 1 3 1 2 2 2 1 4 1 1 1 2 3 3 2 3 3 1 3 3 3
[667] 1 1 1 1 2 2 2 1 1 3 4 2 1 2 3 4 4 4 3 3 3 4 4 2 2 2 2 3 1 1 2 3 1 1 2 2 2
[704] 2 2 4 4 3 3 3 2 2 2 3 1 2 2 2 2 3 3 3 2 2 2 2 3 3 3 2 3 3 3 3 2 3 2 2 2 1
[741] 1 3 1 3 1 1 3 1 1 4 3 1 1 1 4 4 4 2 3 1 2 1 1 1 3 1 1 4 4 2 2 2 2 2 2 2 2
[778] 2 2 3 2 3 1 1 1 3 3 3 3 3 1 1 2 1 3 1 3 1 1 4 4 3 3 3 3 3 3 1 2 3 1 4 4 2
[815] 2 3 3 3 2 1 1 2 2 1 2 4 1 4 3 4 3 2 2 4 4 2 2 2 3 3 1 1 2 3 1 1 1 3 1 1 1
[852] 2 4 1 4 3 4 4 4 4 4 2 2 3 3 3 1 3 1 2 2 4 4 2 2 3 4 4 2 2 2 2 3 3 3 1 1 4
[889] 4 4 4 4 3 3 1 2 3 2 1 3 1 3 1 1 4 3 3 1 3 1 1 1 3 2 3 4 4 3 4 4 4 4 3 3 1
[926] 1 4 4 4 4 3 3 1 1 3 1 3 2 2 1 3 1 1 1 3 4 3 1 1 1 1 3 3 2 3 4 3 1 3 4 2 3
[963] 3 3 3 2 2 3 1 3 4 3 4 1 3 3 1 1 1 3 4 1 3 4 1 1 1 1 1 4 4 4 1 2 3 3 3 3 2
[1000] 1
km1$centers acceleration ball_control dribbling shot_power short_passing sprint_speed
1 58.16667 64.53448 53.31034 60.32759 68.72989 62.16092
2 82.05568 81.30858 81.06497 77.12297 77.85151 81.35731
3 65.00346 78.59862 73.58478 76.90657 78.97924 65.09689
4 48.31132 23.44340 15.88679 25.06604 32.84906 49.17925
km1$size[1] 174 431 289 106
km1$iter[1] 4
4b
sil_km <- silhouette(km1$cluster, d1)
summary(sil_km)Silhouette of 1000 units in 4 clusters from silhouette.default(x = km1$cluster, dist = d1) :
Cluster sizes and average silhouette widths:
174 431 289 106
0.1777123 0.3802751 0.2177741 0.6883372
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1060 0.1807 0.3313 0.3307 0.4788 0.7898
The overall average silhouette for the K-means solution is 0.3307, which indicates a weak to moderate clustering structure. This is similar to the hierarchical result (0.297), though slightly better overall.
Looking at the individual clusters, the quality varies considerably. C4 is by far the strongest cluster with a silhouette width of 0.688, indicating that its 106 players are well-matched to their own cluster and clearly separated from others. C2 is moderate at 0.380, representing 431 players with reasonable but not strong separation. C3 is weak at 0.218 and C1 is the weakest at 0.178, with its 174 players being poorly separated from neighbouring clusters. The minimum silhouette width of -0.106 also indicates that at least some players may have been placed in the wrong cluster.
Overall, the solution is acceptable but not highly distinct, with one strong cluster (C4) driving much of the quality.
4c
(i)
fifa_km_means_tidy <- as_tibble(km1$centers) %>%
mutate(Cluster = c("C1", "C2", "C3", "C4")) %>%
pivot_longer(cols = c(acceleration, ball_control, dribbling,
shot_power, short_passing, sprint_speed),
names_to = "Attribute", values_to = "Mean_Score")
ggplot(fifa_km_means_tidy, aes(x = Attribute, y = Mean_Score,
group = Cluster, colour = Cluster)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
ylab("Mean Score") +
ggtitle("K-means Cluster Profiles – Performance Attributes")This line graph reveals a clear hierarchy across the four clusters. C2 is the highest-performing cluster across almost all attributes, scoring above 80 in acceleration, ball control, dribbling and sprint speed, suggesting these are elite all-round players. C3 is close behind, performing strongly across all attributes and particularly high in shot power and short passing (~79), indicating technically well-rounded players. C1 sits in the middle, with moderate scores across all attributes ranging from roughly 53 to 70, representing average performers. C4 is clearly the lowest-performing cluster, with notably weak dribbling (~17) and ball control (~23)
(ii)
fifa_km_clus_ii <- cbind(fifa_data, km_cluster = km1$cluster) %>%
mutate(Cluster = case_when(
km_cluster == 1 ~ "C1",
km_cluster == 2 ~ "C2",
km_cluster == 3 ~ "C3",
km_cluster == 4 ~ "C4"
))
fifa_km_means_ii <- fifa_km_clus_ii %>%
group_by(Cluster) %>%
summarise(age = mean(age),
value = mean(value),
wage = mean(wage))
kable(fifa_km_means_ii)| Cluster | age | value | wage |
|---|---|---|---|
| C1 | 27.72989 | 13317241 | 58540.23 |
| C2 | 26.16705 | 21964037 | 80707.66 |
| C3 | 28.11765 | 15971626 | 65224.91 |
| C4 | 29.03774 | 14475000 | 51971.70 |
C2 players are the most valuable, with the highest average club value (~€21.96M) and wage (~€80,708), and are also the youngest on average (~26.2 years), consistent with their elite performance profile. C3 players earn the second highest wage (~€65,225) and are relatively young (~28.1 years). C1 sits in the middle for both value and wage. C4 is the oldest cluster (~29 years) and commands the lowest wage (~€51,972) and second lowest club value (~€14.5M), consistent with their weaker performance scores.
fifa_km_ii_long <- fifa_km_means_ii %>%
pivot_longer(cols = -Cluster, names_to = "Variable", values_to = "Mean_Value")
ggplot(fifa_km_ii_long, aes(x = Cluster, y = Mean_Value, fill = Cluster)) +
geom_bar(stat = "identity") +
facet_wrap(~ Variable, scales = "free_y") +
labs(title = "K-means Cluster Profiles – Age, Value & Wage") +
scale_y_continuous(labels = scales::comma) +
theme_minimal()Question 5: Comparing the two methods
5a
The K-means algorithm produced slightly higher quality clusters overall. The average silhouette width for K-means was 0.3307 compared to 0.297 for hierarchical clustering, indicating that K-means achieved marginally better defined and more separated clusters. Both solutions are considered weak to moderate in quality, however both algorithms agreed that one cluster stood out as particularly strong, cluster 3 in hierarchical (0.694) and cluster C4 in K-means (0.688).
5b
Both algorithms produced broadly similar cluster profiles. Both identified the same four distinct player types in terms of various attributes.