Classifying NBA Players and Clustering Soccer Players - Sports Analytics Module
Author
Abby Tarrant
Part A – Classify NBA Players
1. Import Data
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.6.2 Copyright (c) 2006-2023 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
Rows: 130 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 66 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): player, pos, top_50
dbl (12): fg, fgp, thr, thrp, efg, trb, ast, stl, blk, tov, pf, pts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A rule for predicting that a player is in the Top 50 is as follows:
If a player has field goals (fg) greater than 7.1, then the player is classified as being in the Top 50.
The probability values at this leaf node are approximately 0.19 (Not Top 50) and 0.81 (Top 50), meaning that 81% of players in this node are Top 50 players. This indicates that the node is highly pure.
2b(ii) Rule for Predicting NOT Top 50
A rule for predicting that a player is NOT in the Top 50 is:
If a player has total rebound percentage (trb) less than 6.4, then the player is classified as NOT being in the Top 50.
The probability values at this leaf node are approximately 0.99 (Not Top 50) and 0.01 (Top 50), indicating that 99% of players in this node are not Top 50 players. This shows a very high level of node purity.
The impact of each variable is calculated using exp(β).
For the significant variable:
Field goals (fg): exp(β) = 3.07
This means that a 1-unit increase in field goals per game multiplies the odds of a player being in the Top 50 by approximately 3.07, representing an increase of approximately 207% in the odds.
The logistic regression model achieved an accuracy of approximately 91% on the testing dataset.
4. Model Comparison
4a. Accuracy Comparison
The logistic regression model achieved higher accuracy on the testing dataset compared to the classification tree. Therefore, the logistic regression model is more accurate. This suggests that the logistic regression model provides better predictive performance on unseen data.
4b. Variable Comparison
The classification tree identifies important variables based on how the data is split, highlighting variables such as field goals (fg), three-point shots (thr), and rebounds (trb). In contrast, logistic regression identifies important variables using statistical significance, with field goals (fg) being the only significant predictor. Both models highlight the importance of scoring ability, but logistic regression provides clearer statistical evidence of the relationship between field goals and the probability of being in the Top 50.
Part B - Clustering Soccer Players
1. Import Data
fifa <-read_csv("fifa_dataset.csv") %>%na.omit()
Rows: 1000 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, nationality, club
dbl (39): age, overall, potential, value, wage, acceleration, aggression, ag...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2 Scaling Justification
Clustering is based on distance measures, and therefore scaling is required before computing the distance matrix.
If the data is not scaled, variables with larger values or variation may dominate the Euclidean distance calculation.
By standardising the variables, each attribute contributes equally to the clustering process, ensuring a fair and accurate clustering outcome.
plot(h_fifa, hang =-1, labels =FALSE, main ="FIFA Player Dendrogram (4-Cluster Solution)",xlab ="1,000 Players",sub ="")rect.hclust(h_fifa, k =4, border ="red")
The heatmap shows variation in distances between players, with lighter colours representing smaller distances and darker colours representing larger distances.
There are visible blocks of similar colour along the diagonal, which suggests that groups of similar players exist within the dataset.
This provides evidence of clustering structure, although the clusters are not perfectly distinct.
3d. 4-Cluster Solution and Quality Assessment
h_clusters <-cutree(h_fifa, k =4)sil_h <-silhouette(h_clusters, d_fifa)summary(sil_h)
Silhouette of 993 units in 4 clusters from silhouette.default(x = h_clusters, dist = d_fifa) :
Cluster sizes and average silhouette widths:
191 388 307 107
0.34729982 0.23452189 0.07238905 0.69104038
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.48091 0.09872 0.28615 0.25528 0.41572 0.79122
The mean silhouette score for hierarchical clustering is approximately 0.255.
A silhouette score between 0.25 and 0.5 indicates moderate clustering structure.
This suggests that the hierarchical clustering provides relatively weak to moderate cluster separation.
Cluster 1 consists of players with relatively low acceleration and moderate technical ability, suggesting average players or those with lower physical performance.
Cluster 2 contains players with very high acceleration, ball control, and dribbling, indicating elite attacking players or highly skilled forwards.
Cluster 3 represents players with balanced performance across all attributes, suggesting well-rounded players who are competent in multiple areas.
Cluster 4 consists of players with very low values across all performance attributes, indicating low-performing players with limited technical ability.
In terms of personal attributes such as age, value, and wage, higher-performing clusters (such as Cluster 2) are likely associated with higher market value and wages, while lower-performing clusters (such as Cluster 4) are associated with lower values.
5. Comparison of Clustering Methods
5a. Best Method
The K-means clustering method produced a higher mean silhouette score (0.334) compared to hierarchical clustering (0.255).
This indicates that K-means provides better cluster quality and clearer separation between groups.
5b. Comparison
Both hierarchical clustering and K-means produce broadly similar groupings of players.
However, K-means produces more distinct and interpretable clusters based on player performance attributes.
Overall, K-means clustering is preferred as it provides stronger clustering structure and better separation of player types.