Code
library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(gt)
library(scales)
library(janitor)Sports Analytics & Insight — Classification & Clustering Assignment, Part A
In September 2018, ESPN.com published a list of the top 50 NBA players for the 2018-19 season. It is assumed that performance data from the 2017-18 season was a major input into ESPN’s rankings. The aim of Part A is to use classification techniques on 2017-18 player statistics to predict whether or not a player ended up on ESPN’s Top 50 list.
Two classification methods are applied:
rpart package.glm function.The variable pts (average points scored per game) is excluded from both models per the assignment brief, and the variable player is excluded as it is purely an identifier. The two models are then compared on accuracy and the predictors they each consider important.
library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(gt)
library(scales)
library(janitor)The training and testing datasets are imported using read_csv() from the tidyverse. A quick class-balance check on the target variable (top_50) is included to give context for the accuracy results later on.
nba_training <- read_csv("nba_training.csv") #Importing all data sets into R
nba_testing <- read_csv("nba_testing.csv")
nba_training %>% count(top_50) #assessing the balance/imbalance of the top50# A tibble: 2 × 2
top_50 n
<chr> <int>
1 N 99
2 Y 31
The training set contains 130 players and the testing set contains 66 players. The target top_50 is stored as “Y” for Top 50 players and “N” otherwise. As expected from the brief (only 45 of ESPN’s Top 50 met the 1500-minutes-played requirement), the data is imbalanced, with far more non-Top-50 players than Top-50 players in both sets.
Position (pos) is a categorical variable, so it is converted to a factor before modelling. The classification tree is then built using all available predictors except player and pts.
nba_training$pos <- as.factor(nba_training$pos) #changing position into a categorical variable
nba_testing$pos <- as.factor(nba_testing$pos)
nba_model_tree <- rpart(top_50 ~ pos + fg + fgp + thr + thrp + efg + #Building the tree using all the predictors
trb + ast + stl + blk + tov + pf,
data = nba_training, method = 'class') #classification not regressionfancyRpartPlot(nba_model_tree) #Visualization of the treeThe rule for the tree model predicting if the player will be in the Top 50 based on his regular season statistics is if the player simply has ‘fg’ ≥ 7.1, this leaf has a purity of 81% which is the highest of any leaf that predicts a player being “Top 50”.
Now on the other hand the rule for finding if a player is “not top 50” material, we follow the following rule: fg < 7.1, thr < 2.3 and trb < 6.4, this path tells us with 99% purity that the player will not be within the top 50 if applied over different set of data.
The variable.importance element of the fitted rpart object gives a numeric score for each variable, reflecting how much it contributed to the splits in the tree.
nba_model_tree$variable.importance #assess importance of each variable fg thr tov pf blk trb pos
16.3355244 6.1996468 5.4451748 3.5572075 2.8196389 2.7911430 1.9380366
ast thrp stl fgp
1.8660357 1.3760600 1.0881536 0.8043276
summary(nba_model_tree) #show the importance differences in % formCall:
rpart(formula = top_50 ~ pos + fg + fgp + thr + thrp + efg +
trb + ast + stl + blk + tov + pf, data = nba_training, method = "class")
n= 130
CP nsplit rel error xerror xstd
1 0.41935484 0 1.0000000 1.0000000 0.1567347
2 0.01075269 1 0.5806452 0.7096774 0.1379081
3 0.01000000 4 0.5483871 0.7741935 0.1427004
Variable importance
fg thr tov pf blk trb pos ast thrp stl fgp
37 14 12 8 6 6 4 4 3 2 2
Node number 1: 130 observations, complexity param=0.4193548
predicted class=N expected loss=0.2384615 P(node) =1
class counts: 99 31
probabilities: 0.762 0.238
left son=2 (109 obs) right son=3 (21 obs)
Primary splits:
fg < 7.1 to the left, improve=16.335520, (0 missing)
thr < 2.25 to the left, improve=10.069930, (0 missing)
tov < 1.45 to the left, improve= 8.330769, (0 missing)
stl < 1.35 to the left, improve= 6.179021, (0 missing)
trb < 7.3 to the left, improve= 5.553524, (0 missing)
Surrogate splits:
tov < 2.75 to the left, agree=0.892, adj=0.333, (0 split)
pf < 3.05 to the left, agree=0.869, adj=0.190, (0 split)
thr < 2.6 to the left, agree=0.854, adj=0.095, (0 split)
ast < 8.5 to the left, agree=0.854, adj=0.095, (0 split)
stl < 1.75 to the left, agree=0.846, adj=0.048, (0 split)
Node number 2: 109 observations, complexity param=0.01075269
predicted class=N expected loss=0.1284404 P(node) =0.8384615
class counts: 95 14
probabilities: 0.872 0.128
left son=4 (97 obs) right son=5 (12 obs)
Primary splits:
thr < 2.25 to the left, improve=3.723257, (0 missing)
fg < 4.65 to the left, improve=3.478123, (0 missing)
trb < 7.3 to the left, improve=2.893895, (0 missing)
efg < 0.5335 to the left, improve=2.570815, (0 missing)
tov < 1.45 to the left, improve=2.270472, (0 missing)
Surrogate splits:
ast < 6.75 to the left, agree=0.899, adj=0.083, (0 split)
stl < 1.65 to the left, agree=0.899, adj=0.083, (0 split)
Node number 3: 21 observations
predicted class=Y expected loss=0.1904762 P(node) =0.1615385
class counts: 4 17
probabilities: 0.190 0.810
Node number 4: 97 observations, complexity param=0.01075269
predicted class=N expected loss=0.08247423 P(node) =0.7461538
class counts: 89 8
probabilities: 0.918 0.082
left son=8 (73 obs) right son=9 (24 obs)
Primary splits:
trb < 6.35 to the left, improve=2.791143, (0 missing)
blk < 0.95 to the left, improve=2.233258, (0 missing)
fg < 4.65 to the left, improve=1.802364, (0 missing)
pos splits as RLLLL, improve=1.542761, (0 missing)
fgp < 0.464 to the left, improve=1.435421, (0 missing)
Surrogate splits:
pos splits as RLLLL, agree=0.845, adj=0.375, (0 split)
thrp < 0.2565 to the right, agree=0.835, adj=0.333, (0 split)
blk < 0.75 to the left, agree=0.825, adj=0.292, (0 split)
thr < 0.25 to the right, agree=0.814, adj=0.250, (0 split)
fgp < 0.549 to the left, agree=0.804, adj=0.208, (0 split)
Node number 5: 12 observations
predicted class=N expected loss=0.5 P(node) =0.09230769
class counts: 6 6
probabilities: 0.500 0.500
Node number 8: 73 observations
predicted class=N expected loss=0.01369863 P(node) =0.5615385
class counts: 72 1
probabilities: 0.986 0.014
Node number 9: 24 observations, complexity param=0.01075269
predicted class=N expected loss=0.2916667 P(node) =0.1846154
class counts: 17 7
probabilities: 0.708 0.292
left son=18 (15 obs) right son=19 (9 obs)
Primary splits:
blk < 0.95 to the left, improve=2.0055560, (0 missing)
stl < 0.9 to the left, improve=1.0416670, (0 missing)
fgp < 0.474 to the left, improve=0.9388889, (0 missing)
efg < 0.51 to the left, improve=0.9388889, (0 missing)
thr < 0.85 to the left, improve=0.7500000, (0 missing)
Surrogate splits:
pos splits as RLLL-, agree=0.792, adj=0.444, (0 split)
thrp < 0.1715 to the right, agree=0.708, adj=0.222, (0 split)
pf < 2.65 to the left, agree=0.708, adj=0.222, (0 split)
fgp < 0.474 to the left, agree=0.667, adj=0.111, (0 split)
thr < 0.05 to the right, agree=0.667, adj=0.111, (0 split)
Node number 18: 15 observations
predicted class=N expected loss=0.1333333 P(node) =0.1153846
class counts: 13 2
probabilities: 0.867 0.133
Node number 19: 9 observations
predicted class=Y expected loss=0.4444444 P(node) =0.06923077
class counts: 4 5
probabilities: 0.444 0.556
The most important variables for predicting whether a player is in ESPN’s Top 50 are the ones at the top of this list. The top three are:
[1] "fg" "thr" "tov"
In other words, fg, thr, and tov are the predictors the tree relied on most heavily when separating Top 50 from non-Top-50 players.
# --- TRAINING DATA ---
nba_training_tree_prediction <- predict(nba_model_tree, newdata = nba_training, type = 'class')
nba_training_tree_tab <- table(nba_training$top_50,
nba_training_tree_prediction,
dnn = c('Actual', 'Predicted'))
nba_training_tree_tab Predicted
Actual N Y
N 91 8
Y 9 22
nba_training_tree_acc <- sum(diag(nba_training_tree_tab)) / sum(nba_training_tree_tab)
nba_training_tree_acc[1] 0.8692308
# --- TESTING ---
nba_testing_tree_prediction <- predict(nba_model_tree, newdata = nba_testing, type = 'class')
nba_testing_tree_tab <- table(nba_testing$top_50,
nba_testing_tree_prediction,
dnn = c('Actual', 'Predicted'))
nba_testing_tree_tab Predicted
Actual N Y
N 48 4
Y 3 11
nba_testing_tree_acc <- sum(diag(nba_testing_tree_tab)) / sum(nba_testing_tree_tab)
nba_testing_tree_acc[1] 0.8939394
The classification tree is 86.9% accurate on the training data and 89.4% accurate on the testing data.
A model is said to be overfitting when it performs much better on the training data than on the testing data — it has memorised the training set rather than learnt rules that generalise to unseen players.
In this case, the gap between training accuracy (86.9%) and testing accuracy (89.4%) is -2.5 percentage points. The gap is not significant enough in order to label the classification tree as overfitting on the training data set, rather the tree generalises well considering the insiginificant degree of the gap between the training and testing tree accuracy.
The target top_50 is stored as text “N” / “Y”. For logistic regression, it must be converted to a factor with the reference level first (“N” — i.e. not in the Top 50) and the event level second (“Y” — in the Top 50). This ordering ensures the model estimates the log-odds of being in the Top 50. Readable labels are also applied so all the confusion matrixes are easier to interpret.
nba_training$top_50 <- factor(nba_training$top_50, #converting to a factor N=reference Y=event
levels = c("N", "Y"), #Top 50 stored as Y N in the csv file
labels = c("Not Top 50", "Top 50"))
nba_testing$top_50 <- factor(nba_testing$top_50,
levels = c("N", "Y"),
labels = c("Not Top 50", "Top 50")) #labels allow the matrix to be more digestible
levels(nba_training$top_50) #should print: "Not Top 50" "Top 50"[1] "Not Top 50" "Top 50"
levels(nba_testing$top_50)[1] "Not Top 50" "Top 50"
#regression model
nba_model_lr <- glm(top_50 ~ pos + fg + fgp + thr + thrp + efg +
trb + ast + stl + blk + tov + pf,
data = nba_training,
family = binomial(link = 'logit'))
summary(nba_model_lr)
Call:
glm(formula = top_50 ~ pos + fg + fgp + thr + thrp + efg + trb +
ast + stl + blk + tov + pf, family = binomial(link = "logit"),
data = nba_training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.6844 7.3431 -2.544 0.0109 *
posPF -2.1358 1.3850 -1.542 0.1230
posPG -1.8758 2.0854 -0.900 0.3684
posSF -0.6826 1.6757 -0.407 0.6837
posSG -0.3048 1.7937 -0.170 0.8651
fg 1.1205 0.4981 2.250 0.0245 *
fgp 16.4680 41.3104 0.399 0.6902
thr 2.3870 1.5595 1.531 0.1259
thrp -4.5043 5.9837 -0.753 0.4516
efg -0.5173 41.2340 -0.013 0.9900
trb 0.1575 0.2460 0.640 0.5219
ast 0.5552 0.4049 1.371 0.1703
stl 1.3078 1.1259 1.162 0.2454
blk 1.3646 0.9495 1.437 0.1507
tov -1.7518 1.1043 -1.586 0.1127
pf 0.2102 0.9898 0.212 0.8318
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 142.818 on 129 degrees of freedom
Residual deviance: 61.859 on 114 degrees of freedom
AIC: 93.859
Number of Fisher Scoring iterations: 7
Using the estimated coefficients in the summary() output above, the fitted regression equation is of the form:
\[ \log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1(\text{pos}) + \beta_2(\text{fg}) + \beta_3(\text{fgp}) + \beta_4(\text{thr}) + \beta_5(\text{thrp}) + \beta_6(\text{efg}) + \beta_7(\text{trb}) + \beta_8(\text{ast}) + \beta_9(\text{stl}) + \beta_{10}(\text{blk}) + \beta_{11}(\text{tov}) + \beta_{12}(\text{pf}) \]
where π is the probability that a player is in the Top 50. To produce the actual numerical equation, substitute each β with the corresponding Estimate from the summary() table above.
\[ \begin{aligned} \log\!\left(\frac{\pi}{1 - \pi}\right) =\ & -18.6844 - 2.1358\,(\text{posPF}) - 1.8758\,(\text{posPG}) - 0.6826\,(\text{posSF}) \\ & - 0.3048\,(\text{posSG}) + 1.1205\,(\text{fg}) + 16.4680\,(\text{fgp}) + 2.3870\,(\text{thr}) \\ & - 4.5043\,(\text{thrp}) - 0.5173\,(\text{efg}) + 0.1575\,(\text{trb}) + 0.5552\,(\text{ast}) \\ & + 1.3078\,(\text{stl}) + 1.3646\,(\text{blk}) - 1.7518\,(\text{tov}) + 0.2102\,(\text{pf}) \end{aligned} \]
where π is the probability that a player is in the Top 50.
The predictors that are statistically significant at conventional levels are those with p-values (the Pr(>|z|) column) below 0.05.
# Extract the coefficient table
lr_summary <- summary(nba_model_lr)$coefficients
round(lr_summary, 4) Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.6844 7.3431 -2.5445 0.0109
posPF -2.1358 1.3850 -1.5421 0.1230
posPG -1.8758 2.0854 -0.8995 0.3684
posSF -0.6826 1.6757 -0.4074 0.6837
posSG -0.3048 1.7937 -0.1699 0.8651
fg 1.1205 0.4981 2.2498 0.0245
fgp 16.4680 41.3104 0.3986 0.6902
thr 2.3870 1.5595 1.5306 0.1259
thrp -4.5043 5.9837 -0.7528 0.4516
efg -0.5173 41.2340 -0.0125 0.9900
trb 0.1575 0.2460 0.6404 0.5219
ast 0.5552 0.4049 1.3712 0.1703
stl 1.3078 1.1259 1.1616 0.2454
blk 1.3646 0.9495 1.4372 0.1507
tov -1.7518 1.1043 -1.5864 0.1127
pf 0.2102 0.9898 0.2124 0.8318
# Identify variables significant at the 5% level (excluding the intercept)
sig_vars <- rownames(lr_summary)[lr_summary[, 4] < 0.05 & rownames(lr_summary) != "(Intercept)"]
sig_vars[1] "fg"
The variables identified as significant at the 5% level are: fg.
The only variable showing any level of significance is “fg” with a coefficient of 0.0245. Following the standard significance level of p < 0.05
Exponentiating each coefficient converts it from the log-odds scale to an odds ratio — the multiplicative change in the odds of being Top 50 for a one-unit increase in the predictor.
exp(coef(nba_model_lr)) (Intercept) posPF posPG posSF posSG fg
7.681944e-09 1.181444e-01 1.532270e-01 5.052774e-01 7.372898e-01 3.066453e+00
fgp thr thrp efg trb ast
1.419001e+07 1.088058e+01 1.106101e-02 5.961075e-01 1.170595e+00 1.742356e+00
stl blk tov pf
3.697919e+00 3.914198e+00 1.734596e-01 1.233917e+00
For each significant predictor, the odds ratio can be read as follows:
(Odds_Ratio−1) × 100% = (3.07−1) × 100% = 207%
The odds ratio for
fgis 3.07, meaning each additional field goal made per game multiplies the odds of being in the Top 50 by 3.07. This is equivalent to a 207% increase in the odds.
#training
nba_training_lr_pi <- predict(nba_model_lr, newdata = nba_training, type = 'response')
nba_training_lr_final <- nba_training %>%
mutate(pi = nba_training_lr_pi) %>%
mutate(nba_training_lr_prediction = case_when(pi > 0.5 ~ 'Top 50',
pi <= 0.5 ~ 'Not Top 50'))
nba_training_lr_final$nba_training_lr_prediction <- factor(
nba_training_lr_final$nba_training_lr_prediction,
levels = c("Not Top 50", "Top 50")
)
nba_training_lr_tab <- table(nba_training_lr_final$top_50,
nba_training_lr_final$nba_training_lr_prediction,
dnn = c('Actual', 'Predicted'))
nba_training_lr_tab Predicted
Actual Not Top 50 Top 50
Not Top 50 94 5
Top 50 6 25
nba_training_lr_acc <- sum(diag(nba_training_lr_tab)) / sum(nba_training_lr_tab)
nba_training_lr_acc[1] 0.9153846
#testing
nba_testing_lr_pi <- predict(nba_model_lr, newdata = nba_testing, type = 'response')
nba_testing_lr_final <- nba_testing %>%
mutate(pi = nba_testing_lr_pi) %>%
mutate(nba_testing_lr_prediction = case_when(pi > 0.5 ~ 'Top 50',
pi <= 0.5 ~ 'Not Top 50'))
nba_testing_lr_final$nba_testing_lr_prediction <- factor(
nba_testing_lr_final$nba_testing_lr_prediction,
levels = c("Not Top 50", "Top 50")
)
nba_testing_lr_tab <- table(nba_testing_lr_final$top_50,
nba_testing_lr_final$nba_testing_lr_prediction,
dnn = c('Actual', 'Predicted'))
nba_testing_lr_tab Predicted
Actual Not Top 50 Top 50
Not Top 50 50 2
Top 50 4 10
nba_testing_lr_acc <- sum(diag(nba_testing_lr_tab)) / sum(nba_testing_lr_tab)
nba_testing_lr_acc[1] 0.9090909
The logistic regression is 91.5% accurate on the training data and 90.9% accurate on the testing data. Similarly to the tree, a small distance between the training and testing data suggests the model generalises well, whereas a larger gap would suggest that the model is overfitting.
nba_training_tree_acc #Tree[1] 0.8692308
nba_testing_tree_acc[1] 0.8939394
nba_training_lr_acc #logistic regression[1] 0.9153846
nba_testing_lr_acc[1] 0.9090909
# A tibble: 2 × 3
Model TrainingAccuracy TestingAccuracy
<chr> <dbl> <dbl>
1 Classification Tree 0.869 0.894
2 Logistic Regression 0.915 0.909
On the testing data, the Logistic Regression model achieved the higher accuracy (90.9% vs 89.4%).
The classification tree relied most heavily on the predictors at the top of nba_model_tree$variable.importance, while the logistic regression flagged the variables in sig_vars as statistically significant.
The two lists both independently agreed on significance of a single variable which was ‘fg’. The reason this variable is high in popularity in terms of significance is likely due to being highly correlated to a variable (pts) specifically not used in this experiment due to it being the most primary stat on which player value is assessed and would hence make prediction for top 50 quite straight forward. ‘fg’ is correlated to ‘pts’ as they refer to how many shots a player makes on average per game. The reason they aren’t directly correlated with ‘pts’ is due to the fact that players are able to score from methods different to field goals (free throws). Which usually correspond for a smaller portion of a player’s total points but still plays a part nevertheless.