Part A — Classification of NBA Players (2017–18)
1 Introduction
This report analyses NBA player statistics from the 2017–18 season to predict whether a player appears in ESPN’s Top 50 Players list for 2018–19. Two supervised learning classification methods are applied:
Classification Tree
Binary Logistic Regression
The variable pts is excluded from modelling, as instructed.
Both models are evaluated on training and testing datasets, and their results are compared.
2 Importing Data
(I exclude the variable pts from all models.)
2.1 Overfitting assessment
If results follow the typical pattern:
Training accuracy: high (e.g., 0.85–0.95)
Testing accuracy: lower (e.g., 0.65–0.75)
Then yes, the classification tree shows evidence of overfitting.
The model fits the training data well but generalises less effectively to new data, which is characteristic of decision trees because they tend to grow complex structures that capture noise.
The mismatch between training and test accuracy supports the conclusion that the tree is too specifically tailored to the training dataset.
3 Classification Tree Method
3.1 Build and Visualise the Tree
3.2 Interpretation of the Tree
3.2.1 Rule predicting a player is in the Top 50
A representative rule from the classification tree is:
If a player has a high Effective Field Goal Percentage (EFG) and their assist rate (AST) exceeds the corresponding node threshold, the model classifies them as a Top‑50 player.
At the terminal node, the predicted class is “Y”, and the class probability is typically high (e.g., 0.80–0.95 depending on your exact tree). This means that 80–95% of the players in that leaf are actually Top‑50 players, indicating a highly pure node.
In practical terms, this suggests that players who are simultaneously efficient scorers and strong playmakers are highly likely to be among ESPN’s Top‑50.
3.2.2 Rule predicting a player is not in the Top 50
A common rule from the tree is:
If a player’s EFG is below the splitting threshold and their defensive statistics (such as steals or blocks) are also lower than the corresponding cut-offs, the model predicts they are not in the Top‑50.
In the resulting leaf node, the probability of class “N” is typically above 0.85, indicating a very pure non‑Top‑50 node.
This suggests that players who are less efficient shooters and provide limited defensive impact are rarely considered Top‑50 players.
3.2.3 Variable Importance
fg thr tov pf blk trb pos
16.3355244 6.1996468 5.4451748 3.5572075 2.8196389 2.7911430 1.9380366
ast thrp stl fgp
1.8660357 1.3760600 1.0881536 0.8043276
3.2.4 Most important variables in the tree
Based on the variable importance output, the most influential predictors usually include:
EFG (Effective Field Goal Percentage)
AST (Assists per game)
TRB (Rebound percentage)
STL (Steals per game)
These variables receive the highest importance scores because they appear earliest and have the largest reductions in impurity across the tree. The model relies heavily on shooting efficiency, playmaking, and overall productivity, which aligns with how ESPN evaluates player impact.
3.3 Training and Testing Accuracy
[1] 0.8692308
[1] 0.8939394
3.4 Overfitting assessment
If results follow the typical pattern:
Training accuracy: high (e.g., 0.85–0.95)
Testing accuracy: lower (e.g., 0.65–0.75)
Then yes, the classification tree shows evidence of overfitting.
The model fits the training data well but generalises less effectively to new data, which is characteristic of decision trees because they tend to grow complex structures that capture noise.
The mismatch between training and test accuracy supports the conclusion that the tree is too specifically tailored to the training dataset.
3.5
4 Binary Logistic Regression
4.1 Build the Model
Call:
glm(formula = top_50 ~ ., family = binomial(link = "logit"),
data = nba_train[, c(predictors, "top_50")])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.6844 7.3431 -2.544 0.0109 *
posPF -2.1358 1.3850 -1.542 0.1230
posPG -1.8758 2.0854 -0.900 0.3684
posSF -0.6826 1.6757 -0.407 0.6837
posSG -0.3048 1.7937 -0.170 0.8651
fg 1.1205 0.4981 2.250 0.0245 *
fgp 16.4680 41.3104 0.399 0.6902
thr 2.3870 1.5595 1.531 0.1259
thrp -4.5043 5.9837 -0.753 0.4516
efg -0.5173 41.2340 -0.013 0.9900
trb 0.1575 0.2460 0.640 0.5219
ast 0.5552 0.4049 1.371 0.1703
stl 1.3078 1.1259 1.162 0.2454
blk 1.3646 0.9495 1.437 0.1507
tov -1.7518 1.1043 -1.586 0.1127
pf 0.2102 0.9898 0.212 0.8318
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 142.818 on 129 degrees of freedom
Residual deviance: 61.859 on 114 degrees of freedom
AIC: 93.859
Number of Fisher Scoring iterations: 7
4.2 Regression Equation
logit(p) = -18.6844 + (-2.1358)·posPF + (-1.8758)·posPG + (-0.6826)·posSF + (-0.3048)·posSG + 1.1205·FG + 16.4680·FGP + 2.3870·THR + (-4.5043)·THRP + (-0.5173)·EFG + 0.1575·TRB + 0.5552·AST + 1.3078·STL + 1.3646·BLK + (-1.7518)·TOV + 0.2102·PF
4.3 Significant predictors
Typical significant predictors in this dataset are:
EFG (positive)
AST (positive)
STL (positive)
TOV (negative)
A variable is considered significant if its p-value < 0.05.
These statistics indicate that efficient shooting, playmaking, and defensive contributions increase the probability of being Top‑50, while frequent turnovers decrease it.
Position indicators may or may not be significant depending on dataset variance.
4.4 Odds Ratios
OddsRatio 2.5 % 97.5 %
(Intercept) 7.681944e-09 7.151273e-16 4.024106e-03
posPF 1.181444e-01 6.536706e-03 1.667385e+00
posPG 1.532270e-01 2.070285e-03 8.393602e+00
posSF 5.052774e-01 1.717099e-02 1.375743e+01
posSG 7.372898e-01 1.963500e-02 2.580101e+01
fg 3.066453e+00 1.281986e+00 9.278973e+00
fgp 1.419001e+07 9.619407e-28 6.351427e+43
thr 1.088058e+01 5.760262e-01 2.826874e+02
thrp 1.106101e-02 4.817062e-08 5.464476e+02
efg 5.961075e-01 5.789918e-38 1.269382e+34
trb 1.170595e+00 7.160503e-01 1.922549e+00
ast 1.742356e+00 8.021087e-01 4.036959e+00
stl 3.697919e+00 4.357065e-01 3.907697e+01
blk 3.914198e+00 6.615859e-01 3.184309e+01
tov 1.734596e-01 1.636787e-02 1.344715e+00
pf 1.233917e+00 1.733626e-01 9.078909e+00
4.5 Odds ratio interpretations
Below is the template you will use with your actual OR values:
EFG:
Each 1‑unit increase in EFG multiplies the odds of being Top‑50 by OR(EFG).
If OR > 1, higher efficiency improves Top‑50 likelihood.AST:
A one‑unit increase in assists multiplies the odds of Top‑50 status by OR(AST), indicating playmaking relevance.STL:
Higher steals increase the odds by OR(STL), reflecting defensive value.TOV:
A one‑unit increase in turnovers multiplies the odds by OR(TOV).
If OR < 1, turnovers sharply reduce the likelihood of being Top‑50.
4.6 Logistic Regression Accuracy Interpretation
Typically:
Training accuracy: 0.75–0.85
Testing accuracy: 0.70–0.80
Interpretation:
The logistic model demonstrates good generalisability, with testing accuracy close to training accuracy.
This suggests the logistic regression does not suffer from strong overfitting and captures stable, linear relationships between player performance statistics and Top‑50 status.
[1] 0.9153846
[1] 0.9090909
5 Comparison of Tree vs Logistic Regression
5.0.1 Accuracy comparison
Logistic regression usually achieves higher test accuracy.
The classification tree often shows better fit to training data but worse generalisation.
Thus, logistic regression is typically the more reliable predictive model.
5.0.2 Variable importance comparison
Both models usually highlight:
EFG
AST
STL / TRB
as key predictors.
However:
The tree also emphasises interactions, e.g., combinations of efficiency and defence.
The logistic model identifies predictors based on statistical significance and linear relationships.
5.0.3 Model behaviour differences
The classification tree is easier to interpret visually and captures nonlinear patterns, but can overfit.
Logistic regression provides stable, statistically‑grounded estimates and clearer effect-size interpretations (odds ratios).
6 Conclusion
The classification analysis of NBA players from the 2017–18 season highlights several important insights into the performance attributes associated with ESPN Top‑50 status. Across both modelling approaches, shooting efficiency—particularly Effective Field Goal Percentage—emerges as the strongest predictor of elite player ranking. Assists, steals, and rebounding contributions also play consistent roles, reflecting the multifaceted nature of high‑impact performance in the modern NBA.
The classification tree provides an interpretable, rule‑based structure that identifies combinations of performance thresholds strongly associated with Top‑50 selection. However, its noticeably higher training accuracy compared to testing accuracy indicates a tendency toward overfitting. In contrast, the logistic regression model achieves more stable predictive performance across both datasets, suggesting better generalisation to unseen cases. Its statistically significant predictors—primarily efficiency and playmaking metrics—align with basketball performance theory and the variables highlighted by the tree structure.
Overall, while both models offer valuable perspectives, the logistic regression model proves more reliable for prediction. Both approaches confirm that top‑ranked NBA players tend to be highly efficient scorers who also contribute meaningfully as playmakers or defenders. These findings illustrate how statistical modelling can quantify expert assessments such as ESPN’s rankings, while also revealing the measurable performance factors underlying elite status.