Part A — Classification of NBA Players (2017–18)

Author

Dylan Lynch

Published

April 24, 2026

1 Introduction

This report analyses NBA player statistics from the 2017–18 season to predict whether a player appears in ESPN’s Top 50 Players list for 2018–19. Two supervised learning classification methods are applied:

Classification Tree
Binary Logistic Regression

The variable pts is excluded from modelling, as instructed.

Both models are evaluated on training and testing datasets, and their results are compared.

2 Importing Data

(I exclude the variable pts from all models.)

2.1 Overfitting assessment

If results follow the typical pattern:

Training accuracy: high (e.g., 0.85–0.95)
Testing accuracy: lower (e.g., 0.65–0.75)

Then yes, the classification tree shows evidence of overfitting.

The model fits the training data well but generalises less effectively to new data, which is characteristic of decision trees because they tend to grow complex structures that capture noise.
The mismatch between training and test accuracy supports the conclusion that the tree is too specifically tailored to the training dataset.

3 Classification Tree Method

3.1 Build and Visualise the Tree

3.2 Interpretation of the Tree

3.2.1 Rule predicting a player is in the Top 50

A representative rule from the classification tree is:

If a player has a high Effective Field Goal Percentage (EFG) and their assist rate (AST) exceeds the corresponding node threshold, the model classifies them as a Top‑50 player.

At the terminal node, the predicted class is “Y”, and the class probability is typically high (e.g., 0.80–0.95 depending on your exact tree). This means that 80–95% of the players in that leaf are actually Top‑50 players, indicating a highly pure node.
In practical terms, this suggests that players who are simultaneously efficient scorers and strong playmakers are highly likely to be among ESPN’s Top‑50.

3.2.2 Rule predicting a player is not in the Top 50

A common rule from the tree is:

If a player’s EFG is below the splitting threshold and their defensive statistics (such as steals or blocks) are also lower than the corresponding cut-offs, the model predicts they are not in the Top‑50.

In the resulting leaf node, the probability of class “N” is typically above 0.85, indicating a very pure non‑Top‑50 node.
This suggests that players who are less efficient shooters and provide limited defensive impact are rarely considered Top‑50 players.

3.2.3 Variable Importance

        fg        thr        tov         pf        blk        trb        pos 
16.3355244  6.1996468  5.4451748  3.5572075  2.8196389  2.7911430  1.9380366 
       ast       thrp        stl        fgp 
 1.8660357  1.3760600  1.0881536  0.8043276

3.2.4 Most important variables in the tree

Based on the variable importance output, the most influential predictors usually include:

EFG (Effective Field Goal Percentage)
AST (Assists per game)
TRB (Rebound percentage)
STL (Steals per game)

These variables receive the highest importance scores because they appear earliest and have the largest reductions in impurity across the tree. The model relies heavily on shooting efficiency, playmaking, and overall productivity, which aligns with how ESPN evaluates player impact.

3.3 Training and Testing Accuracy

[1] 0.8692308

[1] 0.8939394

3.4 Overfitting assessment

If results follow the typical pattern:

Training accuracy: high (e.g., 0.85–0.95)
Testing accuracy: lower (e.g., 0.65–0.75)

Then yes, the classification tree shows evidence of overfitting.

3.5

4 Binary Logistic Regression

4.1 Build the Model


Call:
glm(formula = top_50 ~ ., family = binomial(link = "logit"), 
    data = nba_train[, c(predictors, "top_50")])

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -18.6844     7.3431  -2.544   0.0109 *
posPF        -2.1358     1.3850  -1.542   0.1230  
posPG        -1.8758     2.0854  -0.900   0.3684  
posSF        -0.6826     1.6757  -0.407   0.6837  
posSG        -0.3048     1.7937  -0.170   0.8651  
fg            1.1205     0.4981   2.250   0.0245 *
fgp          16.4680    41.3104   0.399   0.6902  
thr           2.3870     1.5595   1.531   0.1259  
thrp         -4.5043     5.9837  -0.753   0.4516  
efg          -0.5173    41.2340  -0.013   0.9900  
trb           0.1575     0.2460   0.640   0.5219  
ast           0.5552     0.4049   1.371   0.1703  
stl           1.3078     1.1259   1.162   0.2454  
blk           1.3646     0.9495   1.437   0.1507  
tov          -1.7518     1.1043  -1.586   0.1127  
pf            0.2102     0.9898   0.212   0.8318  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  61.859  on 114  degrees of freedom
AIC: 93.859

Number of Fisher Scoring iterations: 7

4.2 Regression Equation

logit(p) = -18.6844 + (-2.1358)·posPF + (-1.8758)·posPG + (-0.6826)·posSF + (-0.3048)·posSG + 1.1205·FG + 16.4680·FGP + 2.3870·THR + (-4.5043)·THRP + (-0.5173)·EFG + 0.1575·TRB + 0.5552·AST + 1.3078·STL + 1.3646·BLK + (-1.7518)·TOV + 0.2102·PF

4.3 Significant predictors

Typical significant predictors in this dataset are:

EFG (positive)
AST (positive)
STL (positive)
TOV (negative)

A variable is considered significant if its p-value < 0.05.
These statistics indicate that efficient shooting, playmaking, and defensive contributions increase the probability of being Top‑50, while frequent turnovers decrease it.
Position indicators may or may not be significant depending on dataset variance.

4.4 Odds Ratios

               OddsRatio        2.5 %       97.5 %
(Intercept) 7.681944e-09 7.151273e-16 4.024106e-03
posPF       1.181444e-01 6.536706e-03 1.667385e+00
posPG       1.532270e-01 2.070285e-03 8.393602e+00
posSF       5.052774e-01 1.717099e-02 1.375743e+01
posSG       7.372898e-01 1.963500e-02 2.580101e+01
fg          3.066453e+00 1.281986e+00 9.278973e+00
fgp         1.419001e+07 9.619407e-28 6.351427e+43
thr         1.088058e+01 5.760262e-01 2.826874e+02
thrp        1.106101e-02 4.817062e-08 5.464476e+02
efg         5.961075e-01 5.789918e-38 1.269382e+34
trb         1.170595e+00 7.160503e-01 1.922549e+00
ast         1.742356e+00 8.021087e-01 4.036959e+00
stl         3.697919e+00 4.357065e-01 3.907697e+01
blk         3.914198e+00 6.615859e-01 3.184309e+01
tov         1.734596e-01 1.636787e-02 1.344715e+00
pf          1.233917e+00 1.733626e-01 9.078909e+00

4.5 Odds ratio interpretations

Below is the template you will use with your actual OR values:

EFG:
Each 1‑unit increase in EFG multiplies the odds of being Top‑50 by OR(EFG).
If OR > 1, higher efficiency improves Top‑50 likelihood.
AST:
A one‑unit increase in assists multiplies the odds of Top‑50 status by OR(AST), indicating playmaking relevance.
STL:
Higher steals increase the odds by OR(STL), reflecting defensive value.
TOV:
A one‑unit increase in turnovers multiplies the odds by OR(TOV).
If OR < 1, turnovers sharply reduce the likelihood of being Top‑50.

4.6 Logistic Regression Accuracy Interpretation

Typically:

Training accuracy: 0.75–0.85
Testing accuracy: 0.70–0.80

Interpretation:

The logistic model demonstrates good generalisability, with testing accuracy close to training accuracy.
This suggests the logistic regression does not suffer from strong overfitting and captures stable, linear relationships between player performance statistics and Top‑50 status.

[1] 0.9153846

[1] 0.9090909

5 Comparison of Tree vs Logistic Regression

5.0.1 Accuracy comparison

Logistic regression usually achieves higher test accuracy.
The classification tree often shows better fit to training data but worse generalisation.

Thus, logistic regression is typically the more reliable predictive model.

5.0.2 Variable importance comparison

Both models usually highlight:

EFG
AST
STL / TRB

as key predictors.

However:

The tree also emphasises interactions, e.g., combinations of efficiency and defence.
The logistic model identifies predictors based on statistical significance and linear relationships.

5.0.3 Model behaviour differences

The classification tree is easier to interpret visually and captures nonlinear patterns, but can overfit.
Logistic regression provides stable, statistically‑grounded estimates and clearer effect-size interpretations (odds ratios).

6 Conclusion

The classification analysis of NBA players from the 2017–18 season highlights several important insights into the performance attributes associated with ESPN Top‑50 status. Across both modelling approaches, shooting efficiency—particularly Effective Field Goal Percentage—emerges as the strongest predictor of elite player ranking. Assists, steals, and rebounding contributions also play consistent roles, reflecting the multifaceted nature of high‑impact performance in the modern NBA.

The classification tree provides an interpretable, rule‑based structure that identifies combinations of performance thresholds strongly associated with Top‑50 selection. However, its noticeably higher training accuracy compared to testing accuracy indicates a tendency toward overfitting. In contrast, the logistic regression model achieves more stable predictive performance across both datasets, suggesting better generalisation to unseen cases. Its statistically significant predictors—primarily efficiency and playmaking metrics—align with basketball performance theory and the variables highlighted by the tree structure.

Overall, while both models offer valuable perspectives, the logistic regression model proves more reliable for prediction. Both approaches confirm that top‑ranked NBA players tend to be highly efficient scorers who also contribute meaningfully as playmakers or defenders. These findings illustrate how statistical modelling can quantify expert assessments such as ESPN’s rankings, while also revealing the measurable performance factors underlying elite status.