DAT-4313 - Data Viz in Model Development

NBA - Model Analysis

Author

Pamela Carta

NBA with depvar=GP

DATA

Partition 60% Train / 40% Test

The data that were utilized to create this analysis were sourced from NBA.game and NBA.external {datasetsICR}. The GP variable from the dataset was chosen to be the predicted variable.

Training set number of observations: 319
Test dataset number of observations: 211

EDA

Descriptive Statistics

        vars   n    mean     sd min    max  range    se
PLAYER     1 319     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 319     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 319     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 319     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 319     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 319     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 319   26.16   4.27  19   39.0   20.0  0.24
GP         8 319   49.33  25.76   1   82.0   81.0  1.44
W          9 319   25.46  16.08   0   60.0   60.0  0.90
MIN       10 319 1098.37 814.22   1 2848.0 2847.0 45.59
PTS       11 319   19.39   6.42   0   42.8   42.8  0.36
FGM       12 319    7.27   2.48   0   17.1   17.1  0.14
FGA       13 319   16.54   4.93   0   52.7   52.7  0.28
FTM       14 319    2.82   1.74   0    9.1    9.1  0.10
FTA       15 319    3.85   2.36   0   18.2   18.2  0.13
REB       16 319    8.81   4.94   0   52.7   52.7  0.28
AST       17 319    4.24   2.84   0   17.1   17.1  0.16
TOV       18 319    2.43   1.19   0    7.2    7.2  0.07

Boxplots – All Numeric

             GP         AGE          W         MIN        PTS         FGM
GP   1.00000000  0.13765105 0.86655773  0.89265179 0.25903290  0.19635788
AGE  0.13765105  1.00000000 0.18477492  0.12510942 0.03021520 -0.01014518
W    0.86655773  0.18477492 1.00000000  0.77468986 0.24506678  0.18734788
MIN  0.89265179  0.12510942 0.77468986  1.00000000 0.38448583  0.30370830
PTS  0.25903290  0.03021520 0.24506678  0.38448583 1.00000000  0.95622579
FGM  0.19635788 -0.01014518 0.18734788  0.30370830 0.95622579  1.00000000
FGA  0.06449060 -0.04464991 0.06282458  0.23477089 0.71465962  0.69429774
FTM  0.23858521  0.06112941 0.20450642  0.33695833 0.55812674  0.36540135
FTA  0.15912557  0.01157608 0.13534824  0.25532348 0.52877561  0.37860456
REB -0.01152221 -0.02644730 0.01359890 -0.02261317 0.05682883  0.09234671
AST  0.13889650  0.12368162 0.14670535  0.23834552 0.24852086  0.19270196
TOV  0.08590149  0.01571774 0.06185555  0.15733926 0.22376608  0.19650020
            FGA        FTM        FTA         REB        AST        TOV
GP   0.06449060 0.23858521 0.15912557 -0.01152221  0.1388965 0.08590149
AGE -0.04464991 0.06112941 0.01157608 -0.02644730  0.1236816 0.01571774
W    0.06282458 0.20450642 0.13534824  0.01359890  0.1467054 0.06185555
MIN  0.23477089 0.33695833 0.25532348 -0.02261317  0.2383455 0.15733926
PTS  0.71465962 0.55812674 0.52877561  0.05682883  0.2485209 0.22376608
FGM  0.69429774 0.36540135 0.37860456  0.09234671  0.1927020 0.19650020
FGA  1.00000000 0.35195174 0.35811986  0.13470671  0.2437060 0.30590371
FTM  0.35195174 1.00000000 0.93579234  0.23083014  0.2592657 0.32252288
FTA  0.35811986 0.93579234 1.00000000  0.33123676  0.1924632 0.39954490
REB  0.13470671 0.23083014 0.33123676  1.00000000 -0.1688349 0.13735957
AST  0.24370604 0.25926573 0.19246317 -0.16883490  1.0000000 0.39752185
TOV  0.30590371 0.32252288 0.39954490  0.13735957  0.3975218 1.00000000

The boxplots showed that the highest quantity of outliers was present in the AST variable.

Histograms

Moreover, the majority of the numerical variables exhibited a positive skewness.

Scatterplots (depvar ~ all x)

The analysis of the relationship between the dependent variable and all potential continuous independent variables revealed a positive correlation between games played and both the number of wins and minutes played. According to the correlation matrix, the association between games played and minutes played was represented by a correlation coefficient of 0.89. Similarly, the correlation between games played and the number of wins was denoted by a coefficient of 0.88.

Correlation Matrix

MODEL

Linear Regression Model

Estimate the following model:
\(GP = AGE + AGE^2 + MIN + PTS + FGA + TOV + FTM + FGM\)

Estimate Coefficients and show coefficients table

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	58.482	17.523	3.337	0.0009	***
AGE	-2.802	1.288	-2.176	0.0303	*
I(AGE^2)	0.048	0.023	2.058	0.0404	*
MIN	0.017	0.001	17.076	0.0000	***
PTS	-0.543	0.357	-1.519	0.1299
FGA	-0.500	0.124	-4.019	0.0001	***
TOV	-0.031	0.481	-0.065	0.9483
FTM	0.969	0.445	2.178	0.0301	*
FGM	1.182	0.817	1.446	0.1492
W	0.735	0.049	14.864	0.0000	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 8.671 on 309 degrees of freedom
Multiple R-squared: 0.8899, Adjusted R-squared: 0.8867
F-statistic: 277.4 on 309 and 9 DF, p-value: 0.0000

Given that the dependent variable under consideration is games played, utilizing a logarithmic transformation (log(gp)) for model fitting may not be the most appropriate approach. Consequently, I have opted to revise the modeling strategy to employ a linear specification without transformation, thereby adhering more closely to the inherent characteristics of the data. The predictors I used were age, minutes played, points, field goals attempted, turnovers, free throws made, field goals made, and wins.

The model demonstrates a strong ability to predict GP based on the selected predictors, with a particularly strong influence from MIN and W. The significant quadratic term for age suggests that the relationship between age and games played is non-linear, potentially indicating that games played increase to a certain age before declining.Approximately 88.99% of the variability in GP is explained by the model’s predictors, indicating a strong fit.

Coefficient Magnitude Plot

The coefficient plot emphasized that the primary factors positively influencing a player’s total games played included their wins, field goals made, and free throws made. Conversely, the factor exerting the most significant negative impact on their games played was their age.

Check for predictor independence

Using Variance Inflation Factors (VIF)

       AGE   I(AGE^2)        MIN        PTS        FGA        TOV        FTM 
128.044508 127.380905   2.927133  22.249295   1.590280   1.377569   2.521044 
       FGM          W 
 17.430393   2.673223

Residual Analysis

Residual Range

     0%     25%     50%     75%    100% 
-18.740  -6.475  -0.590   5.725  23.450

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

Plot Fitted Value by Actual Value

The plot comparing residuals to fitted values revealed that the distribution of the fitted values did not adhere strictly to randomness. This observation suggests that the residuals are not evenly dispersed around the zero line, indicating potential patterns or systematic deviations in the model’s predictions across the range of fitted values.

Plot Residuals by Fitted Values

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

PLAYER	GP	fit	lwr	upr
Alan Williams	5	10.2	-7.6	27.9
Alec Burks	64	48.7	31.5	65.9
Alex Poythress	21	23.6	6.4	40.7
Alize Johnson	14	23.1	5.7	40.5
Allen Crabbe	43	42.7	25.5	59.9
Allonzo Trier	64	48.2	30.9	65.4
Amile Jefferson	12	23.4	5.9	40.8
Amir Johnson	51	44.7	27.5	62.0
Andrew Harrison	17	21.2	3.9	38.5
Andrew Wiggins	73	76.0	58.7	93.3
n: 10

Plot Actual vs Fitted (`test`)

There’s a positive relation between the predicted and fitted test values.

Performance Metrics

Metric	Value
MAE	7.6006
RMSE	9.3526
MAPE	0.7820

Model Fit by Age

`ggplot2` explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)


Call:
lm(formula = GP ~ W, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.755 -10.085  -1.967   8.881  40.365 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 13.57449    1.30223   10.42 <0.0000000000000002 ***
W            1.40401    0.04326   32.46 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.41 on 317 degrees of freedom
Multiple R-squared:  0.7687,    Adjusted R-squared:  0.768 
F-statistic:  1053 on 1 and 317 DF,  p-value: < 0.00000000000000022

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

The final model demonstrates a robust and statistically significant relationship between the number of wins and games played in the dataset. The model constructed through this analysis represents a moderately satisfactory attempt at capturing the underlying patterns within the data; however, it falls short of being considered a truly effective or robust model. While it demonstrates some capability in predicting outcomes based on the variables selected, its performance metrics and the residuals’ analysis suggest that there is significant room for improvement. The model, as it stands, provides a foundational understanding but does not encapsulate the full complexity or nuances that could lead to a higher level of accuracy or predictive power. Further refinement, including the incorporation of additional variables, reassessment of model assumptions, and exploration of alternative modeling techniques, could potentially elevate its efficacy and reliability as a predictive tool.