vars n mean sd min max range se
PLAYER 1 319 NaN NA Inf -Inf -Inf NA
FORWARD 2 319 NaN NA Inf -Inf -Inf NA
CENTER 3 319 NaN NA Inf -Inf -Inf NA
GUARD 4 319 NaN NA Inf -Inf -Inf NA
ROOKIE 5 319 NaN NA Inf -Inf -Inf NA
TEAM 6 319 NaN NA Inf -Inf -Inf NA
AGE 7 319 26.16 4.27 19 39.0 20.0 0.24
GP 8 319 49.33 25.76 1 82.0 81.0 1.44
W 9 319 25.46 16.08 0 60.0 60.0 0.90
MIN 10 319 1098.37 814.22 1 2848.0 2847.0 45.59
PTS 11 319 19.39 6.42 0 42.8 42.8 0.36
FGM 12 319 7.27 2.48 0 17.1 17.1 0.14
FGA 13 319 16.54 4.93 0 52.7 52.7 0.28
FTM 14 319 2.82 1.74 0 9.1 9.1 0.10
FTA 15 319 3.85 2.36 0 18.2 18.2 0.13
REB 16 319 8.81 4.94 0 52.7 52.7 0.28
AST 17 319 4.24 2.84 0 17.1 17.1 0.16
TOV 18 319 2.43 1.19 0 7.2 7.2 0.07
DAT-4313 - Data Viz in Model Development
NBA - Model Analysis
NBA with depvar=GP
DATA
Partition 60% Train / 40% Test
The data that were utilized to create this analysis were sourced from NBA.game and NBA.external {datasetsICR}. The GP variable from the dataset was chosen to be the predicted variable.
Training set number of observations: 319
Test dataset number of observations: 211
EDA
Descriptive Statistics
Boxplots – All Numeric
GP AGE W MIN PTS FGM
GP 1.00000000 0.13765105 0.86655773 0.89265179 0.25903290 0.19635788
AGE 0.13765105 1.00000000 0.18477492 0.12510942 0.03021520 -0.01014518
W 0.86655773 0.18477492 1.00000000 0.77468986 0.24506678 0.18734788
MIN 0.89265179 0.12510942 0.77468986 1.00000000 0.38448583 0.30370830
PTS 0.25903290 0.03021520 0.24506678 0.38448583 1.00000000 0.95622579
FGM 0.19635788 -0.01014518 0.18734788 0.30370830 0.95622579 1.00000000
FGA 0.06449060 -0.04464991 0.06282458 0.23477089 0.71465962 0.69429774
FTM 0.23858521 0.06112941 0.20450642 0.33695833 0.55812674 0.36540135
FTA 0.15912557 0.01157608 0.13534824 0.25532348 0.52877561 0.37860456
REB -0.01152221 -0.02644730 0.01359890 -0.02261317 0.05682883 0.09234671
AST 0.13889650 0.12368162 0.14670535 0.23834552 0.24852086 0.19270196
TOV 0.08590149 0.01571774 0.06185555 0.15733926 0.22376608 0.19650020
FGA FTM FTA REB AST TOV
GP 0.06449060 0.23858521 0.15912557 -0.01152221 0.1388965 0.08590149
AGE -0.04464991 0.06112941 0.01157608 -0.02644730 0.1236816 0.01571774
W 0.06282458 0.20450642 0.13534824 0.01359890 0.1467054 0.06185555
MIN 0.23477089 0.33695833 0.25532348 -0.02261317 0.2383455 0.15733926
PTS 0.71465962 0.55812674 0.52877561 0.05682883 0.2485209 0.22376608
FGM 0.69429774 0.36540135 0.37860456 0.09234671 0.1927020 0.19650020
FGA 1.00000000 0.35195174 0.35811986 0.13470671 0.2437060 0.30590371
FTM 0.35195174 1.00000000 0.93579234 0.23083014 0.2592657 0.32252288
FTA 0.35811986 0.93579234 1.00000000 0.33123676 0.1924632 0.39954490
REB 0.13470671 0.23083014 0.33123676 1.00000000 -0.1688349 0.13735957
AST 0.24370604 0.25926573 0.19246317 -0.16883490 1.0000000 0.39752185
TOV 0.30590371 0.32252288 0.39954490 0.13735957 0.3975218 1.00000000
The boxplots showed that the highest quantity of outliers was present in the AST variable.
Histograms
Moreover, the majority of the numerical variables exhibited a positive skewness.
Scatterplots (depvar ~ all x)
The analysis of the relationship between the dependent variable and all potential continuous independent variables revealed a positive correlation between games played and both the number of wins and minutes played. According to the correlation matrix, the association between games played and minutes played was represented by a correlation coefficient of 0.89. Similarly, the correlation between games played and the number of wins was denoted by a coefficient of 0.88.
Correlation Matrix
MODEL
Linear Regression Model
Estimate the following model:
\(GP = AGE + AGE^2 + MIN + PTS + FGA + TOV + FTM + FGM\)
Estimate Coefficients and show coefficients table
Estimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 58.482 | 17.523 | 3.337 | 0.0009 | *** |
AGE | -2.802 | 1.288 | -2.176 | 0.0303 | * |
I(AGE^2) | 0.048 | 0.023 | 2.058 | 0.0404 | * |
MIN | 0.017 | 0.001 | 17.076 | 0.0000 | *** |
PTS | -0.543 | 0.357 | -1.519 | 0.1299 |
|
FGA | -0.500 | 0.124 | -4.019 | 0.0001 | *** |
TOV | -0.031 | 0.481 | -0.065 | 0.9483 |
|
FTM | 0.969 | 0.445 | 2.178 | 0.0301 | * |
FGM | 1.182 | 0.817 | 1.446 | 0.1492 |
|
W | 0.735 | 0.049 | 14.864 | 0.0000 | *** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 8.671 on 309 degrees of freedom | |||||
Multiple R-squared: 0.8899, Adjusted R-squared: 0.8867 | |||||
F-statistic: 277.4 on 309 and 9 DF, p-value: 0.0000 | |||||
Given that the dependent variable under consideration is games played, utilizing a logarithmic transformation (log(gp)) for model fitting may not be the most appropriate approach. Consequently, I have opted to revise the modeling strategy to employ a linear specification without transformation, thereby adhering more closely to the inherent characteristics of the data. The predictors I used were age, minutes played, points, field goals attempted, turnovers, free throws made, field goals made, and wins.
The model demonstrates a strong ability to predict GP based on the selected predictors, with a particularly strong influence from MIN and W. The significant quadratic term for age suggests that the relationship between age and games played is non-linear, potentially indicating that games played increase to a certain age before declining.Approximately 88.99% of the variability in GP is explained by the model’s predictors, indicating a strong fit.
Coefficient Magnitude Plot
The coefficient plot emphasized that the primary factors positively influencing a player’s total games played included their wins, field goals made, and free throws made. Conversely, the factor exerting the most significant negative impact on their games played was their age.
Check for predictor independence
Using Variance Inflation Factors (VIF)
AGE I(AGE^2) MIN PTS FGA TOV FTM
128.044508 127.380905 2.927133 22.249295 1.590280 1.377569 2.521044
FGM W
17.430393 2.673223
Residual Analysis
Residual Range
0% 25% 50% 75% 100%
-18.740 -6.475 -0.590 5.725 23.450
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
Plot Fitted Value by Actual Value
The plot comparing residuals to fitted values revealed that the distribution of the fitted values did not adhere strictly to randomness. This observation suggests that the residuals are not evenly dispersed around the zero line, indicating potential patterns or systematic deviations in the model’s predictions across the range of fitted values.
Plot Residuals by Fitted Values
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
PLAYER | GP | fit | lwr | upr |
|---|---|---|---|---|
Alan Williams | 5 | 10.2 | -7.6 | 27.9 |
Alec Burks | 64 | 48.7 | 31.5 | 65.9 |
Alex Poythress | 21 | 23.6 | 6.4 | 40.7 |
Alize Johnson | 14 | 23.1 | 5.7 | 40.5 |
Allen Crabbe | 43 | 42.7 | 25.5 | 59.9 |
Allonzo Trier | 64 | 48.2 | 30.9 | 65.4 |
Amile Jefferson | 12 | 23.4 | 5.9 | 40.8 |
Amir Johnson | 51 | 44.7 | 27.5 | 62.0 |
Andrew Harrison | 17 | 21.2 | 3.9 | 38.5 |
Andrew Wiggins | 73 | 76.0 | 58.7 | 93.3 |
n: 10 | ||||
Plot Actual vs Fitted (test)
There’s a positive relation between the predicted and fitted test values.
Performance Metrics
Metric | Value |
|---|---|
MAE | 7.6006 |
RMSE | 9.3526 |
MAPE | 0.7820 |
Model Fit by Age
ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
Call:
lm(formula = GP ~ W, data = train)
Residuals:
Min 1Q Median 3Q Max
-18.755 -10.085 -1.967 8.881 40.365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.57449 1.30223 10.42 <0.0000000000000002 ***
W 1.40401 0.04326 32.46 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.41 on 317 degrees of freedom
Multiple R-squared: 0.7687, Adjusted R-squared: 0.768
F-statistic: 1053 on 1 and 317 DF, p-value: < 0.00000000000000022
SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
The final model demonstrates a robust and statistically significant relationship between the number of wins and games played in the dataset. The model constructed through this analysis represents a moderately satisfactory attempt at capturing the underlying patterns within the data; however, it falls short of being considered a truly effective or robust model. While it demonstrates some capability in predicting outcomes based on the variables selected, its performance metrics and the residuals’ analysis suggest that there is significant room for improvement. The model, as it stands, provides a foundational understanding but does not encapsulate the full complexity or nuances that could lead to a higher level of accuracy or predictive power. Further refinement, including the incorporation of additional variables, reassessment of model assumptions, and exploration of alternative modeling techniques, could potentially elevate its efficacy and reliability as a predictive tool.