Can We Predict an NBA Player’s Salary Using His Statistics from the Prior Year?
Discussion Prompt:
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
In the previous week’s discussion we used a multiple regression model to try to predict NBA players’ salaries without making any transformations to the data. That analysis and discussion can be reviewed here. As a refresher, our overall goal is to run a regression model to see if we can predict a player’s salary for the coming year. For this analysis, we will use multiple factors in our model and perform multi-factor linear regression. In addition, we’ll also utilize at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.
We’ll use the same two datasets that we used previously. One data set includes player statistics and the other dataset will include information about the salary of the players from 2017 to 2018. We’ll join these two datasets together to perform our analysis. The player statistics dataset can be downloaded here and the NBA Salary dataset can be downloaded here.
In an effort to be concise, I won’t show the data cleaning that went into this dataset to get it ready for analysis. If you’d like to review those steps or learn more about what variables are included in the dataset, please see the previous discussion. We’ll begin this discussion right where we left off.
In Part I of this analysis, you’ll remember we built a model using a single factor. The output from that model is below:
In reviewing the results, you can see we had a standard error of $6.283M dollars and an adjusted R-squared value of 0.3718. In Part II of our analysis, we extended this to be a multiple regression model and used backward elimination to find the optimal variables for the model. We’ll re-run that model here to serve as a baseline.
nba_data_for_model <- nba_data %>%
dplyr::select(-`Year`, -`Player`, -`PTS`)
final_model <- lm(salary_2017_in_millions ~ Pos + Tm + G + MP + FG + FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
summary(final_model)##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + FG +
## FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8021 -2.9458 -0.3667 2.9623 15.7687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.659795 4.093191 4.559 7.37e-06 ***
## PosPF -0.883689 0.925798 -0.955 0.340554
## PosPG -4.181362 1.217512 -3.434 0.000673 ***
## PosSF -2.533525 1.035919 -2.446 0.015005 *
## PosSG -3.220868 1.085514 -2.967 0.003236 **
## TmBOS -6.404589 2.174593 -2.945 0.003468 **
## TmBRK -6.402664 2.262544 -2.830 0.004956 **
## TmCHI -3.804716 2.069279 -1.839 0.066905 .
## TmCHO -3.850031 2.133619 -1.804 0.072114 .
## TmCLE 0.057248 2.225874 0.026 0.979498
## TmDAL -6.424537 2.071344 -3.102 0.002099 **
## TmDEN -8.148359 2.148543 -3.793 0.000179 ***
## TmDET -5.616280 2.267940 -2.476 0.013797 *
## TmGSW -5.030329 2.090181 -2.407 0.016675 *
## TmHOU -7.425419 2.198156 -3.378 0.000822 ***
## TmIND -4.432995 2.245703 -1.974 0.049256 *
## TmLAC -0.505987 2.237237 -0.226 0.821219
## TmLAL -7.370429 2.440239 -3.020 0.002731 **
## TmMEM -3.170121 2.242578 -1.414 0.158465
## TmMIA -3.572091 2.156996 -1.656 0.098708 .
## TmMIL -3.985195 2.144569 -1.858 0.064063 .
## TmMIN -9.062082 2.254292 -4.020 7.29e-05 ***
## TmNOP -2.501026 2.255038 -1.109 0.268240
## TmNYK -5.799560 2.194682 -2.643 0.008640 **
## TmOKC -0.906151 2.172580 -0.417 0.676900
## TmORL -5.363804 2.126292 -2.523 0.012141 *
## TmPHI -9.211694 2.311452 -3.985 8.38e-05 ***
## TmPHO -5.362271 2.144749 -2.500 0.012921 *
## TmPOR -2.686588 2.119968 -1.267 0.205992
## TmSAC -8.264730 2.088801 -3.957 9.39e-05 ***
## TmSAS -4.000362 2.127648 -1.880 0.061006 .
## TmTOR -1.346284 2.110044 -0.638 0.523914
## TmTOT -6.427731 1.721244 -3.734 0.000223 ***
## TmUTA -3.606637 2.089225 -1.726 0.085273 .
## TmWAS -3.919862 2.173399 -1.804 0.072255 .
## G -0.156140 0.028049 -5.567 5.56e-08 ***
## MP 0.004016 0.001137 3.532 0.000474 ***
## FG 0.089992 0.017466 5.152 4.55e-07 ***
## FGA -0.024965 0.007625 -3.274 0.001178 **
## `FG%` -11.915701 7.496500 -1.590 0.112950
## `2PA` -0.015743 0.003517 -4.477 1.06e-05 ***
## FTA 0.015450 0.003770 4.098 5.30e-05 ***
## AST 0.009082 0.003570 2.544 0.011433 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.825 on 315 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.62
## F-statistic: 14.87 on 42 and 315 DF, p-value: < 2.2e-16
The current model is definitely directional and explains ~62% of the variation in the dependent variable. However, let’s see if we can identify some transformations that would be helpful to this model.
First, lets visualize the relationship between the remaining variables in the current model and our dependent variable:
library(psych)
psych::pairs.panels(nba_data_new_model %>% select(-Age, -Pos, -Tm),
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE
)Looking at the relationship between our independent variables and our dependent variables it looks like the majority of our relationships are fairly linear. There’s nothing that stands out here saying that it would benefit from a quadratic transformation. Also, looking at the residuals from our model (see previous discussion), the residuals don’t indicate anything either, if anything the residual plot indicates that perhaps a natural log transformation would be helpful on the dependent variable (salary). One way we can approach this is to use powerTranform from the car library. powerTransform can identify what, if any, transformations need to be made to a column. In working through each of the columns with this function, it appears that games played (G), comes the closes to benefitting from a quadratic transformation:
## Estimated transformation parameter
## nba_data_new_model$G
## 2.387038
Here we see the estimated value that we should raise the column to is 2.387038 – very close to 2. We’ll round down and perform a quadratic transformation on this column by raising to the ^2 power. Additionally, we’ll use the column ‘golden_years’ as our dichotomous variable. This column is a true, false (1,0) column that indicates whether the player is in their golden years or not. Golden years have been defined as the player being between 27 and 31 years old. For our dichotomous vs. quantitative interaction terms, we’ll use ‘golden_years’ and our quadratically transformed G (games played) column. Let’s re-run the model with these adjustments:
nba_data_for_model <- nba_data %>%
dplyr::select(-`Year`, -`Player`, -`PTS`)
final_model <- lm(salary_2017_in_millions ~ Pos + Tm + I(G^2) + MP + golden_years + golden_years:I(G^2) + FG + FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
summary(final_model)##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + I(G^2) + MP +
## golden_years + golden_years:I(G^2) + FG + FGA + `FG%` + `2PA` +
## FTA + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0278 -3.0244 -0.2898 2.8071 14.0732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.454e+01 3.940e+00 3.691 0.000263 ***
## PosPF -6.777e-01 9.172e-01 -0.739 0.460528
## PosPG -3.697e+00 1.216e+00 -3.039 0.002570 **
## PosSF -1.976e+00 1.033e+00 -1.913 0.056686 .
## PosSG -2.856e+00 1.090e+00 -2.620 0.009211 **
## TmBOS -6.355e+00 2.150e+00 -2.956 0.003355 **
## TmBRK -6.515e+00 2.235e+00 -2.915 0.003812 **
## TmCHI -4.196e+00 2.056e+00 -2.041 0.042102 *
## TmCHO -3.923e+00 2.108e+00 -1.861 0.063629 .
## TmCLE -1.562e-01 2.200e+00 -0.071 0.943451
## TmDAL -7.238e+00 2.069e+00 -3.498 0.000537 ***
## TmDEN -8.224e+00 2.132e+00 -3.857 0.000139 ***
## TmDET -5.288e+00 2.240e+00 -2.361 0.018860 *
## TmGSW -5.153e+00 2.066e+00 -2.494 0.013139 *
## TmHOU -7.561e+00 2.172e+00 -3.481 0.000571 ***
## TmIND -4.967e+00 2.223e+00 -2.234 0.026198 *
## TmLAC -8.781e-01 2.211e+00 -0.397 0.691569
## TmLAL -7.581e+00 2.421e+00 -3.132 0.001902 **
## TmMEM -3.514e+00 2.248e+00 -1.563 0.118973
## TmMIA -4.258e+00 2.148e+00 -1.982 0.048335 *
## TmMIL -4.256e+00 2.140e+00 -1.989 0.047608 *
## TmMIN -8.718e+00 2.262e+00 -3.854 0.000141 ***
## TmNOP -3.306e+00 2.241e+00 -1.475 0.141166
## TmNYK -6.009e+00 2.175e+00 -2.763 0.006057 **
## TmOKC -9.360e-01 2.154e+00 -0.435 0.664189
## TmORL -5.450e+00 2.119e+00 -2.572 0.010571 *
## TmPHI -9.075e+00 2.305e+00 -3.937 0.000102 ***
## TmPHO -5.219e+00 2.123e+00 -2.459 0.014491 *
## TmPOR -2.200e+00 2.117e+00 -1.039 0.299489
## TmSAC -8.707e+00 2.068e+00 -4.209 3.35e-05 ***
## TmSAS -4.243e+00 2.105e+00 -2.016 0.044683 *
## TmTOR -1.384e+00 2.094e+00 -0.661 0.509207
## TmTOT -6.799e+00 1.707e+00 -3.984 8.44e-05 ***
## TmUTA -3.469e+00 2.070e+00 -1.676 0.094788 .
## TmWAS -4.060e+00 2.147e+00 -1.891 0.059596 .
## I(G^2) -1.614e-03 2.901e-04 -5.564 5.67e-08 ***
## MP 4.533e-03 1.162e-03 3.900 0.000118 ***
## golden_years 1.856e+00 1.586e+00 1.171 0.242594
## FG 8.969e-02 1.741e-02 5.151 4.60e-07 ***
## FGA -2.585e-02 7.626e-03 -3.389 0.000790 ***
## `FG%` -1.167e+01 7.446e+00 -1.568 0.117928
## `2PA` -1.460e-02 3.510e-03 -4.160 4.11e-05 ***
## FTA 1.534e-02 3.731e-03 4.111 5.04e-05 ***
## AST 8.240e-03 3.539e-03 2.328 0.020534 *
## I(G^2):golden_years -1.311e-04 3.245e-04 -0.404 0.686351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.764 on 313 degrees of freedom
## Multiple R-squared: 0.6752, Adjusted R-squared: 0.6295
## F-statistic: 14.79 on 44 and 313 DF, p-value: < 2.2e-16
Here, we see a slight increase in our Adjusted R-Squared value as well as a decrease in our residual standard error - so the transformations did help, but only minimally. The first 4 coefficients in the model output come from the Position column, a categorical variable, which is why we see each value within the column with its own row in the output. We can see that, while being a power forward (PF) is not significant, the column as a whole is significant to the model. The same is true for the Tm (team) column, which is also a categorical variable. In fact, all the variables in the model are significant except the column interaction variable we added (golden years vs Games Played), the golden years variable, and field goal percentage (FG%).
Now, let’s look at the residuals from our model:
Let’s start by analyzing the Residuals vs Fitted plot on the top-left first. This plot shows us if our residuals have any non-linear patterns. Looking at the plot, we can see that the red line is not perfectly straight and that it does almost seem to have a very gentle sigmoid curve. As the curves are fairly faint, I think it’s safe to say here that it is appropriate to categorize the relationships between our predictor variables and our outcome variable as linear.
Moving to the Normal Q-Q plot on the bottom-left, this plot shows us if our residuals are normally distributed. Looking at our plot, for the most part, our residuals follow the diagonal line, however we do see some deviation in the top right of the chart, although it is not a severe deviation.
Turning our attention now to the Scale-Location plot on the top-right, this plot helps us to check the assumption of equal variance (homoscedasticity). If the residuals had equal variance we would see a horizontal line with equally spread points. However, in our plot you can see that those points from 0-10 have a smaller variance than the rest of the plot. What we are seeing is referred to as heteroscedasticity. There isn’t a tremendous amount of difference in the variation but there definitely is some. This plot is probably right on the edge of what we would deem as acceptable, but as the heteroscedasticity is not overwhelming, I’ll consider this assumption as met.
Lastly, let’s look at the Residuals vs Leverage plot on the bottom-right. This plot helps us to determine if we have influential outliers in our data that are pulling the regression line in one direction or another. In this plot, we aren’t looking for patterns, we are really looking for outlying values in the upper-right or lower-right corner. Those spots are places where cases can be heavily influential to the least-squares line. What we are looking for is points that fall outside of the red-dashed line, which is Cook’s distance. Points outside of that line tell us that the points would be influential to the regression results. In the case of our plot, we can’t even see the Cook’s distance lines because all the cases are well inside of the lines, so we can have some comfort that our outliers, if any, are not having a large effect on the model.
All in all, we’ve built a semi-proficient multiple regression model that could definitely be directional in its predictions. Based on the residual analysis above, we could continue to cautiously use this linear model if we wanted. While we may be able to squeeze out a little more accuracy from the model with some more transformations, I’d expect them to be minimal. Next steps for this analysis to improve the accuracy would most likely involve gathering additional data such as 3-point field goals, 3-point percentage, etc. as well as looking at using a different modeling method such as XGBOOST or Random Forest regression.