We wish to utilize numerous predictors to find a model that accurately predicts a player’s salary. Model selection will include searching for adequate linear models and comparing our best linear model with some common machine learning models like decision trees, random forest, and generalized linear boosting.
Using NBA statistics obtained from ESPN’s online website via web scraping, we start off with over 300 different player observations. Each player has various statistics like season total rebounds, assists, points per game, etc. for a total of 29 predictors and salary as a response.
We will clean the data and select appropriate predictors for our final models by utilizing a model selection method learned in class and compare root mean squared error (RMSE) and mean absolute error (MAE) values to determine which model best predicts salary.
As we explore the data, you’ll notice we run models for each level of the factor. is categorical in nature, but not dichotomous, which lead us to the decision to create linear models for each level of position. We omitted team since there are an abundance of levels and it would just make analysis convoluted. Initially we were considering using and as a dummy variables, but after testing that route, we chose to stick with our decision to break our models into positions.
Our predictions were performed on players who are centers in the NBA. The training data is the 2017-2018 (last season) data, and our test data was the season previous (2016-2017). We chose this split because we needed to ensure enough observations for model adequacy.
To begin the process of model selection and analysis, we import the data and run some brief exploration.
## Data exploration:
colnames(nba)
## [1] "Name" "Position"
## [3] "Salary" "Allstar"
## [5] "Age" "Height"
## [7] "Weight" "Team"
## [9] "Games.Played" "Wins"
## [11] "Losses" "Minutes_Played"
## [13] "Points_Per_Game" "Field_Goals_Made"
## [15] "Field_Goals_Attempted" "Field_Goal_Percent"
## [17] "Three_Point_Made" "Three_Point_Attempted"
## [19] "Three_Pointed_Percent" "Free_Throw_Made"
## [21] "Free_Throw_Attempted" "Free_Throw_Percent"
## [23] "Offensive_Rebound" "Defensive_Rebound"
## [25] "Rebound" "Assist"
## [27] "Turnover" "Steal"
## [29] "Block" "Fouls"
dim(nba)
## [1] 377 30
Our initial data has 29 predictors with one response (), and 377 observations for us to use.
Next step is to begin sub-setting our data to make sure any correlated or identifier variables are omitted from our data. The Team and Name variables have a vast number of factors and are considered identifiers, so we omit them from our analysis. Additionally, When we check the pairs plots, we notice that there are some variables with extremely high correlations. Further investigation tells us that these two variables are meant to predict each other, like and , so we omitted these two variables but kept the predictor as a replacement.
## Need to drop "Team" and "Name" as these are categorical
## identifiers with too many factors.
nba <- subset(nba, select=-c(1,8))
## Create training set, will subset more later
training <- nba
## Check for any correlations with pairs
pairs(training[1:10])
pairs(training[11:20])
pairs(training[19:length(colnames(training))])
With our cleaning suggestions above, we subset our training set and look at our resulting predictors.
#create winning percentage:
Win_Percent <- training$Wins/training$Games.Played
training <- cbind(training,Win_Percent)
## Clean data with suggestions above
training <- subset(training, select=-c(8,9,12,13,15,16,18,19,23))
#get rid of rebounds & stats that aren't percentages
colnames(training)
## [1] "Position" "Salary"
## [3] "Allstar" "Age"
## [5] "Height" "Weight"
## [7] "Games.Played" "Minutes_Played"
## [9] "Points_Per_Game" "Field_Goal_Percent"
## [11] "Three_Pointed_Percent" "Free_Throw_Percent"
## [13] "Offensive_Rebound" "Defensive_Rebound"
## [15] "Assist" "Turnover"
## [17] "Steal" "Block"
## [19] "Fouls" "Win_Percent"
Notice, we kept three of the percentage variables in place of the Attempted/Made combination of variables. This helped significantly reduce the dimensions of our data. I wanted to check the correlation matrix to make sure there are no more issues with co-linearity. The correlation matrix doesn’t include (response) or since the predictor is categorical. Additionally, the relationship between , , and are co-linear so we converted and to winning percentage and left games played in the final data set since games played could indicate injury or other factors beyond just wins and losses.
#obtain correlation matrix & filter unique values over .5
cor_matrix <- cor(training[,-c(1,2)])
unique(cor_matrix[which(cor_matrix > .5 & cor_matrix < 1)])
## [1] 0.8412125 0.5513916 0.5254138 0.5734968 0.6465623 0.5101416 0.5232971
## [8] 0.8671634 0.6832604 0.6408716 0.7493260 0.7493126 0.6717767 0.6028492
## [15] 0.6451691 0.8096894 0.6343131 0.5212930 0.5416109 0.5044241 0.7212068
## [22] 0.6460036 0.5413094 0.5538429 0.6130032 0.6742349 0.8149281 0.6661381
## [29] 0.6445339 0.5573802 0.5238833 0.5093900
Only a few of the variables left have high correlation values - for instance, and , which is understandable however, we will leave both of these in our training data.
Next, we needed to handle our categorical predictor - . We decided to break our data into levels and run separate models for each position. Below you’ll notice the Forward and Guard levels of Position have a small number of observations, so we reclassified these players into the Small Guard and Small Forward levels after looking over their statistics.
## Data is now clean. Need to subset into Positions and check obs
table(training$Position)
##
## C F G PF PG SF SG
## 67 3 1 75 85 65 81
#need to add forward to Small Forward, and Guard to Small Guard
training$Position[training$Position == 'F'] = 'SF'
training$Position[training$Position == 'G'] = 'SG'
training$Position <- factor(training$Position)
table(training$Position) #fixed
##
## C PF PG SF SG
## 67 75 85 68 82
We subset the training data further to create five training sets, one for each position.
## Need to create data frames for each position
center<-pwforward<-ptguard<-smforward<-smguard <- data.frame(NULL)
for (i in 1:length(training$Position)) {
if(training[i,]$Position == 'C') {center <- rbind(center, subset(training[i,], select = -c(1)))}
else if(training[i,]$Position == 'PF') {pwforward <- rbind(pwforward,subset(training[i,], select = -c(1)))}
else if(training[i,]$Position == 'PG') {ptguard <- rbind(ptguard,subset(training[i,], select = -c(1)))}
else if(training[i,]$Position == 'SF') {smforward <- rbind(smforward,subset(training[i,], select = -c(1)))}
else if(training[i,]$Position == 'SG') {smguard <- rbind(smguard,subset(training[i,], select = -c(1)))}
}
Now that we have our training data, we are able to build linear models. We need to find a model that is adequate and has a moderately high Adjusted \(R^2\) value. This will be difficult since our response is continuous, but perhaps with the correct model selection method and transformations we can land on a valid model for our predictions.
We begin the model selection process by using backwards step-wise model selection using AIC criterion testing with \(\alpha = .05\). We used this selection technique for each training set for each position. To be more concise, I have included the details of our selection for the model that yielded the best results - Power Forward. We ultimately used this model to run our predictions.
## Power Forward:
modelpf <- step(lm(Salary~., data=pwforward),
direction = "backward",
k = 4, trace = F) #k=4 for 95%
summary(modelpf) #Model Summary
##
## Call:
## lm(formula = Salary ~ Allstar + Age + Minutes_Played + Field_Goal_Percent +
## Block, data = pwforward)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8044235 -1684601 -429818 1515285 10703650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6751122 3149646 -2.143 0.035604 *
## Allstar 1652508 271904 6.078 5.96e-08 ***
## Age 372846 91464 4.076 0.000121 ***
## Minutes_Played 257500 50908 5.058 3.34e-06 ***
## Field_Goal_Percent -100010 51171 -1.954 0.054706 .
## Block 2901175 1250858 2.319 0.023343 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3176000 on 69 degrees of freedom
## Multiple R-squared: 0.7482, Adjusted R-squared: 0.73
## F-statistic: 41.01 on 5 and 69 DF, p-value: < 2.2e-16
extractAIC(modelpf)[2] #AIC
## [1] 2251.435
par(mfrow=c(2,2));plot(modelpf);par(mfrow=c(1,1)) #Plot Diagnostics
Our initial model gives us an adjusted \(R^2\) of .73, which is fairly high, especially considering we are attempting to predict a continuous response over a range of 30 million. However, after looking over the residual diagnostics, we can see a few possible outliers, and an issue with heteroskedacity. Heteroskedacity is a major issue in regression and needs to be addressed since it kills the homogeneity of residuals assumption for our model. We will apply a transformation, but first we wanted to address the outliers and see if any are influential.
std.res <- rstudent(modelpf)
outliers <- std.res[which(std.res>=3)]
hats <- hatvalues(modelpf)
leverage <- (length(modelpf$coefficients)+1)/length(pwforward$Salary)
lvg_points <- hats[which(hats >=leverage)]
lvg_points[which(names(lvg_points) %in% names(outliers))]
## named numeric(0)
## Note, no influential points! Yay!
After analyzing the outliers and leverage points, none of them are influential so we will leave our data alone for now.
To conquer the issue with heteroskedacity, we turned to Box Cox for an appropriate power transformation to see if this would help. The lambda value is nearly .45. Applying the transformation, we do see an improvement in our diagnostics.
## Box Cox Transformation
boxCox(modelpf, lambda = seq(-1, 1, .05)) #lambda~.45
modelpfTrans <- lm(Salary^(.45)~Allstar + Age + Minutes_Played + Field_Goal_Percent +
Block, data=pwforward)
summary(modelpfTrans);extractAIC(modelpfTrans)[2]
##
## Call:
## lm(formula = Salary^(0.45) ~ Allstar + Age + Minutes_Played +
## Field_Goal_Percent + Block, data = pwforward)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.40 -140.70 18.72 181.97 608.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -307.693 288.968 -1.065 0.2907
## Allstar 65.946 24.946 2.644 0.0101 *
## Age 38.689 8.391 4.611 1.79e-05 ***
## Minutes_Played 27.185 4.671 5.820 1.68e-07 ***
## Field_Goal_Percent -9.470 4.695 -2.017 0.0476 *
## Block 236.438 114.761 2.060 0.0431 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 291.4 on 69 degrees of freedom
## Multiple R-squared: 0.6761, Adjusted R-squared: 0.6526
## F-statistic: 28.81 on 5 and 69 DF, p-value: 1.172e-15
## [1] 856.9617
par(mfrow=c(2,2))
plot(modelpfTrans)
par(mfrow=c(1,1))
There is still a hint of heteroskedacity, but to visualize the residuals on a better scale, we can look at the studentized residuals:
## Look at Studentized residuals for more precision
std.res <- rstudent(modelpfTrans)
hist(std.res, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(std.res),max(std.res),length=50)
yfit<-dnorm(xfit)
lines(xfit, yfit,col="red")
Our errors do seem to follow standard normality, so we conclude the model assumptions are valid, and thus choose the transformed model to predict values in the next section. Our ideal model for the Power Forward position is given by
\[ \begin{aligned} (Y_{ij})^{.45} &= \hat{\beta}_{0} + \hat{\beta}_{\text{allstar}} x_{\text{allstar}} + \hat{\beta}_{\text{age}} x_{\text{age}} + \hat{\beta}_{\text{min.played}} x_{\text{min.played}} + \hat{\beta}_{\text{fg.percent}} x_{\text{fg.percent}} + \hat{\beta}_{\text{blocks}} x_{\text{blocks}}\\ \\ (Y_{ij})^{.45} &= -307.693 + 65.946 x_{\text{allstar}} + 38.689x_{\text{age}} + 27.185x_{\text{min.played}} - 9.470 x_{\text{fg.percent}} + 236.438 x_{\text{blocks}} \end{aligned} \]
Now we are ready to apply our model to our test data. However, we need to use the same cleaning methodology we used on the training data, on our test set. The following is simply preparing the test data:
## Import & Clean Test Data
testing <- read.csv("/Users/seankrinik/Documents/CSULB/STAT510/Project/NBA/test.csv")
colnames(testing);dim(testing);#str(testing)
## [1] "X" "Name"
## [3] "Position" "Salary"
## [5] "Allstar" "Age"
## [7] "Height" "Weight"
## [9] "Team" "Games.Played"
## [11] "Wins" "Losses"
## [13] "Minutes_Played" "Points_Per_Game"
## [15] "Field_Goals_Made" "Field_Goals_Attempted"
## [17] "Field_Goal_Percent" "Three_Point_Made"
## [19] "Three_Point_Attempted" "Three_Pointed_Percent"
## [21] "Free_Throw_Made" "Free_Throw_Attempted"
## [23] "Free_Throw_Percent" "Offensive_Rebound"
## [25] "Defensive_Rebound" "Rebound"
## [27] "Assist" "Turnover"
## [29] "Steal" "Block"
## [31] "Fouls"
## [1] 359 31
#create winning percentage:
Win_Percent <- testing$Wins/testing$Games.Played
testing <- cbind(testing,Win_Percent)
## Clean data with suggestions above
testing <- subset(testing, select=-c(1,2,9,11,12,15,16,18,19,21,22,26))
## Data is now clean. Need to subset into Positions and check obs
table(testing$Position)
##
## C F G PF PG SF SG
## 65 2 4 68 70 71 79
#need to add forward to Small Forward, and Guard to Small Guard
testing$Position[testing$Position == 'F'] = 'SF'
testing$Position[testing$Position == 'G'] = 'SG'
testing$Position <- factor(testing$Position)
table(testing$Position) #fixed
##
## C PF PG SF SG
## 65 68 70 73 83
pwf_test <- data.frame()
for (i in 1:length(testing$Position)) {
if(testing[i,]$Position == 'PF'){
pwf_test <- rbind(pwf_test, subset(testing[i,], select = -c(1)))
}
}
Now our test data aligns with our training data. We will apply our final model (transformed linear model) to the test data. Additionally, we apply the initial linear model without the transformation just to note the improvement.
models <- data.frame()
#transformed model predictions
pwf_test$Salary<- (pwf_test$Salary)^(.45)
pwf_pred <- predict(modelpfTrans, pwf_test)
rmse <- sqrt(mean((pwf_test$Salary)^(1/.45) - (pwf_pred)^(1/.45))^2)
mae <- mean(abs((pwf_test$Salary)^(1/.45) - (pwf_pred)^(1/.45)))
models <- cbind(rbind(rmse,mae))
#intial model predictions
pwf_test$Salary<- (pwf_test$Salary)^(1/.45)
pwf_pred <- predict(modelpf, pwf_test)
rmse <- sqrt(mean((pwf_test$Salary - pwf_pred)^2))
mae <- mean(abs(pwf_test$Salary - pwf_pred))
models <- cbind(models,rbind(rmse,mae))
As stated in the Methodology section, the metrics used to compare accuracy of models will be root mean squared error (RMSE) and mean absolute error (MAE), which are commonly used error metrics in machine learning. Looking at the RMSE and MAE values for our linear models, our transformed model definitely out performs our initial model.
models
## [,1] [,2]
## rmse 1944491 4271573
## mae 2961069 3092818
Now that we have our linear model predictions, let’s compare our results with some basic regression-based machine learning models. We chose to use decision trees, random forest (untuned), and generalized linear boosting to give our linear models a variety of benchmarks.
## Decision Tree
modTree <- train(Salary~., method="rpart2", maxdepth = 10, data=pwforward)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
fancyRpartPlot(modTree$finalModel) #plot tree
pred_tree <- predict(modTree$finalModel, pwf_test)
rmse <- sqrt(mean((pwf_test$Salary - pred_tree)^2))
mae <- mean(abs(pwf_test$Salary - pred_tree))
models <- cbind(models,rbind(rmse,mae))
#Random Forest
modRF <- train(Salary~., method="rf", data=pwforward)
predrf <- predict(modRF$finalModel, pwf_test)
rmse <- sqrt(mean((pwf_test$Salary - predrf)^2))
mae <- mean(abs(pwf_test$Salary - predrf))
models <- cbind(models,rbind(rmse,mae))
#GLMBoost
modGBM <- glmboost(Salary~., data = pwforward)
predgbm <- predict(modGBM, pwf_test)
rmse <- sqrt(mean((pwf_test$Salary - predgbm)^2))
mae <- mean(abs(pwf_test$Salary - predgbm))
models <- cbind(models,rbind(rmse,mae))
colnames(models)<-c("Linear Transformed",
"Linear",
"Decision Tree",
"Random Forest",
"GLM Boosting")
Looking at our end results, we can see our transformed linear model obtains the best error metrics, meaning it predicts salary most accurately.
models
## Linear Transformed Linear Decision Tree Random Forest GLM Boosting
## rmse 1944491 4271573 6103907 4461921 4335331
## mae 2961069 3092818 4285621 3193547 3092389
After running the model selection and prediction, we were able to land on some inferences about our data.
First, overall, our linear models were weak. The average adjusted \(R^2\) value for our initial models across position levels was 57.6% explained variance, and 50.5% for our transformed models. There are many factors that could contribute to low \(R^2\) values, but we believe the main issues with our models are lack of predictors, and the fact our response is continuous and the range is very wide. Regarding lack of predictors, many of the factors we used in our model selection are skewed since they are discrete-quantitative in nature. Also, there are definitely plenty of other player features and player game statistics that could contribute more to the explained variance; for example, number of foul-outs, in-the-paint shooting percentage, ethnicity, sprinting speed, etc…
Second, the position of a player is a significant factor. Why is this? Look at the variability of our models across the five levels. We obtain different predictors in each model, different \(R^2\) values, and some even require different transformations to achieve model adequacy. This begs the question - what if team is also a significant predictor? Are there other categorical independent variables lurking out there causing an absence of explained variance? Most likely.
Lastly, choosing a multiple linear regression model made validating model assumptions difficult. There are definitely transformations and other more advanced techniques to fix the issues we had with heteroskedacity, but applying a simple power transformation via Box Cox seemed to produce a decent model, despite the lack of \(R^2\). This difference in prediction accuracy between the models is definitely due to the difference in model adequacy. Our transformation "worked", but this definitely hints at the existence of even better models.
With more time, we could collect more data or narrow down our predictors to only essential factors, look for better transformations, or conclude linear regression may not be the best modeling method. Machine Learning models are usually strong with classification, but there are many models which offer accurate regression based prediction algorithms, and tuning parameters to help the model achieve even better accuracy. Regarding possible predictors, collecting more data on the players like socioeconomic status, years in NBA, annual endorsements, education level, and so on might help add more explanation to the error term in our model.