Can We Predict an NBA Player’s Salary from the Points Scored in the Prior Year?
To conduct our analysis, we’ll use two datasets. One data set will include player statistics. The other dataset will include information about the salary of the players from 2017 to 2018. We’ll join these two datasets together to perform our analysis. Our overall goal will be to run a simple linear regression model using the prior year’s points scored during the season to see if we can predict a player’s salary for the coming year. The player statistics dataset can be downloaded here and the NBA Salary dataset can be downloaded here.
nba_stats <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Week 11 - Linear Regression Using R/Discussion/Seasons_Stats.csv')
nba_salary <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Week 11 - Linear Regression Using R/Discussion/NBA_season1718_salary.csv')The nba_stats dataset has data from 1955 through 2017. We’ll limit this data down to 2016 to conduct our analysis (since we’ll be looking to see if we can predict salary in 2017). Additionally, some players have have multiple lines per year since they were traded to different teams throughout the year. We’ll aggregate these rows into a single row.
nba_stats_2016 <- nba_stats %>%
dplyr::select(`Year`, `Pos`, `Player`, `PTS`) %>%
dplyr::filter(`Year` == 2016) %>%
dplyr::group_by(`Player`, `Pos`) %>%
dplyr::summarize(PTS = sum(`PTS`))
head(nba_stats_2016)## # A tibble: 6 x 3
## # Groups: Player [6]
## Player Pos PTS
## <chr> <chr> <dbl>
## 1 Aaron Brooks PG 491
## 2 Aaron Gordon PF 719
## 3 Aaron Harrison SG 18
## 4 Adreian Payne PF 132
## 5 Al-Farouq Aminu SF 839
## 6 Al Horford C 1249
Next we’ll select the columns we’ll need in our nba_salary dataset and aggregate the salary values for the players that transitioned teams mid-year (meanining they have multiple rows of data). Additionally, we’ll divide the salary column by a million to make the numbers easier to view when graphing/plotting.
nba_salary_2017 <- nba_salary %>%
dplyr::select(`Player`, `season17_18`) %>%
dplyr::group_by(`Player`) %>%
dplyr::summarize(salary_2017_in_millions = sum(`season17_18`)/1000000)
head(nba_salary_2017)## # A tibble: 6 x 2
## Player salary_2017_in_millions
## <chr> <dbl>
## 1 A.J. Hammons 1.31
## 2 Aaron Brooks 2.12
## 3 Aaron Gordon 5.50
## 4 Aaron Gray 0.452
## 5 Abdel Nader 1.17
## 6 Al-Farouq Aminu 7.32
Before joining these datasets together, lets take a look at the number of rows in each dataset:
## [1] 535
## [1] 480
We can see the number of rows are different - which we would expect. There could be players that played in 2016 that didn’t play in 2017 as well as players who came into the league in 2017 that were not present in the 2016 dataset. Next, we’ll join these two datasets together so we have both points and salary in the same dataset.
nba_data <- nba_stats_2016 %>%
dplyr::left_join(nba_salary_2017, by = "Player") %>%
dplyr::select(-Pos) %>%
dplyr::arrange(PTS)
nba_data## # A tibble: 480 x 3
## # Groups: Player [476]
## Player PTS salary_2017_in_millions
## <chr> <dbl> <dbl>
## 1 J.J. O'Brien 0 NA
## 2 Nate Robinson 0 NA
## 3 Sam Dekker 0 1.79
## 4 Bruno Caboclo 3 2.45
## 5 Joe Harris 3 1.52
## 6 Chuck Hayes 4 NA
## 7 Rakeem Christmas 4 0.173
## 8 Branden Dawson 5 NA
## 9 Coty Clarke 6 NA
## 10 Duje Dukan 6 NA
## # ... with 470 more rows
Before moving on, let’s quickly look and see how many null values we have in the salary column:
## Player PTS salary_2017_in_millions
## 0 0 133
Since our analysis is specifically looking at predicting the salary of those players with stats in 2016, we’ll go ahead and remove any rows of the dataset where the salary is null. Additionally, there are cases where a player has very few points and is still paid a salary (they were injured, etc.). For our purposes, we’ll consider these as outliers and remove anyone with less than 25 points them from our analysis (there are 82 games in an NBA season, so scoring 25 points should be realistic for most players - even bench sitters).
## # A tibble: 336 x 3
## # Groups: Player [334]
## Player PTS salary_2017_in_millions
## <chr> <dbl> <dbl>
## 1 Alan Williams 29 6
## 2 Pat Connaughton 36 1.47
## 3 Nikola Pekovic 54 11.6
## 4 Spencer Dinwiddie 58 1.52
## 5 Udonis Haslem 59 2.33
## 6 Briante Weber 62 0.183
## 7 Caron Butler 63 0.517
## 8 Jarell Eddie 63 0.183
## 9 Lucas Nogueira 65 2.95
## 10 Jeremy Evans 71 0.05
## # ... with 326 more rows
Now that we’ve got our data straightened out, let’s see if we can identify a relationship between points scored and the salary of a player the next year. We’ll start this analysis by plotting these two values against each other on a scatter plot.
ggplot(nba_data) +
aes(x = PTS, y = salary_2017_in_millions) +
geom_point()+
geom_smooth(method = lm, se = FALSE) +
labs(title = "NBA 2017 Salary vs 2016 Season Point Total") +
ylab("Salary ($M)") +
xlab("Season Point Total") +
theme(
plot.title = element_text(hjust = 0.45)
)Looking at the above plot, it does look like there is a medium to medium-loose relationship between the salary of a player and the points they generated the previous year. Having looked at the data visually, let’s now see if we can build a model using the previous season’s point total to predict a player’s salary in the current year. We’ll use R’s lm function to build a linear model. In the call to lm, we’ll first identify our response variable, then we’ll identify the explanatory variable.
##
## Call:
## lm(formula = salary_2017_in_millions ~ PTS, data = nba_data)
##
## Coefficients:
## (Intercept) PTS
## 1.67756 0.01023
In running the above, we get the following model:
\(y={0.01026}x+{1.64386}\)
In english this means, every player starts with a baseline of around $1.64M (1.64 * 1,000,000) and then for each point they score, they make an additional 10,260 dollars (0.01026 * 1,000,000). So using the above model, a player who scores 1,000 points in a season could be expected to earn 10,260,000 + 1,643,860 = 11,903,860. But can our model be relied upon? Well lets look at some of the model diagnostics:
##
## Call:
## lm(formula = salary_2017_in_millions ~ PTS, data = nba_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.777 -4.253 -1.077 3.986 20.171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6775619 0.6236362 2.69 0.0075 **
## PTS 0.0102325 0.0007249 14.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.283 on 334 degrees of freedom
## Multiple R-squared: 0.3737, Adjusted R-squared: 0.3718
## F-statistic: 199.3 on 1 and 334 DF, p-value: < 2.2e-16
Looking at the first line displaying the residuals, we can see the median is very close to 0 which is promising. Additionally the 1st quartile and 3rd quartile are both very close (abs value) and so also suggest that variation within the residuals is fairly evenly distributed. Looking at the max and min, we see the same thing. We’ll investigate the residuals shortly with some plots to see if we can get a better view of this.
The next section of the output are the coefficients. For a good model, we’d expect the standard error to be at least 5-10 times smlaler than the corresponding coefficient. For points (PTS), the standard error is 0.0007312 and the coefficient is 0.0102648 making the error 14 (0.0102648/0.0007312) times smaller than the coefficient. Having a large ratio gives us some assurance that there is relatively low variability in our slope estimate. The standard error for our slope estimate is 0.6309739 which is a little more than 1/3 the value of our intercept. This suggests that the estimate for the y-intercept can vary quite a bit.
The last column shows the probability that the coefficient is NOT relevant to the model. For our model, point (PTS) p-value is incredibly small, meaning there is an incredibly small chance that points is NOT relevant to determining stopping distance. The p-value for our intercept is 0.00959, which is less than the gererally accepted value of 0.05. This indicates that there also a very small chance that this intercept value is NOT relevant to our model.
Lastly, we’ll look at the the Adjusted R-Squared value of 0.3706. This value tells us that when using a players total points from the previous year (only this value), we are only able to explain about 37% of the variation in the player’s salary. This value indicates to us that while this model could possibly be helpful (directional), we need to add some additional factors to our dataset in order to explain some of the remaining variation and make it more accurate.
Now, based on the Adjusted R-Squared Value alone, we probably wouldn’t use this model in a real-life scenario, however, let’s walk through some of the assumptions we’d need to meet if we were going to. We’ll start by looking at the residuals to determine if they are nearly normal:
hist(model$residuals, main = "Histogram of Residuals from NBA Salary Prediction Model",
xlab = "Residuals")In looking at the histogram of the residuals, we can see that they are not quite nearly normal. The values toward the lower end of the plot are definitely smaller than on the right hand side. Additionally, the center of the chart seems to be significantly higher than the right and left sides for the next nearest bars, which also make this not quite normal.
Next, let’s check to see if the variability in our residuals in nearly constant.
plot(model$residuals ~ model$fitted.values,
main = "Scatter Plot of Residuals and Fitted Values",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, lty = 3)Interestingly, for 2/3 of the plotted values, the variability in the residuals look fairly constant, however, in the first section of the plot we can see that there is definitely a lack of varation at the lower values.
Let’s also look at the quantile-versus-quantile plot (Q-Q plot).
While not as pronounced, you can see that the two tails of this plot show some variation away from “normal”. Looking at these tests, we can say that only using points (PTS) as a predictor for our model would be insufficient to build an accurate model to predict the salary for a player. While the model in its current state might be directional, it wouldn’t be much more than that (we could not rely on this model to accurately predict a player’s salary). To make the model more accurate, we’d need to explore adding some additional factors to the model, for example, shot percentage or position. It makes sense that some players, while very valuable to the team, might not be big scorers (example: Dennis Rodman). We can explore this more with multi-factor regression. Based on our initial analysis, I am hopeful that with a combination of a few more factors, we could create a fairly accurate linear model for a player’s salary.