Professor Smalley
The purpose of this paper is to analyze a dataset from 2017 to 2018 NBA season's player statistics and its correlation between players’ salaries. Our dataset contains 559 observations and has 28 variables. Out of 28 variables, 4 of the variables are character variables. The central question of this analysis is whether player statistics affect to player's salaries. Out of all the variables in the dataset, A response variable is Salary and exlpanatory includes dummy variables for players from USA as 1 and players outside of USA as 0, Guranteed, Age, Player_Efficiency_Rating, True_Shooting_Percentage, Three_Point_Field_Goal_Percentage, Free_Throw_Percentage, Offensive_Rebound_Percentage, Defensive_Rebound_Percentage, Total_Rebound_Percentage, Assist_Percentage, Steal_Percentage, Block_Percentage Turnover_Percentage, Usage_Percentage, Offensive_Win_Shares, Defensive_Win_Shares, Win_Shares, Win_Shares_Per_48_Minutes, Offense_Box_Plus_Minus, Defense_Box_Plus_Minus, Box_Plus_Minus, and Value_Over_Replacement_Player.
Specifically, this paper is going to look at and delve into which particular variable(s) has significant correlation and what kind of implication does the test statistics entail. The dataset we explored have many variables and some character and dummy variables. We finally choose to exemplify three main variables, Salary, Age, Position and Player Efficiency Ratings. The central question of this project shall be “Does PER have a statistically significant effect on the salary an NBA player earns during the 2017 and 18 season?” The whole dataset includes NBA salary and statistics for various players during 2017-18 season. We finally explored the model with the varaible(s) that has significant effect of the response variable; salary.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(readxl)
bb <- read_excel("basketball.xlsx")
## New names:
## * `` -> ...1
str(bb$Country)
## chr [1:559] "USA" "USA" "USA" "USA" "USA" "Serbia" "Ukraine" "USA" "Spain" ...
str(bb$Position)
## chr [1:559] "Point Guard" "Power Forward" "Power Forward" "Small Forward" ...
modbb2 <- lm(Salary~Age+Guaranteed+Player_Efficiency_Rating+True_Shooting_Percentage+Three_Point_Field_Goal_Percentage+Free_Throw_Percentage+Offensive_Rebound_Percentage+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Assist_Percentage+Steal_Percentage+Block_Percentage+Turnover_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Defense_Box_Plus_Minus+Box_Plus_Minus+Value_Over_Replacement_Player, data=bb)
modbb2
##
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating +
## True_Shooting_Percentage + Three_Point_Field_Goal_Percentage +
## Free_Throw_Percentage + Offensive_Rebound_Percentage + Defensive_Rebound_Percentage +
## Total_Rebound_Percentage + Assist_Percentage + Steal_Percentage +
## Block_Percentage + Turnover_Percentage + Usage_Percentage +
## Offensive_Win_Shares + Defensive_Win_Shares + Win_Shares +
## Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Defense_Box_Plus_Minus +
## Box_Plus_Minus + Value_Over_Replacement_Player, data = bb)
##
## Coefficients:
## (Intercept) Age
## -7.300e+06 4.011e+05
## Guaranteed Player_Efficiency_Rating
## 1.774e-01 1.384e+05
## True_Shooting_Percentage Three_Point_Field_Goal_Percentage
## -2.283e+06 -1.546e+04
## Free_Throw_Percentage Offensive_Rebound_Percentage
## -3.037e+03 -1.045e+06
## Defensive_Rebound_Percentage Total_Rebound_Percentage
## -8.493e+05 1.911e+06
## Assist_Percentage Steal_Percentage
## -5.664e+04 -2.103e+05
## Block_Percentage Turnover_Percentage
## -1.739e+05 1.448e+04
## Usage_Percentage Offensive_Win_Shares
## 5.667e+04 6.708e+04
## Defensive_Win_Shares Win_Shares
## 4.394e+05 2.403e+05
## Win_Shares_Per_48_Minutes Offense_Box_Plus_Minus
## -1.330e+07 -2.668e+06
## Defense_Box_Plus_Minus Box_Plus_Minus
## -2.804e+06 2.954e+06
## Value_Over_Replacement_Player
## -1.550e+05
summary(modbb2)
##
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating +
## True_Shooting_Percentage + Three_Point_Field_Goal_Percentage +
## Free_Throw_Percentage + Offensive_Rebound_Percentage + Defensive_Rebound_Percentage +
## Total_Rebound_Percentage + Assist_Percentage + Steal_Percentage +
## Block_Percentage + Turnover_Percentage + Usage_Percentage +
## Offensive_Win_Shares + Defensive_Win_Shares + Win_Shares +
## Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Defense_Box_Plus_Minus +
## Box_Plus_Minus + Value_Over_Replacement_Player, data = bb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22609397 -1963642 -455650 1548429 14898466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.300e+06 2.849e+06 -2.562 0.0107 *
## Age 4.011e+05 3.873e+04 10.356 <2e-16 ***
## Guaranteed 1.774e-01 7.314e-03 24.255 <2e-16 ***
## Player_Efficiency_Rating 1.384e+05 1.887e+05 0.734 0.4635
## True_Shooting_Percentage -2.283e+06 3.018e+06 -0.756 0.4498
## Three_Point_Field_Goal_Percentage -1.546e+04 1.491e+04 -1.037 0.3002
## Free_Throw_Percentage -3.037e+03 6.738e+03 -0.451 0.6524
## Offensive_Rebound_Percentage -1.045e+06 6.002e+05 -1.741 0.0823 .
## Defensive_Rebound_Percentage -8.493e+05 5.921e+05 -1.434 0.1520
## Total_Rebound_Percentage 1.911e+06 1.186e+06 1.611 0.1078
## Assist_Percentage -5.664e+04 2.986e+04 -1.897 0.0584 .
## Steal_Percentage -2.103e+05 2.870e+05 -0.733 0.4640
## Block_Percentage -1.739e+05 2.128e+05 -0.817 0.4143
## Turnover_Percentage 1.448e+04 3.311e+04 0.437 0.6620
## Usage_Percentage 5.667e+04 7.048e+04 0.804 0.4217
## Offensive_Win_Shares 6.708e+04 3.083e+06 0.022 0.9827
## Defensive_Win_Shares 4.394e+05 3.083e+06 0.143 0.8867
## Win_Shares 2.403e+05 3.086e+06 0.078 0.9380
## Win_Shares_Per_48_Minutes -1.330e+07 6.777e+06 -1.962 0.0503 .
## Offense_Box_Plus_Minus -2.668e+06 3.304e+06 -0.807 0.4199
## Defense_Box_Plus_Minus -2.804e+06 3.257e+06 -0.861 0.3896
## Box_Plus_Minus 2.954e+06 3.268e+06 0.904 0.3665
## Value_Over_Replacement_Player -1.550e+05 3.718e+05 -0.417 0.6770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3721000 on 536 degrees of freedom
## Multiple R-squared: 0.7453, Adjusted R-squared: 0.7349
## F-statistic: 71.31 on 22 and 536 DF, p-value: < 2.2e-16
anova(modbb2)
## Analysis of Variance Table
##
## Response: Salary
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 3.1939e+15 3.1939e+15 230.6436 < 2.2e-16
## Guaranteed 1 1.7116e+16 1.7116e+16 1235.9803 < 2.2e-16
## Player_Efficiency_Rating 1 2.1420e+14 2.1420e+14 15.4684 9.492e-05
## True_Shooting_Percentage 1 5.9988e+12 5.9988e+12 0.4332 0.510708
## Three_Point_Field_Goal_Percentage 1 4.1604e+13 4.1604e+13 3.0044 0.083614
## Free_Throw_Percentage 1 9.5711e+11 9.5711e+11 0.0691 0.792728
## Offensive_Rebound_Percentage 1 3.6157e+11 3.6157e+11 0.0261 0.871692
## Defensive_Rebound_Percentage 1 3.4278e+14 3.4278e+14 24.7534 8.790e-07
## Total_Rebound_Percentage 1 6.2076e+13 6.2076e+13 4.4827 0.034699
## Assist_Percentage 1 2.1969e+13 2.1969e+13 1.5865 0.208376
## Steal_Percentage 1 5.4734e+10 5.4734e+10 0.0040 0.949894
## Block_Percentage 1 5.8237e+12 5.8237e+12 0.4206 0.516938
## Turnover_Percentage 1 5.7151e+12 5.7151e+12 0.4127 0.520871
## Usage_Percentage 1 1.3529e+14 1.3529e+14 9.7696 0.001870
## Offensive_Win_Shares 1 3.2995e+14 3.2995e+14 23.8270 1.393e-06
## Defensive_Win_Shares 1 1.4424e+14 1.4424e+14 10.4164 0.001325
## Win_Shares 1 1.5392e+10 1.5392e+10 0.0011 0.973417
## Win_Shares_Per_48_Minutes 1 4.2972e+13 4.2972e+13 3.1032 0.078708
## Offense_Box_Plus_Minus 1 4.1327e+13 4.1327e+13 2.9844 0.084649
## Defense_Box_Plus_Minus 1 5.8863e+12 5.8863e+12 0.4251 0.514697
## Box_Plus_Minus 1 1.1415e+13 1.1415e+13 0.8243 0.364325
## Value_Over_Replacement_Player 1 2.4060e+12 2.4060e+12 0.1737 0.676970
## Residuals 536 7.4224e+15 1.3848e+13
##
## Age ***
## Guaranteed ***
## Player_Efficiency_Rating ***
## True_Shooting_Percentage
## Three_Point_Field_Goal_Percentage .
## Free_Throw_Percentage
## Offensive_Rebound_Percentage
## Defensive_Rebound_Percentage ***
## Total_Rebound_Percentage *
## Assist_Percentage
## Steal_Percentage
## Block_Percentage
## Turnover_Percentage
## Usage_Percentage **
## Offensive_Win_Shares ***
## Defensive_Win_Shares **
## Win_Shares
## Win_Shares_Per_48_Minutes .
## Offense_Box_Plus_Minus .
## Defense_Box_Plus_Minus
## Box_Plus_Minus
## Value_Over_Replacement_Player
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
positionvector <- factor(bb$Position)
ggplot(bb, aes(x=Age+Guaranteed+Player_Efficiency_Rating+True_Shooting_Percentage+Three_Point_Field_Goal_Percentage+Free_Throw_Percentage+Offensive_Rebound_Percentage+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Assist_Percentage+Steal_Percentage+Block_Percentage+Turnover_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Defense_Box_Plus_Minus+Box_Plus_Minus+Value_Over_Replacement_Player+Country_Dummy, y=Salary, color = positionvector))+
geom_jitter()+
geom_smooth(col = "orange")+ #least square line
geom_smooth(method = "lm", se = FALSE) #regression line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graphic shows the linear model with all the variables in the dataset and regression lines are expressed by positions and the least square line is expressed in a curvatured line in orange. Regression result shown above tells that only ages and guaranteed are statistically significant variables. ANOVA results shown above tells that Age, Guaranteed, Player Efficiency Rating, Defensive Rebound Percentage, Offensive Win Shares are significant, meaning that those variables have sufficient variation within observations. For residuals versus fitted values graph, non-linear relationship is not explained in the graph. For normal q-q plot graph, we can acknowledge that residuals are normally distributed. Outliers are not influential to this linear model because cook's distance scores are not seen in the graph.
Countrydummy <- factor(bb$Country_Dummy)
ggplot(bb, aes(x=Age+Guaranteed+Player_Efficiency_Rating+True_Shooting_Percentage+Three_Point_Field_Goal_Percentage+Free_Throw_Percentage+Offensive_Rebound_Percentage+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Assist_Percentage+Steal_Percentage+Block_Percentage+Turnover_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Defense_Box_Plus_Minus+Box_Plus_Minus+Value_Over_Replacement_Player+Country_Dummy, y=Salary, color = Countrydummy))+
geom_jitter()+
geom_smooth(col = "orange")+ #least square line
geom_smooth(method = "lm", se = FALSE) #regression line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Above graph shows regression model with dummy variable in which USA represented as 1 and players from other countries represented as 0. As the graph clearly shows, players from the USA are likely to have higher income trend than players from outside of the USA. To only include the most important variables with p-value under 0.1, the model is shown as follows,
ggplot(bb, aes(x=Age+Guaranteed+Player_Efficiency_Rating+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Country_Dummy, y=Salary, color = positionvector))+
geom_jitter()+
geom_smooth(col = "orange")+ #least square line
geom_smooth(method = "lm", se = FALSE) #regression line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

mod01 <- lm(Salary~Age+Guaranteed+Player_Efficiency_Rating+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Country_Dummy, data=bb)
mod01
##
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating +
## Defensive_Rebound_Percentage + Total_Rebound_Percentage +
## Usage_Percentage + Offensive_Win_Shares + Defensive_Win_Shares +
## Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Country_Dummy,
## data = bb)
##
## Coefficients:
## (Intercept) Age
## -1.059e+07 3.957e+05
## Guaranteed Player_Efficiency_Rating
## 1.773e-01 3.433e+04
## Defensive_Rebound_Percentage Total_Rebound_Percentage
## 1.555e+05 -7.066e+04
## Usage_Percentage Offensive_Win_Shares
## 6.217e+04 2.799e+05
## Defensive_Win_Shares Win_Shares_Per_48_Minutes
## 7.295e+05 -6.058e+06
## Offense_Box_Plus_Minus Country_Dummy
## 1.449e+05 4.083e+05
summary(mod01)
##
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating +
## Defensive_Rebound_Percentage + Total_Rebound_Percentage +
## Usage_Percentage + Offensive_Win_Shares + Defensive_Win_Shares +
## Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Country_Dummy,
## data = bb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21770341 -1972639 -415723 1452912 14192657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.059e+07 1.412e+06 -7.501 2.58e-13 ***
## Age 3.957e+05 3.806e+04 10.398 < 2e-16 ***
## Guaranteed 1.773e-01 7.031e-03 25.210 < 2e-16 ***
## Player_Efficiency_Rating 3.433e+04 1.108e+05 0.310 0.75673
## Defensive_Rebound_Percentage 1.555e+05 5.935e+04 2.620 0.00904 **
## Total_Rebound_Percentage -7.066e+04 9.217e+04 -0.767 0.44368
## Usage_Percentage 6.217e+04 5.139e+04 1.210 0.22690
## Offensive_Win_Shares 2.799e+05 1.355e+05 2.066 0.03934 *
## Defensive_Win_Shares 7.295e+05 2.187e+05 3.335 0.00091 ***
## Win_Shares_Per_48_Minutes -6.058e+06 5.272e+06 -1.149 0.25108
## Offense_Box_Plus_Minus 1.449e+05 1.363e+05 1.064 0.28796
## Country_Dummy 4.083e+05 3.868e+05 1.056 0.29166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3713000 on 547 degrees of freedom
## Multiple R-squared: 0.7413, Adjusted R-squared: 0.7361
## F-statistic: 142.5 on 11 and 547 DF, p-value: < 2.2e-16
anova(mod01)
## Analysis of Variance Table
##
## Response: Salary
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 3.1939e+15 3.1939e+15 231.6943 < 2.2e-16 ***
## Guaranteed 1 1.7116e+16 1.7116e+16 1241.6107 < 2.2e-16 ***
## Player_Efficiency_Rating 1 2.1420e+14 2.1420e+14 15.5388 9.132e-05 ***
## Defensive_Rebound_Percentage 1 3.3309e+14 3.3309e+14 24.1632 1.172e-06 ***
## Total_Rebound_Percentage 1 2.8744e+13 2.8744e+13 2.0852 0.1493083
## Usage_Percentage 1 1.4503e+14 1.4503e+14 10.5207 0.0012524 **
## Offensive_Win_Shares 1 3.7489e+14 3.7489e+14 27.1958 2.612e-07 ***
## Defensive_Win_Shares 1 1.5712e+14 1.5712e+14 11.3981 0.0007874 ***
## Win_Shares_Per_48_Minutes 1 1.2867e+13 1.2867e+13 0.9334 0.3344077
## Offense_Box_Plus_Minus 1 1.5766e+13 1.5766e+13 1.1437 0.2853338
## Country_Dummy 1 1.5358e+13 1.5358e+13 1.1141 0.2916590
## Residuals 547 7.5404e+15 1.3785e+13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With this set of variables, Age, Guaranteed, Player Efficiency Rating, Defensive Percentage, Offensive Win Shares as well as Defensive Win Shares are statistically significant, given that the p value is 2.2E-16. We conclud that, in the future research, the more players used in the analysis, the more interesting implications in the analysis would be. The analysis can also be analyzed from assumptions of how NBA players' salaries are decided. In other words, if we are given some underlying assumptions of how salaries are decided, we can include the variables that likely to have significant affects on how salaries are decided. The significant variables in the model implemented by most significant variables; Age Guaranteed, Player Efficiency Rating, Defensive Rebound Percentage, Total Rebound Percentage and Win Shares turned out to be most significant variables. The number of significant variables have increased from the model with both significant variables and non-significant variables, which we thought it was interesting becasue the value of adjusted squared remained almost the same, even though variables are cahnged significantly.