MATH 239

Professor Smalley

    The purpose of this paper is to analyze a dataset from 2017 to 2018 NBA season's player statistics and its correlation between players’ salaries. Our dataset contains 559 observations and has 28 variables. Out of 28 variables, 4 of the variables are character variables. The central question of this analysis is whether player statistics affect to player's salaries. Out of all the variables in the dataset, A response variable is Salary and exlpanatory includes dummy variables for players from USA as 1 and players outside of USA as 0, Guranteed, Age, Player_Efficiency_Rating, True_Shooting_Percentage, Three_Point_Field_Goal_Percentage, Free_Throw_Percentage, Offensive_Rebound_Percentage, Defensive_Rebound_Percentage, Total_Rebound_Percentage, Assist_Percentage, Steal_Percentage, Block_Percentage Turnover_Percentage, Usage_Percentage, Offensive_Win_Shares, Defensive_Win_Shares, Win_Shares, Win_Shares_Per_48_Minutes, Offense_Box_Plus_Minus, Defense_Box_Plus_Minus, Box_Plus_Minus, and Value_Over_Replacement_Player.
    Specifically, this paper is going to look at and delve into which particular variable(s) has significant correlation and what kind of implication does the test statistics entail. The dataset we explored have many variables and some character and dummy variables. We finally choose to exemplify three main variables, Salary, Age, Position and Player Efficiency Ratings. The central question of this project shall be “Does PER have a statistically significant effect on the salary an NBA player earns during the 2017 and 18 season?” The whole dataset includes NBA salary and statistics for various players during 2017-18 season. We finally explored the model with the varaible(s) that has significant effect of the response variable; salary. 
library(tidyverse)
## ── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(readxl)
bb <- read_excel("basketball.xlsx")
## New names:
## * `` -> ...1
str(bb$Country)
##  chr [1:559] "USA" "USA" "USA" "USA" "USA" "Serbia" "Ukraine" "USA" "Spain" ...
str(bb$Position)
##  chr [1:559] "Point Guard" "Power Forward" "Power Forward" "Small Forward" ...
modbb2 <- lm(Salary~Age+Guaranteed+Player_Efficiency_Rating+True_Shooting_Percentage+Three_Point_Field_Goal_Percentage+Free_Throw_Percentage+Offensive_Rebound_Percentage+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Assist_Percentage+Steal_Percentage+Block_Percentage+Turnover_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Defense_Box_Plus_Minus+Box_Plus_Minus+Value_Over_Replacement_Player, data=bb)
modbb2
## 
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating + 
##     True_Shooting_Percentage + Three_Point_Field_Goal_Percentage + 
##     Free_Throw_Percentage + Offensive_Rebound_Percentage + Defensive_Rebound_Percentage + 
##     Total_Rebound_Percentage + Assist_Percentage + Steal_Percentage + 
##     Block_Percentage + Turnover_Percentage + Usage_Percentage + 
##     Offensive_Win_Shares + Defensive_Win_Shares + Win_Shares + 
##     Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Defense_Box_Plus_Minus + 
##     Box_Plus_Minus + Value_Over_Replacement_Player, data = bb)
## 
## Coefficients:
##                       (Intercept)                                Age  
##                        -7.300e+06                          4.011e+05  
##                        Guaranteed           Player_Efficiency_Rating  
##                         1.774e-01                          1.384e+05  
##          True_Shooting_Percentage  Three_Point_Field_Goal_Percentage  
##                        -2.283e+06                         -1.546e+04  
##             Free_Throw_Percentage       Offensive_Rebound_Percentage  
##                        -3.037e+03                         -1.045e+06  
##      Defensive_Rebound_Percentage           Total_Rebound_Percentage  
##                        -8.493e+05                          1.911e+06  
##                 Assist_Percentage                   Steal_Percentage  
##                        -5.664e+04                         -2.103e+05  
##                  Block_Percentage                Turnover_Percentage  
##                        -1.739e+05                          1.448e+04  
##                  Usage_Percentage               Offensive_Win_Shares  
##                         5.667e+04                          6.708e+04  
##              Defensive_Win_Shares                         Win_Shares  
##                         4.394e+05                          2.403e+05  
##         Win_Shares_Per_48_Minutes             Offense_Box_Plus_Minus  
##                        -1.330e+07                         -2.668e+06  
##            Defense_Box_Plus_Minus                     Box_Plus_Minus  
##                        -2.804e+06                          2.954e+06  
##     Value_Over_Replacement_Player  
##                        -1.550e+05
summary(modbb2)
## 
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating + 
##     True_Shooting_Percentage + Three_Point_Field_Goal_Percentage + 
##     Free_Throw_Percentage + Offensive_Rebound_Percentage + Defensive_Rebound_Percentage + 
##     Total_Rebound_Percentage + Assist_Percentage + Steal_Percentage + 
##     Block_Percentage + Turnover_Percentage + Usage_Percentage + 
##     Offensive_Win_Shares + Defensive_Win_Shares + Win_Shares + 
##     Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Defense_Box_Plus_Minus + 
##     Box_Plus_Minus + Value_Over_Replacement_Player, data = bb)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22609397  -1963642   -455650   1548429  14898466 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -7.300e+06  2.849e+06  -2.562   0.0107 *  
## Age                                4.011e+05  3.873e+04  10.356   <2e-16 ***
## Guaranteed                         1.774e-01  7.314e-03  24.255   <2e-16 ***
## Player_Efficiency_Rating           1.384e+05  1.887e+05   0.734   0.4635    
## True_Shooting_Percentage          -2.283e+06  3.018e+06  -0.756   0.4498    
## Three_Point_Field_Goal_Percentage -1.546e+04  1.491e+04  -1.037   0.3002    
## Free_Throw_Percentage             -3.037e+03  6.738e+03  -0.451   0.6524    
## Offensive_Rebound_Percentage      -1.045e+06  6.002e+05  -1.741   0.0823 .  
## Defensive_Rebound_Percentage      -8.493e+05  5.921e+05  -1.434   0.1520    
## Total_Rebound_Percentage           1.911e+06  1.186e+06   1.611   0.1078    
## Assist_Percentage                 -5.664e+04  2.986e+04  -1.897   0.0584 .  
## Steal_Percentage                  -2.103e+05  2.870e+05  -0.733   0.4640    
## Block_Percentage                  -1.739e+05  2.128e+05  -0.817   0.4143    
## Turnover_Percentage                1.448e+04  3.311e+04   0.437   0.6620    
## Usage_Percentage                   5.667e+04  7.048e+04   0.804   0.4217    
## Offensive_Win_Shares               6.708e+04  3.083e+06   0.022   0.9827    
## Defensive_Win_Shares               4.394e+05  3.083e+06   0.143   0.8867    
## Win_Shares                         2.403e+05  3.086e+06   0.078   0.9380    
## Win_Shares_Per_48_Minutes         -1.330e+07  6.777e+06  -1.962   0.0503 .  
## Offense_Box_Plus_Minus            -2.668e+06  3.304e+06  -0.807   0.4199    
## Defense_Box_Plus_Minus            -2.804e+06  3.257e+06  -0.861   0.3896    
## Box_Plus_Minus                     2.954e+06  3.268e+06   0.904   0.3665    
## Value_Over_Replacement_Player     -1.550e+05  3.718e+05  -0.417   0.6770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3721000 on 536 degrees of freedom
## Multiple R-squared:  0.7453, Adjusted R-squared:  0.7349 
## F-statistic: 71.31 on 22 and 536 DF,  p-value: < 2.2e-16
anova(modbb2)
## Analysis of Variance Table
## 
## Response: Salary
##                                    Df     Sum Sq    Mean Sq   F value    Pr(>F)
## Age                                 1 3.1939e+15 3.1939e+15  230.6436 < 2.2e-16
## Guaranteed                          1 1.7116e+16 1.7116e+16 1235.9803 < 2.2e-16
## Player_Efficiency_Rating            1 2.1420e+14 2.1420e+14   15.4684 9.492e-05
## True_Shooting_Percentage            1 5.9988e+12 5.9988e+12    0.4332  0.510708
## Three_Point_Field_Goal_Percentage   1 4.1604e+13 4.1604e+13    3.0044  0.083614
## Free_Throw_Percentage               1 9.5711e+11 9.5711e+11    0.0691  0.792728
## Offensive_Rebound_Percentage        1 3.6157e+11 3.6157e+11    0.0261  0.871692
## Defensive_Rebound_Percentage        1 3.4278e+14 3.4278e+14   24.7534 8.790e-07
## Total_Rebound_Percentage            1 6.2076e+13 6.2076e+13    4.4827  0.034699
## Assist_Percentage                   1 2.1969e+13 2.1969e+13    1.5865  0.208376
## Steal_Percentage                    1 5.4734e+10 5.4734e+10    0.0040  0.949894
## Block_Percentage                    1 5.8237e+12 5.8237e+12    0.4206  0.516938
## Turnover_Percentage                 1 5.7151e+12 5.7151e+12    0.4127  0.520871
## Usage_Percentage                    1 1.3529e+14 1.3529e+14    9.7696  0.001870
## Offensive_Win_Shares                1 3.2995e+14 3.2995e+14   23.8270 1.393e-06
## Defensive_Win_Shares                1 1.4424e+14 1.4424e+14   10.4164  0.001325
## Win_Shares                          1 1.5392e+10 1.5392e+10    0.0011  0.973417
## Win_Shares_Per_48_Minutes           1 4.2972e+13 4.2972e+13    3.1032  0.078708
## Offense_Box_Plus_Minus              1 4.1327e+13 4.1327e+13    2.9844  0.084649
## Defense_Box_Plus_Minus              1 5.8863e+12 5.8863e+12    0.4251  0.514697
## Box_Plus_Minus                      1 1.1415e+13 1.1415e+13    0.8243  0.364325
## Value_Over_Replacement_Player       1 2.4060e+12 2.4060e+12    0.1737  0.676970
## Residuals                         536 7.4224e+15 1.3848e+13                    
##                                      
## Age                               ***
## Guaranteed                        ***
## Player_Efficiency_Rating          ***
## True_Shooting_Percentage             
## Three_Point_Field_Goal_Percentage .  
## Free_Throw_Percentage                
## Offensive_Rebound_Percentage         
## Defensive_Rebound_Percentage      ***
## Total_Rebound_Percentage          *  
## Assist_Percentage                    
## Steal_Percentage                     
## Block_Percentage                     
## Turnover_Percentage                  
## Usage_Percentage                  ** 
## Offensive_Win_Shares              ***
## Defensive_Win_Shares              ** 
## Win_Shares                           
## Win_Shares_Per_48_Minutes         .  
## Offense_Box_Plus_Minus            .  
## Defense_Box_Plus_Minus               
## Box_Plus_Minus                       
## Value_Over_Replacement_Player        
## Residuals                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
positionvector <- factor(bb$Position)

ggplot(bb, aes(x=Age+Guaranteed+Player_Efficiency_Rating+True_Shooting_Percentage+Three_Point_Field_Goal_Percentage+Free_Throw_Percentage+Offensive_Rebound_Percentage+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Assist_Percentage+Steal_Percentage+Block_Percentage+Turnover_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Defense_Box_Plus_Minus+Box_Plus_Minus+Value_Over_Replacement_Player+Country_Dummy, y=Salary, color = positionvector))+
  geom_jitter()+
  geom_smooth(col = "orange")+ #least square line 
  geom_smooth(method = "lm", se = FALSE) #regression line 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

     This graphic shows the linear model with all the variables in the dataset and regression lines are expressed by positions and the least square line is expressed in a curvatured line in orange. Regression result shown above tells that only ages and guaranteed are statistically significant variables. ANOVA results shown above tells that Age, Guaranteed, Player Efficiency Rating, Defensive Rebound Percentage, Offensive Win Shares are significant, meaning that those variables have sufficient variation within observations. For residuals versus fitted values graph, non-linear relationship is not explained in the graph. For normal q-q plot graph, we can acknowledge that residuals are normally distributed. Outliers are not influential to this linear model because cook's distance scores are not seen in the graph. 
Countrydummy <- factor(bb$Country_Dummy)

ggplot(bb, aes(x=Age+Guaranteed+Player_Efficiency_Rating+True_Shooting_Percentage+Three_Point_Field_Goal_Percentage+Free_Throw_Percentage+Offensive_Rebound_Percentage+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Assist_Percentage+Steal_Percentage+Block_Percentage+Turnover_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Defense_Box_Plus_Minus+Box_Plus_Minus+Value_Over_Replacement_Player+Country_Dummy, y=Salary, color = Countrydummy))+
  geom_jitter()+
  geom_smooth(col = "orange")+ #least square line 
  geom_smooth(method = "lm", se = FALSE) #regression line 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

    Above graph shows regression model with dummy variable in which USA represented as 1 and players from other countries represented as 0. As the graph clearly shows, players from the USA are likely to have higher income trend than players from outside of the USA. To only include the most important variables with p-value under 0.1, the model is shown as follows, 
    
ggplot(bb, aes(x=Age+Guaranteed+Player_Efficiency_Rating+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Country_Dummy, y=Salary, color = positionvector))+
  geom_jitter()+
  geom_smooth(col = "orange")+ #least square line 
  geom_smooth(method = "lm", se = FALSE) #regression line 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

mod01 <- lm(Salary~Age+Guaranteed+Player_Efficiency_Rating+Defensive_Rebound_Percentage+Total_Rebound_Percentage+Usage_Percentage+Offensive_Win_Shares+Defensive_Win_Shares+Win_Shares_Per_48_Minutes+Offense_Box_Plus_Minus+Country_Dummy, data=bb)
mod01
## 
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating + 
##     Defensive_Rebound_Percentage + Total_Rebound_Percentage + 
##     Usage_Percentage + Offensive_Win_Shares + Defensive_Win_Shares + 
##     Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Country_Dummy, 
##     data = bb)
## 
## Coefficients:
##                  (Intercept)                           Age  
##                   -1.059e+07                     3.957e+05  
##                   Guaranteed      Player_Efficiency_Rating  
##                    1.773e-01                     3.433e+04  
## Defensive_Rebound_Percentage      Total_Rebound_Percentage  
##                    1.555e+05                    -7.066e+04  
##             Usage_Percentage          Offensive_Win_Shares  
##                    6.217e+04                     2.799e+05  
##         Defensive_Win_Shares     Win_Shares_Per_48_Minutes  
##                    7.295e+05                    -6.058e+06  
##       Offense_Box_Plus_Minus                 Country_Dummy  
##                    1.449e+05                     4.083e+05
summary(mod01)
## 
## Call:
## lm(formula = Salary ~ Age + Guaranteed + Player_Efficiency_Rating + 
##     Defensive_Rebound_Percentage + Total_Rebound_Percentage + 
##     Usage_Percentage + Offensive_Win_Shares + Defensive_Win_Shares + 
##     Win_Shares_Per_48_Minutes + Offense_Box_Plus_Minus + Country_Dummy, 
##     data = bb)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21770341  -1972639   -415723   1452912  14192657 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -1.059e+07  1.412e+06  -7.501 2.58e-13 ***
## Age                           3.957e+05  3.806e+04  10.398  < 2e-16 ***
## Guaranteed                    1.773e-01  7.031e-03  25.210  < 2e-16 ***
## Player_Efficiency_Rating      3.433e+04  1.108e+05   0.310  0.75673    
## Defensive_Rebound_Percentage  1.555e+05  5.935e+04   2.620  0.00904 ** 
## Total_Rebound_Percentage     -7.066e+04  9.217e+04  -0.767  0.44368    
## Usage_Percentage              6.217e+04  5.139e+04   1.210  0.22690    
## Offensive_Win_Shares          2.799e+05  1.355e+05   2.066  0.03934 *  
## Defensive_Win_Shares          7.295e+05  2.187e+05   3.335  0.00091 ***
## Win_Shares_Per_48_Minutes    -6.058e+06  5.272e+06  -1.149  0.25108    
## Offense_Box_Plus_Minus        1.449e+05  1.363e+05   1.064  0.28796    
## Country_Dummy                 4.083e+05  3.868e+05   1.056  0.29166    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3713000 on 547 degrees of freedom
## Multiple R-squared:  0.7413, Adjusted R-squared:  0.7361 
## F-statistic: 142.5 on 11 and 547 DF,  p-value: < 2.2e-16
anova(mod01)
## Analysis of Variance Table
## 
## Response: Salary
##                               Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## Age                            1 3.1939e+15 3.1939e+15  231.6943 < 2.2e-16 ***
## Guaranteed                     1 1.7116e+16 1.7116e+16 1241.6107 < 2.2e-16 ***
## Player_Efficiency_Rating       1 2.1420e+14 2.1420e+14   15.5388 9.132e-05 ***
## Defensive_Rebound_Percentage   1 3.3309e+14 3.3309e+14   24.1632 1.172e-06 ***
## Total_Rebound_Percentage       1 2.8744e+13 2.8744e+13    2.0852 0.1493083    
## Usage_Percentage               1 1.4503e+14 1.4503e+14   10.5207 0.0012524 ** 
## Offensive_Win_Shares           1 3.7489e+14 3.7489e+14   27.1958 2.612e-07 ***
## Defensive_Win_Shares           1 1.5712e+14 1.5712e+14   11.3981 0.0007874 ***
## Win_Shares_Per_48_Minutes      1 1.2867e+13 1.2867e+13    0.9334 0.3344077    
## Offense_Box_Plus_Minus         1 1.5766e+13 1.5766e+13    1.1437 0.2853338    
## Country_Dummy                  1 1.5358e+13 1.5358e+13    1.1141 0.2916590    
## Residuals                    547 7.5404e+15 1.3785e+13                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
     With this set of variables, Age, Guaranteed, Player Efficiency Rating, Defensive Percentage, Offensive Win Shares as well as Defensive Win Shares are statistically significant, given that the p value is 2.2E-16. We conclud that, in the future research, the more players used in the analysis, the more interesting implications in the analysis would be. The analysis can also be analyzed from assumptions of how NBA players' salaries are decided. In other words, if we are given some underlying assumptions of how salaries are decided, we can include the variables that likely to have significant affects on how salaries are decided. The significant variables in the model implemented by most significant variables; Age Guaranteed, Player Efficiency Rating, Defensive Rebound Percentage, Total Rebound Percentage and Win Shares turned out to be most significant variables. The number of significant variables have increased from the model with both significant variables and non-significant variables, which we thought it was interesting becasue the value of adjusted squared remained almost the same, even though variables are cahnged significantly.   
     

Appendix

    This box plot classified by positions have many implications. As you can see, Cetner players are clustered in single box plot, whereas in other positions, outliers that is super players are receiving much larger portion of salary than center players.
ggplot(bb, aes(y=Salary, x=positionvector, fill=positionvector))+
  geom_boxplot()

    The code for position vector variables and country dummy variables(USA = 1, Others = 0) are represented below for appendix.
contrasts(positionvector)
##                Point Guard Power Forward Shooting Guard Small Forward
## Center                   0             0              0             0
## Point Guard              1             0              0             0
## Power Forward            0             1              0             0
## Shooting Guard           0             0              1             0
## Small Forward            0             0              0             1
str(bb$Country)
##  chr [1:559] "USA" "USA" "USA" "USA" "USA" "Serbia" "Ukraine" "USA" "Spain" ...
str(bb$Country_Dummy)
##  num [1:559] 1 1 1 1 1 0 0 1 0 0 ...