Can We Predict an NBA Player’s Salary Using His Statistics from the Prior Year?
In the previous week’s discussion we used a simple linear regression model with one input variable, total points from the previous year (a single independent variable), to try to predict NBA players’ salaries. That analysis and discussion can be reviewed here. As a refresher, our overall goal is to run a regression model to see if we can predict a player’s salary for the coming year. For this analysis, we will use multiple factors in our model and perform multi-factor linear regression.
We’ll use the same two datasets that we used previously. One data set includes player statistics and the other dataset will include information about the salary of the players from 2017 to 2018. We’ll join these two datasets together to perform our analysis. The player statistics dataset can be downloaded here and the NBA Salary dataset can be downloaded here.
nba_stats <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Week 11 - Linear Regression Using R/Discussion/Seasons_Stats.csv')
nba_salary <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Week 11 - Linear Regression Using R/Discussion/NBA_season1718_salary.csv')The nba_stats dataset has data from 1955 through 2017. We’ll limit this data down to 2016 to conduct our analysis (since we’ll be looking to see if we can predict salary in 2017). Additionally, some players have multiple lines per year since they were traded to different teams throughout the year. First, we’ll remove the duplicates from the dataset, then we’ll circle back and only pull the duplicates and pull the total value for the season from the dataset for each player and append on to our dataset.
nba_stats_2016_raw <- nba_stats %>%
dplyr::filter(`Year` == 2016) %>%
dplyr::select(-`X1`, -`USG%`, -`OWS`, -`DWS`, -`WS`, -`WS/48`, -`blanl`, -`blank2`, -`OBPM`, -`DBPM`, -`VORP`, -`PF`)
#removing duplicates
nba_stats_2016 <- nba_stats_2016_raw[!duplicated(nba_stats_2016_raw$Player), ]
dupes <- nba_stats_2016_raw %>%
dplyr::filter(`Tm` == 'TOT')
nba_stats_2016 <- bind_rows(nba_stats_2016, dupes)
#nrow(nba_stats_2016)
head(nba_stats_2016)## # A tibble: 6 x 41
## Year Player Pos Age Tm G GS MP PER `TS%` `3PAr` FTr
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 2016 Quinc~ PF 25 SAC 59 NA 876 14.7 0.629 NA 0.318
## 2 2016 Jorda~ SG 21 MEM 2 FALSE 15 17.3 0.427 NA 0.833
## 3 2016 Steve~ C 22 OKC 80 NA 2014 15.5 0.621 FALSE 0.46
## 4 2016 Arron~ SG 30 NYK 71 NA 2371 10.9 0.531 NA 0.164
## 5 2016 Alexi~ C 27 NOP 59 NA 861 13.8 0.514 NA 0.197
## 6 2016 Cole ~ C 27 LAC 60 NA 800 21.3 0.626 FALSE 0.373
## # ... with 29 more variables: `ORB%` <lgl>, `DRB%` <lgl>, `TRB%` <lgl>,
## # `AST%` <lgl>, `STL%` <lgl>, `BLK%` <lgl>, `TOV%` <lgl>, BPM <lgl>,
## # FG <dbl>, FGA <dbl>, `FG%` <dbl>, `3P` <lgl>, `3PA` <lgl>, `3P%` <lgl>,
## # `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>,
## # `FT%` <dbl>, ORB <lgl>, DRB <lgl>, TRB <dbl>, AST <dbl>, STL <lgl>,
## # BLK <lgl>, TOV <lgl>, PTS <dbl>
Next we’ll select the columns we’ll need in our nba_salary dataset and aggregate the salary values for the players that transitioned teams mid-year (meanining they have multiple rows of data). Additionally, we’ll divide the salary column by a million to make the numbers easier to view when graphing/plotting.
nba_salary_2017 <- nba_salary %>%
dplyr::select(`Player`, `season17_18`) %>%
dplyr::group_by(`Player`) %>%
dplyr::summarize(salary_2017_in_millions = sum(`season17_18`)/1000000)
head(nba_salary_2017)## # A tibble: 6 x 2
## Player salary_2017_in_millions
## <chr> <dbl>
## 1 A.J. Hammons 1.31
## 2 Aaron Brooks 2.12
## 3 Aaron Gordon 5.50
## 4 Aaron Gray 0.452
## 5 Abdel Nader 1.17
## 6 Al-Farouq Aminu 7.32
Next, we’ll join these two datasets together so we have both points and salary in the same dataset.
nba_data <- nba_stats_2016 %>%
dplyr::left_join(nba_salary_2017, by = "Player") %>%
dplyr::arrange(PTS)
head(nba_data)## # A tibble: 6 x 42
## Year Player Pos Age Tm G GS MP PER `TS%` `3PAr` FTr
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 2016 Sam D~ SF 21 HOU 3 FALSE 6 10.8 NA NA NA
## 2 2016 J.J. ~ SF 23 UTA 2 FALSE 6 1.3 0 FALSE 0
## 3 2016 Nate ~ PG 31 NOP 2 TRUE 23 2.6 0 TRUE 0
## 4 2016 Bruno~ SF 20 TOR 6 TRUE 43 -7.7 0.125 NA 0
## 5 2016 Joe H~ SG 24 CLE 5 FALSE 15 3.4 0.375 TRUE 0
## 6 2016 Rakee~ PF 24 IND 1 FALSE 6 32 1 FALSE 0
## # ... with 30 more variables: `ORB%` <lgl>, `DRB%` <lgl>, `TRB%` <lgl>,
## # `AST%` <lgl>, `STL%` <lgl>, `BLK%` <lgl>, `TOV%` <lgl>, BPM <lgl>,
## # FG <dbl>, FGA <dbl>, `FG%` <dbl>, `3P` <lgl>, `3PA` <lgl>, `3P%` <lgl>,
## # `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>,
## # `FT%` <dbl>, ORB <lgl>, DRB <lgl>, TRB <dbl>, AST <dbl>, STL <lgl>,
## # BLK <lgl>, TOV <lgl>, PTS <dbl>, salary_2017_in_millions <dbl>
Before moving on, let’s quickly look and see how many null values we have in each of our columns:
## Year Player Pos
## 0 0 0
## Age Tm G
## 0 0 0
## GS MP PER
## 383 0 0
## TS% 3PAr FTr
## 1 479 1
## ORB% DRB% TRB%
## 502 520 521
## AST% STL% BLK%
## 511 478 470
## TOV% BPM FG
## 512 515 0
## FGA FG% 3P
## 0 1 394
## 3PA 3P% 2P
## 449 466 0
## 2PA 2P% eFG%
## 0 3 1
## FT FTA FT%
## 0 0 15
## ORB DRB TRB
## 494 512 0
## AST STL BLK
## 0 486 458
## TOV PTS salary_2017_in_millions
## 501 0 157
HOLY! That’s a lot of nulls. Some of these columns are almost completly empty. Let’s remove these, as they will not be helpful to our anlaysis:
nba_data <- nba_data %>%
dplyr::select(-`GS`, -`3PAr`, -`ORB%`, -`DRB%`, -`TRB%`, -`AST%`, -`STL%`, -`BLK%`, -`TOV%`, -`BPM`, -`3P`, -`3PA`, -`3P%`, -`ORB`, -`DRB`, -`STL`, -`BLK`, -`TOV`)
colSums(is.na(nba_data))## Year Player Pos
## 0 0 0
## Age Tm G
## 0 0 0
## MP PER TS%
## 0 0 1
## FTr FG FGA
## 1 0 0
## FG% 2P 2PA
## 1 0 0
## 2P% eFG% FT
## 3 1 0
## FTA FT% TRB
## 0 15 0
## AST PTS salary_2017_in_millions
## 0 0 157
Looking above, the null values in our columns look much better, except for the salary column. Let’s clean that up in our next step.
Since our analysis is specifically looking at predicting the salary of those players with stats in 2016, we’ll go ahead and remove any rows of the dataset where the salary is null. Additionally, there are cases where a player has very few points and is still paid a salary (they were injured, etc.). For our purposes, we’ll consider these as outliers and remove anyone with less than 25 points them from our analysis (there are 82 games in an NBA season, so scoring 25 points should be realistic for most players - even bench sitters).
nba_data <- nba_data %>%
dplyr::filter(!is.na(salary_2017_in_millions)) %>%
dplyr::filter(PTS > 25) %>%
dplyr::mutate(Pos = ifelse(Pos == 'PF-C', 'PF', Pos))
nba_data## # A tibble: 358 x 24
## Year Player Pos Age Tm G MP PER `TS%` FTr FG FGA
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 Alan ~ PF 23 PHO 10 68 21.1 0.481 0.583 10 24
## 2 2016 Brian~ PG 23 TOT 7 169 8.9 0.38 0.103 14 39
## 3 2016 Brian~ PG 23 TOT 7 169 8.9 0.38 0.103 14 39
## 4 2016 Pat C~ SG 23 POR 34 143 4.6 0.352 0.102 13 49
## 5 2016 Nikol~ C 30 MIN 12 156 6.5 0.459 0.4 19 50
## 6 2016 Spenc~ PG 22 DET 12 159 8.9 0.423 0.611 19 54
## 7 2016 Udoni~ PF 35 MIA 37 260 9.1 0.434 0.262 23 61
## 8 2016 Caron~ SF 35 SAC 17 176 11.3 0.49 0.203 25 59
## 9 2016 Jarel~ SF 24 WAS 26 147 11 0.46 0.123 20 65
## 10 2016 Lucas~ C 23 TOR 29 225 15.6 0.642 0.341 28 44
## # ... with 348 more rows, and 12 more variables: `FG%` <dbl>, `2P` <dbl>,
## # `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
## # TRB <dbl>, AST <dbl>, PTS <dbl>, salary_2017_in_millions <dbl>
Now that we’ve got our data straightened out, let’s see if we can identify some relationship between some of the factors in our dataset and a player’s salary.
First, let’s take a look at the distribution of salary between different positions in the NBA. Before plotting, lets take a look and see how many players of each position we have in the dataset:
## # A tibble: 5 x 2
## Pos n
## <chr> <int>
## 1 C 68
## 2 PF 70
## 3 PG 75
## 4 SF 72
## 5 SG 73
Looking at the above counts, it looks like there is a pretty even breakout between each position. Equipped with this information, let’s now plot this with the salary information.
ggplot(nba_data) +
aes(x = salary_2017_in_millions, y = reorder(Pos, salary_2017_in_millions, FUN = median)) +
geom_violin() +
geom_boxplot(width = 0.15) +
coord_flip() +
labs(title = "Salary Violin Plots by Position") +
ylab("Position") +
xlab("Salary ($M)") +
theme(
plot.title = element_text(hjust = 0.45)
)One of the most interesting things we can see in the above plot is that point guards (PG) have the lowest median salary of any position in the NBA but also have the widest range and the highest salary value in the dataset as well as several other outliers. The centers (C) have the highest median values as well as a wider IQR than any of the other positions.
Now let’s turn our attention to age to see if it plays a factor in salary:
ggplot(nba_data) +
aes(x = as.factor(Age), y = salary_2017_in_millions ) +
geom_boxplot() +
labs(title = "Salary Boxplots by Age") +
ylab("Age") +
xlab("Salary ($M)") +
theme(
plot.title = element_text(hjust = 0.45)
)Looking at the boxplots above, we can see that age is definitely a factor in salary. We can see that, in general, salary increases at age 22 and then decreases after age 31. This makes sense, because players can begin to enter the league at age 19 and teams generally aren’t willing to take a big risk on a large salary with an untested rookie. After a few years of solid performance, we’d expect salaries to increase. Additionally, players generally start seeing some decline in their athletic abilities in their early 30’s with the demands of the game.
Next, we’ll take a look at the relationship between team (Tm) and salary:
ggplot(nba_data) +
aes(x = reorder(Tm,salary_2017_in_millions, FUN = median) , y = salary_2017_in_millions ) +
geom_boxplot() +
labs(title = "Salary Boxplots by Team") +
ylab("Team") +
xlab("Salary ($M)") +
theme(
plot.title = element_text(hjust = 0.45)
)These results are pretty surprising. You can see there is a huge discrepancy between team’s salaries, for example, it looks like eveyr player on the Cleveland Caveliers makes more than every player on the Philidelphia 76ers. It does look like team will play a factor in the salary puzzle.
Now that we’ve reviewed all of the categorical data, let’s use a pair plot to investigate relationships between the remaining variables:
nba_data_numeric <- nba_data %>%
dplyr::select(-`Year`, -`Player`, -`Pos`, -`Tm`, -`Age`)
pairs(nba_data_numeric, gap = 0.5, main = "NBA Numerical Data Relationships")Looking at the above pair plot we can see a couple things: First, salary has a medium to medium loose relationship with minutes played (MP), Player Efficiency Rating (PER), True Shooting Percentage (TS%), Field Goals (FG), Field Goal Attempts (FGA), Field Goal % (FG%), 2 Point Field Goals (2P), 2 Point Field Goal Attempts (2PA), Free Throws (FT), Free Throw Attempts (FTA), Total Rebounds (TRB) and Points (PTS). Another thing we can see is that some of our columns are highly correlated with one another such as Free Throws and Free Throw Attempts. We’ll have to watch out for this collinearity when we build our model.
Let’s now turn our attention to model building. We’ll start by building a multi-factor regression model and use backward elimination to remove insignificant factors. Based on our analysis, we’ve got a good idea of what factors will be relevant to the model. Before building, you’ll remember in the last discussion we built a model using a single factor. The output from that model is below:
In reviewing the results you can see we had a standard error of $6.283M dollars and an adjusted R-squared value of 0.3718. We’ll be looking to improve those metrics with this multi-factor regression model. Since points (PTS) is strongly correlated with 2PA and 2PA%, we’ll remove it in this model run.
nba_data_for_model <- nba_data %>%
dplyr::select(-`Year`, -`Player`, -`PTS`)
model1 <- lm(salary_2017_in_millions ~ ., data = nba_data_for_model)
summary(model1)##
## Call:
## lm(formula = salary_2017_in_millions ~ ., data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7411 -2.8801 -0.2109 2.9241 15.2311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.897e+01 6.648e+00 2.854 0.004619 **
## PosPF -1.087e+00 9.983e-01 -1.089 0.277000
## PosPG -4.221e+00 1.493e+00 -2.828 0.004998 **
## PosSF -2.795e+00 1.293e+00 -2.161 0.031444 *
## PosSG -3.401e+00 1.419e+00 -2.397 0.017128 *
## Age -2.667e-02 7.653e-02 -0.348 0.727746
## TmBOS -6.258e+00 2.234e+00 -2.802 0.005411 **
## TmBRK -6.391e+00 2.336e+00 -2.735 0.006600 **
## TmCHI -3.697e+00 2.148e+00 -1.721 0.086222 .
## TmCHO -4.059e+00 2.194e+00 -1.850 0.065312 .
## TmCLE 1.626e-01 2.285e+00 0.071 0.943323
## TmDAL -6.356e+00 2.124e+00 -2.992 0.002997 **
## TmDEN -8.187e+00 2.196e+00 -3.729 0.000229 ***
## TmDET -5.588e+00 2.331e+00 -2.397 0.017131 *
## TmGSW -4.907e+00 2.172e+00 -2.259 0.024577 *
## TmHOU -7.030e+00 2.271e+00 -3.096 0.002146 **
## TmIND -4.258e+00 2.304e+00 -1.849 0.065497 .
## TmLAC 1.644e-02 2.340e+00 0.007 0.994397
## TmLAL -7.255e+00 2.517e+00 -2.882 0.004227 **
## TmMEM -2.996e+00 2.333e+00 -1.284 0.200043
## TmMIA -3.531e+00 2.208e+00 -1.599 0.110806
## TmMIL -3.935e+00 2.252e+00 -1.747 0.081576 .
## TmMIN -9.291e+00 2.343e+00 -3.965 9.16e-05 ***
## TmNOP -2.447e+00 2.300e+00 -1.064 0.288290
## TmNYK -5.780e+00 2.255e+00 -2.564 0.010841 *
## TmOKC -7.945e-01 2.249e+00 -0.353 0.724105
## TmORL -5.344e+00 2.177e+00 -2.454 0.014670 *
## TmPHI -9.273e+00 2.400e+00 -3.863 0.000137 ***
## TmPHO -5.368e+00 2.202e+00 -2.438 0.015334 *
## TmPOR -2.377e+00 2.199e+00 -1.081 0.280632
## TmSAC -8.272e+00 2.141e+00 -3.864 0.000136 ***
## TmSAS -3.960e+00 2.253e+00 -1.758 0.079788 .
## TmTOR -1.458e+00 2.186e+00 -0.667 0.505337
## TmTOT -6.363e+00 1.766e+00 -3.602 0.000368 ***
## TmUTA -3.625e+00 2.151e+00 -1.685 0.092959 .
## TmWAS -3.818e+00 2.237e+00 -1.707 0.088891 .
## G -1.599e-01 3.012e-02 -5.309 2.12e-07 ***
## MP 4.589e-03 1.705e-03 2.691 0.007512 **
## PER 6.036e-02 1.919e-01 0.315 0.753338
## `TS%` 3.209e+01 4.079e+01 0.787 0.432111
## FTr -2.643e+00 5.734e+00 -0.461 0.645228
## FG 9.269e-02 4.489e-02 2.065 0.039811 *
## FGA -2.884e-02 1.896e-02 -1.521 0.129174
## `FG%` -2.411e+01 2.321e+01 -1.039 0.299624
## `2P` -6.144e-03 5.333e-02 -0.115 0.908357
## `2PA` -1.066e-02 2.344e-02 -0.455 0.649473
## `2P%` 2.430e+00 1.492e+01 0.163 0.870752
## `eFG%` -1.901e+01 4.110e+01 -0.462 0.644100
## FT -1.657e-03 2.582e-02 -0.064 0.948891
## FTA 1.591e-02 2.106e-02 0.755 0.450601
## `FT%` -3.709e+00 4.968e+00 -0.746 0.455942
## TRB 1.213e-05 4.116e-03 0.003 0.997651
## AST 8.234e-03 4.044e-03 2.036 0.042594 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.884 on 305 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6106
## F-statistic: 11.77 on 52 and 305 DF, p-value: < 2.2e-16
Looking at the above model, we’ve already improved on our simple linear regression model significantly. Using backward eliminations, it looks like total rebounds has the highest p-value, so we’ll remove that factor and rerun the model:
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Age + Tm + G + MP +
## PER + `TS%` + FTr + FG + FGA + `FG%` + `2P` + `2PA` + `2P%` +
## `eFG%` + FT + FTA + `FT%` + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7419 -2.8801 -0.2106 2.9238 15.2299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.975637 6.484227 2.926 0.003686 **
## PosPF -1.087502 0.990855 -1.098 0.273269
## PosPG -4.222675 1.330548 -3.174 0.001659 **
## PosSF -2.796642 1.217967 -2.296 0.022343 *
## PosSG -3.403225 1.257968 -2.705 0.007206 **
## Age -0.026675 0.076332 -0.349 0.726989
## TmBOS -6.258631 2.229770 -2.807 0.005324 **
## TmBRK -6.390893 2.330441 -2.742 0.006459 **
## TmCHI -3.696411 2.143591 -1.724 0.085645 .
## TmCHO -4.058861 2.190527 -1.853 0.064857 .
## TmCLE 0.162783 2.280516 0.071 0.943142
## TmDAL -6.355670 2.120635 -2.997 0.002949 **
## TmDEN -8.186940 2.192114 -3.735 0.000224 ***
## TmDET -5.588048 2.326892 -2.402 0.016924 *
## TmGSW -4.906909 2.163765 -2.268 0.024041 *
## TmHOU -7.030738 2.263621 -3.106 0.002074 **
## TmIND -4.258209 2.299730 -1.852 0.065045 .
## TmLAC 0.015882 2.327986 0.007 0.994561
## TmLAL -7.254382 2.509703 -2.891 0.004121 **
## TmMEM -2.996187 2.329266 -1.286 0.199303
## TmMIA -3.531344 2.202067 -1.604 0.109823
## TmMIL -3.935047 2.244650 -1.753 0.080590 .
## TmMIN -9.291086 2.335093 -3.979 8.65e-05 ***
## TmNOP -2.446562 2.296170 -1.065 0.287491
## TmNYK -5.779795 2.250935 -2.568 0.010711 *
## TmOKC -0.794081 2.241156 -0.354 0.723345
## TmORL -5.344031 2.170376 -2.462 0.014357 *
## TmPHI -9.273574 2.395564 -3.871 0.000132 ***
## TmPHO -5.368108 2.197721 -2.443 0.015148 *
## TmPOR -2.376698 2.193558 -1.083 0.279444
## TmSAC -8.271569 2.136835 -3.871 0.000133 ***
## TmSAS -3.960339 2.248331 -1.761 0.079160 .
## TmTOR -1.457843 2.179375 -0.669 0.504046
## TmTOT -6.363345 1.762842 -3.610 0.000358 ***
## TmUTA -3.624868 2.144496 -1.690 0.091987 .
## TmWAS -3.818186 2.230605 -1.712 0.087960 .
## G -0.159912 0.029976 -5.335 1.86e-07 ***
## MP 0.004592 0.001418 3.239 0.001331 **
## PER 0.060559 0.179309 0.338 0.735792
## `TS%` 32.073489 40.407896 0.794 0.427960
## FTr -2.642786 5.724125 -0.462 0.644630
## FG 0.092689 0.044818 2.068 0.039468 *
## FGA -0.028850 0.018792 -1.535 0.125771
## `FG%` -24.122789 22.889563 -1.054 0.292771
## `2P` -0.006132 0.053089 -0.116 0.908121
## `2PA` -0.010667 0.023392 -0.456 0.648716
## `2P%` 2.430311 14.898810 0.163 0.870531
## `eFG%` -18.995560 40.820113 -0.465 0.642013
## FT -0.001672 0.025267 -0.066 0.947293
## FTA 0.015925 0.020347 0.783 0.434414
## `FT%` -3.707572 4.942419 -0.750 0.453739
## AST 0.008233 0.004002 2.057 0.040539 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.876 on 306 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6119
## F-statistic: 12.04 on 51 and 306 DF, p-value: < 2.2e-16
With the removal of total rebounds we’ve seen a small decrease in residual standard error and a slight increase in the adjusted r-squared value. Next, we’ll rerun removing freethrows (FT):
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Age + Tm + G + MP +
## PER + `TS%` + FTr + FG + FGA + `FG%` + `2P` + `2PA` + `2P%` +
## `eFG%` + FTA + `FT%` + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7700 -2.8748 -0.2025 2.9617 15.2331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.128738 6.047428 3.163 0.001717 **
## PosPF -1.089418 0.988824 -1.102 0.271442
## PosPG -4.216763 1.325390 -3.182 0.001615 **
## PosSF -2.796102 1.215963 -2.299 0.022148 *
## PosSG -3.400814 1.255399 -2.709 0.007129 **
## Age -0.026640 0.076206 -0.350 0.726898
## TmBOS -6.265059 2.224036 -2.817 0.005162 **
## TmBRK -6.402272 2.320314 -2.759 0.006142 **
## TmCHI -3.706725 2.134445 -1.737 0.083457 .
## TmCHO -4.062399 2.186321 -1.858 0.064112 .
## TmCLE 0.161126 2.276677 0.071 0.943625
## TmDAL -6.361807 2.115168 -3.008 0.002850 **
## TmDEN -8.189813 2.188127 -3.743 0.000217 ***
## TmDET -5.574481 2.314076 -2.409 0.016587 *
## TmGSW -4.909443 2.159915 -2.273 0.023718 *
## TmHOU -7.035439 2.258834 -3.115 0.002016 **
## TmIND -4.262703 2.294996 -1.857 0.064213 .
## TmLAC 0.009719 2.322346 0.004 0.996664
## TmLAL -7.263961 2.501458 -2.904 0.003953 **
## TmMEM -3.002419 2.323583 -1.292 0.197277
## TmMIA -3.535789 2.197470 -1.609 0.108638
## TmMIL -3.946765 2.234021 -1.767 0.078278 .
## TmMIN -9.294075 2.330867 -3.987 8.36e-05 ***
## TmNOP -2.452416 2.290741 -1.071 0.285200
## TmNYK -5.786573 2.244953 -2.578 0.010415 *
## TmOKC -0.809383 2.225573 -0.364 0.716352
## TmORL -5.344645 2.166835 -2.467 0.014187 *
## TmPHI -9.266850 2.389523 -3.878 0.000129 ***
## TmPHO -5.371581 2.193529 -2.449 0.014891 *
## TmPOR -2.384480 2.186848 -1.090 0.276404
## TmSAC -8.266228 2.131844 -3.878 0.000129 ***
## TmSAS -3.969662 2.240270 -1.772 0.077394 .
## TmTOR -1.473302 2.163297 -0.681 0.496357
## TmTOT -6.366577 1.759305 -3.619 0.000346 ***
## TmUTA -3.632750 2.137709 -1.699 0.090263 .
## TmWAS -3.819446 2.226904 -1.715 0.087329 .
## G -0.159763 0.029842 -5.354 1.69e-07 ***
## MP 0.004614 0.001376 3.352 0.000903 ***
## PER 0.062145 0.177411 0.350 0.726363
## `TS%` 30.274543 29.842731 1.014 0.311158
## FTr -2.464744 5.043929 -0.489 0.625434
## FG 0.092880 0.044653 2.080 0.038350 *
## FGA -0.029092 0.018403 -1.581 0.114956
## `FG%` -24.601710 21.679586 -1.135 0.257350
## `2P` -0.006279 0.052957 -0.119 0.905697
## `2PA` -0.010457 0.023138 -0.452 0.651642
## `2P%` 2.501004 14.836330 0.169 0.866244
## `eFG%` -17.151034 29.768226 -0.576 0.564934
## FTA 0.014668 0.007261 2.020 0.044228 *
## `FT%` -3.717076 4.932313 -0.754 0.451657
## AST 0.008183 0.003925 2.085 0.037909 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.868 on 307 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6131
## F-statistic: 12.32 on 50 and 307 DF, p-value: < 2.2e-16
We are still seeing improvements to the model, so we’ll continue with eliminations:
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Age + Tm + G + MP +
## PER + `TS%` + FTr + FG + FGA + `FG%` + `2PA` + `2P%` + `eFG%` +
## FTA + `FT%` + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7633 -2.8927 -0.1809 2.9474 15.2428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.149238 6.035273 3.173 0.001662 **
## PosPF -1.088422 0.987205 -1.103 0.271093
## PosPG -4.219524 1.323063 -3.189 0.001573 **
## PosSF -2.808121 1.209789 -2.321 0.020930 *
## PosSG -3.412391 1.249591 -2.731 0.006682 **
## Age -0.026843 0.076065 -0.353 0.724409
## TmBOS -6.282007 2.215883 -2.835 0.004886 **
## TmBRK -6.404695 2.316508 -2.765 0.006039 **
## TmCHI -3.694913 2.128704 -1.736 0.083607 .
## TmCHO -4.063115 2.182810 -1.861 0.063638 .
## TmCLE 0.140416 2.266331 0.062 0.950637
## TmDAL -6.352523 2.110332 -3.010 0.002827 **
## TmDEN -8.191665 2.184566 -3.750 0.000211 ***
## TmDET -5.572941 2.310333 -2.412 0.016442 *
## TmGSW -4.882205 2.144223 -2.277 0.023477 *
## TmHOU -7.042464 2.254440 -3.124 0.001955 **
## TmIND -4.258284 2.291017 -1.859 0.064025 .
## TmLAC 0.017230 2.317763 0.007 0.994073
## TmLAL -7.270515 2.496841 -2.912 0.003855 **
## TmMEM -2.991559 2.318058 -1.291 0.197829
## TmMIA -3.541115 2.193491 -1.614 0.107470
## TmMIL -3.948531 2.230393 -1.770 0.077661 .
## TmMIN -9.307269 2.324480 -4.004 7.81e-05 ***
## TmNOP -2.450573 2.287019 -1.072 0.284777
## TmNYK -5.779309 2.240523 -2.579 0.010359 *
## TmOKC -0.828563 2.216131 -0.374 0.708752
## TmORL -5.346376 2.163315 -2.471 0.014000 *
## TmPHI -9.288735 2.378567 -3.905 0.000116 ***
## TmPHO -5.367927 2.189799 -2.451 0.014788 *
## TmPOR -2.368885 2.179393 -1.087 0.277910
## TmSAC -8.262130 2.128150 -3.882 0.000127 ***
## TmSAS -3.958700 2.234776 -1.771 0.077482 .
## TmTOR -1.460788 2.157260 -0.677 0.498819
## TmTOT -6.363357 1.756278 -3.623 0.000340 ***
## TmUTA -3.624112 2.133045 -1.699 0.090323 .
## TmWAS -3.828000 2.222169 -1.723 0.085957 .
## G -0.160006 0.029724 -5.383 1.45e-07 ***
## MP 0.004582 0.001348 3.398 0.000768 ***
## PER 0.057857 0.173407 0.334 0.738873
## `TS%` 31.124277 28.923001 1.076 0.282721
## FTr -2.481232 5.033935 -0.493 0.622434
## FG 0.088167 0.020312 4.341 1.93e-05 ***
## FGA -0.027126 0.007968 -3.404 0.000751 ***
## `FG%` -24.039099 21.120033 -1.138 0.255916
## `2PA` -0.013116 0.005672 -2.313 0.021406 *
## `2P%` 1.499948 12.180136 0.123 0.902071
## `eFG%` -17.372429 29.662015 -0.586 0.558520
## FTA 0.014561 0.007193 2.024 0.043789 *
## `FT%` -3.794066 4.881552 -0.777 0.437622
## AST 0.008225 0.003902 2.108 0.035860 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.86 on 308 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6144
## F-statistic: 12.61 on 49 and 308 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Age + Tm + G + MP +
## PER + `TS%` + FTr + FG + FGA + `FG%` + `2PA` + `eFG%` + FTA +
## `FT%` + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7724 -2.9077 -0.1963 3.0091 15.2468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.237843 5.982673 3.216 0.001440 **
## PosPF -1.065036 0.967222 -1.101 0.271698
## PosPG -4.199051 1.310483 -3.204 0.001496 **
## PosSF -2.769554 1.166687 -2.374 0.018214 *
## PosSG -3.389594 1.233832 -2.747 0.006363 **
## Age -0.026739 0.075939 -0.352 0.724995
## TmBOS -6.308937 2.201549 -2.866 0.004447 **
## TmBRK -6.448532 2.285343 -2.822 0.005087 **
## TmCHI -3.744525 2.086900 -1.794 0.073743 .
## TmCHO -4.098013 2.160884 -1.896 0.058833 .
## TmCLE 0.109122 2.248447 0.049 0.961324
## TmDAL -6.357322 2.106607 -3.018 0.002758 **
## TmDEN -8.216854 2.171500 -3.784 0.000185 ***
## TmDET -5.589226 2.302867 -2.427 0.015793 *
## TmGSW -4.920410 2.118277 -2.323 0.020837 *
## TmHOU -7.065979 2.242756 -3.151 0.001789 **
## TmIND -4.292849 2.270132 -1.891 0.059558 .
## TmLAC -0.008961 2.304303 -0.004 0.996900
## TmLAL -7.308872 2.473386 -2.955 0.003367 **
## TmMEM -3.027874 2.295557 -1.319 0.188141
## TmMIA -3.563001 2.182793 -1.632 0.103632
## TmMIL -4.002225 2.183872 -1.833 0.067820 .
## TmMIN -9.332995 2.311381 -4.038 6.81e-05 ***
## TmNOP -2.480412 2.270520 -1.092 0.275490
## TmNYK -5.813322 2.219887 -2.619 0.009260 **
## TmOKC -0.849232 2.206241 -0.385 0.700560
## TmORL -5.375653 2.146781 -2.504 0.012793 *
## TmPHI -9.312596 2.366881 -3.935 0.000103 ***
## TmPHO -5.401032 2.169769 -2.489 0.013328 *
## TmPOR -2.400754 2.160522 -1.111 0.267350
## TmSAC -8.284555 2.116963 -3.913 0.000112 ***
## TmSAS -4.001309 2.204308 -1.815 0.070459 .
## TmTOR -1.495658 2.135185 -0.700 0.484154
## TmTOT -6.388921 1.741185 -3.669 0.000286 ***
## TmUTA -3.653216 2.116531 -1.726 0.085340 .
## TmWAS -3.869519 2.192941 -1.765 0.078630 .
## G -0.159696 0.029570 -5.401 1.33e-07 ***
## MP 0.004577 0.001346 3.401 0.000759 ***
## PER 0.056932 0.172968 0.329 0.742269
## `TS%` 31.296961 28.842914 1.085 0.278731
## FTr -2.559017 4.986184 -0.513 0.608163
## FG 0.088416 0.020178 4.382 1.61e-05 ***
## FGA -0.027145 0.007954 -3.413 0.000728 ***
## `FG%` -22.506673 17.037540 -1.321 0.187477
## `2PA` -0.013283 0.005497 -2.416 0.016261 *
## `eFG%` -17.582608 29.565643 -0.595 0.552481
## FTA 0.014648 0.007146 2.050 0.041235 *
## `FT%` -3.808184 4.872422 -0.782 0.435060
## AST 0.008234 0.003896 2.114 0.035356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.853 on 309 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6156
## F-statistic: 12.91 on 48 and 309 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Age + Tm + G + MP +
## `TS%` + FTr + FG + FGA + `FG%` + `2PA` + `eFG%` + FTA + `FT%` +
## AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8643 -2.8448 -0.1932 3.0061 15.3444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.373377 5.959896 3.251 0.001278 **
## PosPF -1.103418 0.958785 -1.151 0.250681
## PosPG -4.291503 1.278185 -3.357 0.000885 ***
## PosSF -2.887009 1.109175 -2.603 0.009690 **
## PosSG -3.522264 1.164458 -3.025 0.002696 **
## Age -0.029649 0.075314 -0.394 0.694090
## TmBOS -6.314430 2.198318 -2.872 0.004355 **
## TmBRK -6.416837 2.280028 -2.814 0.005200 **
## TmCHI -3.714137 2.081856 -1.784 0.075393 .
## TmCHO -4.045062 2.151786 -1.880 0.061064 .
## TmCLE 0.133036 2.244038 0.059 0.952764
## TmDAL -6.323684 2.101098 -3.010 0.002830 **
## TmDEN -8.197184 2.167554 -3.782 0.000187 ***
## TmDET -5.587450 2.299546 -2.430 0.015674 *
## TmGSW -4.925851 2.115164 -2.329 0.020511 *
## TmHOU -7.040701 2.238214 -3.146 0.001818 **
## TmIND -4.279371 2.266496 -1.888 0.059946 .
## TmLAC 0.069401 2.288672 0.030 0.975828
## TmLAL -7.307460 2.469822 -2.959 0.003327 **
## TmMEM -2.947358 2.279201 -1.293 0.196921
## TmMIA -3.537102 2.178235 -1.624 0.105427
## TmMIL -4.022309 2.179878 -1.845 0.065962 .
## TmMIN -9.418114 2.293563 -4.106 5.15e-05 ***
## TmNOP -2.449118 2.265263 -1.081 0.280465
## TmNYK -5.751309 2.208694 -2.604 0.009660 **
## TmOKC -0.863564 2.202637 -0.392 0.695284
## TmORL -5.373432 2.143681 -2.507 0.012700 *
## TmPHI -9.307974 2.363433 -3.938 0.000101 ***
## TmPHO -5.350176 2.161145 -2.476 0.013835 *
## TmPOR -2.395692 2.157358 -1.110 0.267655
## TmSAC -8.332619 2.108881 -3.951 9.63e-05 ***
## TmSAS -3.861378 2.159809 -1.788 0.074780 .
## TmTOR -1.446744 2.126941 -0.680 0.496886
## TmTOT -6.358422 1.736215 -3.662 0.000294 ***
## TmUTA -3.606409 2.108709 -1.710 0.088221 .
## TmWAS -3.891765 2.188744 -1.778 0.076371 .
## G -0.161542 0.028991 -5.572 5.47e-08 ***
## MP 0.004383 0.001208 3.627 0.000335 ***
## `TS%` 34.769289 26.805746 1.297 0.195567
## FTr -2.593037 4.977938 -0.521 0.602804
## FG 0.090826 0.018776 4.837 2.08e-06 ***
## FGA -0.027538 0.007852 -3.507 0.000520 ***
## `FG%` -22.054223 16.957555 -1.301 0.194377
## `2PA` -0.013457 0.005464 -2.463 0.014329 *
## `eFG%` -19.738184 28.789698 -0.686 0.493478
## FTA 0.015008 0.007052 2.128 0.034100 *
## `FT%` -4.031305 4.818092 -0.837 0.403405
## AST 0.008654 0.003675 2.355 0.019158 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.846 on 310 degrees of freedom
## Multiple R-squared: 0.6672, Adjusted R-squared: 0.6167
## F-statistic: 13.22 on 47 and 310 DF, p-value: < 2.2e-16
We’ve now come to the point where we will remove Age, which I had originally thought would be a helpful indicator to the model:
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` +
## FTr + FG + FGA + `FG%` + `2PA` + `eFG%` + FTA + `FT%` + AST,
## data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9012 -2.8248 -0.1794 3.0236 15.4394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.566720 5.588983 3.322 0.001000 **
## PosPF -1.061774 0.951636 -1.116 0.265397
## PosPG -4.190041 1.250230 -3.351 0.000903 ***
## PosSF -2.810366 1.090472 -2.577 0.010421 *
## PosSG -3.454598 1.150137 -3.004 0.002884 **
## TmBOS -6.268907 2.192290 -2.860 0.004530 **
## TmBRK -6.399142 2.276485 -2.811 0.005253 **
## TmCHI -3.738799 2.078085 -1.799 0.072963 .
## TmCHO -3.998885 2.145666 -1.864 0.063306 .
## TmCLE 0.106429 2.239971 0.048 0.962134
## TmDAL -6.415502 2.085275 -3.077 0.002280 **
## TmDEN -8.125158 2.156882 -3.767 0.000198 ***
## TmDET -5.535999 2.292708 -2.415 0.016329 *
## TmGSW -4.952196 2.111231 -2.346 0.019622 *
## TmHOU -7.072912 2.233678 -3.166 0.001696 **
## TmIND -4.256594 2.262677 -1.881 0.060876 .
## TmLAC -0.024564 2.273098 -0.011 0.991385
## TmLAL -7.255242 2.462905 -2.946 0.003464 **
## TmMEM -3.079225 2.251388 -1.368 0.172393
## TmMIA -3.579052 2.172670 -1.647 0.100505
## TmMIL -3.929935 2.164266 -1.816 0.070360 .
## TmMIN -9.308636 2.273547 -4.094 5.40e-05 ***
## TmNOP -2.445056 2.262160 -1.081 0.280601
## TmNYK -5.753550 2.205685 -2.609 0.009533 **
## TmOKC -0.839353 2.198785 -0.382 0.702919
## TmORL -5.308780 2.134476 -2.487 0.013401 *
## TmPHI -9.181521 2.338321 -3.927 0.000106 ***
## TmPHO -5.342968 2.158130 -2.476 0.013828 *
## TmPOR -2.324455 2.146833 -1.083 0.279764
## TmSAC -8.344454 2.105800 -3.963 9.20e-05 ***
## TmSAS -3.966484 2.140330 -1.853 0.064799 .
## TmTOR -1.405214 2.121436 -0.662 0.508213
## TmTOT -6.407384 1.729400 -3.705 0.000250 ***
## TmUTA -3.544161 2.099914 -1.688 0.092459 .
## TmWAS -3.887985 2.185748 -1.779 0.076251 .
## G -0.161268 0.028944 -5.572 5.47e-08 ***
## MP 0.004331 0.001199 3.611 0.000355 ***
## `TS%` 34.096828 26.714898 1.276 0.202794
## FTr -2.464821 4.960519 -0.497 0.619619
## FG 0.090574 0.018740 4.833 2.11e-06 ***
## FGA -0.027274 0.007813 -3.491 0.000551 ***
## `FG%` -21.423964 16.858863 -1.271 0.204755
## `2PA` -0.013434 0.005456 -2.462 0.014358 *
## `eFG%` -19.632916 28.749319 -0.683 0.495178
## FTA 0.014939 0.007040 2.122 0.034623 *
## `FT%` -4.075498 4.810236 -0.847 0.397504
## AST 0.008435 0.003628 2.325 0.020711 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.839 on 311 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.6178
## F-statistic: 13.54 on 46 and 311 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` +
## FG + FGA + `FG%` + `2PA` + `eFG%` + FTA + `FT%` + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7725 -2.7710 -0.2483 2.9208 15.4736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.918309 5.427942 3.301 0.001075 **
## PosPF -0.999265 0.942145 -1.061 0.289680
## PosPG -4.179786 1.248550 -3.348 0.000915 ***
## PosSF -2.745907 1.081420 -2.539 0.011597 *
## PosSG -3.416748 1.146226 -2.981 0.003100 **
## TmBOS -6.244440 2.189090 -2.853 0.004627 **
## TmBRK -6.395766 2.273726 -2.813 0.005221 **
## TmCHI -3.732267 2.075534 -1.798 0.073109 .
## TmCHO -3.960610 2.141693 -1.849 0.065362 .
## TmCLE 0.118214 2.237140 0.053 0.957892
## TmDAL -6.457485 2.081046 -3.103 0.002091 **
## TmDEN -8.101066 2.153733 -3.761 0.000202 ***
## TmDET -5.587081 2.287636 -2.442 0.015149 *
## TmGSW -4.914952 2.107352 -2.332 0.020321 *
## TmHOU -7.068386 2.230962 -3.168 0.001685 **
## TmIND -4.219731 2.258729 -1.868 0.062673 .
## TmLAC -0.128312 2.260754 -0.057 0.954776
## TmLAL -7.254854 2.459931 -2.949 0.003427 **
## TmMEM -3.102576 2.248179 -1.380 0.168563
## TmMIA -3.543214 2.168850 -1.634 0.103334
## TmMIL -3.854960 2.156392 -1.788 0.074797 .
## TmMIN -9.293542 2.270599 -4.093 5.43e-05 ***
## TmNOP -2.450586 2.259401 -1.085 0.278928
## TmNYK -5.757565 2.203006 -2.614 0.009396 **
## TmOKC -0.762667 2.190713 -0.348 0.727973
## TmORL -5.335097 2.131242 -2.503 0.012816 *
## TmPHI -9.143565 2.334250 -3.917 0.000110 ***
## TmPHO -5.371814 2.154744 -2.493 0.013184 *
## TmPOR -2.381222 2.141202 -1.112 0.266953
## TmSAC -8.273354 2.098396 -3.943 9.95e-05 ***
## TmSAS -3.993014 2.137080 -1.868 0.062637 .
## TmTOR -1.371237 2.117773 -0.647 0.517791
## TmTOT -6.428177 1.726806 -3.723 0.000234 ***
## TmUTA -3.556630 2.097228 -1.696 0.090908 .
## TmWAS -3.919168 2.182208 -1.796 0.073468 .
## G -0.158550 0.028388 -5.585 5.09e-08 ***
## MP 0.004177 0.001157 3.609 0.000358 ***
## `TS%` 27.726810 23.409647 1.184 0.237149
## FG 0.090788 0.018712 4.852 1.94e-06 ***
## FGA -0.026974 0.007780 -3.467 0.000600 ***
## `FG%` -24.701646 15.496045 -1.594 0.111935
## `2PA` -0.012669 0.005229 -2.423 0.015960 *
## `eFG%` -11.624879 23.778501 -0.489 0.625269
## FTA 0.012347 0.004722 2.615 0.009360 **
## `FT%` -3.111552 4.396377 -0.708 0.479627
## AST 0.008631 0.003602 2.396 0.017159 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.833 on 312 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6187
## F-statistic: 13.87 on 45 and 312 DF, p-value: < 2.2e-16
In this last run, our adjusted r-squared didn’t decrease, but our residual standard error went down. Let’s keep going with our eliminations:
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` +
## FG + FGA + `FG%` + `2PA` + FTA + `FT%` + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8446 -2.7867 -0.2265 2.9412 15.5120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.367171 5.303123 3.275 0.001175 **
## PosPF -1.049536 0.935378 -1.122 0.262703
## PosPG -4.290358 1.226400 -3.498 0.000536 ***
## PosSF -2.833499 1.065178 -2.660 0.008214 **
## PosSG -3.529062 1.121601 -3.146 0.001812 **
## TmBOS -6.186426 2.183212 -2.834 0.004901 **
## TmBRK -6.361102 2.269856 -2.802 0.005388 **
## TmCHI -3.750866 2.072661 -1.810 0.071303 .
## TmCHO -3.940468 2.138692 -1.842 0.066353 .
## TmCLE 0.149522 2.233504 0.067 0.946668
## TmDAL -6.383992 2.073084 -3.079 0.002258 **
## TmDEN -8.081361 2.150737 -3.757 0.000205 ***
## TmDET -5.598814 2.284728 -2.451 0.014811 *
## TmGSW -4.859742 2.101765 -2.312 0.021414 *
## TmHOU -7.098690 2.227388 -3.187 0.001583 **
## TmIND -4.211452 2.255918 -1.867 0.062858 .
## TmLAC -0.194220 2.253986 -0.086 0.931388
## TmLAL -7.216472 2.455687 -2.939 0.003541 **
## TmMEM -3.093364 2.245365 -1.378 0.169291
## TmMIA -3.494986 2.163970 -1.615 0.107301
## TmMIL -3.843599 2.153644 -1.785 0.075279 .
## TmMIN -9.189014 2.257760 -4.070 5.96e-05 ***
## TmNOP -2.454769 2.256637 -1.088 0.277520
## TmNYK -5.714177 2.198540 -2.599 0.009790 **
## TmOKC -0.707466 2.185140 -0.324 0.746334
## TmORL -5.310440 2.128053 -2.495 0.013095 *
## TmPHI -9.092178 2.329046 -3.904 0.000116 ***
## TmPHO -5.316163 2.149118 -2.474 0.013904 *
## TmPOR -2.362740 2.138264 -1.105 0.270017
## TmSAC -8.268715 2.095822 -3.945 9.84e-05 ***
## TmSAS -3.925513 2.130021 -1.843 0.066282 .
## TmTOR -1.361017 2.115094 -0.643 0.520385
## TmTOT -6.406202 1.724121 -3.716 0.000240 ***
## TmUTA -3.533948 2.094164 -1.688 0.092499 .
## TmWAS -3.919957 2.179553 -1.799 0.073059 .
## G -0.159850 0.028229 -5.663 3.37e-08 ***
## MP 0.004149 0.001154 3.594 0.000378 ***
## `TS%` 18.862452 14.788941 1.275 0.203098
## FG 0.087816 0.017676 4.968 1.11e-06 ***
## FGA -0.026641 0.007741 -3.442 0.000657 ***
## `FG%` -27.293938 14.542887 -1.877 0.061477 .
## `2PA` -0.011624 0.004766 -2.439 0.015284 *
## FTA 0.013505 0.004080 3.310 0.001043 **
## `FT%` -2.082230 3.854731 -0.540 0.589460
## AST 0.008802 0.003580 2.458 0.014497 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.827 on 313 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6196
## F-statistic: 14.22 on 44 and 313 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` +
## FG + FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8548 -2.7673 -0.2873 3.0274 15.5516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.020890 4.675694 3.426 0.000693 ***
## PosPF -1.026501 0.933351 -1.100 0.272261
## PosPG -4.326583 1.223183 -3.537 0.000465 ***
## PosSF -2.817146 1.063546 -2.649 0.008486 **
## PosSG -3.544884 1.119954 -3.165 0.001702 **
## TmBOS -6.196685 2.180666 -2.842 0.004782 **
## TmBRK -6.288926 2.263363 -2.779 0.005789 **
## TmCHI -3.718131 2.069437 -1.797 0.073346 .
## TmCHO -3.871668 2.132487 -1.816 0.070391 .
## TmCLE 0.207959 2.228365 0.093 0.925706
## TmDAL -6.372841 2.070643 -3.078 0.002270 **
## TmDEN -8.074077 2.148268 -3.758 0.000204 ***
## TmDET -5.471832 2.270038 -2.410 0.016507 *
## TmGSW -4.810971 2.097455 -2.294 0.022467 *
## TmHOU -7.037423 2.221988 -3.167 0.001691 **
## TmIND -4.200427 2.253281 -1.864 0.063235 .
## TmLAC -0.199023 2.251425 -0.088 0.929616
## TmLAL -7.129292 2.447613 -2.913 0.003839 **
## TmMEM -3.135079 2.241505 -1.399 0.162905
## TmMIA -3.437127 2.158878 -1.592 0.112371
## TmMIL -3.796548 2.149454 -1.766 0.078320 .
## TmMIN -9.160278 2.254586 -4.063 6.13e-05 ***
## TmNOP -2.455896 2.254089 -1.090 0.276756
## TmNYK -5.770647 2.193575 -2.631 0.008941 **
## TmOKC -0.664463 2.181226 -0.305 0.760851
## TmORL -5.307012 2.125643 -2.497 0.013049 *
## TmPHI -8.986538 2.318202 -3.877 0.000129 ***
## TmPHO -5.269972 2.144993 -2.457 0.014556 *
## TmPOR -2.373148 2.135765 -1.111 0.267355
## TmSAC -8.191351 2.088563 -3.922 0.000108 ***
## TmSAS -3.958534 2.126741 -1.861 0.063634 .
## TmTOR -1.296021 2.109286 -0.614 0.539372
## TmTOT -6.371101 1.720952 -3.702 0.000253 ***
## TmUTA -3.491915 2.090357 -1.670 0.095819 .
## TmWAS -3.848108 2.173036 -1.771 0.077557 .
## G -0.159647 0.028194 -5.662 3.37e-08 ***
## MP 0.004207 0.001148 3.664 0.000291 ***
## `TS%` 14.717145 12.627825 1.165 0.244720
## FG 0.087203 0.017619 4.949 1.22e-06 ***
## FGA -0.026295 0.007706 -3.412 0.000729 ***
## `FG%` -22.897220 12.038176 -1.902 0.058079 .
## `2PA` -0.012263 0.004611 -2.659 0.008232 **
## FTA 0.014020 0.003963 3.538 0.000464 ***
## AST 0.008910 0.003571 2.495 0.013101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.822 on 314 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6205
## F-statistic: 14.57 on 43 and 314 DF, p-value: < 2.2e-16
The above model run, is the best model we’ll be able to eek out without any type of data transformations. If we remove the next highest p-value factor TS%, the model returns a higher residual error as well as a lower adjusted r-squared value:
##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + FG +
## FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8021 -2.9458 -0.3667 2.9623 15.7687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.659795 4.093191 4.559 7.37e-06 ***
## PosPF -0.883689 0.925798 -0.955 0.340554
## PosPG -4.181362 1.217512 -3.434 0.000673 ***
## PosSF -2.533525 1.035919 -2.446 0.015005 *
## PosSG -3.220868 1.085514 -2.967 0.003236 **
## TmBOS -6.404589 2.174593 -2.945 0.003468 **
## TmBRK -6.402664 2.262544 -2.830 0.004956 **
## TmCHI -3.804716 2.069279 -1.839 0.066905 .
## TmCHO -3.850031 2.133619 -1.804 0.072114 .
## TmCLE 0.057248 2.225874 0.026 0.979498
## TmDAL -6.424537 2.071344 -3.102 0.002099 **
## TmDEN -8.148359 2.148543 -3.793 0.000179 ***
## TmDET -5.616280 2.267940 -2.476 0.013797 *
## TmGSW -5.030329 2.090181 -2.407 0.016675 *
## TmHOU -7.425419 2.198156 -3.378 0.000822 ***
## TmIND -4.432995 2.245703 -1.974 0.049256 *
## TmLAC -0.505987 2.237237 -0.226 0.821219
## TmLAL -7.370429 2.440239 -3.020 0.002731 **
## TmMEM -3.170121 2.242578 -1.414 0.158465
## TmMIA -3.572091 2.156996 -1.656 0.098708 .
## TmMIL -3.985195 2.144569 -1.858 0.064063 .
## TmMIN -9.062082 2.254292 -4.020 7.29e-05 ***
## TmNOP -2.501026 2.255038 -1.109 0.268240
## TmNYK -5.799560 2.194682 -2.643 0.008640 **
## TmOKC -0.906151 2.172580 -0.417 0.676900
## TmORL -5.363804 2.126292 -2.523 0.012141 *
## TmPHI -9.211694 2.311452 -3.985 8.38e-05 ***
## TmPHO -5.362271 2.144749 -2.500 0.012921 *
## TmPOR -2.686588 2.119968 -1.267 0.205992
## TmSAC -8.264730 2.088801 -3.957 9.39e-05 ***
## TmSAS -4.000362 2.127648 -1.880 0.061006 .
## TmTOR -1.346284 2.110044 -0.638 0.523914
## TmTOT -6.427731 1.721244 -3.734 0.000223 ***
## TmUTA -3.606637 2.089225 -1.726 0.085273 .
## TmWAS -3.919862 2.173399 -1.804 0.072255 .
## G -0.156140 0.028049 -5.567 5.56e-08 ***
## MP 0.004016 0.001137 3.532 0.000474 ***
## FG 0.089992 0.017466 5.152 4.55e-07 ***
## FGA -0.024965 0.007625 -3.274 0.001178 **
## `FG%` -11.915701 7.496500 -1.590 0.112950
## `2PA` -0.015743 0.003517 -4.477 1.06e-05 ***
## FTA 0.015450 0.003770 4.098 5.30e-05 ***
## AST 0.009082 0.003570 2.544 0.011433 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.825 on 315 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.62
## F-statistic: 14.87 on 42 and 315 DF, p-value: < 2.2e-16
While our final model still isn’t amazing, it is about twice as good as our initial model - and we still haven’t performed any transformations to the dataset. I think there is still some room for improvement here, which we’ll explore next week.
You can actually run this analysis automatically using the stepAIC function from the MASS library as shown below, with very similar results. Having done it once by hand, i’ll probably save some time next time by using this function.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
fit <- lm(salary_2017_in_millions ~ ., data = nba_data_for_model)
step <- MASS::stepAIC(fit, direction = 'both')## Start: AIC=1184.21
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + PER + `TS%` +
## FTr + FG + FGA + `FG%` + `2P` + `2PA` + `2P%` + `eFG%` +
## FT + FTA + `FT%` + TRB + AST
##
## Df Sum of Sq RSS AIC
## - TRB 1 0.00 7275.8 1182.2
## - FT 1 0.10 7275.9 1182.2
## - `2P` 1 0.32 7276.1 1182.2
## - `2P%` 1 0.63 7276.4 1182.2
## - PER 1 2.36 7278.1 1182.3
## - Age 1 2.90 7278.7 1182.4
## - `2PA` 1 4.94 7280.7 1182.5
## - FTr 1 5.07 7280.8 1182.5
## - `eFG%` 1 5.10 7280.9 1182.5
## - `FT%` 1 13.29 7289.1 1182.9
## - FTA 1 13.61 7289.4 1182.9
## - `TS%` 1 14.76 7290.5 1182.9
## - `FG%` 1 25.75 7301.5 1183.5
## <none> 7275.8 1184.2
## - FGA 1 55.22 7331.0 1184.9
## - Pos 4 211.69 7487.5 1186.5
## - AST 1 98.91 7374.7 1187.0
## - FG 1 101.68 7377.5 1187.2
## - MP 1 172.78 7448.5 1190.6
## - Tm 30 1900.08 9175.9 1207.3
## - G 1 672.45 7948.2 1213.9
##
## Step: AIC=1182.21
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + PER + `TS%` +
## FTr + FG + FGA + `FG%` + `2P` + `2PA` + `2P%` + `eFG%` +
## FT + FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - FT 1 0.10 7275.9 1180.2
## - `2P` 1 0.32 7276.1 1180.2
## - `2P%` 1 0.63 7276.4 1180.2
## - PER 1 2.71 7278.5 1180.3
## - Age 1 2.90 7278.7 1180.4
## - `2PA` 1 4.94 7280.7 1180.5
## - FTr 1 5.07 7280.8 1180.5
## - `eFG%` 1 5.15 7280.9 1180.5
## - `FT%` 1 13.38 7289.2 1180.9
## - FTA 1 14.57 7290.3 1180.9
## - `TS%` 1 14.98 7290.8 1181.0
## - `FG%` 1 26.41 7302.2 1181.5
## <none> 7275.8 1182.2
## - FGA 1 56.04 7331.8 1183.0
## + TRB 1 0.00 7275.8 1184.2
## - AST 1 100.60 7376.4 1185.1
## - FG 1 101.70 7377.5 1185.2
## - Pos 4 294.64 7570.4 1188.4
## - MP 1 249.46 7525.2 1192.3
## - Tm 30 1901.56 9177.3 1205.3
## - G 1 676.68 7952.4 1212.0
##
## Step: AIC=1180.22
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + PER + `TS%` +
## FTr + FG + FGA + `FG%` + `2P` + `2PA` + `2P%` + `eFG%` +
## FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - `2P` 1 0.33 7276.2 1178.2
## - `2P%` 1 0.67 7276.5 1178.2
## - Age 1 2.90 7278.8 1178.4
## - PER 1 2.91 7278.8 1178.4
## - `2PA` 1 4.84 7280.7 1178.5
## - FTr 1 5.66 7281.5 1178.5
## - `eFG%` 1 7.87 7283.7 1178.6
## - `FT%` 1 13.46 7289.3 1178.9
## - `TS%` 1 24.39 7300.3 1179.4
## - `FG%` 1 30.52 7306.4 1179.7
## <none> 7275.9 1180.2
## - FGA 1 59.22 7335.1 1181.1
## + FT 1 0.10 7275.8 1182.2
## + TRB 1 0.01 7275.9 1182.2
## - FTA 1 96.73 7372.6 1183.0
## - FG 1 102.54 7378.4 1183.2
## - AST 1 103.01 7378.9 1183.2
## - Pos 4 295.84 7571.7 1186.5
## - MP 1 266.28 7542.2 1191.1
## - Tm 30 1923.34 9199.2 1204.2
## - G 1 679.27 7955.2 1210.2
##
## Step: AIC=1178.24
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + PER + `TS%` +
## FTr + FG + FGA + `FG%` + `2PA` + `2P%` + `eFG%` + FTA + `FT%` +
## AST
##
## Df Sum of Sq RSS AIC
## - `2P%` 1 0.36 7276.6 1176.2
## - PER 1 2.63 7278.8 1176.4
## - Age 1 2.94 7279.2 1176.4
## - FTr 1 5.74 7281.9 1176.5
## - `eFG%` 1 8.10 7284.3 1176.6
## - `FT%` 1 14.27 7290.5 1176.9
## - `TS%` 1 27.36 7303.6 1177.6
## - `FG%` 1 30.61 7306.8 1177.7
## <none> 7276.2 1178.2
## + `2P` 1 0.33 7275.9 1180.2
## + FT 1 0.12 7276.1 1180.2
## + TRB 1 0.00 7276.2 1180.2
## - FTA 1 96.82 7373.0 1181.0
## - AST 1 104.95 7381.2 1181.4
## - `2PA` 1 126.34 7402.5 1182.4
## - Pos 4 298.37 7574.6 1184.6
## - MP 1 272.80 7549.0 1189.4
## - FGA 1 273.78 7550.0 1189.5
## - FG 1 445.12 7721.3 1197.5
## - Tm 30 1926.34 9202.6 1202.3
## - G 1 684.57 7960.8 1208.4
##
## Step: AIC=1176.25
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + PER + `TS%` +
## FTr + FG + FGA + `FG%` + `2PA` + `eFG%` + FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - PER 1 2.55 7279.1 1174.4
## - Age 1 2.92 7279.5 1174.4
## - FTr 1 6.20 7282.8 1174.6
## - `eFG%` 1 8.33 7284.9 1174.7
## - `FT%` 1 14.39 7291.0 1175.0
## - `TS%` 1 27.73 7304.3 1175.6
## <none> 7276.6 1176.2
## - `FG%` 1 41.09 7317.7 1176.3
## + `2P%` 1 0.36 7276.2 1178.2
## + FT 1 0.14 7276.4 1178.2
## + `2P` 1 0.02 7276.5 1178.2
## + TRB 1 0.01 7276.6 1178.2
## - FTA 1 98.94 7375.5 1179.1
## - AST 1 105.19 7381.8 1179.4
## - `2PA` 1 137.48 7414.1 1181.0
## - Pos 4 299.19 7575.8 1182.7
## - MP 1 272.45 7549.0 1187.4
## - FGA 1 274.30 7550.9 1187.5
## - FG 1 452.13 7728.7 1195.8
## - Tm 30 1926.17 9202.7 1200.3
## - G 1 686.83 7963.4 1206.5
##
## Step: AIC=1174.38
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + `TS%` + FTr +
## FG + FGA + `FG%` + `2PA` + `eFG%` + FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - Age 1 3.64 7282.8 1172.6
## - FTr 1 6.37 7285.5 1172.7
## - `eFG%` 1 11.04 7290.2 1172.9
## - `FT%` 1 16.44 7295.6 1173.2
## - `TS%` 1 39.51 7318.6 1174.3
## - `FG%` 1 39.72 7318.8 1174.3
## <none> 7279.1 1174.4
## + PER 1 2.55 7276.6 1176.2
## + TRB 1 0.47 7278.6 1176.4
## + FT 1 0.33 7278.8 1176.4
## + `2P%` 1 0.28 7278.8 1176.4
## + `2P` 1 0.01 7279.1 1176.4
## - FTA 1 106.36 7385.5 1177.6
## - AST 1 130.20 7409.3 1178.7
## - `2PA` 1 142.42 7421.5 1179.3
## - Pos 4 349.76 7628.9 1183.2
## - FGA 1 288.78 7567.9 1186.3
## - MP 1 308.98 7588.1 1187.3
## - FG 1 549.45 7828.6 1198.4
## - Tm 30 1972.90 9252.0 1200.2
## - G 1 729.05 8008.2 1206.5
##
## Step: AIC=1172.56
## salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` + FTr + FG +
## FGA + `FG%` + `2PA` + `eFG%` + FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - FTr 1 5.78 7288.5 1170.8
## - `eFG%` 1 10.92 7293.7 1171.1
## - `FT%` 1 16.81 7299.6 1171.4
## - `FG%` 1 37.82 7320.6 1172.4
## - `TS%` 1 38.15 7320.9 1172.4
## <none> 7282.8 1172.6
## + Age 1 3.64 7279.1 1174.4
## + PER 1 3.27 7279.5 1174.4
## + TRB 1 0.70 7282.1 1174.5
## + FT 1 0.35 7282.4 1174.5
## + `2P%` 1 0.25 7282.5 1174.5
## + `2P` 1 0.01 7282.8 1174.6
## - FTA 1 105.45 7388.2 1175.7
## - AST 1 126.59 7409.4 1176.7
## - `2PA` 1 141.94 7424.7 1177.5
## - Pos 4 349.19 7631.9 1181.3
## - FGA 1 285.35 7568.1 1184.3
## - MP 1 305.34 7588.1 1185.3
## - FG 1 547.04 7829.8 1196.5
## - Tm 30 1973.81 9256.6 1198.4
## - G 1 726.99 8009.7 1204.6
##
## Step: AIC=1170.84
## salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` + FG + FGA +
## `FG%` + `2PA` + `eFG%` + FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - `eFG%` 1 5.58 7294.1 1169.1
## - `FT%` 1 11.70 7300.2 1169.4
## - `TS%` 1 32.77 7321.3 1170.5
## <none> 7288.5 1170.8
## - `FG%` 1 59.36 7347.9 1171.8
## + FTr 1 5.78 7282.8 1172.6
## + PER 1 3.38 7285.2 1172.7
## + Age 1 3.05 7285.5 1172.7
## + `2P%` 1 0.63 7287.9 1172.8
## + TRB 1 0.43 7288.1 1172.8
## + FT 1 0.34 7288.2 1172.8
## + `2P` 1 0.04 7288.5 1172.8
## - AST 1 134.12 7422.7 1175.4
## - `2PA` 1 137.15 7425.7 1175.5
## - FTA 1 159.73 7448.3 1176.6
## - Pos 4 349.85 7638.4 1179.6
## - FGA 1 280.78 7569.3 1182.4
## - MP 1 304.30 7592.8 1183.5
## - FG 1 549.91 7838.5 1194.9
## - Tm 30 1968.97 9257.5 1196.5
## - G 1 728.71 8017.2 1203.0
##
## Step: AIC=1169.12
## salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` + FG + FGA +
## `FG%` + `2PA` + FTA + `FT%` + AST
##
## Df Sum of Sq RSS AIC
## - `FT%` 1 6.80 7300.9 1167.5
## - `TS%` 1 37.91 7332.0 1169.0
## <none> 7294.1 1169.1
## + PER 1 5.66 7288.5 1170.8
## + `eFG%` 1 5.58 7288.5 1170.8
## + Age 1 3.32 7290.8 1171.0
## + FT 1 3.25 7290.9 1171.0
## + `2P%` 1 0.54 7293.6 1171.1
## + FTr 1 0.44 7293.7 1171.1
## + `2P` 1 0.03 7294.1 1171.1
## + TRB 1 0.02 7294.1 1171.1
## - `FG%` 1 82.08 7376.2 1171.1
## - `2PA` 1 138.63 7432.8 1173.9
## - AST 1 140.84 7435.0 1174.0
## - FTA 1 255.29 7549.4 1179.4
## - Pos 4 384.57 7678.7 1179.5
## - FGA 1 276.01 7570.1 1180.4
## - MP 1 300.99 7595.1 1181.6
## - FG 1 575.21 7869.3 1194.3
## - Tm 30 1965.34 9259.5 1194.5
## - G 1 747.27 8041.4 1202.0
##
## Step: AIC=1167.45
## salary_2017_in_millions ~ Pos + Tm + G + MP + `TS%` + FG + FGA +
## `FG%` + `2PA` + FTA + AST
##
## Df Sum of Sq RSS AIC
## - `TS%` 1 31.58 7332.5 1167.0
## <none> 7300.9 1167.5
## + `FT%` 1 6.80 7294.1 1169.1
## + PER 1 5.99 7294.9 1169.2
## + Age 1 3.75 7297.2 1169.3
## + `eFG%` 1 0.68 7300.2 1169.4
## + `2P%` 1 0.44 7300.5 1169.4
## + TRB 1 0.22 7300.7 1169.4
## + FTr 1 0.14 7300.8 1169.4
## + FT 1 0.11 7300.8 1169.4
## + `2P` 1 0.01 7300.9 1169.5
## - `FG%` 1 84.12 7385.0 1169.5
## - AST 1 144.77 7445.7 1172.5
## - `2PA` 1 164.43 7465.4 1173.4
## - Pos 4 394.61 7695.5 1178.3
## - FGA 1 270.74 7571.7 1178.5
## - FTA 1 291.07 7592.0 1179.5
## - MP 1 312.22 7613.1 1180.4
## - FG 1 569.55 7870.5 1192.3
## - Tm 30 1958.56 9259.5 1192.5
## - G 1 745.50 8046.4 1200.3
##
## Step: AIC=1167
## salary_2017_in_millions ~ Pos + Tm + G + MP + FG + FGA + `FG%` +
## `2PA` + FTA + AST
##
## Df Sum of Sq RSS AIC
## <none> 7332.5 1167.0
## + `TS%` 1 31.58 7300.9 1167.5
## - `FG%` 1 58.81 7391.3 1167.9
## + PER 1 18.67 7313.8 1168.1
## + `eFG%` 1 10.97 7321.5 1168.5
## + FT 1 8.48 7324.0 1168.6
## + `2P` 1 1.91 7330.6 1168.9
## + Age 1 1.27 7331.2 1168.9
## + `FT%` 1 0.47 7332.0 1169.0
## + `2P%` 1 0.46 7332.0 1169.0
## + FTr 1 0.31 7332.2 1169.0
## + TRB 1 0.15 7332.4 1169.0
## - AST 1 150.67 7483.2 1172.3
## - Pos 4 368.31 7700.8 1176.5
## - FGA 1 249.52 7582.0 1177.0
## - MP 1 290.44 7622.9 1178.9
## - FTA 1 390.96 7723.5 1183.6
## - `2PA` 1 466.52 7799.0 1187.1
## - Tm 30 1946.23 9278.7 1191.3
## - FG 1 617.95 7950.5 1194.0
## - G 1 721.32 8053.8 1198.6
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## salary_2017_in_millions ~ Pos + Age + Tm + G + MP + PER + `TS%` +
## FTr + FG + FGA + `FG%` + `2P` + `2PA` + `2P%` + `eFG%` +
## FT + FTA + `FT%` + TRB + AST
##
## Final Model:
## salary_2017_in_millions ~ Pos + Tm + G + MP + FG + FGA + `FG%` +
## `2PA` + FTA + AST
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 305 7275.772 1184.214
## 2 - TRB 1 0.000207136 306 7275.772 1182.214
## 3 - FT 1 0.104079693 307 7275.876 1180.220
## 4 - `2P` 1 0.333172397 308 7276.210 1178.236
## 5 - `2P%` 1 0.358263313 309 7276.568 1176.254
## 6 - PER 1 2.551185046 310 7279.119 1174.379
## 7 - Age 1 3.639134778 311 7282.758 1172.558
## 8 - FTr 1 5.781650500 312 7288.540 1170.842
## 9 - `eFG%` 1 5.583338028 313 7294.123 1169.116
## 10 - `FT%` 1 6.799831890 314 7300.923 1167.450
## 11 - `TS%` 1 31.581893335 315 7332.505 1166.995
final_model <- lm(salary_2017_in_millions ~ Pos + Tm + G + MP + FG + FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
summary(final_model)##
## Call:
## lm(formula = salary_2017_in_millions ~ Pos + Tm + G + MP + FG +
## FGA + `FG%` + `2PA` + FTA + AST, data = nba_data_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8021 -2.9458 -0.3667 2.9623 15.7687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.659795 4.093191 4.559 7.37e-06 ***
## PosPF -0.883689 0.925798 -0.955 0.340554
## PosPG -4.181362 1.217512 -3.434 0.000673 ***
## PosSF -2.533525 1.035919 -2.446 0.015005 *
## PosSG -3.220868 1.085514 -2.967 0.003236 **
## TmBOS -6.404589 2.174593 -2.945 0.003468 **
## TmBRK -6.402664 2.262544 -2.830 0.004956 **
## TmCHI -3.804716 2.069279 -1.839 0.066905 .
## TmCHO -3.850031 2.133619 -1.804 0.072114 .
## TmCLE 0.057248 2.225874 0.026 0.979498
## TmDAL -6.424537 2.071344 -3.102 0.002099 **
## TmDEN -8.148359 2.148543 -3.793 0.000179 ***
## TmDET -5.616280 2.267940 -2.476 0.013797 *
## TmGSW -5.030329 2.090181 -2.407 0.016675 *
## TmHOU -7.425419 2.198156 -3.378 0.000822 ***
## TmIND -4.432995 2.245703 -1.974 0.049256 *
## TmLAC -0.505987 2.237237 -0.226 0.821219
## TmLAL -7.370429 2.440239 -3.020 0.002731 **
## TmMEM -3.170121 2.242578 -1.414 0.158465
## TmMIA -3.572091 2.156996 -1.656 0.098708 .
## TmMIL -3.985195 2.144569 -1.858 0.064063 .
## TmMIN -9.062082 2.254292 -4.020 7.29e-05 ***
## TmNOP -2.501026 2.255038 -1.109 0.268240
## TmNYK -5.799560 2.194682 -2.643 0.008640 **
## TmOKC -0.906151 2.172580 -0.417 0.676900
## TmORL -5.363804 2.126292 -2.523 0.012141 *
## TmPHI -9.211694 2.311452 -3.985 8.38e-05 ***
## TmPHO -5.362271 2.144749 -2.500 0.012921 *
## TmPOR -2.686588 2.119968 -1.267 0.205992
## TmSAC -8.264730 2.088801 -3.957 9.39e-05 ***
## TmSAS -4.000362 2.127648 -1.880 0.061006 .
## TmTOR -1.346284 2.110044 -0.638 0.523914
## TmTOT -6.427731 1.721244 -3.734 0.000223 ***
## TmUTA -3.606637 2.089225 -1.726 0.085273 .
## TmWAS -3.919862 2.173399 -1.804 0.072255 .
## G -0.156140 0.028049 -5.567 5.56e-08 ***
## MP 0.004016 0.001137 3.532 0.000474 ***
## FG 0.089992 0.017466 5.152 4.55e-07 ***
## FGA -0.024965 0.007625 -3.274 0.001178 **
## `FG%` -11.915701 7.496500 -1.590 0.112950
## `2PA` -0.015743 0.003517 -4.477 1.06e-05 ***
## FTA 0.015450 0.003770 4.098 5.30e-05 ***
## AST 0.009082 0.003570 2.544 0.011433 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.825 on 315 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.62
## F-statistic: 14.87 on 42 and 315 DF, p-value: < 2.2e-16
Now that we have a better model, let’s do some analysis of the residuals:
Let’s start by analyzing the Residuals vs Fitted plot on the top-left first. This plot shows us if our residuals have any non-linear patterns. Looking at the plot, we can see that the red line is not perfectly straight and that it does almost seem to have a very gentle sigmoid curve. As the curves are fairly faint, I think its safe to say here that it is approriate to categorize the relationships between our predictor variables and our outcome variable as linear.
Moving to the Normal Q-Q plot on the bottom-left, this plot shows us if our residuals are normally distributed. Looking at our plot, for the most part, our residuals follow the diagonal line, however we do see some deviation in the top right of the chart, although it is not a severe deviation.
Turning our attention now to the Scale-Location plot on the top-right, this plot helps us to check the assumption of equal variance (homoscedasticity). If the residuals had equal variance we would see a horizontal line with equally spread points. However, in our plot you can see that those points from 0-5 have a smaller variance than the rest of the plot. What we are seeing is referred to as heteroscedasticity. There isn’t a tremendous amount of difference in the variation but there definitely is some. This plot is probably right on the edge of what we would deem as acceptable, but as the heteroscedasticity is fairly faint, I’ll consider this assumption as met.
Lastly, let’s look at the Residuals vs Leverage plot on the bottom-right. This plot helps us to determine if we have influential outliers in our data that are pulling the regression line in one direction or another. In this plot, we aren’t looking for patterns, we are really looking for outlying values in the upper-right or lower-right corner. Those spots are places where cases can be heavily influential to the least-squares line. What we are looking for is points that fall outside of the red-dashed line, which is Cook’s distance. Points outside of that line tell us that the points would be influential to the regression results. In the case of our plot, we can’t even see the Cook’s distance lines because all the cases are well inside of the lines, so we can have some comfort that our outliers, if any, are not having a large effect on the model. All in all, we’ve built a semi-proficient multi-factor linear regression model that could definitely be directional in its predictions. Our next step in the coming week will be to see if we can improve our model with some data transformations (log transformation, including a quadratic term, etc.).