The introduction of advanced statistics to basketball, and particularly NBA basketball, is changing the way that experts and analysts cover the sport. As the search for an exact blueprint for individual and team success continues, the statistics that are constructed become increasingly outlandish. One of the earliest “advanced” statistics to be popularized is a player’s plus/minus rating. This value is calculated by taking the difference in the team’s scoring versus that of the opponent’s while that particular player was in the game. For example, if you played 30 minutes in a game, and in those 30 minutes, your team scored 65 points and the opponent scored 55, you would have a +10 plus/minus rating for that game. The goal of this machine learning model is to see if we can predict a player’s plus/minus rating in a particular game based on an assortment of that player’s other statistics that day.
The data I used contains information from NBA games between 2008 and 2015, and it covers 984 different players in that timespan. The data itself has 106 different variables representing a certain statistic, and each row represents the statistics of a player in a particular game. Those rows are then grouped by player name and sorted in alphabetical order, and all of the games for each player are then sorted in chronological order. I will discuss later how I cut down the amount of variables in the table, but I ended up considering only 58 of the variables, and here are their labels and explanations:
| Variable | Type | Description |
|---|---|---|
| name | Factor | player name |
| venue | Factor | home or away |
| team | Factor | player’s team |
| date | Factor | date of game |
| start | Factor | did the player start? |
| opp | Factor | opposing team |
| minutes | num | minutes played |
| true_shooting | num | true shooting percentage |
| efg | num | effective field goal percentage |
| tpar | num | three point attempt rate |
| orb_perc | num | percentage of offensive rebounds player got while on court (offensive rebounding percentage) |
| drb_perc | num | percentage of defensive rebounds player got while on court (defensive rebounding percentage) |
| trb_perc | num | percentage of total rebounds player got while on court (total rebounding percentage) |
| ast_perc | num | percentage of field goals player assisted while on court |
| stl_perc | num | percentage of opponent’s possessions ended with steal by player |
| blk_perc | num | percentage of opponent’s shots that were blocked by player |
| tov_perc | num | estimate of turnovers by player per 100 plays |
| usg_perc | num | percentage of team’s plays used by player while on floor |
| off_rtg | num | number of points produced by player over 100 possessions (offensive rating) |
| def_rtg | int | number of points allowed by player over 100 possessions (defensive rating) |
| fg | int | field goals player made in game |
| fga | int | field goals player attempted in game |
| fg_perc | num | field goal percentage (fg/fga) |
| tp | int | three point field goals player made in game |
| tpa | int | three point field goals player attempted in game |
| tp_perc | num | three point field goal percentage (tp/tpa) |
| ft | int | free throws player made in game |
| fta | int | free throws player attempted in game |
| tp_perc | num | free throw percentage (ft/fta) |
| orb | int | number of offensive rebounds player had in game |
| drb | int | number of defensive rebounds player had in game |
| trb | int | number of total rebounds player had in game |
| ast | int | number of assists player had in game |
| stl | int | number of steals player had in game |
| blk | int | number of blocks player had in game |
| tov | int | number of turnovers player had in game |
| pf | int | number of personal fouls player had in game |
| pts | int | number of points player had in game |
| plus_minus | int | player’s plus minus rating in game |
| pace | num | number of possessions per 48 minutes by player’s team (pace) |
| team_efg_perc | num | player’s team’s effective field goal percentage |
| team_tov_perc | num | player’s team’s effective turnover percentage |
| team_orb_perc | num | player’s team’s offensive rebounding percentage |
| team_fg_fga | num | player’s team’s field goal attempt percentage |
| team_off_rtg | num | player’s team’s points scored per 100 possessions (offensive rating) |
| opp_pts | int | number of points scored by player’s opponent |
| opp_rbs | int | number of rebounds by player’s opponent |
| opp_tov | int | number of turnovers by player’s opponent |
| opp_stl | int | number of steals by player’s opponent |
| opp_tp | int | number of three point field goals by player’s opponent |
| opp_pace | num | number of possessions per 48 minutes by player’s opponent (pace) |
| opp_team_efg_perc | num | opposing team’s effective field goal percentage |
| opp_team_tov_perc | num | estimate of number of turnovers per 100 possessions by opposing team |
| opp_team_orb_perc | num | opposing team’s offensive rebounding percentage |
| opp_team_fg_fga | num | opposing team’s field goal percentage |
| opp_team_off_rtg | num | opposing team’s points scored per 100 possessions (offensive rating) |
Note: effective field goal percentage is an advanced statistic that accounts for the additional points provided by three point shots, and true shooting percentage is an advanced statistic that accounts for field goals, three point field goals, and free throws.
The data that I used was found on the following reddit thread: https://www.reddit.com/r/dfsports/comments/3q89gx/nba_basketball_research_dataset/.
This user was kind enough to compile and post an enormous excel spreadsheet of data that can be used for all sorts of NBA related purposes, as there are over 200,000 rows in the file. A great deal of web scraping must have been done for this file, which mostly was probably from
For the most part, this data is pretty clean, but a few changes needed to be made to fit the purpose of this model. First, I loaded all the required packages for this model, imported the data from a csv into RStudio, and took a look at some of the fields:
library(dplyr)
library(ggplot2)
library(caret)
basketball <- read.csv("~/Desktop/basketball research data set.csv", header=TRUE)
head(basketball,20)[1:9]
## name venue team date start opp minutes true_shooting efg
## 1 A.J. Price H CLE 2014-12-02 No MIL 0.000000 NA NA
## 2 A.J. Price A CLE 2014-12-04 No NYK 0.600000 NA NA
## 3 A.J. Price A CLE 2014-12-05 No TOR 0.000000 NA NA
## 4 A.J. Price A CLE 2014-12-08 No BRK 3.366667 0.000 0.00
## 5 A.J. Price H CLE 2014-12-09 No TOR 0.000000 NA NA
## 6 A.J. Price A CLE 2014-12-11 No OKC 1.516667 NA NA
## 7 A.J. Price A CLE 2014-12-12 No NOP 0.000000 NA NA
## 8 A.J. Price H CLE 2014-12-15 No CHO 0.000000 NA NA
## 9 A.J. Price H CLE 2014-12-17 No ATL 7.683333 0.000 0.00
## 10 A.J. Price H CLE 2014-12-19 No BRK 0.000000 NA NA
## 11 A.J. Price H CLE 2014-12-21 No MEM 0.000000 NA NA
## 12 A.J. Price H CLE 2014-12-23 No MIN 2.733333 NA NA
## 13 A.J. Price A CLE 2014-12-26 No ORL 6.366667 NA NA
## 14 A.J. Price H CLE 2014-12-28 No DET 17.650000 0.129 0.00
## 15 A.J. Price A CLE 2014-12-30 No ATL 8.083333 0.500 0.50
## 16 A.J. Price H CLE 2014-12-31 No MIL 11.083333 0.410 0.25
## 17 A.J. Price A CLE 2015-01-02 No CHO 0.000000 NA NA
## 18 A.J. Price H CLE 2015-01-04 No DAL 13.933333 0.500 0.50
## 19 A.J. Price A CLE 2015-01-05 No PHI 13.716667 0.300 0.30
## 20 A.J. Price H IND 2009-10-30 No MIA 1.316667 NA NA
At first glance, this doesn’t look too pretty, but in reality, it is not bad at all. The only glaring issue is that NA values need to be replaced with 0, which is an easy fix. First, however, we should consider the fact that a player who played only one or two minutes in a game would have a negligible impact on any game. Thus, we want to remove the games for a player where he had such a small impact. But where do we draw that line? I decided to draw the line at twelve minutes, so a player would have to play an entire quarter to be qualified for this experiment.
length(basketball[,1])
## [1] 208782
basketball <- filter(basketball,minutes >= 12)
length(basketball[,1])
## [1] 161678
We see now that we have cut down the amount of rows significantly by eliminating data that will be essentially useless for our purposes. Now, we should cut unnecessary variables out of our data set.
variable.names(basketball)
## [1] "name" "venue"
## [3] "team" "date"
## [5] "start" "opp"
## [7] "minutes" "true_shooting"
## [9] "efg" "tpar"
## [11] "ftar" "orb_perc"
## [13] "drb_perc" "trb_perc"
## [15] "ast_perc" "stl_perc"
## [17] "blk_perc" "tov_perc"
## [19] "usg_perc" "off_rtg"
## [21] "def_rtg" "fg"
## [23] "fga" "fg_perc"
## [25] "tp" "tpa"
## [27] "tp_perc" "ft"
## [29] "fta" "ft_perc"
## [31] "orb" "drb"
## [33] "trb" "ast"
## [35] "stl" "blk"
## [37] "tov" "pf"
## [39] "pts" "plus_minus"
## [41] "pace" "team_efg_perc"
## [43] "team_tov_perc" "team_orb_perc"
## [45] "team_fg_fga" "team_off_rtg"
## [47] "suspended" "dnp"
## [49] "season" "dk_fp"
## [51] "fd_fp" "pts_ma"
## [53] "trb_ma" "tov_ma"
## [55] "stl_ma" "ast_ma"
## [57] "tp_ma" "min_ma"
## [59] "dk_fp_ma" "fd_fp_ma"
## [61] "pts_ma_1" "trb_ma_1"
## [63] "tov_ma_1" "stl_ma_1"
## [65] "ast_ma_1" "tp_ma_1"
## [67] "dk_fp_ma_1" "fd_fp_ma_1"
## [69] "min_ma_1" "opp_pts"
## [71] "opp_trb" "opp_tov"
## [73] "opp_stl" "opp_ast"
## [75] "opp_tp" "opp_pace"
## [77] "opp_team_efg_perc" "opp_team_tov_perc"
## [79] "opp_team_orb_perc" "opp_team_fg_fga"
## [81] "opp_team_off_rtg" "opp_pts_ma"
## [83] "opp_trb_ma" "opp_tov_ma"
## [85] "opp_stl_ma" "opp_ast_ma"
## [87] "opp_tp_ma" "opp_pace_ma"
## [89] "opp_team_efg_perc_ma" "opp_team_tov_perc_ma"
## [91] "opp_team_orb_perc_ma" "opp_team_fg_fga_ma"
## [93] "opp_team_off_rtg_ma" "opp_pts_ma_1"
## [95] "opp_trb_ma_1" "opp_tov_ma_1"
## [97] "opp_stl_ma_1" "opp_ast_ma_1"
## [99] "opp_tp_ma_1" "opp_pace_ma_1"
## [101] "opp_team_efg_perc_ma_1" "opp_team_tov_perc_ma_1"
## [103] "opp_team_orb_perc_ma_1" "opp_team_fg_fga_ma_1"
## [105] "opp_team_off_rtg_ma_1"
Every variable with an “ma” at the end of its name represents a “moving average” variable, which basically means that it depends on the previous game. Since we are calculating a statistic that is only contained within one game, every moving average variable is useless to us. Additionally, the “dnp”, which means “did not play”, and “suspended” variables are useless due to our minutes restriction, so they can be removed. There are also a few variables that are from DraftKings, so they represent online gambling values that have no purpose here. Furthermore, the “season” variable is unimportant, because the median plus/minus rating is a zero regardless of the season that the game is played. Let’s remove these things now:
basketball <- basketball[,c(1:46,70:81)]
Now, we can finally remove all of those NA values by setting them to zero:
for( i in 1:58) {
if ( length(basketball[,i][is.na(basketball[,i])]) > 0) {
basketball[,i][is.na(basketball[,i])] <- 0
}
}
Luckily, that wasn’t too bad at all, as the data was pretty clean to begin with. Let’s look at some visualizations of the data now to see what we can do with it.
As mentioned earlier, the median of the plus/minus rating will be zero, and the rating is a continuous variable that is normally distributed. This all follows from logic, but let’s test that with a histogram:
We can see here that our initial intuition was correct! Now, let’s think about some variables that could possibly affect this value. My first thought is to say that if a player gets a lot of minutes, he must be a superior player, and thus he must have a larger plus/minus value as a result. Let’s test this out:
Woah, that was an incorrect assumption. The graph generally seems to be proportional across the x axis, which makes sense, because even as a team is losing, they will have their best players on the court. Hmm, this makes me look towards less traditional statistics for our answer. Let’s start with the player’s team’s offensive rating, which represents the amount of points a team generates per 100 possessions. However, we must also take into account the opposing team’s offensive rating, so let’s execute some feature engineering and create a field that represents the difference in these two values.
basketball$off_rtg_diff <- basketball$team_off_rtg - basketball$opp_team_off_rtg
Cool, now let’s plot this relationship.
Well, that looks like a legitimate positive correlation. Let’s start with these variables and try to determine the correct model for this data.
Since this is a continuous variable, a linear regression model seems like the obvious and logical choice. Let’s enter the team offensive rating and opposing team’s offensive rating into a linear regression model.
fit <- lm(plus_minus ~ team_off_rtg + opp_team_off_rtg, data = basketball)
summary(fit)
##
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.367 -5.462 0.049 5.509 35.633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.289672 0.251499 1.152 0.249
## team_off_rtg 0.508700 0.001786 284.796 <2e-16 ***
## opp_team_off_rtg -0.507866 0.001788 -284.077 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.228 on 161675 degrees of freedom
## Multiple R-squared: 0.4659, Adjusted R-squared: 0.4659
## F-statistic: 7.052e+04 on 2 and 161675 DF, p-value: < 2.2e-16
An adjusted r-squared value of .4659 is pretty solid for just two variables! Now let’s add the individual player’s offensive rating, which represents the number of points a player produces over 100 of his team’s possessions, and the player’s defensive rating, which represents the number of points a player allows over 100 of his team’s possessions. The way these statistics are calculated is really cool and somewhat complicated, and a very in depth explanation can be found here: http://www.basketball-reference.com/about/ratings.html.
fit <- update(fit,.~. + off_rtg + def_rtg, data = basketball)
summary(fit)
##
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg + off_rtg +
## def_rtg, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.462 -5.321 0.005 5.368 35.247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5118931 0.2447657 2.091 0.0365 *
## team_off_rtg 0.4564635 0.0018484 246.945 <2e-16 ***
## opp_team_off_rtg -0.3436589 0.0040329 -85.215 <2e-16 ***
## off_rtg 0.0505466 0.0006124 82.538 <2e-16 ***
## def_rtg -0.1638360 0.0036503 -44.883 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.007 on 161673 degrees of freedom
## Multiple R-squared: 0.4942, Adjusted R-squared: 0.4942
## F-statistic: 3.949e+04 on 4 and 161673 DF, p-value: < 2.2e-16
This provides an additional boost in the adjusted r-squared value. What more can we add that isn’t included in those statistics? Modern basketball is moving outwards in the direction of the three point line, as the three point shot has become increasingly popular and also analytically applauded. Let’s see if a player’s three point attempt rate has any effect on his plus/minus rating. Additionally, a player’s usage rate, which is the percentage of possessions he is used in while on the court, will theoretically have a direct impact on his performance, so let’s add that too.
fit <- update(fit,.~. + tpar + usg_perc, data = basketball)
summary(fit)
##
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg + off_rtg +
## def_rtg + tpar + usg_perc, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.026 -5.313 0.008 5.362 35.527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5792048 0.2509797 -2.308 0.021 *
## team_off_rtg 0.4550637 0.0018478 246.270 <2e-16 ***
## opp_team_off_rtg -0.3338993 0.0041025 -81.388 <2e-16 ***
## off_rtg 0.0509090 0.0006119 83.199 <2e-16 ***
## def_rtg -0.1734127 0.0037261 -46.540 <2e-16 ***
## tpar 1.1478318 0.0814357 14.095 <2e-16 ***
## usg_perc 0.0464741 0.0027638 16.815 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.996 on 161671 degrees of freedom
## Multiple R-squared: 0.4956, Adjusted R-squared: 0.4956
## F-statistic: 2.648e+04 on 6 and 161671 DF, p-value: < 2.2e-16
Well this doesn’t seem to have too large of an impact, but it does increase the adjusted r-squared slightly. Let’s try to add a few other variables to test their impact. This exercise becomes very difficult because the offensive and defensive rating statistics incorporate so many of a player’s statistics, and double counting variables would lead to a poor model. Interestingly enough, the individual offensive rating value does not incorporate three point percentage at all, so let’s add that in. Also, the offensive rating has the amount of offensive rebounds within it, but the number of defensive rebounds is never involved, so let’s put that into our model as well. Both of these statistics are an important and basic evaluation of a player’s performance
fit <- update(fit,.~. + drb + tp_perc, data = basketball)
summary(fit)
##
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg + off_rtg +
## def_rtg + tpar + usg_perc + drb + tp_perc, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.317 -5.316 -0.024 5.330 35.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.9624932 0.2565506 -7.65 2.03e-14 ***
## team_off_rtg 0.4556942 0.0018415 247.46 < 2e-16 ***
## opp_team_off_rtg -0.3861599 0.0044337 -87.10 < 2e-16 ***
## off_rtg 0.0465270 0.0006572 70.80 < 2e-16 ***
## def_rtg -0.1127784 0.0042051 -26.82 < 2e-16 ***
## tpar 1.0780860 0.0930998 11.58 < 2e-16 ***
## usg_perc 0.0313286 0.0028128 11.14 < 2e-16 ***
## drb 0.2825914 0.0087538 32.28 < 2e-16 ***
## tp_perc 0.9011014 0.0834246 10.80 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.968 on 161669 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 2.014e+04 on 8 and 161669 DF, p-value: < 2.2e-16
That again increases the adjusted r-squared by a small but noticable amount. It does feel as though we have covered the impact that a player can make through these values, but we are still hovering around a 50% success rate. Just for fun, let’s try to add all of the variables in the table to a separate linear model and see if that would increase our adjusted r-squared, but let’s exclude factor variables with a large number of levels to make this process quick and easy.
summary(lm(plus_minus ~ . - date - team - opp - name, data = basketball))
##
## Call:
## lm(formula = plus_minus ~ . - date - team - opp - name, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.182 -5.253 -0.034 5.227 36.547
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4407334 1.2333705 -2.790 0.005276 **
## venueH -0.0071203 0.0404500 -0.176 0.860273
## startYes -1.7106773 0.0503667 -33.964 < 2e-16 ***
## minutes 0.0900032 0.0079497 11.322 < 2e-16 ***
## true_shooting -0.2005796 0.4499920 -0.446 0.655785
## efg 0.8098755 0.8065443 1.004 0.315317
## tpar -0.2224776 0.2013283 -1.105 0.269140
## ftar -0.0024593 0.0677452 -0.036 0.971041
## orb_perc -0.0866994 0.0199544 -4.345 1.39e-05 ***
## drb_perc -0.0478886 0.0169154 -2.831 0.004640 **
## trb_perc 0.1700196 0.0366985 4.633 3.61e-06 ***
## ast_perc 0.0099755 0.0043352 2.301 0.021392 *
## stl_perc 0.0226895 0.0402928 0.563 0.573358
## blk_perc -0.0367563 0.0222142 -1.655 0.098002 .
## tov_perc -0.0084220 0.0031626 -2.663 0.007745 **
## usg_perc 0.0417928 0.0086298 4.843 1.28e-06 ***
## off_rtg 0.0019968 0.0018270 1.093 0.274433
## def_rtg -0.0384843 0.0126501 -3.042 0.002349 **
## fg 0.8726931 0.0250248 34.873 < 2e-16 ***
## fga -0.5196774 0.0197794 -26.274 < 2e-16 ***
## fg_perc -0.3845369 0.7729652 -0.497 0.618849
## tp 0.4208343 0.0521287 8.073 6.91e-16 ***
## tpa 0.1063156 0.0280664 3.788 0.000152 ***
## tp_perc -0.0087181 0.1217901 -0.072 0.942934
## ft 0.3175718 0.0335880 9.455 < 2e-16 ***
## fta -0.1703355 0.0274795 -6.199 5.71e-10 ***
## ft_perc 0.2107740 0.0707789 2.978 0.002903 **
## orb 0.0836990 0.0393902 2.125 0.033599 *
## drb 0.2096120 0.0246798 8.493 < 2e-16 ***
## trb NA NA NA NA
## ast 0.4438777 0.0242284 18.321 < 2e-16 ***
## stl 0.3414796 0.0594536 5.744 9.28e-09 ***
## blk 0.4028373 0.0614580 6.555 5.59e-11 ***
## tov -0.6306077 0.0295025 -21.375 < 2e-16 ***
## pf -0.0668252 0.0146654 -4.557 5.20e-06 ***
## pts NA NA NA NA
## pace 0.0449642 0.0081646 5.507 3.65e-08 ***
## team_efg_perc -5.7269918 1.5893272 -3.603 0.000314 ***
## team_tov_perc 0.0453179 0.0172463 2.628 0.008598 **
## team_orb_perc 0.0197116 0.0062646 3.147 0.001653 **
## team_fg_fga 0.5887362 0.4337236 1.357 0.174656
## team_off_rtg 0.4661945 0.0100075 46.584 < 2e-16 ***
## opp_pts -0.0765618 0.0104599 -7.320 2.50e-13 ***
## opp_trb 0.0009631 0.0089680 0.107 0.914474
## opp_tov 0.2266360 0.0633778 3.576 0.000349 ***
## opp_stl -0.0122907 0.0101080 -1.216 0.224010
## opp_ast -0.0085705 0.0053068 -1.615 0.106308
## opp_tp -0.0062613 0.0079875 -0.784 0.433110
## opp_pace NA NA NA NA
## opp_team_efg_perc 3.4086286 1.6083701 2.119 0.034066 *
## opp_team_tov_perc -0.2454530 0.0711510 -3.450 0.000561 ***
## opp_team_orb_perc -0.0019395 0.0078368 -0.247 0.804530
## opp_team_fg_fga 0.6972796 0.4465810 1.561 0.118438
## opp_team_off_rtg -0.3971572 0.0174680 -22.736 < 2e-16 ***
## off_rtg_diff NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.826 on 161627 degrees of freedom
## Multiple R-squared: 0.5169, Adjusted R-squared: 0.5168
## F-statistic: 3459 on 50 and 161627 DF, p-value: < 2.2e-16
Well clearly our original model is not bad at all! In 8 variables, our adjusted r-squared is only slightly below the adjusted r-squared for a linear model that includes all of the variables (though this is overfitting, the model still is expected to produce a large adjusted r-squared). Now, we should evaluate our model.
Let’s cross validate our model now and pray that our error is small so we can see which variables have the largest effect on our outcome.
TRAINCONTROL = trainControl(method = "cv", number = 10, verboseIter = TRUE)
basketball.glm <- train(plus_minus ~ off_rtg + def_rtg + team_off_rtg + opp_team_off_rtg + tpar + usg_perc + drb + tp_perc, data = basketball, method = "glm", trControl = TRAINCONTROL)
basketball.glm
## Generalized Linear Model
##
## 161678 samples
## 58 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 145508, 145510, 145510, 145510, 145508, 145510, ...
## Resampling results:
##
## RMSE Rsquared
## 7.967788 0.499138
##
##
Our root mean squared error is 7.967831, which means that we are off by that much in our calculations on average. Considering that our adjusted r-squared value is 0.4991, this is not a terrible outcome, as we are correct about half the time, which means that in that half, we are essentially within 8 points of the right value. Now, let’s see which parameters made the largets impact.
summary(basketball.glm$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -36.317 -5.316 -0.024 5.330 35.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.9624932 0.2565506 -7.65 2.03e-14 ***
## off_rtg 0.0465270 0.0006572 70.80 < 2e-16 ***
## def_rtg -0.1127784 0.0042051 -26.82 < 2e-16 ***
## team_off_rtg 0.4556942 0.0018415 247.46 < 2e-16 ***
## opp_team_off_rtg -0.3861599 0.0044337 -87.10 < 2e-16 ***
## tpar 1.0780860 0.0930998 11.58 < 2e-16 ***
## usg_perc 0.0313286 0.0028128 11.14 < 2e-16 ***
## drb 0.2825914 0.0087538 32.28 < 2e-16 ***
## tp_perc 0.9011014 0.0834246 10.80 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 63.48344)
##
## Null deviance: 20492516 on 161677 degrees of freedom
## Residual deviance: 10263304 on 161669 degrees of freedom
## AIC: 1129923
##
## Number of Fisher Scoring iterations: 2
The values in the estimate column represent how much the plus/minus rating is impacted by an increase in that variable by one. This is extremely interesting, as the two variables that seem to have the largest impact here are the three point attempt rate and the three point percent variables. The outcome here seems to confirm the popular opinion amongst basketball analysts that the three point shot is extremely valuable and that teams should look to attempt more three pointers. Another interesting wrinkle here is that both team’s offensive ratings seem to impact the plus/minus score more than a player’s individual offensive and defensive ratings.
There were many interesting observations to be taken away from this model, the first of which is the importance of the three point shot. Secondly, the performance of the team around a player seems to affect his plus/minus score more than his own actions, and that makes intuitive sense, as basketball is a team sport in which outside circumstances can control much more than a single player can. Lastly, we must put into perspective our findings. We have been saying that the outcomes of a .4991 adjusted r-squared and a RMSE slightly below 8 are good outcomes, but numerically, they don’t seem very good. However, it is important to remember that what we have attempted to do was never going to yield optimal results; it is impossible to quantify and detect exactly the things a player can do to have the maximum possible positive impact on a game. However, this exercise has shined some light on what aspects may nudge a player’s impact in a more positive direction, and that is not something to be taken lightly. Furthermore, the large impact of advanced statistics such as offensive and defensive ratings shows the developments that have been made in the realm of basketball analytics, and the future of the field appears bright. Maybe one day, with further developments, the question of what determines a player’s plus/minus value can be more reliably answered, and a blueprint to basketball success can become more concrete.