Introduction

The introduction of advanced statistics to basketball, and particularly NBA basketball, is changing the way that experts and analysts cover the sport. As the search for an exact blueprint for individual and team success continues, the statistics that are constructed become increasingly outlandish. One of the earliest “advanced” statistics to be popularized is a player’s plus/minus rating. This value is calculated by taking the difference in the team’s scoring versus that of the opponent’s while that particular player was in the game. For example, if you played 30 minutes in a game, and in those 30 minutes, your team scored 65 points and the opponent scored 55, you would have a +10 plus/minus rating for that game. The goal of this machine learning model is to see if we can predict a player’s plus/minus rating in a particular game based on an assortment of that player’s other statistics that day.

Data Explanation

The data I used contains information from NBA games between 2008 and 2015, and it covers 984 different players in that timespan. The data itself has 106 different variables representing a certain statistic, and each row represents the statistics of a player in a particular game. Those rows are then grouped by player name and sorted in alphabetical order, and all of the games for each player are then sorted in chronological order. I will discuss later how I cut down the amount of variables in the table, but I ended up considering only 58 of the variables, and here are their labels and explanations:

Variable Type Description
name Factor player name
venue Factor home or away
team Factor player’s team
date Factor date of game
start Factor did the player start?
opp Factor opposing team
minutes num minutes played
true_shooting num true shooting percentage
efg num effective field goal percentage
tpar num three point attempt rate
orb_perc num percentage of offensive rebounds player got while on court (offensive rebounding percentage)
drb_perc num percentage of defensive rebounds player got while on court (defensive rebounding percentage)
trb_perc num percentage of total rebounds player got while on court (total rebounding percentage)
ast_perc num percentage of field goals player assisted while on court
stl_perc num percentage of opponent’s possessions ended with steal by player
blk_perc num percentage of opponent’s shots that were blocked by player
tov_perc num estimate of turnovers by player per 100 plays
usg_perc num percentage of team’s plays used by player while on floor
off_rtg num number of points produced by player over 100 possessions (offensive rating)
def_rtg int number of points allowed by player over 100 possessions (defensive rating)
fg int field goals player made in game
fga int field goals player attempted in game
fg_perc num field goal percentage (fg/fga)
tp int three point field goals player made in game
tpa int three point field goals player attempted in game
tp_perc num three point field goal percentage (tp/tpa)
ft int free throws player made in game
fta int free throws player attempted in game
tp_perc num free throw percentage (ft/fta)
orb int number of offensive rebounds player had in game
drb int number of defensive rebounds player had in game
trb int number of total rebounds player had in game
ast int number of assists player had in game
stl int number of steals player had in game
blk int number of blocks player had in game
tov int number of turnovers player had in game
pf int number of personal fouls player had in game
pts int number of points player had in game
plus_minus int player’s plus minus rating in game
pace num number of possessions per 48 minutes by player’s team (pace)
team_efg_perc num player’s team’s effective field goal percentage
team_tov_perc num player’s team’s effective turnover percentage
team_orb_perc num player’s team’s offensive rebounding percentage
team_fg_fga num player’s team’s field goal attempt percentage
team_off_rtg num player’s team’s points scored per 100 possessions (offensive rating)
opp_pts int number of points scored by player’s opponent
opp_rbs int number of rebounds by player’s opponent
opp_tov int number of turnovers by player’s opponent
opp_stl int number of steals by player’s opponent
opp_tp int number of three point field goals by player’s opponent
opp_pace num number of possessions per 48 minutes by player’s opponent (pace)
opp_team_efg_perc num opposing team’s effective field goal percentage
opp_team_tov_perc num estimate of number of turnovers per 100 possessions by opposing team
opp_team_orb_perc num opposing team’s offensive rebounding percentage
opp_team_fg_fga num opposing team’s field goal percentage
opp_team_off_rtg num opposing team’s points scored per 100 possessions (offensive rating)

Note: effective field goal percentage is an advanced statistic that accounts for the additional points provided by three point shots, and true shooting percentage is an advanced statistic that accounts for field goals, three point field goals, and free throws.

Data Origin and Collection

The data that I used was found on the following reddit thread: https://www.reddit.com/r/dfsports/comments/3q89gx/nba_basketball_research_dataset/.

This user was kind enough to compile and post an enormous excel spreadsheet of data that can be used for all sorts of NBA related purposes, as there are over 200,000 rows in the file. A great deal of web scraping must have been done for this file, which mostly was probably from and the DraftKings online website. With such a large amount of actual NBA statistics, this is the perfect data set to create a model from.

Data Preparation

For the most part, this data is pretty clean, but a few changes needed to be made to fit the purpose of this model. First, I loaded all the required packages for this model, imported the data from a csv into RStudio, and took a look at some of the fields:

library(dplyr)
library(ggplot2)
library(caret)
basketball <- read.csv("~/Desktop/basketball research data set.csv", header=TRUE)
head(basketball,20)[1:9]
##          name venue team       date start opp   minutes true_shooting  efg
## 1  A.J. Price     H  CLE 2014-12-02    No MIL  0.000000            NA   NA
## 2  A.J. Price     A  CLE 2014-12-04    No NYK  0.600000            NA   NA
## 3  A.J. Price     A  CLE 2014-12-05    No TOR  0.000000            NA   NA
## 4  A.J. Price     A  CLE 2014-12-08    No BRK  3.366667         0.000 0.00
## 5  A.J. Price     H  CLE 2014-12-09    No TOR  0.000000            NA   NA
## 6  A.J. Price     A  CLE 2014-12-11    No OKC  1.516667            NA   NA
## 7  A.J. Price     A  CLE 2014-12-12    No NOP  0.000000            NA   NA
## 8  A.J. Price     H  CLE 2014-12-15    No CHO  0.000000            NA   NA
## 9  A.J. Price     H  CLE 2014-12-17    No ATL  7.683333         0.000 0.00
## 10 A.J. Price     H  CLE 2014-12-19    No BRK  0.000000            NA   NA
## 11 A.J. Price     H  CLE 2014-12-21    No MEM  0.000000            NA   NA
## 12 A.J. Price     H  CLE 2014-12-23    No MIN  2.733333            NA   NA
## 13 A.J. Price     A  CLE 2014-12-26    No ORL  6.366667            NA   NA
## 14 A.J. Price     H  CLE 2014-12-28    No DET 17.650000         0.129 0.00
## 15 A.J. Price     A  CLE 2014-12-30    No ATL  8.083333         0.500 0.50
## 16 A.J. Price     H  CLE 2014-12-31    No MIL 11.083333         0.410 0.25
## 17 A.J. Price     A  CLE 2015-01-02    No CHO  0.000000            NA   NA
## 18 A.J. Price     H  CLE 2015-01-04    No DAL 13.933333         0.500 0.50
## 19 A.J. Price     A  CLE 2015-01-05    No PHI 13.716667         0.300 0.30
## 20 A.J. Price     H  IND 2009-10-30    No MIA  1.316667            NA   NA

At first glance, this doesn’t look too pretty, but in reality, it is not bad at all. The only glaring issue is that NA values need to be replaced with 0, which is an easy fix. First, however, we should consider the fact that a player who played only one or two minutes in a game would have a negligible impact on any game. Thus, we want to remove the games for a player where he had such a small impact. But where do we draw that line? I decided to draw the line at twelve minutes, so a player would have to play an entire quarter to be qualified for this experiment.

length(basketball[,1])
## [1] 208782
basketball <- filter(basketball,minutes >= 12)
length(basketball[,1])
## [1] 161678

We see now that we have cut down the amount of rows significantly by eliminating data that will be essentially useless for our purposes. Now, we should cut unnecessary variables out of our data set.

variable.names(basketball)
##   [1] "name"                   "venue"                 
##   [3] "team"                   "date"                  
##   [5] "start"                  "opp"                   
##   [7] "minutes"                "true_shooting"         
##   [9] "efg"                    "tpar"                  
##  [11] "ftar"                   "orb_perc"              
##  [13] "drb_perc"               "trb_perc"              
##  [15] "ast_perc"               "stl_perc"              
##  [17] "blk_perc"               "tov_perc"              
##  [19] "usg_perc"               "off_rtg"               
##  [21] "def_rtg"                "fg"                    
##  [23] "fga"                    "fg_perc"               
##  [25] "tp"                     "tpa"                   
##  [27] "tp_perc"                "ft"                    
##  [29] "fta"                    "ft_perc"               
##  [31] "orb"                    "drb"                   
##  [33] "trb"                    "ast"                   
##  [35] "stl"                    "blk"                   
##  [37] "tov"                    "pf"                    
##  [39] "pts"                    "plus_minus"            
##  [41] "pace"                   "team_efg_perc"         
##  [43] "team_tov_perc"          "team_orb_perc"         
##  [45] "team_fg_fga"            "team_off_rtg"          
##  [47] "suspended"              "dnp"                   
##  [49] "season"                 "dk_fp"                 
##  [51] "fd_fp"                  "pts_ma"                
##  [53] "trb_ma"                 "tov_ma"                
##  [55] "stl_ma"                 "ast_ma"                
##  [57] "tp_ma"                  "min_ma"                
##  [59] "dk_fp_ma"               "fd_fp_ma"              
##  [61] "pts_ma_1"               "trb_ma_1"              
##  [63] "tov_ma_1"               "stl_ma_1"              
##  [65] "ast_ma_1"               "tp_ma_1"               
##  [67] "dk_fp_ma_1"             "fd_fp_ma_1"            
##  [69] "min_ma_1"               "opp_pts"               
##  [71] "opp_trb"                "opp_tov"               
##  [73] "opp_stl"                "opp_ast"               
##  [75] "opp_tp"                 "opp_pace"              
##  [77] "opp_team_efg_perc"      "opp_team_tov_perc"     
##  [79] "opp_team_orb_perc"      "opp_team_fg_fga"       
##  [81] "opp_team_off_rtg"       "opp_pts_ma"            
##  [83] "opp_trb_ma"             "opp_tov_ma"            
##  [85] "opp_stl_ma"             "opp_ast_ma"            
##  [87] "opp_tp_ma"              "opp_pace_ma"           
##  [89] "opp_team_efg_perc_ma"   "opp_team_tov_perc_ma"  
##  [91] "opp_team_orb_perc_ma"   "opp_team_fg_fga_ma"    
##  [93] "opp_team_off_rtg_ma"    "opp_pts_ma_1"          
##  [95] "opp_trb_ma_1"           "opp_tov_ma_1"          
##  [97] "opp_stl_ma_1"           "opp_ast_ma_1"          
##  [99] "opp_tp_ma_1"            "opp_pace_ma_1"         
## [101] "opp_team_efg_perc_ma_1" "opp_team_tov_perc_ma_1"
## [103] "opp_team_orb_perc_ma_1" "opp_team_fg_fga_ma_1"  
## [105] "opp_team_off_rtg_ma_1"

Every variable with an “ma” at the end of its name represents a “moving average” variable, which basically means that it depends on the previous game. Since we are calculating a statistic that is only contained within one game, every moving average variable is useless to us. Additionally, the “dnp”, which means “did not play”, and “suspended” variables are useless due to our minutes restriction, so they can be removed. There are also a few variables that are from DraftKings, so they represent online gambling values that have no purpose here. Furthermore, the “season” variable is unimportant, because the median plus/minus rating is a zero regardless of the season that the game is played. Let’s remove these things now:

basketball <- basketball[,c(1:46,70:81)]

Now, we can finally remove all of those NA values by setting them to zero:

for( i in 1:58) {
  if ( length(basketball[,i][is.na(basketball[,i])]) > 0) {
    basketball[,i][is.na(basketball[,i])] <- 0
  }
}

Luckily, that wasn’t too bad at all, as the data was pretty clean to begin with. Let’s look at some visualizations of the data now to see what we can do with it.

Data Exploration

As mentioned earlier, the median of the plus/minus rating will be zero, and the rating is a continuous variable that is normally distributed. This all follows from logic, but let’s test that with a histogram:

We can see here that our initial intuition was correct! Now, let’s think about some variables that could possibly affect this value. My first thought is to say that if a player gets a lot of minutes, he must be a superior player, and thus he must have a larger plus/minus value as a result. Let’s test this out:

Woah, that was an incorrect assumption. The graph generally seems to be proportional across the x axis, which makes sense, because even as a team is losing, they will have their best players on the court. Hmm, this makes me look towards less traditional statistics for our answer. Let’s start with the player’s team’s offensive rating, which represents the amount of points a team generates per 100 possessions. However, we must also take into account the opposing team’s offensive rating, so let’s execute some feature engineering and create a field that represents the difference in these two values.

basketball$off_rtg_diff <- basketball$team_off_rtg - basketball$opp_team_off_rtg

Cool, now let’s plot this relationship.

Well, that looks like a legitimate positive correlation. Let’s start with these variables and try to determine the correct model for this data.

Modelling

Since this is a continuous variable, a linear regression model seems like the obvious and logical choice. Let’s enter the team offensive rating and opposing team’s offensive rating into a linear regression model.

fit <- lm(plus_minus ~ team_off_rtg + opp_team_off_rtg, data = basketball)
summary(fit)
## 
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.367  -5.462   0.049   5.509  35.633 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       0.289672   0.251499    1.152    0.249    
## team_off_rtg      0.508700   0.001786  284.796   <2e-16 ***
## opp_team_off_rtg -0.507866   0.001788 -284.077   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.228 on 161675 degrees of freedom
## Multiple R-squared:  0.4659, Adjusted R-squared:  0.4659 
## F-statistic: 7.052e+04 on 2 and 161675 DF,  p-value: < 2.2e-16

An adjusted r-squared value of .4659 is pretty solid for just two variables! Now let’s add the individual player’s offensive rating, which represents the number of points a player produces over 100 of his team’s possessions, and the player’s defensive rating, which represents the number of points a player allows over 100 of his team’s possessions. The way these statistics are calculated is really cool and somewhat complicated, and a very in depth explanation can be found here: http://www.basketball-reference.com/about/ratings.html.

fit <- update(fit,.~. + off_rtg + def_rtg, data = basketball)
summary(fit)
## 
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg + off_rtg + 
##     def_rtg, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.462  -5.321   0.005   5.368  35.247 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.5118931  0.2447657   2.091   0.0365 *  
## team_off_rtg      0.4564635  0.0018484 246.945   <2e-16 ***
## opp_team_off_rtg -0.3436589  0.0040329 -85.215   <2e-16 ***
## off_rtg           0.0505466  0.0006124  82.538   <2e-16 ***
## def_rtg          -0.1638360  0.0036503 -44.883   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.007 on 161673 degrees of freedom
## Multiple R-squared:  0.4942, Adjusted R-squared:  0.4942 
## F-statistic: 3.949e+04 on 4 and 161673 DF,  p-value: < 2.2e-16

This provides an additional boost in the adjusted r-squared value. What more can we add that isn’t included in those statistics? Modern basketball is moving outwards in the direction of the three point line, as the three point shot has become increasingly popular and also analytically applauded. Let’s see if a player’s three point attempt rate has any effect on his plus/minus rating. Additionally, a player’s usage rate, which is the percentage of possessions he is used in while on the court, will theoretically have a direct impact on his performance, so let’s add that too.

fit <- update(fit,.~. + tpar + usg_perc, data = basketball)
summary(fit)
## 
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg + off_rtg + 
##     def_rtg + tpar + usg_perc, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.026  -5.313   0.008   5.362  35.527 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.5792048  0.2509797  -2.308    0.021 *  
## team_off_rtg      0.4550637  0.0018478 246.270   <2e-16 ***
## opp_team_off_rtg -0.3338993  0.0041025 -81.388   <2e-16 ***
## off_rtg           0.0509090  0.0006119  83.199   <2e-16 ***
## def_rtg          -0.1734127  0.0037261 -46.540   <2e-16 ***
## tpar              1.1478318  0.0814357  14.095   <2e-16 ***
## usg_perc          0.0464741  0.0027638  16.815   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.996 on 161671 degrees of freedom
## Multiple R-squared:  0.4956, Adjusted R-squared:  0.4956 
## F-statistic: 2.648e+04 on 6 and 161671 DF,  p-value: < 2.2e-16

Well this doesn’t seem to have too large of an impact, but it does increase the adjusted r-squared slightly. Let’s try to add a few other variables to test their impact. This exercise becomes very difficult because the offensive and defensive rating statistics incorporate so many of a player’s statistics, and double counting variables would lead to a poor model. Interestingly enough, the individual offensive rating value does not incorporate three point percentage at all, so let’s add that in. Also, the offensive rating has the amount of offensive rebounds within it, but the number of defensive rebounds is never involved, so let’s put that into our model as well. Both of these statistics are an important and basic evaluation of a player’s performance

fit <- update(fit,.~. + drb + tp_perc, data = basketball)
summary(fit)
## 
## Call:
## lm(formula = plus_minus ~ team_off_rtg + opp_team_off_rtg + off_rtg + 
##     def_rtg + tpar + usg_perc + drb + tp_perc, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.317  -5.316  -0.024   5.330  35.630 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.9624932  0.2565506   -7.65 2.03e-14 ***
## team_off_rtg      0.4556942  0.0018415  247.46  < 2e-16 ***
## opp_team_off_rtg -0.3861599  0.0044337  -87.10  < 2e-16 ***
## off_rtg           0.0465270  0.0006572   70.80  < 2e-16 ***
## def_rtg          -0.1127784  0.0042051  -26.82  < 2e-16 ***
## tpar              1.0780860  0.0930998   11.58  < 2e-16 ***
## usg_perc          0.0313286  0.0028128   11.14  < 2e-16 ***
## drb               0.2825914  0.0087538   32.28  < 2e-16 ***
## tp_perc           0.9011014  0.0834246   10.80  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.968 on 161669 degrees of freedom
## Multiple R-squared:  0.4992, Adjusted R-squared:  0.4991 
## F-statistic: 2.014e+04 on 8 and 161669 DF,  p-value: < 2.2e-16

That again increases the adjusted r-squared by a small but noticable amount. It does feel as though we have covered the impact that a player can make through these values, but we are still hovering around a 50% success rate. Just for fun, let’s try to add all of the variables in the table to a separate linear model and see if that would increase our adjusted r-squared, but let’s exclude factor variables with a large number of levels to make this process quick and easy.

summary(lm(plus_minus ~ . - date - team - opp - name, data = basketball))
## 
## Call:
## lm(formula = plus_minus ~ . - date - team - opp - name, data = basketball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.182  -5.253  -0.034   5.227  36.547 
## 
## Coefficients: (4 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.4407334  1.2333705  -2.790 0.005276 ** 
## venueH            -0.0071203  0.0404500  -0.176 0.860273    
## startYes          -1.7106773  0.0503667 -33.964  < 2e-16 ***
## minutes            0.0900032  0.0079497  11.322  < 2e-16 ***
## true_shooting     -0.2005796  0.4499920  -0.446 0.655785    
## efg                0.8098755  0.8065443   1.004 0.315317    
## tpar              -0.2224776  0.2013283  -1.105 0.269140    
## ftar              -0.0024593  0.0677452  -0.036 0.971041    
## orb_perc          -0.0866994  0.0199544  -4.345 1.39e-05 ***
## drb_perc          -0.0478886  0.0169154  -2.831 0.004640 ** 
## trb_perc           0.1700196  0.0366985   4.633 3.61e-06 ***
## ast_perc           0.0099755  0.0043352   2.301 0.021392 *  
## stl_perc           0.0226895  0.0402928   0.563 0.573358    
## blk_perc          -0.0367563  0.0222142  -1.655 0.098002 .  
## tov_perc          -0.0084220  0.0031626  -2.663 0.007745 ** 
## usg_perc           0.0417928  0.0086298   4.843 1.28e-06 ***
## off_rtg            0.0019968  0.0018270   1.093 0.274433    
## def_rtg           -0.0384843  0.0126501  -3.042 0.002349 ** 
## fg                 0.8726931  0.0250248  34.873  < 2e-16 ***
## fga               -0.5196774  0.0197794 -26.274  < 2e-16 ***
## fg_perc           -0.3845369  0.7729652  -0.497 0.618849    
## tp                 0.4208343  0.0521287   8.073 6.91e-16 ***
## tpa                0.1063156  0.0280664   3.788 0.000152 ***
## tp_perc           -0.0087181  0.1217901  -0.072 0.942934    
## ft                 0.3175718  0.0335880   9.455  < 2e-16 ***
## fta               -0.1703355  0.0274795  -6.199 5.71e-10 ***
## ft_perc            0.2107740  0.0707789   2.978 0.002903 ** 
## orb                0.0836990  0.0393902   2.125 0.033599 *  
## drb                0.2096120  0.0246798   8.493  < 2e-16 ***
## trb                       NA         NA      NA       NA    
## ast                0.4438777  0.0242284  18.321  < 2e-16 ***
## stl                0.3414796  0.0594536   5.744 9.28e-09 ***
## blk                0.4028373  0.0614580   6.555 5.59e-11 ***
## tov               -0.6306077  0.0295025 -21.375  < 2e-16 ***
## pf                -0.0668252  0.0146654  -4.557 5.20e-06 ***
## pts                       NA         NA      NA       NA    
## pace               0.0449642  0.0081646   5.507 3.65e-08 ***
## team_efg_perc     -5.7269918  1.5893272  -3.603 0.000314 ***
## team_tov_perc      0.0453179  0.0172463   2.628 0.008598 ** 
## team_orb_perc      0.0197116  0.0062646   3.147 0.001653 ** 
## team_fg_fga        0.5887362  0.4337236   1.357 0.174656    
## team_off_rtg       0.4661945  0.0100075  46.584  < 2e-16 ***
## opp_pts           -0.0765618  0.0104599  -7.320 2.50e-13 ***
## opp_trb            0.0009631  0.0089680   0.107 0.914474    
## opp_tov            0.2266360  0.0633778   3.576 0.000349 ***
## opp_stl           -0.0122907  0.0101080  -1.216 0.224010    
## opp_ast           -0.0085705  0.0053068  -1.615 0.106308    
## opp_tp            -0.0062613  0.0079875  -0.784 0.433110    
## opp_pace                  NA         NA      NA       NA    
## opp_team_efg_perc  3.4086286  1.6083701   2.119 0.034066 *  
## opp_team_tov_perc -0.2454530  0.0711510  -3.450 0.000561 ***
## opp_team_orb_perc -0.0019395  0.0078368  -0.247 0.804530    
## opp_team_fg_fga    0.6972796  0.4465810   1.561 0.118438    
## opp_team_off_rtg  -0.3971572  0.0174680 -22.736  < 2e-16 ***
## off_rtg_diff              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.826 on 161627 degrees of freedom
## Multiple R-squared:  0.5169, Adjusted R-squared:  0.5168 
## F-statistic:  3459 on 50 and 161627 DF,  p-value: < 2.2e-16

Well clearly our original model is not bad at all! In 8 variables, our adjusted r-squared is only slightly below the adjusted r-squared for a linear model that includes all of the variables (though this is overfitting, the model still is expected to produce a large adjusted r-squared). Now, we should evaluate our model.

Model Evaluation and Results

Let’s cross validate our model now and pray that our error is small so we can see which variables have the largest effect on our outcome.

TRAINCONTROL = trainControl(method = "cv", number = 10, verboseIter = TRUE)
basketball.glm <- train(plus_minus ~ off_rtg + def_rtg + team_off_rtg + opp_team_off_rtg + tpar + usg_perc + drb + tp_perc, data = basketball, method = "glm", trControl = TRAINCONTROL)
basketball.glm
## Generalized Linear Model 
## 
## 161678 samples
##     58 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 145508, 145510, 145510, 145510, 145508, 145510, ... 
## Resampling results:
## 
##   RMSE      Rsquared
##   7.967788  0.499138
## 
## 

Our root mean squared error is 7.967831, which means that we are off by that much in our calculations on average. Considering that our adjusted r-squared value is 0.4991, this is not a terrible outcome, as we are correct about half the time, which means that in that half, we are essentially within 8 points of the right value. Now, let’s see which parameters made the largets impact.

summary(basketball.glm$finalModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -36.317   -5.316   -0.024    5.330   35.630  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.9624932  0.2565506   -7.65 2.03e-14 ***
## off_rtg           0.0465270  0.0006572   70.80  < 2e-16 ***
## def_rtg          -0.1127784  0.0042051  -26.82  < 2e-16 ***
## team_off_rtg      0.4556942  0.0018415  247.46  < 2e-16 ***
## opp_team_off_rtg -0.3861599  0.0044337  -87.10  < 2e-16 ***
## tpar              1.0780860  0.0930998   11.58  < 2e-16 ***
## usg_perc          0.0313286  0.0028128   11.14  < 2e-16 ***
## drb               0.2825914  0.0087538   32.28  < 2e-16 ***
## tp_perc           0.9011014  0.0834246   10.80  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 63.48344)
## 
##     Null deviance: 20492516  on 161677  degrees of freedom
## Residual deviance: 10263304  on 161669  degrees of freedom
## AIC: 1129923
## 
## Number of Fisher Scoring iterations: 2

The values in the estimate column represent how much the plus/minus rating is impacted by an increase in that variable by one. This is extremely interesting, as the two variables that seem to have the largest impact here are the three point attempt rate and the three point percent variables. The outcome here seems to confirm the popular opinion amongst basketball analysts that the three point shot is extremely valuable and that teams should look to attempt more three pointers. Another interesting wrinkle here is that both team’s offensive ratings seem to impact the plus/minus score more than a player’s individual offensive and defensive ratings.

Conclusions

There were many interesting observations to be taken away from this model, the first of which is the importance of the three point shot. Secondly, the performance of the team around a player seems to affect his plus/minus score more than his own actions, and that makes intuitive sense, as basketball is a team sport in which outside circumstances can control much more than a single player can. Lastly, we must put into perspective our findings. We have been saying that the outcomes of a .4991 adjusted r-squared and a RMSE slightly below 8 are good outcomes, but numerically, they don’t seem very good. However, it is important to remember that what we have attempted to do was never going to yield optimal results; it is impossible to quantify and detect exactly the things a player can do to have the maximum possible positive impact on a game. However, this exercise has shined some light on what aspects may nudge a player’s impact in a more positive direction, and that is not something to be taken lightly. Furthermore, the large impact of advanced statistics such as offensive and defensive ratings shows the developments that have been made in the realm of basketball analytics, and the future of the field appears bright. Maybe one day, with further developments, the question of what determines a player’s plus/minus value can be more reliably answered, and a blueprint to basketball success can become more concrete.