Introduction

For this project, I decided to predict whether a team in the National Basketball Association (NBA) would partake in postseason play given the squad“s traditional statistics, such as total field goals, turnovers, rebounds, and free throws. In order to follow long along, I have provided my seed code to generate identical random sequences.

set.seed(5)

Data Origin & Collection

My data for this analytical survery comes from nothing but the best: basketball-reference.com. Transcending the most recent statistics, I drew team statistics dating back to the 1980 season, before which team data was logged in the site differently. Using Hadley Wickham“s rvest R package, the process of scraping the data for the years was distilled down to a rather simple call to the page. Once the data had been collected, I used rbind to merge the data frames from different years. Initially, the”Year" predictor was not listed, so I used dplyr (another Wickham package) to mutate the data. Finally, as the column names were not formatted in a clear fashion, I used the stringr package to replace “per” for “.”, which made sense in the context of the data.

nba=data.frame()
years <- 1980:2016
for(y in years) {
  nba.y <- read_html(paste("http://www.basketball-reference.com/leagues/NBA_",y,".html#all_team_stats",sep="")) %>%
    html_nodes(css = "#team") %>% html_table %>% data.frame %>% mutate(Year = y)
  names(nba.y) <- names(nba)
  nba <- rbind(nba,nba.y)
}
rm(years,y,nba.y)
# Get Column Names
names.no.year <- read_html("http://www.basketball-reference.com/leagues/NBA_2016.html#all_team_stats") %>%
  html_nodes(css = "#team") %>% html_table %>% data.frame %>% names
names(nba) <- c(names.no.year,"Year")
rm(names.no.year)
# Fix Up Column Names
names(nba) <- str_replace_all(names(nba), "\\.", "per")
head(nba)
##   Rk                Team  G    MP   FG  FGA FGper X3P X3PA X3Pper  X2P
## 1  1  San Antonio Spurs* 82 19755 3856 7738 0.498  52  206  0.252 3804
## 2  2 Los Angeles Lakers* 82 19880 3898 7368 0.529  20  100  0.200 3878
## 3  3 Cleveland Cavaliers 82 19930 3811 8041 0.474  36  187  0.193 3775
## 4  4     New York Knicks 82 19780 3802 7672 0.496  42  191  0.220 3760
## 5  5     Boston Celtics* 82 19880 3617 7387 0.490 162  422  0.384 3455
## 6  6      Indiana Pacers 82 19855 3639 7689 0.473  88  314  0.280 3551
##   X2PA X2Pper   FT  FTA FTper  ORB  DRB  TRB  AST STL BLK  TOV   PF  PTS
## 1 7532  0.505 2024 2528 0.801 1153 2515 3668 2326 771 333 1589 2103 9788
## 2 7268  0.534 1622 2092 0.775 1085 2653 3738 2413 774 546 1639 1784 9438
## 3 7854  0.481 1702 2205 0.772 1307 2381 3688 2108 764 342 1370 1934 9360
## 4 7481  0.503 1698 2274 0.747 1236 2303 3539 2265 881 457 1613 2168 9344
## 5 6965  0.496 1907 2449 0.779 1227 2457 3684 2198 809 308 1539 1974 9303
## 6 7375  0.481 1753 2333 0.751 1398 2326 3724 2148 900 530 1517 1973 9119
##   PTSperG Year
## 1   119.4 1980
## 2   115.1 1980
## 3   114.1 1980
## 4   114.0 1980
## 5   113.5 1980
## 6   111.2 1980

Data Explanation

As is the case with most sports data, the majority of the statistics listed below are numerical (integer or floating point) values. However, the frame column of utmost importance for this regression study, the POFF (added later), contains strictly logical values, a variable type which can take one of two values: TRUE (the team made the playoffs), and FALSE (the unfortunate alternative).

sapply(nba, class)
##          Rk        Team           G          MP          FG         FGA 
##   "integer" "character"   "integer"   "integer"   "integer"   "integer" 
##       FGper         X3P        X3PA      X3Pper         X2P        X2PA 
##   "numeric"   "integer"   "integer"   "numeric"   "integer"   "integer" 
##      X2Pper          FT         FTA       FTper         ORB         DRB 
##   "numeric"   "integer"   "integer"   "numeric"   "integer"   "integer" 
##         TRB         AST         STL         BLK         TOV          PF 
##   "integer"   "integer"   "integer"   "integer"   "integer"   "integer" 
##         PTS     PTSperG        Year 
##   "integer"   "numeric"   "integer"

In the case of the POFF variable, which stands for “playoff,” the TRUE/FALSE values were assessed on the basis of the presence of an asterisk in the team name.


Preparation

To prepare this data for proper logistic regression, a few predictor columns were stricken from the record, namely the rank (Rk) and the team name (Team). Additionally, the data from basketball-reference.com contained a “League Average” column, which contained NA values for the team rank; accordingly, I filtered such rows out of the set. The POFF column was previously described, but implemented in the section below via the help of another Wickham-based function in the stringr package: str_detect.

# Remove Rk/Team Column & Add Playoff Data
nba <- mutate(nba, POFF = str_detect(nba$Team, "\\*")) %>% 
  filter(!is.na(nba$Rk)) %>% dplyr::select(-Rk, -Team)
head(nba)
##    G    MP   FG  FGA FGper X3P X3PA X3Pper  X2P X2PA X2Pper   FT  FTA
## 1 82 19755 3856 7738 0.498  52  206  0.252 3804 7532  0.505 2024 2528
## 2 82 19880 3898 7368 0.529  20  100  0.200 3878 7268  0.534 1622 2092
## 3 82 19930 3811 8041 0.474  36  187  0.193 3775 7854  0.481 1702 2205
## 4 82 19780 3802 7672 0.496  42  191  0.220 3760 7481  0.503 1698 2274
## 5 82 19880 3617 7387 0.490 162  422  0.384 3455 6965  0.496 1907 2449
## 6 82 19855 3639 7689 0.473  88  314  0.280 3551 7375  0.481 1753 2333
##   FTper  ORB  DRB  TRB  AST STL BLK  TOV   PF  PTS PTSperG Year  POFF
## 1 0.801 1153 2515 3668 2326 771 333 1589 2103 9788   119.4 1980  TRUE
## 2 0.775 1085 2653 3738 2413 774 546 1639 1784 9438   115.1 1980  TRUE
## 3 0.772 1307 2381 3688 2108 764 342 1370 1934 9360   114.1 1980 FALSE
## 4 0.747 1236 2303 3539 2265 881 457 1613 2168 9344   114.0 1980 FALSE
## 5 0.779 1227 2457 3684 2198 809 308 1539 1974 9303   113.5 1980  TRUE
## 6 0.751 1398 2326 3724 2148 900 530 1517 1973 9119   111.2 1980 FALSE

Exploratory Analysis

In order to better understand the data as well as the patterns to be found within, I decided to use the ggplot2 package for visualization purposes. The below plots show the relationship between a team“s total points and their playoff chances. The first plot (left) shows a few outliers; this is because during the 1988-1989 season, NBA owners led a lockout, removing 39% of the usual games (82 -> 50). On the right plot, the outliers have been identified.

g1 <- ggplot(nba, aes(x = PTSperG, y=POFF, colour=nba$POFF)) + geom_point()
g2 <- ggplot(nba, aes(x=PTSperG, y=POFF, colour=G)) + geom_point()
grid.arrange(g1,g2,ncol=2,nrow=1)

The next two plots show playoff chances" (1) direct proportionality with three-point percentage and, and (2) inverse proportionality with turnovers.

g3 <- ggplot(nba, aes(X3Pper, POFF, colour=POFF)) + geom_point()
g4 <- ggplot(nba, aes(TOV, POFF, colour=POFF)) + geom_point() + theme(legend.position="none")
grid.arrange(g3,g4,ncol=2,nrow=1)


Modeling

In the case of deciding whether or not a team will make the NBA playoffs, the variable of interesty is binary, TRUE or FALSE. Consequently, logistic regression as well as a decision tree are examples of two appropriate models.

Before conducting and assessing any models, I split our data into training and testing sets; I chose a respective 80-20 divide.

# Split Data (train/test)
index <- createDataPartition(nba$POFF, p = 0.8)[[1]]
train.data = nba[index,]
test.data = nba[-index,]

Logistic Regression

For the first type of model, I looked at logistic regression. Principally, let us consider the saturated model - that which contains all the variables available from the dataset.

# fully saturated glm
# nba.glm <- glm(POFF ~ ., data=train.data, family=binomial) <<< does not converge
nba.glm2 <- glm(POFF ~ . - X2P - X2PA - TRB - PTS, data=train.data, family=binomial)
predictions.glm2 <- ifelse(predict(nba.glm2, test.data, type="response") > 0.5, "TRUE", "FALSE")
### Accuracy - 0.8713

During my testing of the completely saturated model, I encountered a message which warned of a non-convergent model. After further investigation, it seemed four of the variables in the glm were singularities; they were linear combinations of the other listed columns. In this case, X2P, X2PA, TRB, and PTS could be defined in the context of the other predictors and henceforth, offered no further information to the model. In this light, they were stricken from the model. The next model of interest was the parsimonious model, that which contained only the predictors of statistical significance.

# parsimonious glm
nba.pars <- glm(POFF ~ FGA + X3Pper + ORB + DRB + AST + STL + TOV + Year,
                data=train.data, family=binomial)
predictions.pars <- ifelse(predict(nba.pars, test.data, type="response") > 0.5, "TRUE", "FALSE")
### Accuracy -- 0.7871

The second model, which was comprised of the above variables, actually performed more poorly than the previous model. So although parsimony receives preferential treatment, the lack of predictors served as detrimental to the accuracy. In order to test the accuracy of the model, I used the predict function and ifelse conditional statement as seen above. In the following section (Model Evaluation & Results), I use caret“s confusionMatrix function to further analyze each model. The third and final in the regression model set was generated by the stepAIC function in the MASS package. This function calculates the AIC (Akaike Information Criterion) for every combination of the predictors in the supplied model, and returns that which minimizes the value.

# AIC-optimized glm
to.optimize <- glm(POFF ~ ., train.data, family=binomial)
nba.aic <- MASS::stepAIC(to.optimize, direction="both")
predictions.aic <- ifelse(predict(nba.aic, test.data, type="response") > 0.5, "TRUE", "FALSE")
summary(nba.aic)
## 
## Call:
## glm(formula = POFF ~ G + FGA + X3Pper + FTA + ORB + DRB + AST + 
##     STL + BLK + TOV + PTSperG + Year, family = binomial, data = train.data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.69520  -0.38837   0.06235   0.43081   2.52027  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.813e+02  5.210e+01   5.400 6.68e-08 ***
## G            6.094e-01  9.461e-02   6.442 1.18e-10 ***
## FGA         -1.416e-02  1.241e-03 -11.417  < 2e-16 ***
## X3Pper      -1.001e+01  3.670e+00  -2.727 0.006395 ** 
## FTA         -3.654e-03  9.516e-04  -3.840 0.000123 ***
## ORB          1.218e-02  1.676e-03   7.270 3.60e-13 ***
## DRB          1.508e-02  1.443e-03  10.453  < 2e-16 ***
## AST          4.916e-03  1.097e-03   4.482 7.38e-06 ***
## STL          1.970e-02  2.005e-03   9.826  < 2e-16 ***
## BLK          2.498e-03  1.629e-03   1.534 0.125107    
## TOV         -1.781e-02  1.643e-03 -10.838  < 2e-16 ***
## PTSperG      5.167e-01  6.289e-02   8.217  < 2e-16 ***
## Year        -1.626e-01  2.671e-02  -6.086 1.16e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1110.72  on 811  degrees of freedom
## Residual deviance:  494.08  on 799  degrees of freedom
## AIC: 520.08
## 
## Number of Fisher Scoring iterations: 6
### Accuracy -- 0.8812

In the case of the AIC-selected model, the formula exhibits 88% accuracy in determining whether or not a team will make the playoffs. Looking at the variables in the model, it is apparent that nearly every variable (save the BLK column) is statistically significant. According to p-value, the most important columns were FGA, STL, TOV, and PTSperG. Naturally, this makes sense as each represents the two most crucial facets of the sport - offense (1&$), and defense (2&3).

Decision Tree

For my fourth model, I decided to grow a decision tree. With the output variable taking one of two possible values, both the rpart package andfunction seemed most apt. Again, paired with the predict function and ifelse statement, I was availabe to ascertain the degree of accuracy for this model.

# rpart decision tree
nba.rpart <- rpart(POFF ~ ., data=train.data)
predictions.rpart <- ifelse(predict(nba.rpart, test.data) > 0.5, "TRUE", "FALSE")
### Accuracy --- 0.797

Below I have re-created the decision tree for the model via the fancyRpartPlot function in the rattle package.

library(rattle)
library(rpart.plot)
fancyRpartPlot(nba.rpart)


Model Evaluation & Results

To evaluate my first model, which contained the most variables (21 excluding singularities), I observed the results from the confusionMatrix function. Beyond providing the sensitivity and specifity of the model, the accuracy was listed primarily.

confusionMatrix(predictions.glm2, test.data$POFF, positive="TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    74   13
##      TRUE     13  102
##                                           
##                Accuracy : 0.8713          
##                  95% CI : (0.8171, 0.9142)
##     No Information Rate : 0.5693          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7375          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8870          
##             Specificity : 0.8506          
##          Pos Pred Value : 0.8870          
##          Neg Pred Value : 0.8506          
##              Prevalence : 0.5693          
##          Detection Rate : 0.5050          
##    Detection Prevalence : 0.5693          
##       Balanced Accuracy : 0.8688          
##                                           
##        'Positive' Class : TRUE            
## 

With 87.13% accuracy, the first model appears to perform spectacularly. However, as I learned in class, parsimonious models are looked upon more favorably, the cream of the crop. Accordandingly, the second model used fewer variables of proven statistical significance - only eight.

confusionMatrix(predictions.pars, test.data$POFF, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    65   21
##      TRUE     22   94
##                                           
##                Accuracy : 0.7871          
##                  95% CI : (0.7242, 0.8415)
##     No Information Rate : 0.5693          
##     P-Value [Acc > NIR] : 6.475e-11       
##                                           
##                   Kappa : 0.5653          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8174          
##             Specificity : 0.7471          
##          Pos Pred Value : 0.8103          
##          Neg Pred Value : 0.7558          
##              Prevalence : 0.5693          
##          Detection Rate : 0.4653          
##    Detection Prevalence : 0.5743          
##       Balanced Accuracy : 0.7823          
##                                           
##        'Positive' Class : TRUE            
## 

The second model did not capture all the important information in the model, with an accuracy of 78.71%. Though this simple glm indeed beats out flipping a fair, two-sided coin, there exists much room for improvement. To find said improvement, I used the MASS library“s stepAIC function. Using the confusionMatrix function, I found the best model of the lot, which boasted an 88.12% accuracy.

confusionMatrix(predictions.aic, test.data$POFF, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    75   12
##      TRUE     12  103
##                                           
##                Accuracy : 0.8812          
##                  95% CI : (0.8284, 0.9224)
##     No Information Rate : 0.5693          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7577          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8957          
##             Specificity : 0.8621          
##          Pos Pred Value : 0.8957          
##          Neg Pred Value : 0.8621          
##              Prevalence : 0.5693          
##          Detection Rate : 0.5099          
##    Detection Prevalence : 0.5693          
##       Balanced Accuracy : 0.8789          
##                                           
##        'Positive' Class : TRUE            
## 

Taking a slightly different approach for the fourth model, I tested the accuracy of a decision tree in determining playoff chances. However, the model fell short of the first and third glm models, barely surpassing the parsimonious model with a less-than-lusty 79.7%.

confusionMatrix(predictions.rpart, test.data$POFF, positive="TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    69   23
##      TRUE     18   92
##                                           
##                Accuracy : 0.797           
##                  95% CI : (0.7349, 0.8502)
##     No Information Rate : 0.5693          
##     P-Value [Acc > NIR] : 7.693e-12       
##                                           
##                   Kappa : 0.589           
##  Mcnemar's Test P-Value : 0.5322          
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.7931          
##          Pos Pred Value : 0.8364          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.5693          
##          Detection Rate : 0.4554          
##    Detection Prevalence : 0.5446          
##       Balanced Accuracy : 0.7966          
##                                           
##        'Positive' Class : TRUE            
## 

Conclusion

Whilst discussing this model with a few colleagues over a mighty fine steak dinner, one expressed qualms with my project. He pointed out that this model only accurately predicts a team“s chances after all regular season play has come to an end, at which point a team, if not the coach/manager, should know whether or not the franchise will be seeing the postseason. However, looking at this project in a different light, any and every team can look at the compiled data from the last 36 years in order to assess their progress during the season. In order to improve these models, I could dig into more gameplay-specific, non-traditional statistics. For example, stats.nba.com lists effective field goal percentage (a stat that is adjusted to balance the importance of two/three-point shots), and PACE (a stat that measures a team”s number of possessions per 48 minutes).