For this project, I decided to predict whether a team in the National Basketball Association (NBA) would partake in postseason play given the squad“s traditional statistics, such as total field goals, turnovers, rebounds, and free throws. In order to follow long along, I have provided my seed code to generate identical random sequences.
set.seed(5)
My data for this analytical survery comes from nothing but the best: basketball-reference.com. Transcending the most recent statistics, I drew team statistics dating back to the 1980 season, before which team data was logged in the site differently. Using Hadley Wickham“s rvest R package, the process of scraping the data for the years was distilled down to a rather simple call to the page. Once the data had been collected, I used rbind to merge the data frames from different years. Initially, the”Year" predictor was not listed, so I used dplyr (another Wickham package) to mutate the data. Finally, as the column names were not formatted in a clear fashion, I used the stringr package to replace “per” for “.”, which made sense in the context of the data.
nba=data.frame()
years <- 1980:2016
for(y in years) {
nba.y <- read_html(paste("http://www.basketball-reference.com/leagues/NBA_",y,".html#all_team_stats",sep="")) %>%
html_nodes(css = "#team") %>% html_table %>% data.frame %>% mutate(Year = y)
names(nba.y) <- names(nba)
nba <- rbind(nba,nba.y)
}
rm(years,y,nba.y)
# Get Column Names
names.no.year <- read_html("http://www.basketball-reference.com/leagues/NBA_2016.html#all_team_stats") %>%
html_nodes(css = "#team") %>% html_table %>% data.frame %>% names
names(nba) <- c(names.no.year,"Year")
rm(names.no.year)
# Fix Up Column Names
names(nba) <- str_replace_all(names(nba), "\\.", "per")
head(nba)
## Rk Team G MP FG FGA FGper X3P X3PA X3Pper X2P
## 1 1 San Antonio Spurs* 82 19755 3856 7738 0.498 52 206 0.252 3804
## 2 2 Los Angeles Lakers* 82 19880 3898 7368 0.529 20 100 0.200 3878
## 3 3 Cleveland Cavaliers 82 19930 3811 8041 0.474 36 187 0.193 3775
## 4 4 New York Knicks 82 19780 3802 7672 0.496 42 191 0.220 3760
## 5 5 Boston Celtics* 82 19880 3617 7387 0.490 162 422 0.384 3455
## 6 6 Indiana Pacers 82 19855 3639 7689 0.473 88 314 0.280 3551
## X2PA X2Pper FT FTA FTper ORB DRB TRB AST STL BLK TOV PF PTS
## 1 7532 0.505 2024 2528 0.801 1153 2515 3668 2326 771 333 1589 2103 9788
## 2 7268 0.534 1622 2092 0.775 1085 2653 3738 2413 774 546 1639 1784 9438
## 3 7854 0.481 1702 2205 0.772 1307 2381 3688 2108 764 342 1370 1934 9360
## 4 7481 0.503 1698 2274 0.747 1236 2303 3539 2265 881 457 1613 2168 9344
## 5 6965 0.496 1907 2449 0.779 1227 2457 3684 2198 809 308 1539 1974 9303
## 6 7375 0.481 1753 2333 0.751 1398 2326 3724 2148 900 530 1517 1973 9119
## PTSperG Year
## 1 119.4 1980
## 2 115.1 1980
## 3 114.1 1980
## 4 114.0 1980
## 5 113.5 1980
## 6 111.2 1980
As is the case with most sports data, the majority of the statistics listed below are numerical (integer or floating point) values. However, the frame column of utmost importance for this regression study, the POFF (added later), contains strictly logical values, a variable type which can take one of two values: TRUE (the team made the playoffs), and FALSE (the unfortunate alternative).
sapply(nba, class)
## Rk Team G MP FG FGA
## "integer" "character" "integer" "integer" "integer" "integer"
## FGper X3P X3PA X3Pper X2P X2PA
## "numeric" "integer" "integer" "numeric" "integer" "integer"
## X2Pper FT FTA FTper ORB DRB
## "numeric" "integer" "integer" "numeric" "integer" "integer"
## TRB AST STL BLK TOV PF
## "integer" "integer" "integer" "integer" "integer" "integer"
## PTS PTSperG Year
## "integer" "numeric" "integer"
In the case of the POFF variable, which stands for “playoff,” the TRUE/FALSE values were assessed on the basis of the presence of an asterisk in the team name.
To prepare this data for proper logistic regression, a few predictor columns were stricken from the record, namely the rank (Rk) and the team name (Team). Additionally, the data from basketball-reference.com contained a “League Average” column, which contained NA values for the team rank; accordingly, I filtered such rows out of the set. The POFF column was previously described, but implemented in the section below via the help of another Wickham-based function in the stringr package: str_detect.
# Remove Rk/Team Column & Add Playoff Data
nba <- mutate(nba, POFF = str_detect(nba$Team, "\\*")) %>%
filter(!is.na(nba$Rk)) %>% dplyr::select(-Rk, -Team)
head(nba)
## G MP FG FGA FGper X3P X3PA X3Pper X2P X2PA X2Pper FT FTA
## 1 82 19755 3856 7738 0.498 52 206 0.252 3804 7532 0.505 2024 2528
## 2 82 19880 3898 7368 0.529 20 100 0.200 3878 7268 0.534 1622 2092
## 3 82 19930 3811 8041 0.474 36 187 0.193 3775 7854 0.481 1702 2205
## 4 82 19780 3802 7672 0.496 42 191 0.220 3760 7481 0.503 1698 2274
## 5 82 19880 3617 7387 0.490 162 422 0.384 3455 6965 0.496 1907 2449
## 6 82 19855 3639 7689 0.473 88 314 0.280 3551 7375 0.481 1753 2333
## FTper ORB DRB TRB AST STL BLK TOV PF PTS PTSperG Year POFF
## 1 0.801 1153 2515 3668 2326 771 333 1589 2103 9788 119.4 1980 TRUE
## 2 0.775 1085 2653 3738 2413 774 546 1639 1784 9438 115.1 1980 TRUE
## 3 0.772 1307 2381 3688 2108 764 342 1370 1934 9360 114.1 1980 FALSE
## 4 0.747 1236 2303 3539 2265 881 457 1613 2168 9344 114.0 1980 FALSE
## 5 0.779 1227 2457 3684 2198 809 308 1539 1974 9303 113.5 1980 TRUE
## 6 0.751 1398 2326 3724 2148 900 530 1517 1973 9119 111.2 1980 FALSE
In order to better understand the data as well as the patterns to be found within, I decided to use the ggplot2 package for visualization purposes. The below plots show the relationship between a team“s total points and their playoff chances. The first plot (left) shows a few outliers; this is because during the 1988-1989 season, NBA owners led a lockout, removing 39% of the usual games (82 -> 50). On the right plot, the outliers have been identified.
g1 <- ggplot(nba, aes(x = PTSperG, y=POFF, colour=nba$POFF)) + geom_point()
g2 <- ggplot(nba, aes(x=PTSperG, y=POFF, colour=G)) + geom_point()
grid.arrange(g1,g2,ncol=2,nrow=1)
The next two plots show playoff chances" (1) direct proportionality with three-point percentage and, and (2) inverse proportionality with turnovers.
g3 <- ggplot(nba, aes(X3Pper, POFF, colour=POFF)) + geom_point()
g4 <- ggplot(nba, aes(TOV, POFF, colour=POFF)) + geom_point() + theme(legend.position="none")
grid.arrange(g3,g4,ncol=2,nrow=1)
In the case of deciding whether or not a team will make the NBA playoffs, the variable of interesty is binary, TRUE or FALSE. Consequently, logistic regression as well as a decision tree are examples of two appropriate models.
Before conducting and assessing any models, I split our data into training and testing sets; I chose a respective 80-20 divide.
# Split Data (train/test)
index <- createDataPartition(nba$POFF, p = 0.8)[[1]]
train.data = nba[index,]
test.data = nba[-index,]
For the first type of model, I looked at logistic regression. Principally, let us consider the saturated model - that which contains all the variables available from the dataset.
# fully saturated glm
# nba.glm <- glm(POFF ~ ., data=train.data, family=binomial) <<< does not converge
nba.glm2 <- glm(POFF ~ . - X2P - X2PA - TRB - PTS, data=train.data, family=binomial)
predictions.glm2 <- ifelse(predict(nba.glm2, test.data, type="response") > 0.5, "TRUE", "FALSE")
### Accuracy - 0.8713
During my testing of the completely saturated model, I encountered a message which warned of a non-convergent model. After further investigation, it seemed four of the variables in the glm were singularities; they were linear combinations of the other listed columns. In this case, X2P, X2PA, TRB, and PTS could be defined in the context of the other predictors and henceforth, offered no further information to the model. In this light, they were stricken from the model. The next model of interest was the parsimonious model, that which contained only the predictors of statistical significance.
# parsimonious glm
nba.pars <- glm(POFF ~ FGA + X3Pper + ORB + DRB + AST + STL + TOV + Year,
data=train.data, family=binomial)
predictions.pars <- ifelse(predict(nba.pars, test.data, type="response") > 0.5, "TRUE", "FALSE")
### Accuracy -- 0.7871
The second model, which was comprised of the above variables, actually performed more poorly than the previous model. So although parsimony receives preferential treatment, the lack of predictors served as detrimental to the accuracy. In order to test the accuracy of the model, I used the predict function and ifelse conditional statement as seen above. In the following section (Model Evaluation & Results), I use caret“s confusionMatrix function to further analyze each model. The third and final in the regression model set was generated by the stepAIC function in the MASS package. This function calculates the AIC (Akaike Information Criterion) for every combination of the predictors in the supplied model, and returns that which minimizes the value.
# AIC-optimized glm
to.optimize <- glm(POFF ~ ., train.data, family=binomial)
nba.aic <- MASS::stepAIC(to.optimize, direction="both")
predictions.aic <- ifelse(predict(nba.aic, test.data, type="response") > 0.5, "TRUE", "FALSE")
summary(nba.aic)
##
## Call:
## glm(formula = POFF ~ G + FGA + X3Pper + FTA + ORB + DRB + AST +
## STL + BLK + TOV + PTSperG + Year, family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.69520 -0.38837 0.06235 0.43081 2.52027
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.813e+02 5.210e+01 5.400 6.68e-08 ***
## G 6.094e-01 9.461e-02 6.442 1.18e-10 ***
## FGA -1.416e-02 1.241e-03 -11.417 < 2e-16 ***
## X3Pper -1.001e+01 3.670e+00 -2.727 0.006395 **
## FTA -3.654e-03 9.516e-04 -3.840 0.000123 ***
## ORB 1.218e-02 1.676e-03 7.270 3.60e-13 ***
## DRB 1.508e-02 1.443e-03 10.453 < 2e-16 ***
## AST 4.916e-03 1.097e-03 4.482 7.38e-06 ***
## STL 1.970e-02 2.005e-03 9.826 < 2e-16 ***
## BLK 2.498e-03 1.629e-03 1.534 0.125107
## TOV -1.781e-02 1.643e-03 -10.838 < 2e-16 ***
## PTSperG 5.167e-01 6.289e-02 8.217 < 2e-16 ***
## Year -1.626e-01 2.671e-02 -6.086 1.16e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1110.72 on 811 degrees of freedom
## Residual deviance: 494.08 on 799 degrees of freedom
## AIC: 520.08
##
## Number of Fisher Scoring iterations: 6
### Accuracy -- 0.8812
In the case of the AIC-selected model, the formula exhibits 88% accuracy in determining whether or not a team will make the playoffs. Looking at the variables in the model, it is apparent that nearly every variable (save the BLK column) is statistically significant. According to p-value, the most important columns were FGA, STL, TOV, and PTSperG. Naturally, this makes sense as each represents the two most crucial facets of the sport - offense (1&$), and defense (2&3).
For my fourth model, I decided to grow a decision tree. With the output variable taking one of two possible values, both the rpart package andfunction seemed most apt. Again, paired with the predict function and ifelse statement, I was availabe to ascertain the degree of accuracy for this model.
# rpart decision tree
nba.rpart <- rpart(POFF ~ ., data=train.data)
predictions.rpart <- ifelse(predict(nba.rpart, test.data) > 0.5, "TRUE", "FALSE")
### Accuracy --- 0.797
Below I have re-created the decision tree for the model via the fancyRpartPlot function in the rattle package.
library(rattle)
library(rpart.plot)
fancyRpartPlot(nba.rpart)
To evaluate my first model, which contained the most variables (21 excluding singularities), I observed the results from the confusionMatrix function. Beyond providing the sensitivity and specifity of the model, the accuracy was listed primarily.
confusionMatrix(predictions.glm2, test.data$POFF, positive="TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 74 13
## TRUE 13 102
##
## Accuracy : 0.8713
## 95% CI : (0.8171, 0.9142)
## No Information Rate : 0.5693
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7375
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8870
## Specificity : 0.8506
## Pos Pred Value : 0.8870
## Neg Pred Value : 0.8506
## Prevalence : 0.5693
## Detection Rate : 0.5050
## Detection Prevalence : 0.5693
## Balanced Accuracy : 0.8688
##
## 'Positive' Class : TRUE
##
With 87.13% accuracy, the first model appears to perform spectacularly. However, as I learned in class, parsimonious models are looked upon more favorably, the cream of the crop. Accordandingly, the second model used fewer variables of proven statistical significance - only eight.
confusionMatrix(predictions.pars, test.data$POFF, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 65 21
## TRUE 22 94
##
## Accuracy : 0.7871
## 95% CI : (0.7242, 0.8415)
## No Information Rate : 0.5693
## P-Value [Acc > NIR] : 6.475e-11
##
## Kappa : 0.5653
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8174
## Specificity : 0.7471
## Pos Pred Value : 0.8103
## Neg Pred Value : 0.7558
## Prevalence : 0.5693
## Detection Rate : 0.4653
## Detection Prevalence : 0.5743
## Balanced Accuracy : 0.7823
##
## 'Positive' Class : TRUE
##
The second model did not capture all the important information in the model, with an accuracy of 78.71%. Though this simple glm indeed beats out flipping a fair, two-sided coin, there exists much room for improvement. To find said improvement, I used the MASS library“s stepAIC function. Using the confusionMatrix function, I found the best model of the lot, which boasted an 88.12% accuracy.
confusionMatrix(predictions.aic, test.data$POFF, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 75 12
## TRUE 12 103
##
## Accuracy : 0.8812
## 95% CI : (0.8284, 0.9224)
## No Information Rate : 0.5693
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7577
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8957
## Specificity : 0.8621
## Pos Pred Value : 0.8957
## Neg Pred Value : 0.8621
## Prevalence : 0.5693
## Detection Rate : 0.5099
## Detection Prevalence : 0.5693
## Balanced Accuracy : 0.8789
##
## 'Positive' Class : TRUE
##
Taking a slightly different approach for the fourth model, I tested the accuracy of a decision tree in determining playoff chances. However, the model fell short of the first and third glm models, barely surpassing the parsimonious model with a less-than-lusty 79.7%.
confusionMatrix(predictions.rpart, test.data$POFF, positive="TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 69 23
## TRUE 18 92
##
## Accuracy : 0.797
## 95% CI : (0.7349, 0.8502)
## No Information Rate : 0.5693
## P-Value [Acc > NIR] : 7.693e-12
##
## Kappa : 0.589
## Mcnemar's Test P-Value : 0.5322
##
## Sensitivity : 0.8000
## Specificity : 0.7931
## Pos Pred Value : 0.8364
## Neg Pred Value : 0.7500
## Prevalence : 0.5693
## Detection Rate : 0.4554
## Detection Prevalence : 0.5446
## Balanced Accuracy : 0.7966
##
## 'Positive' Class : TRUE
##
Whilst discussing this model with a few colleagues over a mighty fine steak dinner, one expressed qualms with my project. He pointed out that this model only accurately predicts a team“s chances after all regular season play has come to an end, at which point a team, if not the coach/manager, should know whether or not the franchise will be seeing the postseason. However, looking at this project in a different light, any and every team can look at the compiled data from the last 36 years in order to assess their progress during the season. In order to improve these models, I could dig into more gameplay-specific, non-traditional statistics. For example, stats.nba.com lists effective field goal percentage (a stat that is adjusted to balance the importance of two/three-point shots), and PACE (a stat that measures a team”s number of possessions per 48 minutes).