Predicting NBA Season Point Totals

I. Introduction
II. Getting the Data
III. Feature Selection and Checking Assumptions
IV. Optimum Model by Adjusted R-Squared
V. Optimum Model by the Bayesian Information Criterion
VI. Conclusion: The Model Through Time
Appendix A. Three Point Shooting

I. Introduction

The NBA has seen an explosion of data. Advancements in tracking technologies and developments of more sophisticated measures of performance encourage richer understandings of what happens between the baselines. Furthermore, the availability and accessability of this data has never been greater, although much remains behind closed doors.

And whether framed as evolution or devolution, the NBA game has changed along with measures of its performance. A cursory search through the numbers and a glance over the trends in season totals confirm that the mainstream identity of an NBA offenses has fluctuated over time. For example, consider the plots below. The first, “Mean Three Point Rate by Year, 1980-2015” and the second, “Mean Pace by Year, 1980-2015”, both illustrate these changes.

This fluctuation problematizes predictive modeling over time with NBA statistics. In 2015, for example, three point makes were strongly correlated to team success (r = 0.60) and total points (r = 0.66). By contrast, three point makes in 1983 had the opposite effect: they were negatively correlated to wins (r = -0.32) and negatively correlated to total points (r = -0.13). To call out two explicit examples, the San Antonio Spurs hit 94 three point shots in 1983 en route to 53 wins, and 22 years later, the Houston Rockets hit 839 more three point shots (933), while winning only three more games (56). Again, the game has changed.

The purpose of this project is to explore how NBA teams have manufactured points over a thirty-five year period (1980-2015). The goal is not to predict precise point totals, but to begin to quantify stylistic changes. The 1979-1980 season is a sensible starting point, as it marked the advent of the three-point shot.

Is it possible to build a single model that transcends these fluctuations and models the competitive balance of the eighties, nineties, tens, and teens? A model that handles the grinding, slow teams of the late nineties as well as the trigger-happy, three point shooting teams of the present? Perhaps. At the very least, the attempt may provide clues to how NBA teams adapt (or fail to adapt) to a relatively fluid competitive landscape. We can pay special attention to the seasons before and after important rule changes such as the hand-check rule introduced during the 2004-2005 season, and the extension of the three-point line before the 1997-1998 season.

II. Getting the Data

The data was scraped from Basketball Reference, a popular sports website that has collected data from every game and team in the NBA’s history. Using the rvest package and the Chrome extension SelectorGadget, I created get_nba_season_data(), a function that takes a single argument, year, and returns a data.frame of 72 variables or features of team data for that season. The source code for the function is below, along with some limited commentary:

library(dplyr)
library(magrittr)
library(rvest)
library(stringr)
library(knitr)

get_nba_season_data <- function(year) {
  
  options(warn = -1) #silences later left_join() warning about coercion to factor
  
    #Concatenate url to scrape
    url <- paste0("http://www.basketball-reference.com/leagues/NBA_", year, ".html")
    
    #Get season data
    season_data <- url %>%
      read_html() %>% 
      html_nodes("#team") %>%
      html_table(header = FALSE, fill = FALSE) %>%
      data.frame()
    
    #Clean season data header
    names(season_data) <- season_data %>% 
      slice(1) %>%
      str_replace_all("%", "_pct") %>%
      str_replace_all("/", "_per_") %>%
      make.names(unique = TRUE, allow = TRUE) %>% 
      str_replace_all("[.]", "") %>%
      str_replace_all("1", "_pct") %>% 
      str_replace_all("X", "") %>% 
      tolower()
    
    #Clean season data, create new variables: year, playoff
    season_data %<>%
      slice(-c(1, nrow(.))) %>% 
      mutate(year = rep(year, nrow(.)),
             playoffs = ifelse(grepl("[*]", team), "playoffs", "lottery")) %>% 
      mutate(team = str_replace_all(team, "[*]", "")) %>% 
      select(-rk, -g)
    
    #Get miscellaneous data
    misc_data <- url %>%
      read_html() %>% 
      html_nodes("#misc") %>%
      html_table(header = FALSE, fill = FALSE) %>%
      data.frame()
    
    #Clean miscellaneous data header
    names(misc_data) <- misc_data %>% 
      slice(2) %>%
      str_replace_all("%", "_pct") %>%
      str_replace_all("/", "_per_") %>%
      make.names(unique = TRUE, allow = TRUE) %>% 
      str_replace_all("[.]", "") %>%
      str_replace_all("X", "") %>% 
      tolower()
    
    #Clean miscellaneous data
    misc_data %<>%
      slice(-c(1:2, nrow(.))) %>% 
      select(-rk, -arena, -attendance) %>% 
      mutate(team = str_replace_all(team, "[*]", "")) 
    
    #Specify certain 'opponent' variables
    names(misc_data)[(ncol(misc_data) - 4):ncol(misc_data)] <- 
      paste0('opp_', names(misc_data)[(ncol(misc_data) - 4):ncol(misc_data)]) 
    
    #Join season data and miscellaneous data
    all_data <- left_join(season_data, misc_data, by = "team")
    
    #Get oppononent data
    opp_data <- url %>%
      read_html() %>% 
      html_nodes("#opponent") %>%
      html_table(header = FALSE, fill = FALSE) %>%
      data.frame()
    
    #Clean opponent data header
    names(opp_data) <- opp_data %>% 
      slice(1) %>% 
      str_replace_all("%", "_pct") %>%
      str_replace_all("/", "_per_") %>%
      make.names(unique = TRUE, allow = TRUE) %>% 
      str_replace_all("[.]", "") %>%
      str_replace_all("X", "") %>%
      str_replace_all("[1]", "_pct") %>% 
      tolower() %>% 
      paste0("opp_", .)
    
    #Clean opponent data
    opp_data %<>%
      slice(-c(1, nrow(.))) %>% 
      select(-c(1, 3:4)) %>% 
      rename(team = opp_team) %>% 
      mutate(team = str_replace_all(team, "[*]", "")) 
    
    #Join opponent data to larger dataset
    all_data <- left_join(all_data, opp_data, by = "team")
    
    #Get wins data from different url
    wins_url <- paste0('http://www.basketball-reference.com/leagues/NBA_', year, '_standings.html')
    wins_data <- wins_url %>%
      read_html() %>% 
      html_nodes("#expanded-standings td:nth-child(2) , 
                 #expanded-standings td:nth-child(2) , 
                 #expanded-standings td:nth-child(3) , 
                 #expanded-standings .sort_default_asc+ .tooltip") %>%
      html_text()
    
    teams <- wins_data[c(TRUE, FALSE)]
    wins <- substr(wins_data[c(FALSE, TRUE)], 1, 2)
    
    wins_df <- data.frame(team = teams,
                          wins = wins)
    
    #Join wins data to larger dataset
    all_season_data <- left_join(all_data, wins_df, by = "team")
    
    #Get season champion
    champion <- url %>% 
      read_html() %>%
      html_nodes('.sub_index+ p a') %>% 
      html_text()
    
    #Create champion variable, specify variable order
    all_season_data %<>%
      mutate(champion = ifelse(team == champion, "Champion", "Non-Champion"))%>% 
      select(team, playoffs, champion, everything())
    
    #Convert factor classes to character to numeric
    all_season_data[,4:ncol(all_season_data)] <- 
      sapply(all_season_data[,4:ncol(all_season_data)], as.character)
    all_season_data[,4:ncol(all_season_data)] <- 
      sapply(all_season_data[,4:ncol(all_season_data)], as.numeric)
    
    return(all_season_data)
}

Now that we have our function, we can scrape the season data between 1980 and 2015:

years <- 1980:2015
nba_data <- data.frame() #Initialize empty data frame
for (year in years) {
  year_data <- get_nba_season_data(year)
  nba_data <- rbind(nba_data, year_data) #Bind rows from each function call
}

III. Feature Selection and Checking Assumptions

Our nba_data dataset has 984 observations and 71 features. The features we want to include in the model should exhibit a moderately strong linear relationship with our predictor variable, pts, and here I’ll set the minimum threshold to 0.4:

nba_correlations <- apply(nba_data[sapply(nba_data, is.numeric)], 
                      2, 
                      function(x) with(nba_data, cor(x, pts, use = "complete.obs")))

mod_names <- nba_correlations[abs(nba_correlations) > 0.4]

nba_data_for_model <- nba_data %>% 
  select_(.dots = names(mod_names))

The remaining features to be used in the model:

Feature	Description
orb	Offensive Rebounds
stl	Steals
ast	Assists
tov	Turnovers
pace	Estimate of the number of possessions per 48 minutes
opp_fga	Opponent Field Goal Attempts
opp_orb	Opponent Offensive Rebounds
opp_drb	Opponent Defensive Rebounds
opp_ast	Opponent Assists
opp_stl	Opponent Steals
opp_tov	Opponent Turnovers

Each of these variables makes intuitive sense. With the exception of pace, they all indicate some measure of a team’s ability to increase their number of possessions, and thus their opportunities to score, or approximate the quality of shot attempts.

Behind the scenes, I removed some additional variables that are also highly correlated to pts such as pts_per_game, ortg (Offensive Rating), pw (Pythagorean Wins), fg_pct (Field Goal Percentage), fga (Field Goal Attempts), and fg (Field Goals). Including these variables in any model is senseless because many of these variables are calculated from or caused by pts.

IV. Optimum Model by Adjusted R-Squared

We can now use the package leaps to evaluate all the best-subset models. We’ll first attempt to build a model using adjusted R-squared as our preferred measure.

library(leaps)
regfit.full <- regsubsets(pts ~ ., data = nba_data_for_model, nvmax = 12)
reg.summary <- summary(regfit.full)

plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R-Squared")
optimal_variables <- which.max(reg.summary$adjr2) 
points(optimal_variables, 
       reg.summary$adjr2[optimal_variables], 
       pch=20, col="red")
plot(regfit.full, scale = "adjr2", ylab = "Adjusted R Squared")

The model with the highest adjusted R-squared has nine features and includes ast, stl, tov, opp_fga, opp_orb, opp_drb, opp_ast, opp_stl, and opp_tov. There is, however, no significant change after six variables, so we can safely choose the corresponding model with ast, tov, opp_fga, opp_drb, opp_orb, and opp_tov.

Now we can build the model with those features included:

set.seed(50)
sample_index <- sample(1:2, nrow(nba_data_for_model), replace = TRUE, prob = c(0.80, 0.20))
nba_training <- nba_data_for_model[sample_index == 1,]
nba_testing <- nba_data_for_model[sample_index == 2,]

nba_mod_r2 <- lm(pts ~ ast + tov + opp_fga + 
                   opp_drb + opp_orb + opp_tov, data = nba_training)
summary(nba_mod_r2)

## 
## Call:
## lm(formula = pts ~ ast + tov + opp_fga + opp_drb + opp_orb + 
##     opp_tov, data = nba_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -594.52 -148.02   -9.68  137.17  617.07 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 96.84688   84.31650   1.149    0.251    
## ast          0.29286    0.05537   5.289 1.59e-07 ***
## tov         -1.41628    0.08099 -17.487  < 2e-16 ***
## opp_fga      1.68213    0.03942  42.668  < 2e-16 ***
## opp_drb     -1.07222    0.06029 -17.785  < 2e-16 ***
## opp_orb     -1.79749    0.10231 -17.569  < 2e-16 ***
## opp_tov      1.91615    0.07750  24.726  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207 on 783 degrees of freedom
## Multiple R-squared:  0.9468, Adjusted R-squared:  0.9464 
## F-statistic:  2321 on 6 and 783 DF,  p-value: < 2.2e-16

Diagnostic plots:

par(mfrow = c(1, 2))
plot(nba_mod_r2, which = c(1, 2))

Test performance:

predictions_adjr2 <- predict.lm(nba_mod_r2, newdata = nba_testing)
nba_df <- data.frame(predictions = predictions_adjr2,
                     values = nba_testing$pts)

ggplot(nba_df, aes(predictions, values)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(x = "Predicted Values", y = "Actual Values") +
  theme_minimal()

nba_df %>%
  mutate(residual = predictions - values) %>% 
  summarize(Mean_Difference = mean(abs(residual)))

##   Mean_Difference
## 1        169.0154

The model’s point prediction was off by an average of 169 points.

V. Optimum Model by the Bayesian Information Criterion

Now let’s rebuild the model using the Bayesian Information Criterion (BIC):

optimal_variables <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "Bayesian Information Criterion")
points(optimal_variables, 
       reg.summary$bic[optimal_variables], 
       pch=20, col="red")
plot(regfit.full, scale = "bic", ylab = "Bayesian Information Criterion")

The model with the lowest BIC has only six features and includes ast, tov, opp_fga, opp_orb, opp_drb, and opp_tov, the same variables we ascertained from the previous model.

VI. Conclusion: The Model Through Time

Earlier I mentioned how team statistics fluctuate over time. Our preferred model was trained and tested on data spanning a thirty-five year period, and it is worth questioning the model’s consistency. Does the model perform better in a particular “era”, or period of NBA history? Using the purrr package, we can split and model the our data by year while extracting the adjusted R-squared for each year:

library(purrr)
year_adjr2 <- nba_data %>% 
  split(.$year) %>% 
  map(~ lm(pts ~ ast + tov + opp_fga + opp_orb + opp_drb + opp_tov, data = .)) %>% 
  map(summary) %>% 
  map_dbl("adj.r.squared") 

df_year_adjr2 <- data.frame(year = as.numeric(names(year_adjr2)), 
                            rsq = unname(year_adjr2))

ggplot(df_year_adjr2, aes(year, rsq)) +
  geom_point() +
  geom_smooth(color = "red") +
  labs(x = "Year", y = "Adjusted R-Squared") +
  theme_minimal()

There is a slight trend downward in the adjusted R squared from 1980 to 2000, before the value plateaus, and perhaps even begins another decline. This phenomenon resists easy interpretation. The trend appears to loosely follow the general decline in pace, but another explanation is the stylistic changes from 2000-2015. The early 2000s were marked by heavy doses of isolation play, but towards the end of the decade and soon into the tens, NBA teams attempted and made three pointers more aggressively. Still, the model’s performance is remarkably consistent over the 35 year period.

Appendix A. Three Point Shooting

As mentioned above, teams are shooting more and more three pointers. In the eighties and early nineties, no three-point shooting metric was a strong indicator of offensive production. That changed in the 2000s.

This model takes three predictable variables: threes (Made 3-Point Field Goals), three_rate (3-Point Attempt Rate), and threep_pct (3-Point Percentage).

threes_nba <- nba_data %>%
  filter(year != 2012) %>% 
  split(.$year) %>% 
  map(~ lm(pts ~ threes + three_rate + threep_pct, data = .)) %>% 
  map(summary) %>% 
  map_dbl("adj.r.squared") 

df_threes <- data.frame(year = as.numeric(names(threes_nba)), 
                            rsq = unname(threes_nba))

ggplot(df_threes, aes(year, rsq, label = year)) +
  geom_point() +
  geom_smooth(color = "red") +
  labs(x = "Year", y = "Adjusted R-Squared") +
  theme_minimal()