Introduction

Sport gambling has become a lucrative industry in the United States since the Supreme Cuurt’s 2018 ruling legalizing it. While not a gambler myself, I am an avid sports fan. I was able to pull two datasets (one small and one large) rleated to both the National Football League Game results in the Super Bowl Era (since 1967) and NBA individual box scores since 1997.

ELO Ratings

I pulled an ELO dataset for NFL teams dating back to 1920 as well. ELO is a rating system, originally used to produce the relative ratings of chess players, which can be re-purposed for other competitions. You can read more about FiveThirtyEight’s ELO Rating system here. In our case, we’ll be looking to see whether ELO ratings are helpful predictors for NFL spread data.

There’s ~13k rows here in this dataset. In terms of building a predictive model, there’s multiple ways one could approach this dataset. For instance, one could try to build a classification model that selects which of the participant teams ois likely to win outright or cover the predicted point spread (i.e., the proposed difference in point totals which bookmakers collect wagers on). In this case, I’d like to build a model to help predict the spread of a game itself, which is more of a regression task. For simplicity, a simple linear regression model could be used.

elo <- read_csv("data/nfl_elo.csv")
head(elo)
## # A tibble: 6 × 33
##   date       season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
##   <date>      <dbl>   <dbl> <chr>   <chr> <chr>    <dbl>    <dbl>     <dbl>
## 1 1920-09-26   1920       0 <NA>    RII   STP      1504.    1300      0.825
## 2 1920-10-03   1920       0 <NA>    CBD   PTQ      1505.    1300      0.825
## 3 1920-10-03   1920       0 <NA>    CHI   MUT      1368.    1300      0.683
## 4 1920-10-03   1920       0 <NA>    RII   MUN      1516.    1478.     0.644
## 5 1920-10-03   1920       0 <NA>    DAY   COL      1493.    1505.     0.576
## 6 1920-10-03   1920       0 <NA>    RCH   ABU      1503.    1300      0.824
## # ℹ 24 more variables: elo_prob2 <dbl>, elo1_post <dbl>, elo2_post <dbl>,
## #   qbelo1_pre <dbl>, qbelo2_pre <dbl>, qb1 <chr>, qb2 <chr>,
## #   qb1_value_pre <dbl>, qb2_value_pre <dbl>, qb1_adj <dbl>, qb2_adj <dbl>,
## #   qbelo_prob1 <dbl>, qbelo_prob2 <dbl>, qb1_game_value <dbl>,
## #   qb2_game_value <dbl>, qb1_value_post <dbl>, qb2_value_post <dbl>,
## #   qbelo1_post <dbl>, qbelo2_post <dbl>, score1 <dbl>, score2 <dbl>,
## #   quality <dbl>, importance <dbl>, total_rating <dbl>

We can filter out any matchups prior to the 1966 season, where our spread data begins

# Clean up some names/IDs
elo <- elo %>% filter(season > 1967) %>% 
  mutate(team1 = ifelse(team1=="WSH", "WAS", team1),
         team2 = ifelse(team2=="WSH", "WAS", team2))

# Calculate acutal point spread of game result
elo$actual_spread <- elo$score1 - elo$score2

head(elo)
## # A tibble: 6 × 34
##   date       season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
##   <date>      <dbl>   <dbl> <chr>   <chr> <chr>    <dbl>    <dbl>     <dbl>
## 1 1968-09-06   1968       0 <NA>    LAC   CIN      1477.    1300      0.801
## 2 1968-09-08   1968       0 <NA>    BUF   NE       1452.    1407.     0.653
## 3 1968-09-09   1968       0 <NA>    TEN   KC       1508.    1585.     0.483
## 4 1968-09-14   1968       0 <NA>    MIN   ATL      1451.    1329.     0.745
## 5 1968-09-14   1968       0 <NA>    MIA   TEN      1390.    1491.     0.448
## 6 1968-09-15   1968       0 <NA>    CHI   WAS      1534.    1460.     0.690
## # ℹ 25 more variables: elo_prob2 <dbl>, elo1_post <dbl>, elo2_post <dbl>,
## #   qbelo1_pre <dbl>, qbelo2_pre <dbl>, qb1 <chr>, qb2 <chr>,
## #   qb1_value_pre <dbl>, qb2_value_pre <dbl>, qb1_adj <dbl>, qb2_adj <dbl>,
## #   qbelo_prob1 <dbl>, qbelo_prob2 <dbl>, qb1_game_value <dbl>,
## #   qb2_game_value <dbl>, qb1_value_post <dbl>, qb2_value_post <dbl>,
## #   qbelo1_post <dbl>, qbelo2_post <dbl>, score1 <dbl>, score2 <dbl>,
## #   quality <dbl>, importance <dbl>, total_rating <dbl>, actual_spread <dbl>

We can use the ggpairs function to plot the relationship between the variables we care about and our predictor

elo <- elo %>% mutate(log_elo1_pre = log(elo1_pre), 
                      elo_diff = elo1_pre - elo2_pre,
                      elo_ratio = elo1_pre / elo2_pre,
                      qb_diff = qbelo1_pre - qbelo2_pre,
                      qb_ratio = qbelo1_pre / qbelo2_pre)
elo_ratings <- elo %>% select(actual_spread, elo1_pre, elo2_pre, qbelo1_pre, qbelo2_pre, elo_diff, elo_ratio, qb_diff, qb_ratio)
ggpairs(elo_ratings, upper=NULL)

We see relatively normal distributions of our predictor variables and our outcome (actual_spread). However, this plot is a bit busy due to the density of Elo ratings. One easy way for us to visualize the relationships between features and outcome variables is a correlation plot. The corrplot package in R is handy for this.

# Make correlation plot
cor <- cor(elo_ratings)

corrplot(cor)

Modeling

We can build a simple linear regression model to predict the game’s spread based on the team and quarterback Elo ratings ratios prior to the game.

# Set up linear model
nfl_model <- lm(actual_spread ~ elo_diff + qb_diff, elo)

summary(nfl_model)
## 
## Call:
## lm(formula = actual_spread ~ elo_diff + qb_diff, data = elo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.158  -8.670  -0.160   8.431  52.346 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.624993   0.119949  21.884  < 2e-16 ***
## elo_diff    0.014226   0.005264   2.702  0.00689 ** 
## qb_diff     0.029111   0.005422   5.369 8.06e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.63 on 12920 degrees of freedom
## Multiple R-squared:  0.156,  Adjusted R-squared:  0.1559 
## F-statistic:  1194 on 2 and 12920 DF,  p-value: < 2.2e-16

We see an \(R^2_{adj}\) value of 0.1569, which isn’t great in terms of prediction. We only have ~13k instances of NFL contests, which wouldn’t qualify as big data, but could be neough to produce a regression. In this case, however, it looks liek Elo ratingsmay not be the strongest predictors of the outcome of an NFL game.

NBA Boxscores

I also found this other Kaggle dataset with NBA Boxscores which would be considered more “big data” (~700k rows). Each row in this dataset consists of an individual NBA player’s statistics (points, rebounds assists, and other counting stats) in a single game from the 1996 season through the 2023 season.

A typical NBA game will see anywhere from 15-20 unique players, and with a single team 82 games in a regular reasons between 30 teams in the NBA, there can quickly be many instances of individual player performance. While not traditionally “big data”, this dataset is considerably larger than our NFL dataset above.

In our case, we’ll be interested in predicting the total number of points scored by a given player in a single game. We’ll train a simple regression decision tree, as it’s a robust and explainable model. We’ll want to be sure to avoid overfitting, as decision trees can often be highly sensitive to changes in training data

nba <- read_csv("data/traditional.csv")
## Rows: 702387 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): type, player, team, home, away
## dbl  (24): gameid, playerid, MIN, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FT...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nba)
## # A tibble: 6 × 30
##     gameid date       type   playerid player team  home  away    MIN   PTS   FGM
##      <dbl> <date>     <chr>     <dbl> <chr>  <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 29600001 1996-11-01 regul…      893 Micha… CHI   BOS   CHI      43    30    10
## 2 29600001 1996-11-01 regul…      937 Scott… CHI   BOS   CHI      40    18     8
## 3 29600001 1996-11-01 regul…      677 Eric … BOS   BOS   CHI      25    14     6
## 4 29600001 1996-11-01 regul…      146 Jud B… CHI   BOS   CHI       1     0     0
## 5 29600001 1996-11-01 regul…      166 Ron H… CHI   BOS   CHI      25     7     3
## 6 29600001 1996-11-01 regul…      442 Pervi… BOS   BOS   CHI      31     7     2
## # ℹ 19 more variables: FGA <dbl>, `FG%` <dbl>, `3PM` <dbl>, `3PA` <dbl>,
## #   `3P%` <dbl>, FTM <dbl>, FTA <dbl>, `FT%` <dbl>, OREB <dbl>, DREB <dbl>,
## #   REB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>,
## #   `+/-` <dbl>, win <dbl>, season <dbl>
# Convert player and teams to a factor
nba$player <- as.factor(nba$player)
nba$team <- as.factor(nba$team)
nba$home <- as.factor(nba$home)
nba$away <- as.factor(nba$away)

In the case of this dataset, we’d be interested in predicting the points a player will score in a given game. Betting markets also exist that set lines on individual players’ rebounds, assists, or combinations of points, rebounds, and assists. For our purposes, we’ll stick to attempting to predict the total amount of points a player will score in a given game

Before we build any sort of predictvie model, however, an exploratory data analysis (EDA) will be a helpful exercise. Firstly, we can plot the distribution of points scored in games by individual NBA players

ggplot(nba, aes(x=PTS)) + 
  geom_histogram() + 
  labs(x="Points", title="NBA Player Point Totals: 1996 - 2023")

We see a right-skewed distribution, which makes sense as most players in the game will not be scoring abnormally high point totals. There are a few outlier points as well, such as Kobe Bryant’s 81-point game against the Toronto Raptors as well.

Computing Rolling Averages

Int his dataset, it’s not feasible for us to necessarily predict a player’s point total based on his rebounds and assists from the same game, since that wouldn’t necessarily help us in predicting before the game. However, from this dataset we can compute rolling averages per player in a given season of his average points, rebounds, and assists per game. This could give us some historical context on a player’s performance going into a given game.

# Calculate rolling averages of a player's points, rebounds, and assists per game
nba_per_game <- nba %>% 
  group_by(player, season) %>% 
  arrange(date, .by_group = TRUE) %>%
  mutate(
    game_number = row_number(),
    rolling_pt_total = cumsum(PTS),
    rolling_reb = cumsum(REB),
    rolling_ast = cumsum(AST)
    ) %>%
  mutate(PPG = rolling_pt_total / game_number,
         RPG = rolling_reb / game_number,
         APG = rolling_ast / game_number,
         )

As a check, let’s plot Michael Jordan’s points-per-game average for the 1997 season, in which he finished the season averaging 29.6 points-per-game (no small feat!). We see more variability at the start of the season, with the curve approching his season average towards the end, which makes sense as the rolling average will be subject to fewer larger fluctuations as the sample size grows.

jordan97 <- nba_per_game %>% filter(player=="Michael Jordan", season==1997)
ggplot(jordan97, aes(x=date, y=PPG)) + geom_line()

Three-point attempts

One point of interest will be how the game of basketball has been played differently over time. One phenomenon that has changed the in-game strategy has been the proliferation of the three-point shot, which has been recognized as a more efficient shot than a 2-pointers from a comparable distance (a.k.a. a Mid-range). The chart below depicts the total number of three-pointers attempted per game in the NBA since the 1996 season. It’s clear to see that the number of 3-pointers attempted has risen steadily since around 2013, when the phenomenon began gaining traction.

From a modeling perspective, this is a interesting phenomenon because it impacts the predictions of the kind of player. Those who are strong outside shooters would likely have more opportunities to shoot (and score) in later years, as their skill was recognized.

# Plot avg number of 3PA per game
threes_per_game <- nba %>% 
  group_by(season, gameid) %>%
  summarise(total_threes_attempted = sum(`3PA`))

avg_threes_per_game <- threes_per_game %>% 
  group_by(season) %>% 
  summarise(avg_threes_per_game = mean(total_threes_attempted))

ggplot(avg_threes_per_game, aes(x=season, y=avg_threes_per_game)) + 
    geom_line() + labs(x="NBA Season", y="Average 3PA per Game", title="Average Number of 3-pt Attempts Per Game: NBA 1996 - 2023")

Luckily, we have many more instances than dimensions in this dataset, so we will have to worry less about dimensionality. I’m looking to train a decision tree to produce a regression of points scored in a game by a player. First, we can set up a training and testing set for model evaluation.

#make this example reproducible
set.seed(23)

#create ID column
nba_per_game$id <- 1:nrow(nba_per_game)

#use 70% of dataset as training set and 30% as test set 
train <- nba_per_game %>% dplyr::sample_frac(0.80)
test  <- dplyr::anti_join(nba_per_game, train, by = 'id')

As a sanity check, let’s ensure the distribution of points in each of these datasets resemble each other

train$label <- "Train"
test$label <- "Test"
combo <- rbind(train, test)

# Plot distributions of PTS
ggplot(combo, aes(PTS, fill = label)) + geom_density(alpha = 0.2)

That looks like a pretty close match between these distributions. Let’s check some of our other variables that we’d use as features (PPG, RPG, APG)

# Points-per-game
ggplot(combo, aes(PPG, fill = label)) + geom_density(alpha = 0.2)

#Rebounds-per-game
ggplot(combo, aes(RPG, fill = label)) + geom_density(alpha = 0.2)

# Asssists-per-game
ggplot(combo, aes(APG, fill = label)) + geom_density(alpha = 0.2)

Modeling

We’re seeing pretty good alignment in the distributions of our features between our training and test data, which is a good sign that our decision tree will be both trained and evaluated on data reflecting the population. I found this datacamp

tree_spec <- decision_tree() %>%
 set_engine("rpart") %>%
 set_mode("regression")

# Fit the model to the data
tree_fit <- tree_spec %>%
 fit(PTS ~ PPG + RPG + APG + player + team + home + away, data = train)
# Make predictions on the testing data
# Make predictions on the testing data
predictions <- tree_fit %>%
 predict(test) %>%
 pull(.pred)

# Calculate RMSE and R-squared
results <- data.frame(truth=test$PTS, estimate=predictions)

rmse(test$PTS, predictions)
## [1] 5.597105
results %>% rsq(truth, estimate)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.533

We see a value of \(R^2 = 0.533\). Some tuning of our decision tree model (different leaf node sizes/engine) could see us improve the predictive capability of this model.