Sport gambling has become a lucrative industry in the United States since the Supreme Cuurt’s 2018 ruling legalizing it. While not a gambler myself, I am an avid sports fan. I was able to pull two datasets (one small and one large) rleated to both the National Football League Game results in the Super Bowl Era (since 1967) and NBA individual box scores since 1997.
I pulled an ELO dataset for NFL teams dating back to 1920 as well. ELO is a rating system, originally used to produce the relative ratings of chess players, which can be re-purposed for other competitions. You can read more about FiveThirtyEight’s ELO Rating system here. In our case, we’ll be looking to see whether ELO ratings are helpful predictors for NFL spread data.
There’s ~13k rows here in this dataset. In terms of building a predictive model, there’s multiple ways one could approach this dataset. For instance, one could try to build a classification model that selects which of the participant teams ois likely to win outright or cover the predicted point spread (i.e., the proposed difference in point totals which bookmakers collect wagers on). In this case, I’d like to build a model to help predict the spread of a game itself, which is more of a regression task. For simplicity, a simple linear regression model could be used.
elo <- read_csv("data/nfl_elo.csv")
head(elo)
## # A tibble: 6 × 33
## date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
## <date> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1920-09-26 1920 0 <NA> RII STP 1504. 1300 0.825
## 2 1920-10-03 1920 0 <NA> CBD PTQ 1505. 1300 0.825
## 3 1920-10-03 1920 0 <NA> CHI MUT 1368. 1300 0.683
## 4 1920-10-03 1920 0 <NA> RII MUN 1516. 1478. 0.644
## 5 1920-10-03 1920 0 <NA> DAY COL 1493. 1505. 0.576
## 6 1920-10-03 1920 0 <NA> RCH ABU 1503. 1300 0.824
## # ℹ 24 more variables: elo_prob2 <dbl>, elo1_post <dbl>, elo2_post <dbl>,
## # qbelo1_pre <dbl>, qbelo2_pre <dbl>, qb1 <chr>, qb2 <chr>,
## # qb1_value_pre <dbl>, qb2_value_pre <dbl>, qb1_adj <dbl>, qb2_adj <dbl>,
## # qbelo_prob1 <dbl>, qbelo_prob2 <dbl>, qb1_game_value <dbl>,
## # qb2_game_value <dbl>, qb1_value_post <dbl>, qb2_value_post <dbl>,
## # qbelo1_post <dbl>, qbelo2_post <dbl>, score1 <dbl>, score2 <dbl>,
## # quality <dbl>, importance <dbl>, total_rating <dbl>
We can filter out any matchups prior to the 1966 season, where our spread data begins
# Clean up some names/IDs
elo <- elo %>% filter(season > 1967) %>%
mutate(team1 = ifelse(team1=="WSH", "WAS", team1),
team2 = ifelse(team2=="WSH", "WAS", team2))
# Calculate acutal point spread of game result
elo$actual_spread <- elo$score1 - elo$score2
head(elo)
## # A tibble: 6 × 34
## date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
## <date> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1968-09-06 1968 0 <NA> LAC CIN 1477. 1300 0.801
## 2 1968-09-08 1968 0 <NA> BUF NE 1452. 1407. 0.653
## 3 1968-09-09 1968 0 <NA> TEN KC 1508. 1585. 0.483
## 4 1968-09-14 1968 0 <NA> MIN ATL 1451. 1329. 0.745
## 5 1968-09-14 1968 0 <NA> MIA TEN 1390. 1491. 0.448
## 6 1968-09-15 1968 0 <NA> CHI WAS 1534. 1460. 0.690
## # ℹ 25 more variables: elo_prob2 <dbl>, elo1_post <dbl>, elo2_post <dbl>,
## # qbelo1_pre <dbl>, qbelo2_pre <dbl>, qb1 <chr>, qb2 <chr>,
## # qb1_value_pre <dbl>, qb2_value_pre <dbl>, qb1_adj <dbl>, qb2_adj <dbl>,
## # qbelo_prob1 <dbl>, qbelo_prob2 <dbl>, qb1_game_value <dbl>,
## # qb2_game_value <dbl>, qb1_value_post <dbl>, qb2_value_post <dbl>,
## # qbelo1_post <dbl>, qbelo2_post <dbl>, score1 <dbl>, score2 <dbl>,
## # quality <dbl>, importance <dbl>, total_rating <dbl>, actual_spread <dbl>
We can use the ggpairs function to plot the relationship
between the variables we care about and our predictor
elo <- elo %>% mutate(log_elo1_pre = log(elo1_pre),
elo_diff = elo1_pre - elo2_pre,
elo_ratio = elo1_pre / elo2_pre,
qb_diff = qbelo1_pre - qbelo2_pre,
qb_ratio = qbelo1_pre / qbelo2_pre)
elo_ratings <- elo %>% select(actual_spread, elo1_pre, elo2_pre, qbelo1_pre, qbelo2_pre, elo_diff, elo_ratio, qb_diff, qb_ratio)
ggpairs(elo_ratings, upper=NULL)
We see relatively normal distributions of our predictor variables and
our outcome (
actual_spread). However, this plot is a bit
busy due to the density of Elo ratings. One easy way for us to visualize
the relationships between features and outcome variables is a
correlation plot. The corrplot package in R is handy for
this.
# Make correlation plot
cor <- cor(elo_ratings)
corrplot(cor)
We can build a simple linear regression model to predict the game’s spread based on the team and quarterback Elo ratings ratios prior to the game.
# Set up linear model
nfl_model <- lm(actual_spread ~ elo_diff + qb_diff, elo)
summary(nfl_model)
##
## Call:
## lm(formula = actual_spread ~ elo_diff + qb_diff, data = elo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.158 -8.670 -0.160 8.431 52.346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.624993 0.119949 21.884 < 2e-16 ***
## elo_diff 0.014226 0.005264 2.702 0.00689 **
## qb_diff 0.029111 0.005422 5.369 8.06e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.63 on 12920 degrees of freedom
## Multiple R-squared: 0.156, Adjusted R-squared: 0.1559
## F-statistic: 1194 on 2 and 12920 DF, p-value: < 2.2e-16
We see an \(R^2_{adj}\) value of 0.1569, which isn’t great in terms of prediction. We only have ~13k instances of NFL contests, which wouldn’t qualify as big data, but could be neough to produce a regression. In this case, however, it looks liek Elo ratingsmay not be the strongest predictors of the outcome of an NFL game.
I also found this other Kaggle dataset with NBA Boxscores which would be considered more “big data” (~700k rows). Each row in this dataset consists of an individual NBA player’s statistics (points, rebounds assists, and other counting stats) in a single game from the 1996 season through the 2023 season.
A typical NBA game will see anywhere from 15-20 unique players, and with a single team 82 games in a regular reasons between 30 teams in the NBA, there can quickly be many instances of individual player performance. While not traditionally “big data”, this dataset is considerably larger than our NFL dataset above.
In our case, we’ll be interested in predicting the total number of points scored by a given player in a single game. We’ll train a simple regression decision tree, as it’s a robust and explainable model. We’ll want to be sure to avoid overfitting, as decision trees can often be highly sensitive to changes in training data
nba <- read_csv("data/traditional.csv")
## Rows: 702387 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): type, player, team, home, away
## dbl (24): gameid, playerid, MIN, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FT...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nba)
## # A tibble: 6 × 30
## gameid date type playerid player team home away MIN PTS FGM
## <dbl> <date> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 29600001 1996-11-01 regul… 893 Micha… CHI BOS CHI 43 30 10
## 2 29600001 1996-11-01 regul… 937 Scott… CHI BOS CHI 40 18 8
## 3 29600001 1996-11-01 regul… 677 Eric … BOS BOS CHI 25 14 6
## 4 29600001 1996-11-01 regul… 146 Jud B… CHI BOS CHI 1 0 0
## 5 29600001 1996-11-01 regul… 166 Ron H… CHI BOS CHI 25 7 3
## 6 29600001 1996-11-01 regul… 442 Pervi… BOS BOS CHI 31 7 2
## # ℹ 19 more variables: FGA <dbl>, `FG%` <dbl>, `3PM` <dbl>, `3PA` <dbl>,
## # `3P%` <dbl>, FTM <dbl>, FTA <dbl>, `FT%` <dbl>, OREB <dbl>, DREB <dbl>,
## # REB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>,
## # `+/-` <dbl>, win <dbl>, season <dbl>
# Convert player and teams to a factor
nba$player <- as.factor(nba$player)
nba$team <- as.factor(nba$team)
nba$home <- as.factor(nba$home)
nba$away <- as.factor(nba$away)
In the case of this dataset, we’d be interested in predicting the
points a player will score in a given game. Betting markets
also exist that set lines on individual players’ rebounds, assists, or
combinations of points, rebounds, and assists. For our purposes, we’ll
stick to attempting to predict the total amount of points a player will
score in a given game
Before we build any sort of predictvie model, however, an exploratory data analysis (EDA) will be a helpful exercise. Firstly, we can plot the distribution of points scored in games by individual NBA players
ggplot(nba, aes(x=PTS)) +
geom_histogram() +
labs(x="Points", title="NBA Player Point Totals: 1996 - 2023")
We see a right-skewed distribution, which makes sense as most players in
the game will not be scoring abnormally high point totals. There are a
few outlier points as well, such as Kobe
Bryant’s 81-point game against the Toronto Raptors as well.
Int his dataset, it’s not feasible for us to necessarily predict a player’s point total based on his rebounds and assists from the same game, since that wouldn’t necessarily help us in predicting before the game. However, from this dataset we can compute rolling averages per player in a given season of his average points, rebounds, and assists per game. This could give us some historical context on a player’s performance going into a given game.
# Calculate rolling averages of a player's points, rebounds, and assists per game
nba_per_game <- nba %>%
group_by(player, season) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
game_number = row_number(),
rolling_pt_total = cumsum(PTS),
rolling_reb = cumsum(REB),
rolling_ast = cumsum(AST)
) %>%
mutate(PPG = rolling_pt_total / game_number,
RPG = rolling_reb / game_number,
APG = rolling_ast / game_number,
)
As a check, let’s plot Michael Jordan’s points-per-game average for the 1997 season, in which he finished the season averaging 29.6 points-per-game (no small feat!). We see more variability at the start of the season, with the curve approching his season average towards the end, which makes sense as the rolling average will be subject to fewer larger fluctuations as the sample size grows.
jordan97 <- nba_per_game %>% filter(player=="Michael Jordan", season==1997)
ggplot(jordan97, aes(x=date, y=PPG)) + geom_line()
One point of interest will be how the game of basketball has been played differently over time. One phenomenon that has changed the in-game strategy has been the proliferation of the three-point shot, which has been recognized as a more efficient shot than a 2-pointers from a comparable distance (a.k.a. a Mid-range). The chart below depicts the total number of three-pointers attempted per game in the NBA since the 1996 season. It’s clear to see that the number of 3-pointers attempted has risen steadily since around 2013, when the phenomenon began gaining traction.
From a modeling perspective, this is a interesting phenomenon because it impacts the predictions of the kind of player. Those who are strong outside shooters would likely have more opportunities to shoot (and score) in later years, as their skill was recognized.
# Plot avg number of 3PA per game
threes_per_game <- nba %>%
group_by(season, gameid) %>%
summarise(total_threes_attempted = sum(`3PA`))
avg_threes_per_game <- threes_per_game %>%
group_by(season) %>%
summarise(avg_threes_per_game = mean(total_threes_attempted))
ggplot(avg_threes_per_game, aes(x=season, y=avg_threes_per_game)) +
geom_line() + labs(x="NBA Season", y="Average 3PA per Game", title="Average Number of 3-pt Attempts Per Game: NBA 1996 - 2023")
Luckily, we have many more instances than dimensions in this dataset,
so we will have to worry less about dimensionality. I’m looking to train
a decision tree to produce a regression of points scored in
a game by a player. First, we can set up a training and
testing set for model evaluation.
#make this example reproducible
set.seed(23)
#create ID column
nba_per_game$id <- 1:nrow(nba_per_game)
#use 70% of dataset as training set and 30% as test set
train <- nba_per_game %>% dplyr::sample_frac(0.80)
test <- dplyr::anti_join(nba_per_game, train, by = 'id')
As a sanity check, let’s ensure the distribution of points in each of these datasets resemble each other
train$label <- "Train"
test$label <- "Test"
combo <- rbind(train, test)
# Plot distributions of PTS
ggplot(combo, aes(PTS, fill = label)) + geom_density(alpha = 0.2)
That looks like a pretty close match between these distributions.
Let’s check some of our other variables that we’d use as features
(PPG, RPG, APG)
# Points-per-game
ggplot(combo, aes(PPG, fill = label)) + geom_density(alpha = 0.2)
#Rebounds-per-game
ggplot(combo, aes(RPG, fill = label)) + geom_density(alpha = 0.2)
# Asssists-per-game
ggplot(combo, aes(APG, fill = label)) + geom_density(alpha = 0.2)
We’re seeing pretty good alignment in the distributions of our features between our training and test data, which is a good sign that our decision tree will be both trained and evaluated on data reflecting the population. I found this datacamp
tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
# Fit the model to the data
tree_fit <- tree_spec %>%
fit(PTS ~ PPG + RPG + APG + player + team + home + away, data = train)
# Make predictions on the testing data
# Make predictions on the testing data
predictions <- tree_fit %>%
predict(test) %>%
pull(.pred)
# Calculate RMSE and R-squared
results <- data.frame(truth=test$PTS, estimate=predictions)
rmse(test$PTS, predictions)
## [1] 5.597105
results %>% rsq(truth, estimate)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.533
We see a value of \(R^2 = 0.533\). Some tuning of our decision tree model (different leaf node sizes/engine) could see us improve the predictive capability of this model.