Betting on Basketball Image

Introduction

According to an article published by Front Office Sports, the market for sports betting doubled in 2021 and has surpassed $52 billion.

Sports betting is new and lucrative right now: Many states are still in the process of legislating on sports betting, with 11 states legalizing sports betting in the last year.

So, what’s a good betting strategy?

Let’s look at historic college basketball data, specifically 2013-2021 Division I basketball teams.

High rank correlates with more wins, right? BARTHAG: Power Rating (Chance of beating an average Division I team)

A risky, but profitable bet would be to vote on a volatile team, one that unexpectedly wins.

The Data

TEAM CONF G W ADJOE ADJDE BARTHAG EFG_O EFG_D TOR TORD ORB DRB FTR FTRD X2P_O X2P_D X3P_O X3P_D ADJ_T WAB POSTSEASON SEED YEAR
North Carolina ACC 40 33 123.3 94.9 0.9531 52.6 48.1 15.4 18.2 40.7 30.0 32.3 30.4 53.9 44.6 32.7 36.2 71.7 8.6 2ND 1 2016
Wisconsin B10 40 36 129.1 93.6 0.9758 54.8 47.7 12.4 15.8 32.1 23.7 36.2 22.4 54.8 44.7 36.5 37.5 59.3 11.3 2ND 1 2015
Michigan B10 40 33 114.4 90.4 0.9375 53.9 47.7 14.0 19.5 25.5 24.9 30.7 30.0 54.7 46.8 35.2 33.2 65.9 6.9 2ND 3 2018
Texas Tech B12 38 31 115.2 85.2 0.9696 53.5 43.0 17.7 22.8 27.4 28.7 32.9 36.6 52.8 41.9 36.5 29.7 67.5 7.0 2ND 3 2019
Gonzaga WCC 39 37 117.8 86.3 0.9728 56.6 41.1 16.2 17.1 30.0 26.2 39.0 26.9 56.3 40.0 38.2 29.0 71.5 7.7 2ND 1 2017
Kentucky SEC 40 29 117.2 96.2 0.9062 49.9 46.0 18.1 16.1 42.0 29.7 51.8 36.8 50.0 44.9 33.2 32.2 65.9 3.9 2ND 8 2014
  • TEAM: The Division I college basketball school
  • CONF: The Athletic Conference in which the school participates in
  • G: Number of games played
  • W: Number of games won
  • ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
  • ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
  • BARTHAG: Power Rating (Chance of beating an average Division I team)
  • EFG_O: Effective Field Goal Percentage Shot
  • EFG_D: Effective Field Goal Percentage Allowed
  • TOR: Turnover Percentage Allowed (Turnover Rate)
  • TORD: Turnover Percentage Committed (Steal Rate)
  • ORB: Offensive Rebound Rate
  • DRB: Offensive Rebound Rate Allowed
  • FTR : Free Throw Rate (How often the given team shoots Free Throws)
  • FTRD: Free Throw Rate Allowed
  • 2P_O: Two-Point Shooting Percentage
  • 2P_D: Two-Point Shooting Percentage Allowed
  • 3P_O: Three-Point Shooting Percentage
  • 3P_D: Three-Point Shooting Percentage Allowed
  • ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)

Calculating Volatility

We can use power rating from year to year to calculate the volatility of each team. Does their power rating, ie performance, vary from year to year? If yes, these are the teams we want to bet on. These teams are the underdogs, which will earn you the $!

  • Calculate the volatility by team: sqrt(variance)sqrt(years)

  • Classify as high or low volatility based on the third quartile

TEAM YEAR BARTHAG VOL volatility_index
North Carolina 2016 0.9531 0.0841434 Low
Wisconsin 2015 0.9758 0.1919160 Low
Michigan 2018 0.9375 0.1622063 Low
Texas Tech 2019 0.9696 0.5517687 High
Gonzaga 2017 0.9728 0.0722673 Low
Kentucky 2014 0.9062 0.1531929 Low

Can we predict high volatility teams well via a random forest?

Exploratory Data Analysis

Class Imbalance

Barthag Distribution

Barthag by Conference

Volatility Distribution

Five Number Summary of Volatility

Statistic Value
Min. 0.0360298
1st Qu. 0.2165824
Median 0.3010490
Mean 0.3153891
3rd Qu. 0.4039958
Max. 0.7807641

Prevalence of Target Variable: Volatility (High or Low) Recall High Volatility is considered the top quartile of volatility scores.

Var1 Freq
High 620
Low 1833

Prevalence = 25.27%

Random Forest Model

We want to build a random forest that is really good at accurately predicting highly volatile teams

Random forest classifer with volatility index (High or Low) as the response.

Use this model to predict and pinpoint highly volatile teams, which would make for risky and successful bets.

Our metric here is true positive rate: we want to be really good at predicting highly volatile teams (low error rate in the positive class).

Metrics of Interest:

  • True Positive Rate
  • F1 (imbalance)

Initial Model Outputs

High Low class.error
High 0 434 1
Low 0 1284 0

74.6% accuracy: misleading OOB Error: 25.26% Want to adjust the threshold.

Final Model: Adjusting the Threshold to 0.3

Random Forest Hyperparameters:

  • default threshold of 0.5, tuning hyperparameters
High Low class.error
High 157 277 0.6382488
Low 172 1112 0.1339564

randomForest(formula = as.factor(volatility_index) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, cutoff = c(0.3, 1 - 0.3))

Adjusting the Threshold to 0.1

Problematic becuase we predict all as high volatility

Positive class error rate minimized, but not useful for betting

High Low class.error
High 430 4 0.0092166
Low 1236 48 0.9626168

Predicting with the Test Set: Evaluation Metrics

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction High Low
##       High   30  46
##       Low    63 228
##                                           
##                Accuracy : 0.703           
##                  95% CI : (0.6534, 0.7493)
##     No Information Rate : 0.7466          
##     P-Value [Acc > NIR] : 0.9747          
##                                           
##                   Kappa : 0.1646          
##                                           
##  Mcnemar's Test P-Value : 0.1254          
##                                           
##             Sensitivity : 0.32258         
##             Specificity : 0.83212         
##          Pos Pred Value : 0.39474         
##          Neg Pred Value : 0.78351         
##               Precision : 0.39474         
##                  Recall : 0.32258         
##                      F1 : 0.35503         
##              Prevalence : 0.25341         
##          Detection Rate : 0.08174         
##    Detection Prevalence : 0.20708         
##       Balanced Accuracy : 0.57735         
##                                           
##        'Positive' Class : High            
## 

Our accuracy is the same as the prevalence.

Variable Importances

ROC Curve

Conclusions

  • Our model isn’t much better than random guessing
  • Class imbalance
  • Forecasting and time series
  • Lots of relevant variables are hard to measure: team chemistry, home court advantage, injuries, etc.
  • Teams change so much within a year
  • How educated of a guess can you make when gambling?

Fairness

  • Not all schools and not all teams are created equal
  • Keep biases in mind.

Limitations and Future Work

  • Incorporate other variables related to schools
  • 2020 data unavailable from COVID
  • Trying unsupervised approaches or neural networks
  • Restructuring the problem to utilize time series data: change train and test sets
  • Rethink metrics for what makes a good bet?

Betting and team success isn’t predictable, and that’s why it’s fun!

[](

via GIPHY

)