Want to get rich? Bet on Sports.

Betting on Basketball Image

Introduction

According to an article published by Front Office Sports, the market for sports betting doubled in 2021 and has surpassed $52 billion.

Sports betting is new and lucrative right now: Many states are still in the process of legislating on sports betting, with 11 states legalizing sports betting in the last year.

So, what’s a good betting strategy?

Let’s look at historic college basketball data, specifically 2013-2021 Division I basketball teams.

High rank correlates with more wins, right? BARTHAG: Power Rating (Chance of beating an average Division I team)

A risky, but profitable bet would be to vote on a volatile team, one that unexpectedly wins.

The Data

TEAM	CONF	G	W	ADJOE	ADJDE	BARTHAG	EFG_O	EFG_D	TOR	TORD	ORB	DRB	FTR	FTRD	X2P_O	X2P_D	X3P_O	X3P_D	ADJ_T	WAB	POSTSEASON	SEED	YEAR
North Carolina	ACC	40	33	123.3	94.9	0.9531	52.6	48.1	15.4	18.2	40.7	30.0	32.3	30.4	53.9	44.6	32.7	36.2	71.7	8.6	2ND	1	2016
Wisconsin	B10	40	36	129.1	93.6	0.9758	54.8	47.7	12.4	15.8	32.1	23.7	36.2	22.4	54.8	44.7	36.5	37.5	59.3	11.3	2ND	1	2015
Michigan	B10	40	33	114.4	90.4	0.9375	53.9	47.7	14.0	19.5	25.5	24.9	30.7	30.0	54.7	46.8	35.2	33.2	65.9	6.9	2ND	3	2018
Texas Tech	B12	38	31	115.2	85.2	0.9696	53.5	43.0	17.7	22.8	27.4	28.7	32.9	36.6	52.8	41.9	36.5	29.7	67.5	7.0	2ND	3	2019
Gonzaga	WCC	39	37	117.8	86.3	0.9728	56.6	41.1	16.2	17.1	30.0	26.2	39.0	26.9	56.3	40.0	38.2	29.0	71.5	7.7	2ND	1	2017
Kentucky	SEC	40	29	117.2	96.2	0.9062	49.9	46.0	18.1	16.1	42.0	29.7	51.8	36.8	50.0	44.9	33.2	32.2	65.9	3.9	2ND	8	2014

TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG: Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)

Calculating Volatility

We can use power rating from year to year to calculate the volatility of each team. Does their power rating, ie performance, vary from year to year? If yes, these are the teams we want to bet on. These teams are the underdogs, which will earn you the $!

Calculate the volatility by team: sqrt(variance)sqrt(years)
Classify as high or low volatility based on the third quartile

TEAM	YEAR	BARTHAG	VOL	volatility_index
North Carolina	2016	0.9531	0.0841434	Low
Wisconsin	2015	0.9758	0.1919160	Low
Michigan	2018	0.9375	0.1622063	Low
Texas Tech	2019	0.9696	0.5517687	High
Gonzaga	2017	0.9728	0.0722673	Low
Kentucky	2014	0.9062	0.1531929	Low

Can we predict high volatility teams well via a random forest?

Exploratory Data Analysis

Class Imbalance

Barthag Distribution

Barthag by Conference

Volatility Distribution

Five Number Summary of Volatility

Statistic	Value
Min.	0.0360298
1st Qu.	0.2165824
Median	0.3010490
Mean	0.3153891
3rd Qu.	0.4039958
Max.	0.7807641

Prevalence of Target Variable: Volatility (High or Low) Recall High Volatility is considered the top quartile of volatility scores.

Var1	Freq
High	620
Low	1833

Prevalence = 25.27%

Random Forest Model

We want to build a random forest that is really good at accurately predicting highly volatile teams

Random forest classifer with volatility index (High or Low) as the response.

Use this model to predict and pinpoint highly volatile teams, which would make for risky and successful bets.

Our metric here is true positive rate: we want to be really good at predicting highly volatile teams (low error rate in the positive class).

Metrics of Interest:

True Positive Rate
F1 (imbalance)

Initial Model Outputs

	High	Low	class.error
High	0	434	1
Low	0	1284	0

74.6% accuracy: misleading OOB Error: 25.26% Want to adjust the threshold.

Final Model: Adjusting the Threshold to 0.3

Random Forest Hyperparameters:

default threshold of 0.5, tuning hyperparameters

	High	Low	class.error
High	157	277	0.6382488
Low	172	1112	0.1339564

randomForest(formula = as.factor(volatility_index) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, cutoff = c(0.3, 1 - 0.3))

Adjusting the Threshold to 0.1

Problematic becuase we predict all as high volatility

Positive class error rate minimized, but not useful for betting

	High	Low	class.error
High	430	4	0.0092166
Low	1236	48	0.9626168

Predicting with the Test Set: Evaluation Metrics

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction High Low
##       High   30  46
##       Low    63 228
##                                           
##                Accuracy : 0.703           
##                  95% CI : (0.6534, 0.7493)
##     No Information Rate : 0.7466          
##     P-Value [Acc > NIR] : 0.9747          
##                                           
##                   Kappa : 0.1646          
##                                           
##  Mcnemar's Test P-Value : 0.1254          
##                                           
##             Sensitivity : 0.32258         
##             Specificity : 0.83212         
##          Pos Pred Value : 0.39474         
##          Neg Pred Value : 0.78351         
##               Precision : 0.39474         
##                  Recall : 0.32258         
##                      F1 : 0.35503         
##              Prevalence : 0.25341         
##          Detection Rate : 0.08174         
##    Detection Prevalence : 0.20708         
##       Balanced Accuracy : 0.57735         
##                                           
##        'Positive' Class : High            
##

Our accuracy is the same as the prevalence.

Variable Importances

ROC Curve

Conclusions

Our model isn’t much better than random guessing
Class imbalance
Forecasting and time series
Lots of relevant variables are hard to measure: team chemistry, home court advantage, injuries, etc.
Teams change so much within a year
How educated of a guess can you make when gambling?

Fairness

Not all schools and not all teams are created equal
Keep biases in mind.

Limitations and Future Work

Incorporate other variables related to schools
2020 data unavailable from COVID
Trying unsupervised approaches or neural networks
Restructuring the problem to utilize time series data: change train and test sets
Rethink metrics for what makes a good bet?

Betting and team success isn’t predictable, and that’s why it’s fun!

[](

via GIPHY

)

Sources

https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset https://rpubs.com/phamdinhkhanh/389752 https://stackoverflow.com/questions/57939453/building-a-randomforest-with-caret https://stackoverflow.com/questions/25433805/how-do-you-change-the-cutoff-parameter-in-rs-randomforest https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/tutorial-random-forest-parameter-tuning-r/tutorial/