March Madness

Background

I wanted to see if I could find an important logistic model to determine what helps make the higher seeded teams win in the March Madness tournament held every year. To do so I gathered data from 2013-2017 tournaments to retrieve enough data to find significant trends.

Hypothesis

What best predicts higher seeded teams winning in the tournament?

\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]

\[ H_0: \beta_2 = 0 \\ H_a: \beta_2 \neq 0 \]

\[ H_0: \beta_3 = 0 \\ H_a: \beta_3 \neq 0 \]

\[ H_0: \beta_4 = 0 \\ H_a: \beta_4 \neq 0 \]

Here is my proposed final model:

\[ P(Y_i = 1|\, x_i) = \frac{e^{\beta_0 + \beta_1 x_i+ \beta_2 x_2i + \beta_3 x_3i + \beta_4 x_4i + \beta_5 x5_i}}{1+ e^{\beta_0 + \beta_1 x_i+ \beta_2 x_2i + \beta_3 x_3i + \beta_4 x_4i + \beta_5 x5_i}} = \pi_i \]

And our significance threshold:

\[ \alpha = .05 \]

Analysis

madness.glm <- glm(lower_seed_won == 1 ~  BARTHAG_diff + seed_diff + ADJT_diff , data = madness2,family=binomial)

pander(summary(madness.glm))

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.7314	0.2734	-2.675	0.007472
BARTHAG_diff	24.55	3.423	7.173	7.355e-13
seed_diff	0.1431	0.05835	2.452	0.0142
ADJT_diff	0.1141	0.04348	2.624	0.008686

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	411.4 on 317 degrees of freedom
Residual deviance:	245.2 on 314 degrees of freedom

All of our variables are significant at our stated threshold value and we achieved an AIC of 176 which is the lowest I achieved over all my models after starting around the 420 mark. We can now use this model and interpret it below.

Graphics

plot(lower_seed_won == 1 ~ BARTHAG_diff, data = madness2, pch = 21, bg=rgb(.1,.1,.6,.1), col = "gray", cex = 1.5,
     main = "March Madness Logistic Regression", xlab = "Power Rating Difference", ylab = "Lower Seed Won")



b <- coef(madness.glm)
b

##  (Intercept) BARTHAG_diff    seed_diff    ADJT_diff 
##   -0.7313701   24.5523302    0.1430917    0.1140932

badiff = 1
seeddiff = 1
adjtdiff = 1


curve( exp(b[1] + b[2]*badiff + b[3]*seeddiff + b[4]*adjtdiff) / (1 +  exp(b[1] + b[2]*badiff + b[3]*seeddiff + b[4]*adjtdiff)),
       add = TRUE, col=palette()[1], xname="badiff")


badiff = .05
seeddiff = 5
adjtdiff = 1


curve( exp(b[1] + b[2]*badiff + b[3]*seeddiff + b[4]*adjtdiff) / (1 +  exp(b[1] + b[2]*badiff + b[3]*seeddiff + b[4]*adjtdiff)),
       add = TRUE, col=palette()[2], xname="badiff")

badiff = .07
seeddiff = 8
adjtdiff = 0


curve( exp(b[1] + b[2]*badiff + b[3]*seeddiff + b[4]*adjtdiff) / (1 +  exp(b[1] + b[2]*badiff + b[3]*seeddiff + b[4]*adjtdiff)),
       add = TRUE, col=palette()[3], xname="badiff")

legend("topleft", legend=c("Baseline", "Scenario 2", "Scenario 3"), lwd=c(2,2,2), col= c(palette()[1],palette()[2],palette()[3]) , bty='n')

Scenario 1 (Baseline):

The first scenario I am calling the baseline model where seed_diff and Tempo difference equal 1, and your power difference is 0.

pander(predict(madness.glm, data.frame(BARTHAG_diff= 0, seed_diff = 1, ADJT_diff = 1), type = "response"))

1
0.3836

This scenario gives us a 38.4% chance for higher seeds to win.

Scenario 2:

pander(predict(madness.glm, data.frame(BARTHAG_diff=.05, seed_diff = 5, ADJT_diff = 1), type = "response"))

1
0.7901

A scenario that you may see while comparing 2 teams. We input that this team has a power rating difference of .05, a seed difference of 5, and an adjusted tempo rating difference of 1. The out put is telling us that this higher seeded team has a 79% higher chance to win based on these statistics.

Scenario 3:

pander(predict(madness.glm, data.frame(BARTHAG_diff= .07, seed_diff = 8, ADJT_diff = 0), type = "response"))

1
0.894

A scenario that you may see in the first round. We input that this team has a power rating difference of .07, a seed difference of 8, and an adjusted tempo rating difference of 0. The out put is telling us that this higher seeded team has a 89.4% higher chance to win based on these statistics.

Interpretation of Variables

The numbers above each sentence represent the odds of the different variables of interest and how they change based on a unit change in the variable. Each point above 1 is a 1% increase in the odds. My values below 1 are percent decreases that came from subtracting the original odds from 1. Our first value here is almost 3 which is a multiple of 1 meaning that it is a over 100% change so I interpreted it as 3 times for easier understanding.

pander(exp(24.5523302*.05))

3.413

You are 3.4 times more likely to win the game as the higher seed for every .05 difference in your power ranking over the opponents power ranking.

pander(exp(0.1430917))

1.154

There is a 15.4% increase for the higher seed to win for every seed higher you are over your opponent.

pander(exp(0.1140932))

1.121

There is a 12% increase in the higher seeded teams chance to win for ever point higher they are in their adjusted tempo rating against the other team.

Model Validation

set.seed(121)
n <- 200

keep <- sample(1:nrow(madness2), n)

mytrain <- madness2[keep, ]
mytest <- madness2[-keep, ]


madness.val <- glm(lower_seed_won == 1 ~  BARTHAG_diff + seed_diff + ADJT_diff, 
            data=mytrain, family="binomial")

mypreds <- predict(madness.val, mytest, type="response")

pander(table(mytest$lower_seed_won, ifelse(mypreds > 0.5, 1, 0))) #confusion matrix

	0	1
0	28	8
1	12	70

pander((28 + 70)/ (28 + 70 + 12 + 8))

0.8305

We validated our model by separating it into two sets and predicting our values. We predicted 83% of the games correctly from our data that spans 2013 to 2017 which is very good for sports with high variability in outcomes.

Conclusion

Our model seemed to work very well and we found that the difference in power ranking seems to be a very important factor in determining the winner of the game with the seed differential and the difference in adjusted tempo can also give good insight on the chances of the higher seeded team winning.

Appendix

While the data provided for me was a little outdated, I wanted to see how this model would predict the 2023 March Madness Tournament.I wanted to measure the success of the model on 2 different situations; one was filling out a successful bracket from start to finish and the other was to predict each match up on a round by round basis regardless of who was predicted to win the previous round.

The Perfect Bracket

First we will start with the model everyone wishes they could get correct, predicting the perfect bracket. With the prize money for a perfect bracket increasing every year, the desire to crack the code on creating the perfect bracket increases. While this model wasn’t perfect, it could get you to be the top of your company or friend group and earn you a little money on the side.

My model received a total score of 760 out of a possible 1,920 for the 2023 tournament. Now that may not sound like a lot but with all of the early upsets that had occurred in this years tournament, no perfect brackets remained after the first round of the tournament and the highest score anyone received was 1,600. A score of 760 would put you in the top 3% of the 20 million brackets created. The model fell a little short in the sweet sixteen and elite eight rounds of the tournament but was successful in predicting UConn to be a title contender allowing you to get points deep into the tournament. Below are the point breakdowns for each round.

Round of 64: 240 of 320

Round of 32: 240 of 320

Sweet Sixteen: 40 of 320

Elite Eight: 80 of 320

Final Four: 160 of 320

Championship: 0 of 320

A Bettor’s Model

Now for a bettor, he may not care about filling out a fully successful bracket, but more on who is going to win each match up every week. When predicting on a game to game basis we see where the model can predict the match ups in the mid to late rounds knowing who has made the deeper pushes. The model predicted 45 out of 63 games correctly throughout the whole tournament which is 71% accurate. This is a little lower than our validated percent from our model. This could be due to the unusual amount of deep tournament pushes by lower seeds then we have ever seen before and the slightly older data. Even with this, the model still predicted a significant portion correct that could put you in plus territory in a bettor’s world. It predicted the final 2 rounds with 100% accuracy as well. Below is the round by round breakdown of the predicted outcomes.

Round of 64: 24 out of 32 - 75%

Round of 32: 13 out of 16 - 81%

Sweet Sixteen: 3 out of 8 - 38%

Elite Eight: 2 out of 4 - 50%

Final Four: 2 out of 2 - 100%

Championship: 1 out of 1 - 100%

Total Games: 45 out of 63 - 71%