Abstract

In college basketball, predicting games in a March Madness tournament can be a difficult task. However, using statistics we can fit linear and logistic models that allow us to predict the margin of victory or probability of victory respectively of the higher ranked team for any given matchup in a tournament. Using 21 years of tournament data, we fit five different models to predict games using methods like minimizing Aikake’s Information Criterion or maximizing the Adjusted R-squared, and then tested each of their predictive accuracies on past tournaments. The results are promising, as many of our models were able to predict games correctly over 70% of the time, and one of the models predicted 51% of all upsets from 2000 to 2021 correctly.


Introduction

When you hear the words college basketball and March in the same sentence, it can only mean one thing: it’s time for March Madness. Every year, at the end of the college basketball season, a committee gets together and picks the 64 best Division I teams to compete in a single elimination tournament until there is only one team left. Every year, this tournament is a big deal because people all over the world compete in bracket challenges to make a perfect bracket. This has never been accomplished in the history of this tournament, and it is not hard to see why; for starters, if you look at every game as a 50/50 toss-up the odds of predicting 63 straight games correctly are 1 in 10800000000000000000.

The grand prize for predicting a perfect bracket is $1 billion, and Warren Buffett has recently revamped the contest announcing that the top 20 imperfect brackets would also receive $100,000 in the case of a perfect one. The best attempt at predicting all 63 games correctly happened in 2019 by a man named Gregg Nigl who correctly chose 49 straight games, beating the previous record by 10 games. To put this in perspective, between 60 million to 100 million brackets are filled out each year, and only two have had 39 or more consecutive games predicted correctly. Our thought process is that if we can figure out what factors consistently have the most influence on the games, we might be able to predict future winners and increase our chances of getting the elusive perfect bracket.

Our goal is to collect data from previous tournaments and develop a regression model that will be able to predict the winner of each game in the tournament. We gathered data on every single March Madness game from 2000 to 2021, originally collecting 54 variables for every single game. The majority of these predictors are made up of various stats we thought would be relevant for each team; average points, assists, and rebounds per game among many more stats were included. For consistency, we designated the higher ranked team in each game as “Team A” and the lower as “Team B.” Originally, we wanted to predict Team A’s margin of victory (MOV), or how much they would win by. A positive margin of victory indicates a victory for A, while a negative value means that B won. Further along in our research we also decided to fit a model that would predict the probability of A winning as opposed to MOV.


Methods

Early on in our analysis, we ran into a problem. For example, we would see that Team A’s average assists per game was significant but Team B’s was not. Instead of debating whether or not we should keep both if only one of the pair is significant, we made the decision to combine A and B’s stats by taking the difference of them. Instead of having separate variables “A_AST” and “B_AST”, we created “diffAST” which is A’s average assists subtracted by B’s average assists, and we did this for each A and B variable. If diffAST is positive, that means that A had more assists while if it is negative, B had more. We found this new dataset with the differences solved the issue and still provided meaningful results.

Using this dataset with the differences, we created four models to predict MOV: one that minimizes Aikaike’s Information Criterion (AIC), one that maximizes the Adjusted R-squared, one that minimizes AIC but removes the y-intercept, and one that is only a y-intercept. We also used logistic regression to fit a model that predicts the probability of Team A winning. The following will provide a brief description of each model and how they were chosen.

Our first model was selected by minimizing Aikaike’s Information Criterion, a popular method for determining what predictors to include in a model. The “best” model is the one with the lowest AIC score, which was derived using a formula from Julian Faraway’s book Linear Modelling with R (Faraway 2014), and when comparing all possible combinations of predictors the model with eight ended up minimizing it (Fig. 1).

*Plot of AIC for the best model for each number of predictors. The model with 8 predictors minimizes AIC.*

Figure 1: Plot of AIC for the best model for each number of predictors. The model with 8 predictors minimizes AIC.

The eight most significant variables for predicting MOV (calculated using the regsubsets function from the leaps package in R) are difference in assists, steals, personal fouls, points per game, opponent’s points per game, win-loss percentage, strength of schedule (SOS), and pace. The Adjusted R-squared for this model is 0.3665. All the predictors are significant at the 10% significance level (Fig. 2).

*R* `summary` *output for each of the predictors for the AIC model. All are significant at the 1% significance level except diffPF, which is significant at the 10% level.*

Figure 2: R summary output for each of the predictors for the AIC model. All are significant at the 1% significance level except diffPF, which is significant at the 10% level.

We selected our second model by maximizing the Adjusted R-squared, another common technique for model selection. In general, a higher Adjusted R-squared is considered to be better, and as Fig. 3 shows, the model with 13 parameters (12 predictors plus one y-intercept) maximizes it.

*Plot of Adjusted R-squared for each possible number of parameters. The model with 13 parameters maximizes the Adjusted R-squared.*

Figure 3: Plot of Adjusted R-squared for each possible number of parameters. The model with 13 parameters maximizes the Adjusted R-squared.

The 12 most significant predictors are the same as the AIC model, but with the addition of difference in free throw percentage, defensive rating, offensive rating, and win-loss percentage in the last 10 games. The adjusted R-squared for this model was 0.3672. None of the added variables are significant (Fig. 4).

*R* `summary` *output for each of the predictors for the R-squared model.*

Figure 4: R summary output for each of the predictors for the R-squared model.

The other two models we fit to predict MOV were a model with no y-intercept, and a model with only a y-intercept. The model with no intercept was found by minimizing AIC, and ended up using the same eight predictors as the previous AIC model we fit. The reasoning behind removing the intercept was that it would give both teams an “even playing field;” as it stands, the original AIC model has a negative intercept, meaning that in a completely even matchup, it will actually favor the lower ranked team. The only intercept model was selected to represent the strategy of always picking the higher ranked team no matter what, which is the equivalent of “auto-filling” a bracket. The reason neither of these models are getting as in-depth of a discussion as the others is because neither of these ended up being very good at predicting.

Lastly, we fit a model using logistic regression, which predicts the probability that the higher ranked team will win as opposed to margin of victory. The predictors for this model were chosen by minimizing AIC as shown in Fig. 5.

*Plot of AIC for the best model for each number of predictors for predicting the probability of Team A winning. The model with 6 predictors minimizes AIC.*

Figure 5: Plot of AIC for the best model for each number of predictors for predicting the probability of Team A winning. The model with 6 predictors minimizes AIC.

We found that the six predictors most significant for predicting the probability of the higher ranked team winning are the difference in assists, steals, personal fouls, win-loss percentage, SOS, and offensive rating. All six of these variables are significant at the 10% level (Fig. 6).

*R* `summary` *output for each of the predictors for the logistic model.*

Figure 6: R summary output for each of the predictors for the logistic model.

At this point, we had five different models and no way to say which one is “best.” In order to figure this out, we used each model to predict March Madness games in varying different situations to see which ones predicted the best consistently. The following section will go into more detail about how our models did when it came down to actually predicting games.


Results/Discussion

*Table displaying prediction accuracies for each model.*

Figure 7: Table displaying prediction accuracies for each model.

For each model, we wanted to figure out what percentage of games they got correct across all games from 2000 to 2021, solely in the 2022, 2021, and 2014 tournaments, and how many upsets they predicted correctly overall. In order to calculate this, we created a variable “A_Win” that is a 1 if the higher ranked team wins and a 0 if they do not. For the models predicting margin of victory, if the predicted MOV for a game was positive we transformed it to a 1 as that means the higher ranked team scored more points. If it was negative, we made it a 0. For the logistic model we did something similar: since it predicts the probability of Team A winning, if the predicted probability was greater than 0.5, we would round it up to 1 and make it a 0 otherwise. This left us with a large list of ones and zeros for each model; the next step was to figure out how many times those matched up with the ones and zeros corresponding to the actual wins and losses, and from there we can calculate win percentage.

One important thing to note is that doing this calculates the percentage of games our models predict right given that each matchup has been determined, as in how well you would do if you chose the winner right before each game starts. The actual percentage you would get correct if you used each model to fill out an entire bracket before any games had been played will end up being lower (see note in Fig. 7 for more details). These latter probabilities were calculated by using Excel to fill out a bracket based on each model and compare those predictions to what actually happened each year. For the following discussion, we will primarily be focusing on the AIC and logistic models as they ended up being the best overall.

For the first round of predictions, we used all our data from 2000 to 2021 to see how well our models predicted 2022’s tournament. The results were less than ideal, with some of our models predicting barely better than a coin flip. However, after further experimentation with predictions for other years along with an analysis of the 2022 tournament, we have come to the conclusion that 2022 was an abnormal year and that our models are still good overall.

*Bracket for 2022 as filled in by the AIC model. The games in green are correct predictions, the games in red are what it got wrong.*

Figure 8: Bracket for 2022 as filled in by the AIC model. The games in green are correct predictions, the games in red are what it got wrong.

For 2022, given that every matchup had occurred the AIC model was right 61.9% of the time, but when used to fill out a bracket it only predicted 50.8% of the games correctly. The logistic model fared a little better as it had a per-game accuracy of 63.5%, and predicted correctly 52.4% of the time in a bracket setting. The model with only an intercept, which is the equivalent of filling a bracket by only picking the higher ranked team in every matchup, ended up being the best model of the five this year (which is the only time it was not the worst model as seen in Fig. 7).

In defense of our models, we believe that 2022 was a “bad” year for brackets. There were many anomalies that contributed to this, such as Saint Peter’s beating 2, 3, and 7 seeds to get to the Elite 8 as a 15 seed (a first in tournament history), North Carolina making it to the championship as an 8 seed, Miami getting to the Elite 8 as a 10 seed, and three of the four 1 seeds losing before the Elite 8. Additionally, in a normal year, the higher ranked team wins approximately 71% of the time but in 2022, higher seeds won two thirds of the time, about a four percent decrease. That is also assuming that every matchup has been determined–if you filled out a bracket only picking the higher ranked teams you would have only gotten about 57% of the games right. None of this necessarily proves our models are good though. To test that, we fit the models on some of the other “bad” years for brackets.

*Bracket for 2021 as filled in by the AIC model.*

Figure 9: Bracket for 2021 as filled in by the AIC model.

We chose to test how our models would perform for 2021’s tournament because 2021 had the most double-digit seed wins of any tournament (15, the next closest years had 12) and it is tied for third for total upsets (20), where an “upset” is defined as the lower ranked team winning; we think these factors qualify it as a “bad” year. We took 2021 out of the dataset and refit our models to avoid getting predictions that were too optimistic. Both the AIC and the logistic models had a per-game accuracy of 75.8%, while the AIC had a bracket accuracy of 75.8% compared to the logistic model’s 74.2%.

*Bracket for 2021 as filled in by the logistic model.*

Figure 10: Bracket for 2021 as filled in by the logistic model.

This year contains several “triumphs” for our models. The AIC model correctly predicted six of the eight teams in the Elite 8 correctly, and both the AIC and logistic models got three out of four Final Four teams, the championship game, and champion correct. Both models also correctly predicted UCLA getting to the Sweet 16 as an 11 seed (they ended up making it to the Final Four) as well as the 6 seed USC versus 7 seed Oregon matchup in the Sweet 16. The AIC model also does roughly 19% better than a bracket that picks the highest ranked team every time.

*Bracket for 2014 as filled in by the logistic model.*

Figure 11: Bracket for 2014 as filled in by the logistic model.

We also decided to test our models against 2014’s tournament because 2014 had the most total upsets of any bracket (23). Similarly to 2021, we took 2014 out of the data and refit our models to avoid overestimating our models’ accuracy. For this year, the AIC model has a per-game accuracy of 74.6% and a bracket accuracy of 68.3%, while the logistic model has a per-game accuracy of about 81% and a bracket accuracy of 71.4%. One of the most striking things about our predictions in this tournament was that both the AIC and logistic models predicted 7 seed Uconn getting all the way to the Final Four (they ended up winning the championship). The logistic model also ended up predicting 3 of the 4 teams in the Final Four correctly.

Here is a link to pictures of how all five models filled out the brackets each year if you are interested in seeing how models not mentioned in this discussion did.

These are all predictions based on individual years, but we also wanted to get an idea of how well the models are performing overall. To do this, we tested each model’s predictive ability on every game from 2000 to 2021. The logistic model ended up predicting 78.29% of the games correctly, while the AIC model came in a little lower at 76.63%. It should be noted that these percentages are probably too optimistic, as we are using models fit based on the games that we are trying to predict.

In addition to this, we wanted to see how well each model predicts upsets, where an upset is defined as the lower ranked team winning. We looked at every single upset from 2000 to 2021 and found that the logistic model predicted a little over half of all upsets correctly (51.04%). The AIC model had the second highest prediction accuracy for upsets, coming in at 38.54%. Again, it should be noted that these are potentially optimistic estimations as the games we are predicting are the same games we used to fit the models.

The two standout models from our testing are the logistic and AIC models. The logistic seems to be the best overall; it consistently does either the best or close to it. The AIC is the best out of the models predicting MOV, and it is usually relatively close to the logistic model in terms of predictive accuracy. It also does appear to meet all the standard regression assumptions as the plots in Fig. 12 show, the only thing that is questionable is the normality of the residuals, which look to be potentially positively skewed. A Shapiro-Wilkes test returns a p-value of 0.03257, signifying that the residuals are not normally distributed, however we have come to the conclusion that they are close enough to normal and that the sample size of 1322 games is sufficiently large enough to say that the model is still useful.

*Plots to test assumptions of the AIC model.*

Figure 12: Plots to test assumptions of the AIC model.

One thing that should be mentioned about this model is that three of the variables, the differences in points per game, opponent’s points per game, and pace, have a significant amount of collinearity which is not ideal. However, we think that they are still worth including because collinearity does not have a major effect on predictive accuracy if what you are trying to predict has similar correlations, which the games in 2022 do (i.e. when a team’s points per game goes up, so does their pace. That applies to both all our games from 2000 to 2021 and the games in 2022). What it does affect is our ability to interpret the slopes for each of the correlated variables, but we are not too worried about that as the goal of our models is more focused on making predictions than it is on interpreting the effects that each variable has on MOV.

Additionally, for the AIC model, up to this point we have only been focusing on how well it predicts the outcome of games. However, we have neglected to mention how well it predicts what it is actually giving us—margin of victory. Unfortunately, if you are looking for a model to tell you how much a team will win by, our models may not be the best. In 2022, the AIC model was off by an average of 10 points per game, while across all games from 2000 to 2021 it was off by about 8. These numbers are quite high, but that is most likely due to several factors—those being that basketball game scores are highly variable, and the presence of major outliers which will be discussed later.

The logistic model does not hold up as well when you look at its assumptions. Most of them are met: the response variable is binary, the observations are independent, there is no multicollinearity, and the sample size is sufficiently large. However, one of the assumptions is that each of the predictor variables has a linear relationship with the logit of the response variable. As can be seen in Fig. 13, it might be a stretch to call some of these relationships linear.

*Plot of predictors against the logit of the response, blue lines should be linear*

Figure 13: Plot of predictors against the logit of the response, blue lines should be linear

That being said, the reason this model is still in consideration is because it is still predicting games well. One other problem is that, as mentioned earlier, this data has some rather extreme outliers. For instance, a game like Virginia versus UMBC in 2018 where UMBC became the first 16 seed in tournament history to beat a 1 seed is a major outlier; the logistic model said Virginia had a 99.7% chance of winning while the AIC model predicted them to win by roughly 24 points. UMB ended up winning by 20. Single games such as this can have a significant impact on our models, which is not ideal.

This led us to the question of what to do with these outliers. One way of dealing with them is to just remove them from the dataset altogether; however, we could not in good faith justify removing these games. Yes, they are outliers and yes, they may be negatively impacting the quality of our models, but they are still real outcomes that did happen. We cannot just pretend UMBC did not beat Virginia, or that Saint Peter’s did not upset Kentucky in 2022, and so we chose to keep the outliers.

We also wanted to compare the AIC and logistic models to each other in more ways than just overall predictive accuracy to try and see which ones are better for specific situations. For instance, we noticed that the logistic model does much better at predicting upsets correctly, so we wanted to also see if it was better at predicting close games, where a “close game” is defined as the final score being within five points. If we look at all our data from 2000 to 2021, the logistic model did the best job of this, predicting the winners of games within five points correctly 66.4% of the time, while the AIC model is only correct 59.9% of the time. In 2022, the logistic model got 55.56% of close games correct, while the AIC only got half right. What we can conclude from this is that for games that you believe will be close, you might want to pay more heed to what the logistic model is predicting. Or, you could even use the two models in tandem with each other; if the AIC predicts that game will be close, maybe that Team A will win by two, but the logistic model says Team B will win, it might be better to go with Team B.

This leads us to the question you have all most likely been waiting for—which model do I use to fill out my bracket? Unfortunately, the answer is not as simple as just “use this model.” We recommend using a combination of the models, which allows you to see if models are contradicting each other or backing each other up. You can also get insight into which teams to look out for as potential upsets. For instance, with the 5 seed Tennessee versus 12 seed Oregon State game in 2021 which Oregon State ended up winning, the logistic model predicted that they would win, while the AIC, R-Squared, and No-Intercept models all predicted Tennessee to win by 2, 1, and 3 respectively. Instead of just blindly trusting the logistic model, you can use these latter predictions as an indication that the game will be close; the final score of most 5 and 12 seed matchups is not predicted to be within so few points. So even though those models predicted Tennessee to win, they showed it would be close so you can take that as a good indication that you should consider choosing Oregon State.

And that brings us to a crucial point; our models are not “golden standards.” You probably should not rely solely on our models to make decisions for your bracket. They do well, but they should be used more as suggestions to give advice as you make your picks. Your intuition and gut feelings are still useful to make picks; our models are more complementary to those. In March Madness, uncertainty is king; anything can happen, and our models are here to try and help make things just a little more certain.

If you want to experiment with our models and predict March Madness games yourself, you can download a .zip file containing the Excel sheet used to do so as well as various other files such as the code used to create these models and our datasets here.


Additional Research

With more time, it may have been beneficial to test our model predictions against more years to see how they would have done every single year from 2000 to 2022. Doing so could potentially yield new results showing that maybe overall our models are better or worse than we thought, because even though all the individual years we ended up picking we consider to be “bad,” they may not actually be “bad” for our models.

We also would have liked to test different types of classification models, as the one we had (the logistic model) did end up being arguably the best of our models. Other classification methods like Linear Discriminant Analysis, Naive Bayes, or k-Nearest Neighbors may have proven to be even better than our logistic model, so that might be something worth looking into.

Lastly, it might also be good to collect even more variables than we did. When collecting our data, we selected what variables we personally thought would be important. What this means is that we may have left out some variables that do have a significant impact, so expanding the number of variables per team could also yield new results and potentially make our models even better.


References

Faraway, Julian J. 2014. Linear Models with r. Second Edition. Chapman & Hall/CRC.

 

Data collected from:

  • www.sportsreference.com

  • www.basketball.realgm.com

  • www.teamrankings.com

R packages used:

  • tidyverse

  • leaps

  • car