Something that has fascinated me for quite some time are the ideas of probability and variance as they relate to sporting events. If we are able to generalize a sport, or any process for the matter, such that it follows a known probability function, we could make incredibly accurate predictions due to the extreme level of detail we would gain access to.
I’ve largely studied this concept within baseball, and if you are interested in learning more can find that here. Where in baseball making this idea work required a lot of heavy lifting, I think that basketball naturally lends itself towards the concept. Of the four major American sports, basketball is distinct in that scoring is a common occurrence. This is a big deal since it implies the lower bound on the scoring distribution (being shut out) is so improbable that it should not skew the distribution; you can imagine how dramaticly this would impact the scoring distrubutions in a sport like hockey or baseball.
In the spirit of upcoming March Madness, let’s focus on college basketball rather than the NBA. The general findings from one should easily translate to the other anyways. Let’s start by taking a look at all Division I games played since 2015:
The shape of this scoring distribution certainly lends itself towards the normal distribution, but let’s make sure and dive a little deeper.
If our data set is actually normally distributed, it would follow the general rule with regards to its variance. That is, approximately 68% of the data is within one standard deviation of the mean, 95% of the data is within two standard deviations of the mean, and 99.7% of the data is within three standard deviations of the mean.
The exact standard deviation of our sample is 13.22 points, while the mean is 71. Using this information, we can find exactly how close this distribution is to a perfect normal. Consider the following chart:
It appears that the scoring in basketball naturally follows a normal distribution. The differences in proportions here are so small that I believe approximating NCAAB scoring to be perfectly normal is a harmless assumption.
To really drive this point home, we can check that the mean of our distribution is very close (theoretically identical) to the median and mode. Our distribution also passes this test with flying colors as the mean and median are both 71 while the mode is 69.
So the scoring in college basketball follows a normal distribution. Great. But how does this help us make predictions? Well, if we assume that scoring in general follows a normal distribution, it would naturally follow that the scoring distribution for a specific team would as well.
As a simplified example, lets say we project Team A to score 75 points and Team B to score 68 (Let’s also assume that they each have the same standard deviation around 8). Because of the fact that we can approximate scoring to follow a normal distribution, we can directly calculate the probability of either team winning. This can also be expressed graphically:
Where we can express the probability of either team scoring any number of points.
In this toy example, the probability of Team A winning would be 76.03%. As you could imagine, this is a very powerful tool that we could use to predict the outcomes of individual games, or perhaps something more interesting, such as entire tournaments such as the March Madness Bracket. All that remains before we’re able to do that is specify the mean and standard deviation for each team’s specific point scoring distribution.
Centering each distribution is actually quite simple, and we’ll use the Kenpom Adjusted Ratings to develop these estimates, which have been quite reliable over the past 20+ years.
The offensive are defensive ratings Kenpom provides are more than enough to give us a good estimate for a “center” value of any distribution.
As an example, let’s say we were interested in a game between West Virginia and Gonzaga: According to Kenpom, West Virginia has an offensive and defensive rating of 114 and 96 respectively, while Gonzaga has ratings of 124 and 103.
These values are given per 100 possessions, so dividing them by 100 gives us a per possession estimate of each side of the ball.
We can directly calculate the value of each side of the ball relative to league average using the fact that we know teams since 2015 have averaged 1.025 points per possession.
Sticking with our example, let’s calculate the value of each team’s offense and defense on a possession level basis relative to league average:
West Virginia Offense Points per Possession = \(1.14-1.025=.115\)
West Virginia Defense Points per Possession = \(.96-1.025=-.065\)
Gonzaga Offense Points per Possession = \(1.24-1.025=.215\)
Gonzaga Defense Points per Possession = \(1.03-1.025=.005\)
Essentially, what this means is that Gonzaga’s offense scores .215 more points per possession than the average team, while West Virginia’s defense allows .065 less points per possession than league average.
Alright, simple enough. Now if we want to project how many points Gonzaga will score against West Virginia per possession, we need we linearly combine their offense’s value with West Virginia’s defensive value:
Gonzaga Points Per Possession vs. West Virginia: \(1.025+.215-.065=1.175\)
and we can do the same for West Virginia’s offense…
West Virginia Points Per Possession vs. Gonzaga: \(1.025+.115+.005=1.145\)
We now have effective estimates for each team’s production, and we now just need to multiply each number by the expected pace (number of possessions for each team) in the game. Kenpom also provides tempo numbers that streamline this entire process: West Virginia averages 66 possessions a game while Gonzaga averages 72.
The pace of a game is something that is controllable by both teams, and the number of possessions per team in any given game is largely 1 to 1. Therefore the average of each team’s pace value should serve as a fine estimate of the expected pace of the game for both teams. In this scenario, the expected pace would be 69 possessions per team.
Thus, the expected total points per team is as follows:
Gonzaga Expected Points: \(1.175*69=81.075\)
West Virginia Expected Points: \(1.145*69=79.005\)
These estimates can now serve as the centers of their respective normal distributions.
All that remains now is the variance. We cannot just use the average standard deviation seen in the entire data set, 13.2, since this value is composed of many different teams each with their own varying centers; this value is certainly over-inflated and incorrect. In matter of fact, having a variance so large would make it impossible for the model to project a chance of winning higher than 80%, regardless of circumstance, which we know cannot be correct.
So how can we correctly estimate this value? The answer lies within the pace of each game. We can clearly see how more possessions leads to more points: it is baked into our estimates and is also a general rule in basketball. If you aren’t fully sold on this idea, consider the following plot which shows the average number of points scored given the final number of possessions a team had:
You’ll notice that the trend appears to be linear and constant, with a slope equal to 1.025 (the exact number of points per possession we discussed earlier!).
Worth noting is that while any particular game doesn’t obviously need to have between 55-85 possessions in a game, these values represent the minimum/maximum that we can project upon, since each year the slowest and fastest teams play at around these boundaries per 40 minutes. Despite these restrictions, this graph consists of more than 98% of games played since 2015, which is a very large sample size.
Games that have possessions totals outside of this range have less total observations and are inherently noisier, but we should be able to easily extrapolate if need be given the strength of our data within 55-85 possessions. So the number of possessions a team has certainly impacts our best estimate at their final score, but could it impact our variance? Let’s take a look at the variance in our final score given the number of possessions a team had:
Alright, so it appears the average spread of scoring increases as the number of possessions increase, which makes sense: adding extra possessions changes the possible values which a final score could be. But there is still an issue here: the standard deviation is still way too large to draw sensible results from.
I suspect this has to do with the fact that any given team doesn’t have full control over the pace the game which is inherently dependent upon their opponent. This would theoretically serve to inflate the observed variation since it is effectively combining multiple variables.
Let’s go back to the graph showing the average total points given the exact number of possessions. We can actually model this relationship using another well-known distribution, the Binomial Distribution. This is possible using the definition of effective field goal percentage, or eFG, which can allow us to model each possession like a coin-flip: a team either scores 2 points or they don’t.
We know that teams score an average of 1.025 points per possession, so according to eFG, the probability of scoring is effectively \(\frac{1.025}{2}=.5125\), making our distribution \(Binom(n, .5125)\), where n represents the number of possessions per team.
The reason for going through this is that the Binomial Distribution converges to the Normal Distribution when \(n\) is large enough, which it should be regardless of value here. This would provide us with an independent estimate of the standard deviation per team, which is exactly what we’re looking for.
The formula for the standard deviation of a Binomial Distribution is relatively straightforward:
\(\sqrt{np(1-p)}\)
Using this formula, let’s get a sense of what an unbiased estimate of the standard deviation would be given the total number of possessions per team.
These are more sensible, and there is theoretical backing to suggest that these should be effective estimates.
Going a little further with this, it may be beneficial to see if these estimates produce values close to those observed if they are convoluted. Theoretically speaking, if a given basketball game’s variance is impacted by both teams, we could convolute our Binomial approximation and see if the resulting standard deviation is close to the observed values.
Consider the following game, where each team plays 70 possessions, which is around league average. We can then derive the full game distribution using the Binomial as follows:
\(Binom(70, .5125)+Binom(70, .5125)=Binom(140, .5125)\)
and the standard deviation:
\(\sqrt{140*.5125*(1-.5125)}*2=11.83\).
Remember, eFG calculates the probability of scoring 2 points, so the standard deviation is doubled.
11.83 is not terribly far off from our observed value of 13.22, but the difference is likely indicative of the fact that the spread of an individual game is likely more complex than than a simple convolution.
That being said, this difference is less important on the level we will be using it since we won’t need to convolute and the standard deviation only differs slightly based on our adjusted pace.
We now have estimates for the center and spread of any potential match-up just using some probability and Kenpom estimates.
We can get a general sense our model’s accuracy by comparing it against the Vegas point spread, which is generally regarded as the most reliable predictor of the outcome of individual games. If you are unfamilar with the Vegas spread system, you can read up on it here.
If our model is reflective of reality, the implied probability of an 8 point favorite winning in our model should be in the same ballpark as what Vegas values an 8 point favorite to be. The following graph shows how the probability of a favorite winning changes given their spread according to Vegas odds-makers historical data:
Let’s now compare this against our own model. Remember, the exact probability of victory differs slightly given how we estimate the pace of the game. Therefore, we’ll approach this by looking at the range of possible probabilities given all reasonable pace inputs. That allows us to produce the following plot:
The error-bars here represent the possible range of probabilities we could observe from our model, while the points are still Vegas spread estimates. You’ll notice that the error is maximized at a difference of 11.5, which makes sense since the standard deviation would be less influential on a game that is projected to be very close or very lopsided.
Now I’m not going to say that this model is a better predictor than the super models employed by sports-books, but I will say how spectacular such a naive model can be by employing just a few probabilistic assumptions. I personally find this amazing in its own right, but now let’s get to the fun stuff: predicting and simulating entire tournaments, like the upcoming major conference tourneys (I am writing this less than 24 hours before these tourneys begin, making it the perfect time for some predictions!). Let’s begin with the Big 10:
After running a simulation of 100,000 tournies, let’s see what the results were…
## # A tibble: 14 × 7
## Team Seed Rnd2 Quarters Semis Final Champ
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Purdue 1 1 1 0.697 0.473 0.317
## 2 Indiana 3 1 1 0.515 0.279 0.121
## 3 Michigan St. 4 1 1 0.563 0.228 0.118
## 4 Maryland 6 1 0.777 0.422 0.245 0.117
## 5 Northwestern 2 1 1 0.483 0.216 0.0845
## 6 Illinois 7 1 0.554 0.300 0.145 0.0619
## 7 Iowa 5 1 0.577 0.271 0.101 0.0492
## 8 Penn St. 10 1 0.446 0.217 0.0943 0.0357
## 9 Michigan 8 1 0.506 0.154 0.0748 0.0356
## 10 Rutgers 9 1 0.494 0.149 0.0712 0.0332
## 11 Ohio St. 13 0.521 0.226 0.0914 0.0289 0.0121
## 12 Wisconsin 12 0.479 0.197 0.0752 0.0226 0.00902
## 13 Nebraska 11 0.747 0.198 0.0597 0.0197 0.00491
## 14 Minnesota 14 0.253 0.0250 0.00304 0.000434 0.0000329
The probabilities you see here represent the probability of each team reaching a specific landmark. For instance, the chances of Purdue reaching the second round is 100% since they have a bye. Something interesting here is that our model predicts Maryland to have a better chance at winning the tournament over Northwestern despite Northwestern having an additional bye.
Very cool! How about we now take a look at the Big 12, which has been by far the most competitive conference in college hoops this year:
## # A tibble: 10 × 6
## Seed Team Quarters Semis Final Champ
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2 Texas 1 0.687 0.419 0.234
## 2 1 Kansas 1 0.628 0.364 0.207
## 3 4 Baylor 1 0.543 0.264 0.136
## 4 3 Kansas St. 1 0.538 0.250 0.116
## 5 5 Iowa St. 1 0.457 0.201 0.0938
## 6 6 TCU 1 0.462 0.200 0.0864
## 7 8 West Virginia 0.632 0.263 0.131 0.0641
## 8 7 Oklahoma St. 0.526 0.170 0.0731 0.0278
## 9 10 Oklahoma 0.474 0.143 0.0576 0.0206
## 10 9 Texas Tech 0.368 0.108 0.0399 0.0143
Despite Kansas being the better team on paper, our model pins Texas as the favorite since their theoretical path to the finals is so much easier than Kansas’. These types of nuances are what makes these analyses important, and can help us predict which teams are more poised to make deep runs late in March.
I for one look forward to applying this methodology to the full bracket after selection Sunday to better gauge the field. I also hope you leave this study with an understanding of how powerful a probabilitiy distribution can be in predictive analytics.
If you are interested in the projections for the remaining major conference tournaments, they are listed below.
## # A tibble: 15 × 7
## Seed Team Round2 Quarters Semis Finals Champ
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 Duke 1 1 0.650 0.379 0.214
## 2 1 Miami FL 1 1 0.693 0.356 0.191
## 3 2 Virginia 1 1 0.574 0.327 0.171
## 4 7 North Carolina 1 0.840 0.399 0.218 0.109
## 5 3 Clemson 1 1 0.499 0.222 0.0997
## 6 6 N.C. State 1 0.616 0.330 0.157 0.0755
## 7 5 Pittsburgh 1 0.783 0.316 0.155 0.0724
## 8 11 Virginia Tech 0.746 0.331 0.158 0.0671 0.0288
## 9 9 Wake Forest 1 0.593 0.202 0.0725 0.0273
## 10 8 Syracuse 1 0.407 0.106 0.0297 0.00876
## 11 13 Georgia Tech 0.570 0.136 0.0233 0.00545 0.00113
## 12 10 Boston College 0.701 0.135 0.0247 0.00565 0.00110
## 13 14 Notre Dame 0.254 0.0531 0.0126 0.00249 0.000536
## 14 12 Florida St. 0.430 0.0805 0.0104 0.00186 0.000294
## 15 15 Louisville 0.299 0.0244 0.00190 0.000189 0.0000173
## # A tibble: 14 × 7
## Seed Team Round2 Quarters Semis Finals Champ
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Alabama 1 1 0.763 0.457 0.315
## 2 5 Tennessee 1 0.902 0.689 0.378 0.255
## 3 3 Kentucky 1 1 0.725 0.366 0.139
## 4 2 Texas A&M 1 1 0.497 0.273 0.102
## 5 10 Arkansas 1 0.532 0.276 0.157 0.0608
## 6 7 Auburn 1 0.468 0.227 0.122 0.0436
## 7 4 Missouri 1 1 0.282 0.0791 0.0314
## 8 9 Mississippi St. 1 0.545 0.139 0.0498 0.0218
## 9 6 Vanderbilt 1 0.696 0.225 0.0731 0.0164
## 10 8 Florida 1 0.455 0.0978 0.0311 0.0122
## 11 13 Mississippi 0.685 0.0824 0.0265 0.00416 0.00102
## 12 11 Georgia 0.508 0.155 0.0259 0.00437 0.000484
## 13 14 LSU 0.492 0.149 0.0246 0.00406 0.000431
## 14 12 South Carolina 0.315 0.0156 0.00267 0.000189 0.0000227
## # A tibble: 11 × 6
## Seed Team Quarters Semis Finals Champ
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 Connecticut 1 0.735 0.480 0.311
## 2 3 Creighton 1 0.739 0.444 0.216
## 3 1 Marquette 1 0.796 0.369 0.204
## 4 2 Xavier 1 0.709 0.359 0.158
## 5 5 Providence 1 0.265 0.111 0.0450
## 6 6 Villanova 0.821 0.248 0.0987 0.0291
## 7 7 Seton Hall 0.739 0.253 0.0904 0.0264
## 8 8 St. John's 0.610 0.141 0.0303 0.00837
## 9 9 Butler 0.390 0.0634 0.00962 0.00190
## 10 10 DePaul 0.261 0.0379 0.00607 0.000708
## 11 11 Georgetown 0.179 0.0130 0.00157 0.000113
## # A tibble: 12 × 6
## Seed Team Quarters Semis Finals Champ
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 UCLA 1 0.830 0.643 0.452
## 2 2 Arizona 1 0.738 0.492 0.238
## 3 3 USC 1 0.631 0.271 0.0980
## 4 4 Oregon 1 0.582 0.173 0.0752
## 5 5 Washington St. 0.885 0.407 0.108 0.0425
## 6 6 Arizona St. 0.820 0.345 0.122 0.0361
## 7 9 Colorado 0.655 0.133 0.0631 0.0245
## 8 7 Utah 0.580 0.168 0.0767 0.0224
## 9 10 Stanford 0.420 0.0941 0.0354 0.00813
## 10 8 Washington 0.345 0.0373 0.0121 0.00314
## 11 11 Oregon St. 0.180 0.0240 0.00251 0.000182
## 12 12 California 0.115 0.0108 0.000437 0.0000269