Note: It’s unclear whether the KenPom data is calculated after or before the NCAA tournament (though it seems like it may be after). For the purpose of the exercise, I assume it is pre-March Madness data. However, given the uncertainty please treat this more as a fun coding process rather than taking away conclusions (at least for arguments about KenPom’s predictability).
As March Madness begins this week, I took some time to explore historical KenPom data and how teams have fared in the tourney based on their KenPom ratings using the CBBData R package. My analysis starts with taking every tourney team from 2001-2023 and deeming them “underrated”, “rated”, or “overrated” according to KenPom. My methodology creates hypothetical tournament seeding using KenPom ratings (i.e. the #5 rated KenPom team would be a 2 seed, etc.) and compares them to the committee’s seeding. For example, Auburn is a 4 seed in this year’s tournament, but they are KenPom’s 4th overall ranked team (equivalent to the last #1 seed), so we call them underrated according to KenPom. This is an imperfect methodology but due to lack of time and data I resorted to this. First, I show how many teams fall under each category in each year of our period:
There are far more overrated teams than rated or underrated, but we don’t see any general trends over the years. This is slightly interesting, as I would expect ‘rated’ teams to increase as KenPom has become a very popular metric and you would think the committee takes it into account when seeding teams (taking into account of course that some teams are automatic qualifiers that the committee has to include).
My next analyses will look at the success of teams according to their rating. Many have already shown the power of KenPom ratings to predict the tournament, but I’ll show a couple hopefully novel illustrations. First, we see the average wins of teams based on their NCAA seed and their KenPom rating:
For each seed, overrated teams consistently win less on average and vice versa. Interestingly, for higher seeds, the greatest effect seems to be the boost of being underrated, while for lower seeds, the greatest effect is the negative of being overrated. Again, keep in mind that many of those lower seeds are automatic qualifiers, so this is not much of a surprise. Still, this gives some credence to choosing underrated top seeded teams to go deep in the tournament. The next graph shows the same idea, but instead of average wins, we look at average win percentage in the first round:
Another way I look at overperformance vs underperformance is by calcualting the average wins for each seed, and then subtracting the actual wins for each team in a certain seed, resulting in ‘wins above average’ for a given seeded team.
This boxplot shows us the distributions of each team’s wins above average from 2001-2023. We see that about 75% of overrated teams win less games than the average of their seed. This is quite a remarkable stat, given that we’re taking into account that the low quality automatic qualifiers are worse seeds. We see that the data in this group is also negatively skewed. There is more variation with rated and underrated teams, but we see the medians are actually all fairly close among each group. Essentially, while the median teams perform relatively similar across groups, there is more positive variance for rated and overrated teams. That is, overrated teams rarely overperform the average for their seed, while the 75th percentile underrated team performs on average about 1 win better than their seed average.
Next, I look at KenPom’s offensive and defensive rankings and how they correlate with wins.
Finally, I present the 2024 tourney teams KenPom seeds and actual seeds. Note that teams above the dashed line are ‘overrated’ and those below the line are ‘underrated.’ I chose to show the wordmarks because I simply do not know all the logos off the top of my head.