Every March, 64 teams get the opportunity to play in the largest college basketball tournament ever. Over the span of three weeks, these 64 teams meet on the basketball floor and play in order to survive and advance. In the end, only one team is crowned the National Champion. This last year it was Connecticut, beating San Diego State, 76-59. The year prior Kansas was victorious over the University of North Carolina, 72 - 69.
On top of the yearly competition, individuals from around the world compete in bracket pools, in which they attempt to make the ‘perfect’ bracket. But the odds of accomplishing this task are 1 in 9.2 quintillion if you flip a coin for each game and 1 in 120.2 billion if you know a little bit about basketball. Either way, these are astronomical odds and no one has ever correctly predicted all of the games.
For the remainder of this report, I explore some statistical differences within data collected surrounding the NCAA tournament. I want to know if there are any trends that help a team make a team make the NCAA tournament?
Before we step into the analysis, it is important to explore some of the statistical data that has been captured and will be used throughout this report.
This report is using data captured from Sports Reference and Twitter.
Sports Reference is a website whose mission is the “democratize data, so our users enjoy, understand, and share the sports they love”. It houses all kinds of statistical information, some of which is detailed below. Each link is for the 2023 stats, but data goes further.
For a complete data dictionary, click here.
Important Note: I have removed ALL data from 2020 as the NCAA tournament did not happen.
Lastly, let’s talk about a couple of quick summary statistics that I noticed throughout this data. Following is a table of these summary statistics from 2000 to 2023.| Statistic | Value |
|---|---|
| Average Points Per Game | 70.36828 |
| Average Points Per Game Against | 69.35283 |
| Average Turnover Percentage | 17.25193 |
| Average Pace | 68.14219 |
We can see that over the last 23 years, teams were scoring, on average, 70.4 points per game, while having 69.27 points scored against them. Further, teams were having 17.22% of their possessions end in turnovers. Finally, using average pace, we see that teams were having approximately 68.22 possessions in each game.
The following analysis is not a complete list. These are factors that I was interested in exploring as it relates to the NCAA tournament. They are focused more on style of play and advanced statistics over base statistics such as total points and total rebounds. These statistics are informative, but when some teams play more than others, they lose some of their appeal. In addition, a lot of the ‘basic’ stats can be seen in the ‘eye-test’, and I wanted to challenge myself to look even further.
The first thing I wanted to explore was related to margin of victory.
Are teams that are making the tournament dramatically outscoring their
opponents or are they in tight, competitive games?
Looking at the comparison, we see that teams who make the tournament are
typically winning by a larger margin than those that do not. Teams that
do not make the tournament are in tighter matches with the 75th
percentile winning games by roughly 4 on average. However of the teams
that do make the tournament, we do see that there are some outliers that
have a negative margin of victory, but on average, teams who make the
tournament are winning by over 9 points per game.
The Simple Rating System Measure is a way to compare various teams based on strength of schedule and point differentinal. While I was unable to find the calculation, I was able to find the following on Sports Reference as an exmaple on how it can be used:
If Team A’s rating is 3 bigger than Team B’s, this means that the system thinks Team A is 3 points better than Team B. (Sports Reference)
For college basketball, it only counts games against major opponents.
Compared to the distance between Margin of Victory and Simple Rating
System, we see a much larger disparity within the Simple Rating System
when comparing teams that made the tournament versus those who did not.
Over half of the teams that made a tournament had a Simple Rating Score
of 12 or more, while those did not had a negative simple rating
score.
Again, we see some outliers as it relates to the teams that did not make the tournament. A few schools had a simple rating score of more than 20, and did make the tournament. That could be a factor to the competition that was seen that specific year.
Like I mentioned earlier, I was more interested in diving into the statistics that are beyond just the eye test, and for the remainder of this section, that is exactly what we will do.
Strength of Schedule is a measure of how hard a school’s schedule is.
In other words, it is a measure of how difficult their opponent’s are
when they play them.
Looking at the distribution of Strength of Schedule, we are able to see
that teams that typically make the tournament have a higher strength of
schedule, but not always. We see a couple of schools that have a
extremely low strength of schedule that made the tournament.
Upon further research, these outliers are Alcron State, Alabama State, and Alabama A&M. Interestingly enough, all of these teams are members of the Southwestern Atlantic Conference and made the tournament in 2002, 2001, and 2005 respectively. They each made the tournament by winning the conference. Since this is not a Power 5 or Mid Major Conference, my assumption is that they do not get the attention they deserve nor do they have the same level of competition to have more than just the conference winner make the tournament.
A team’s turnover percentage describes how many times a team turns over the basketball in 100 plays. Typically, team’s want their own number to be quite low, meaning they are controlling the ball, while simultaneously forcing the opponent to have have a high turnover percentage.
I wanted to explore and see if there was a correlation between these
two categories.
Comparing both facets, there is not a significant difference between a
school’s own turnover percentage compared to the opponent’s turnover
percentage. Across all instances within the data, both the school
turnover percentage and opponent turnover percentage are centered around
16-18%.
Nearing the end of the statistical analysis, I wanted to explore the
Offense Rating. Offensive Rating is a estimate of points scored per 100
possessions by either a player or a team.
Unsurprisingly, schools that made the tournament typically had a higher
overall Offensive Rating than those that did not make the tournament.
This implies that they are simply scoring at a higher percentage.
However, looking at the Opponent Offensive Rating, both clusters are still relatively the same. There is not much of a difference between the Opponent Offensive Rating as there is for a school’s own offensive ratings. This tells me that a school that typically make the tournament wins by primarily focusing on scoring more and more points, rather than significantly limiting the opponents points.
Up to this point, we have explored both offensive and defensive
styles of play. In my mind, there is one more aspect of the game that
should be explored: how aggressive a team is. Personal fouls are either
a measure of how sloppy a team plays the game or how aggressive they are
being when playing.
Based on this chart, we are able to see that teams that make the
tournament typically play with less fouls per game. This could be due to
the polished style of play, or that they have more skill to avoid fouls
and make clean plays. There is still a large amount of overlap between
the two, as all around 50% of all teams have a foul range that falls
within 17-20 fouls per game.
On top of exploring statistical commonalities and differences between teams that make the NCAA tournament, I wanted to explore their respective fan reactions.
Originally, the plan was to use Twitter’s API to pull the data down, but as that is no longer possible due to the restrictions of the API, I had to pivot to another source: Reddit (More specifically, the subreddit: r/CollegeBasketball).
Due to my own familiarity, I decided to focus only on schools in the Big East Conference, and then add in Purdue and FDU since that was one of the largest upsets in recent history. Further, I am using data after March 1st, 2023 in order to just capture Selection Sunday and the rest of the tournament.
Lastly, in order to find the overall sentiment, I am using the NRC lexicon.
First, I wanted to see if there any meaningful difference over the
total volume by season result.
Unsurprisingly, teams that made the tournament had a higher overall
total number of words in each sentiment category. They were actively
playing over this time period and individuals were engaging with each
other over their teams, thus leading to a higher number of words being
used.
While volume is interesting, I think it is also important to examine
the sentiment of the fans using a relative word count. This would help
us see the frequency of each sentiment category in proportion to the
total volume.
Overall, we can see that the trends are overall similar. Positivity is
the highest, with trust and negative rounding out the top 3 overall
sentiments. Interestingly however, teams that did make the tournament
have a slightly higher positive and trusting sentiment towards their
teams. This could be due to a number of reasons including:
Quickly, out of my own curiosity, I wanted to analyze the overall
sentiment following the Purdue- Fairleigh Dickinson (FDU) historic
upset.
In all honestly, I was surprised by the similarities between these two
teams. Like the overall sentiment analysis performed above, both teams
had ‘positive’, ‘negative’ and ‘trust’ rounding the top overall
sentiment. FDU did have a higher overall positive usage than Purdue of
around 2%, while Purdue added it back in the negative with 2% increase
over FDU.
Through this report, we have examined a few key pieces of what helps a team make the NCAA Men’s Basketball Tournament and participle in March Madness. We have seen how teams that make the tournament have, on average, have a:
Finally, using the r/CollegeBasketball subreddit, we explored fan’s sentiment as it relates to team’s in the Big East and Purdue vs FDU. We found that team’s make the tournament have a larger engagement on the platform, but there is not much significant difference within the relative word use percentage for any given sentiment category.
Overall, there are a lot of factors that play into team making the tournament. Some of these factors are tangible and can be found within the statistical analysis, but others are intangible, and are merely based on the ‘eye test’. Either way, it is important to look at the advanced statistics to understand, not only if your team will make the tournament, but also if they have a shot at winning the national championship.