The NCAA Division 1 Basketball Tournament, more commonly known as March Madness, is one of the most unpredictable sporting events. Upsets and Cinderella stories captivate fans every year. While top-seeded teams and “Power-5” Conferences often perform well, lower-seeded and lesser-known teams frequently defy expectations, making it challenging to predict tournament success. This analysis explores key statistical factors—such as offensive efficiency, three point percentage, regular season win percentage, and net rating—that may influence how far a team advances in the NCAA tournament. By analyzing historical data from past tournaments, we aim to identify metrics that provide a more accurate picture of a team’s potential for a deep run. Understanding these factors can offer valuable insights for analysts, fans, and even bracket enthusiasts looking for an edge each year when they fill out their brackets.
This project utilizes Python to explore and visualize College Basketball and March Madness Data using a variety of metrics. I am especially interested in the factors that are associated with a deep tournament run, and how those factors differ based on each season.
The dataset was found on a Kaggle Page titled “college basketball march madness data”. The data originates from another Kaggle dataset titled “college-basketball-dataset”, and was updated with data from “https://barttorvik.com/”. The resulting dataset is a csv file named “alldataclean.csv”.The dataset involves the years of 2013-2022, with no data for 2020 as the March Madness tournament was cancelled. The dataset includes variables such as team name, season/year, conference, March Madness seed, games played, wins, offensive three point percentage, and adjusted offensive efficiency. The original dataset includes every Division 1 Basketball team, including those teams who did not qualify for March Madness. The filtered dataset that I use for the majority of my analysis includes only the teams who qualified for March Madness from 2013 to 2022. Each qualifier in a given year has a unique observation, meaning that a college such as Duke that qualified for the tournament in multiple years will have multiple observations.
# Import Libraries
import matplotlib.pyplot as plt
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
warnings.filterwarnings("ignore")
path = "C:/Users/Admin/Downloads/IS460/Python/Datafiles/"
filename = "alldataclean.csv"
# get rid of columns we are not interested in
df = pd.read_csv(path+filename, usecols = ['YEAR','TEAM','CONF','G','W','SEED','POSTSEASON','3P_O','ADJOE', 'ADJDE','EFG_O','ORB','ADJ_T','FTR','BARTHAG','TOR'])
pd.unique(df['POSTSEASON'])
## array(['S16', 'E8', 'Champions', 'R32', 'F4', 'R64', '2ND', nan, 'R68'],
## dtype=object)
pd.unique(df['SEED'])
## array([ 1., 5., 3., 2., 4., 6., 8., 9., 11., 10., 12., nan, 7.,
## 13., 15., 14., 16.])
# there should be less for Seed and Postseason as only 68 teams per year qualify for March Madness Tournament
df.notna().sum()
## TEAM 3160
## CONF 3160
## G 3160
## W 3160
## ADJOE 3160
## ADJDE 3160
## BARTHAG 3160
## EFG_O 3160
## TOR 3160
## ORB 3160
## FTR 3160
## 3P_O 3160
## ADJ_T 3160
## POSTSEASON 612
## SEED 612
## YEAR 3160
## dtype: int64
df
## TEAM CONF G W ... ADJ_T POSTSEASON SEED YEAR
## 0 Gonzaga WCC 32 28 ... 72.6 S16 1.0 2022
## 1 Houston Amer 38 32 ... 63.7 E8 5.0 2022
## 2 Kansas B12 40 34 ... 69.1 Champions 1.0 2022
## 3 Texas Tech B12 37 27 ... 66.3 S16 3.0 2022
## 4 Baylor B12 34 27 ... 67.6 R32 1.0 2022
## ... ... ... .. .. ... ... ... ... ...
## 3155 Michigan St. B10 35 26 ... 64.4 S16 3.0 2013
## 3156 Arizona P12 35 27 ... 66.8 S16 6.0 2013
## 3157 Oregon P12 37 28 ... 69.2 S16 12.0 2013
## 3158 La Salle A10 34 24 ... 66.0 S16 13.0 2013
## 3159 Florida Gulf Coast ASun 35 24 ... 69.1 S16 15.0 2013
##
## [3160 rows x 16 columns]
df.describe()
## G W ... SEED YEAR
## count 3160.000000 3160.000000 ... 612.000000 3160.000000
## mean 30.427848 15.889557 ... 8.802288 2017.234494
## std 4.009704 6.616489 ... 4.674526 2.899457
## min 4.000000 0.000000 ... 1.000000 2013.000000
## 25% 29.000000 11.000000 ... 5.000000 2015.000000
## 50% 31.000000 15.500000 ... 9.000000 2017.000000
## 75% 33.000000 21.000000 ... 13.000000 2019.000000
## max 40.000000 38.000000 ... 16.000000 2022.000000
##
## [8 rows x 13 columns]
Filtered Data Below (all 612 March Madness Qualifiers)
# Filtered Dataframe (only March Madness Teams)
df1 = df.copy()
df1 = df1[df1['SEED'].notna()].reset_index(drop=True)
df1.columns = df1.columns.str.lower()
df1['seed'] = df1['seed'].astype(int)
# removes the teams that did not make March Madness, changes seeds to integers
# create a win percentage column
df1['win%'] = round(df1['w']/df1['g'],2)
df1
## team conf g w ... postseason seed year win%
## 0 Gonzaga WCC 32 28 ... S16 1 2022 0.88
## 1 Houston Amer 38 32 ... E8 5 2022 0.84
## 2 Kansas B12 40 34 ... Champions 1 2022 0.85
## 3 Texas Tech B12 37 27 ... S16 3 2022 0.73
## 4 Baylor B12 34 27 ... R32 1 2022 0.79
## .. ... ... .. .. ... ... ... ... ...
## 607 Michigan St. B10 35 26 ... S16 3 2013 0.74
## 608 Arizona P12 35 27 ... S16 6 2013 0.77
## 609 Oregon P12 37 28 ... S16 12 2013 0.76
## 610 La Salle A10 34 24 ... S16 13 2013 0.71
## 611 Florida Gulf Coast ASun 35 24 ... S16 15 2013 0.69
##
## [612 rows x 17 columns]
df1.describe()
## g w adjoe ... seed year win%
## count 612.000000 612.000000 612.000000 ... 612.000000 612.000000 612.000000
## mean 33.390523 24.063725 111.158987 ... 8.802288 2017.222222 0.719788
## std 3.365772 4.477716 6.372912 ... 4.674526 2.899793 0.101500
## min 16.000000 12.000000 90.600000 ... 1.000000 2013.000000 0.360000
## 25% 32.000000 21.000000 107.000000 ... 5.000000 2015.000000 0.640000
## 50% 34.000000 24.000000 111.200000 ... 9.000000 2017.000000 0.715000
## 75% 35.000000 27.000000 115.600000 ... 13.000000 2019.000000 0.790000
## max 40.000000 38.000000 129.100000 ... 16.000000 2022.000000 1.000000
##
## [8 rows x 14 columns]
There are 3160 observations and 16 features in the original dataset, while the filtered dataset has 612 observations. Summary statistics can be seen above, both before the filtering and after (the filtering entails Division 1 Basketball Teams who qualified for March Madness from 2013-2022). Teams qualify in one of two ways: automatic bid by winning their conference, or an at-large bid. An at-large bid occurs when a team does not win their conference, but their resume is impressive enough to be chosen by the selection committee for the tournament.
The first Tab looks at all NCAA division 1 teams, including those who did not qualify for March Madness. Tabs 2-5 look at only the teams who qualified for March Madness between 2013 and 2022. The objective is to understand the factors that contribute to sustained success in the March Madness tournament.
Before diving into March Madness specific data, I wanted to get a general overview of how well all NCAA division 1 basketball teams shoot from three, and what the average adjusted offensive efficincy rating is (ADJOE). ADJOE is the number of points a team would score per 100 possessions against the average division 1 opponent. The adjusted part of this efficiency metric is that it accounts for the quality of the opponent that teams score on.
Note: This is the only graph that includes all Division 1 Men’s Basketball teams, as the rest of the graphs focus on teams who qualified for March Madness from 2013-2022.
df2 = df.copy()
df2.rename(columns={'YEAR': 'Year'}, inplace=True)
# Dual Axis Bar Chart
# Define years, skipping 2020
years = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022]
# Compute mean values for each year
mean_3P_O = df2[df2['Year'].isin(years)].groupby('Year')['3P_O'].mean()
mean_ADJOE = df2[df2['Year'].isin(years)].groupby('Year')['ADJOE'].mean()
# Define bar width and positions
bar_width = 0.4
x_indexes = np.arange(len(years))
# Create the figure and axes with increased size
fig, ax1 = plt.subplots(figsize=(20, 13))
ax2 = ax1.twinx() # Create second y-axis
# Plot bar charts side by side
bars1 = ax1.bar(x_indexes - bar_width/2, mean_3P_O, bar_width, color='green', alpha=0.6, label="Mean 3P% (Left Axis)")
bars2 = ax2.bar(x_indexes + bar_width/2, mean_ADJOE, bar_width, color='gray', alpha=0.6, label="Mean ADJOE (Right Axis)")
# Set proper y-axis limits with extra space
ax1.set_ylim(0, (max(mean_3P_O) + 6));
ax2.set_ylim(0, (max(mean_ADJOE) + 8));
# Set proper tick increments
ax1.set_yticks(np.arange(0, ax1.get_ylim()[1] + 1, 5));
ax2.set_yticks(np.arange(0, ax2.get_ylim()[1] + 1, 10));
# Labels and title
ax1.set_xlabel("Season")
ax1.set_ylabel("Mean Three Point Percentage", color='black',fontsize=15)
ax2.set_ylabel("Mean Adjusted Offensive Efficiency", color='black',fontsize=15)
ax1.set_title("NCAA Basketball 3 Point Percentage vs. Adjusted Offensive Efficiency",fontsize=20)
# Set x-ticks and labels
ax1.set_xticks(x_indexes);
ax1.set_xticklabels(years);
# Add values on top of each bar
for bar in bars1:
height = bar.get_height()
ax1.text(bar.get_x() + bar.get_width()/2, height + .3, f'{height:.2f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='black')
for bar in bars2:
height = bar.get_height()
ax2.text(bar.get_x() + bar.get_width()/2, height + .3, f'{height:.2f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='black')
# Improve readability
ax1.tick_params(axis='y', colors='black')
ax2.tick_params(axis='y', colors='black')
# Position legends above bars
ax1.legend(loc="upper left", bbox_to_anchor=(0, 1.1),fontsize=15)
ax2.legend(loc="upper right", bbox_to_anchor=(1, 1.1),fontsize=15)
# Show the plot
plt.show()
The goal of the first visualization is to understand how three point
percentage and ADJOE are related. We expected a strong, positive
correlation between the two variables. The dual axis bar chart shows us
that there is indeed a positive correlation between the two variables.
In general, the higher the percentage of three pointers made, the higher
the ADJOE, meaning the team is more efficient at scoring points when
they are making threes at a high rate. The year with the highest mean
ADJOE was 2014, where teams scored an average of 104.58 points per 100
possessions against a typical opponent. However, the mean three point
percentage for this year was only 34.29%, which was not the highest
percentage of the years. Therefore, the correlation between these two
variables is moderate, suggesting that scoring efficiency is more
complex than only considering three point percentage. Other factors,
like turnover rate, free throw rate, true shooting percentage, etc. are
also contributing factors to a team’s scoring ability.
# Extract champions sorted by year, then add them to a list
winners = df1[df1['postseason'] == 'Champions'][['year', 'team']].sort_values(by='year')
winners_list = winners['team'].tolist()
# Get the 3P_O values for each winner
winners_3p = df1.set_index(['year', 'team']).loc[winners.set_index(['year', 'team']).index]['3p_o'].tolist()
# get the mean 3p% of every MM team from 2013-2022 (2020 was cancelled)
threeprct_mean = round(df1['3p_o'].mean(),2)
# Vertical Bar Chart
years = [2013,2014,2015,2016,2017,2018,2019,2021,2022]
# Create labels that include both year and team name (to separate duplicate winners)
winners_labels = [f"{team} ({year})" for team, year in zip(winners_list, years)]
# Define colors based on whether the team's 3P% is above or below the mean
colors = ['blue' if pct >= threeprct_mean else 'red' for pct in winners_3p]
# Create the bar chart
plt.figure(figsize=(22, 14))
bars = plt.bar(winners_labels, winners_3p, color=colors, zorder=3) # Ensure bars are on top of gridlines
# Add the overall mean line
plt.axhline(y=threeprct_mean, color='black', linestyle='--', linewidth=2, zorder=4)
# Add text labels on top of each bar
for bar, pct in zip(bars, winners_3p):
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width() / 2, height+.15, f'{pct:.2f}',
ha='center', va='bottom', fontsize=10, fontweight='bold', zorder=5)
# Labels and title
plt.xlabel("March Madness Winners from 2013 to 2022", fontsize=15)
plt.ylabel("Team 3 Point Percentage",fontsize=15)
plt.title("March Madness Winners and their 3 Point Percentages (2013-2022)",fontsize=20)
# Rotate x labels for better readability
plt.xticks(rotation=45, ha='right');
# Manually set the y-axis limit to create space for the legend at the top
plt.ylim(0, max(winners_3p) + 5);
# Manually set the y-ticks in increments of 5
plt.yticks(np.arange(0, max(winners_3p)+5, 5));
# Add the legend and adjust its position
plt.legend([plt.Line2D([0], [0], color='blue', lw=4),
plt.Line2D([0], [0], color='red', lw=4),
plt.Line2D([0], [0], color='black', linestyle='--', lw=2)],
['Above Mean', 'Below Mean', f'Mean of All MM Teams = {threeprct_mean:.2f}'],
loc='upper left', bbox_to_anchor=(0.0, 1), fontsize=15)
# Show the plot
plt.show()
The goal of the vertical bar chart was to compare the March Madness
winners’ three point percentages to the average of all March Madness
Qualifiers. We want to know if elite three point shooting was a major
reason that these teams won the championship. In 8 of the 10 years
analyzed, the winner’s three point percentage was above the mean of
35.71%. North Carolina in 2017 was slightly below the mean at 35.5%, but
Louisville is the only champion in this time period that shot well below
the mean 3 point percentage, shooting 33.3% in 2013. In general, the
March Madness winners have shot better from beyond the arc as the years
have passed, as shown in 2018-2021 specifically. This upward trend of
three point percentage over time for the winners might suggest shooting
the three ball at a high clip is especially important in having a chance
at the NCAA title.
df3 = df1.copy()
# Multiple Line Plot
# Define years, skipping 2020, ensuring no gap in the x-axis
years = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022]
x_positions = np.arange(len(years)) # Create a continuous index for plotting
# Define postseason rounds and their labels
postseason_rounds = {
"R64": "Round of 64",
"R32": "Round of 32",
"S16": "Sweet 16",
"E8": "Elite 8",
"F4": "Final 4",
"Champions": "Champions"
}
# Add net_rating column to df3
df3["net_rating"] = df3["adjoe"] - df3["adjde"]
# Dictionary to store mean net_rating values for each round
round_net_rating = {round_name: [] for round_name in postseason_rounds.keys()}
# Compute mean values for each year and round
for year in years:
yearly_data = df3[df3["year"] == year]
for round_code, round_label in postseason_rounds.items():
mean_net_rating = yearly_data[yearly_data["postseason"] == round_code]["net_rating"].mean()
round_net_rating[round_code].append(mean_net_rating)
# Create the figure
fig, ax = plt.subplots(figsize=(20, 13))
# Define color map for rounds
colors = {
"R64": "blue", "R32": "green", "S16": "purple",
"E8": "gray", "F4": "red", "Champions": "gold"
}
# Plot each postseason round as a line
for round_code, round_label in postseason_rounds.items():
ax.plot(x_positions, round_net_rating[round_code], marker="o", linestyle="-",
label=round_label, color=colors[round_code])
# Labels and title
ax.set_xlabel("Year",fontsize=15)
ax.set_ylabel("Net Rating",fontsize=15)
ax.set_title("NCAA Division 1 Basketball Net Rating by March Madness Round Reached",fontsize=20)
# Set x-ticks without a gap
ax.set_xticks(x_positions);
ax.set_xticklabels(years);
# Adjust y-axis range to start at 0
max_net_rating = max([max(values) for values in round_net_rating.values() if len(values) > 0])
ax.set_ylim(0, max_net_rating + 2);
# Set y-ticks in increments of 2
ax.set_yticks(np.arange(0, max_net_rating + 12, 2));
# Adjust layout to create whitespace at the top for the legend
fig.subplots_adjust(top=0.85)
# Legend adjusted to avoid overlapping with the "Champions" line
ax.legend(title="Postseason Round Reached", loc="upper left", fontsize=15)
# Grid for readability
ax.grid(True, linestyle="--", alpha=0.6)
# Show the plot
plt.show()
This multiple line plot looks at how net rating is related to the round
reached for March Madness Teams. Net Rating is arguably the single best
metric for evaluating how good a college basketball team is. It can be
defined as adjusted offensive efficiency - adjusted defensive
efficiency, resulting in a net efficiency rating. Adjusted defensive
efficiency is how many points a team allows per 100 possessions,
adjusting for the quality of the opponent. This difference results in a
number that indicates how many more points a team scores than their
opponent on average (while factoring in the difficulty of the opponents
faced).
In the graph, the teams who did not win a game in the tournament (round of 64) had the lowest net rating every year compared to the teams who made it past the first round. With the exception of 2014, the champions had the highest net rating of all the groups each year. The middle rounds have net ratings that are much closer to each other. There are a few years where this does not hold true, but on average, the further a team advances, the better net rating they have. Net rating is clearly a major indicator for the strength of a team and their ability to win consecutive games in the tournament.
df4 = df3.copy()
# Nested Pie Chart
# Filter df4 for teams reaching at least the Final Four
final_four_teams = df4[df4["postseason"].isin(["F4", "2nd", "Champions"])]
# Count occurrences of each seed
seed_counts = final_four_teams["seed"].value_counts().sort_index()
seeds = seed_counts.index.astype(str)
# Count occurrences of each conference
conference_counts = final_four_teams["conf"].value_counts()
conferences = conference_counts.index.astype(str)
# Dictionary to map conference abbreviations to full names
conference_full_names = {
"Amer": "American", "B12": "Big 12", "B10": "Big 10", "P12": "Pac-12",
"SEC": "Southeastern", "ACC": "Atlantic Coast", "MWC": "Mountain West",
"WCC": "West Coast", "BE": "Big East", "A10": "Atlantic 10"
}
full_conferences = [conference_full_names.get(conf, conf) for conf in conferences]
# Define more vibrant and distinct colors
seed_colors = plt.cm.tab10(np.linspace(0, 1, len(seeds)))
conf_colors = plt.cm.tab20(np.linspace(0, 1, len(conferences)))
# Pie chart sizes
seed_sizes = seed_counts.values
conf_sizes = conference_counts.values
# Create nested pie chart
fig, ax = plt.subplots(figsize=(10, 10))
# Outer ring (Seeds)
wedgeprops = dict(width=0.3, edgecolor='w')
outside_pie, texts, autotexts = ax.pie(seed_sizes, labels=None, colors=seed_colors,
radius=1, wedgeprops=wedgeprops, autopct='', pctdistance=0.85)
# Label each slice with seed and count
for wedge, seed, size in zip(outside_pie, seeds, seed_sizes):
angle = (wedge.theta2 + wedge.theta1) / 2 # Compute the midpoint angle
x = 1.1 * np.cos(np.radians(angle)) # Adjust text position outside the slice
y = 1.1 * np.sin(np.radians(angle))
ax.text(x, y, f"{seed} Seed\n{size} teams", ha='center', va='center',
fontsize=9, fontweight='bold')
# Place percentage inside the slice with black text
x_inside = 0.85 * np.cos(np.radians(angle))
y_inside = 0.85 * np.sin(np.radians(angle))
ax.text(x_inside, y_inside, f"{size/sum(seed_sizes)*100:.1f}%", ha='center', va='center',
fontsize=9, fontweight='bold', color='black')
# Inner ring (Conferences) with inside labels
wedgeprops_inner = dict(width=0.3, edgecolor='w')
inside_pie, _ = ax.pie(conf_sizes, labels=None, colors=conf_colors,
radius=0.7, wedgeprops=wedgeprops_inner)
# Manually place text inside the slices for conferences
for wedge, conf, size in zip(inside_pie, full_conferences, conf_sizes):
angle = (wedge.theta2 + wedge.theta1) / 2 # Compute the midpoint angle
x = 0.5 * np.cos(np.radians(angle)) # Adjust text position inside the inner ring
y = 0.5 * np.sin(np.radians(angle))
ax.text(x, y, f"{conf}\n{size} teams\n{size/sum(conf_sizes)*100:.1f}%", ha='center', va='center',
fontsize=9, fontweight='bold')
# Add center hole with total team count, making it smaller
center_circle = plt.Circle((0, 0), 0.3, color='white')
ax.add_artist(center_circle)
ax.text(0, 0, "Total Teams:\n612", ha='center', va='center', fontsize=14, fontweight='bold')
# Title
plt.title("NCAA March Madness Final Four Appearances by Seed and Conference (2013-2022)")
# Show plot
plt.show()
The graph shows the percentage of each seed and conference that reach
the Final Four in each year, which is a major milestone for any college
team each season. For seeds, as we would expect, 1 and 2 seeds make up
the majority of teams who reach the Final 4, at a rate of 62.9%.
Surprisingly, after the first 2 seeds, seeding is not highly predictive
of which teams end up in the Final 4. From 2013-2022, 3 7 seed teams
reached the final four, which is as many teams as 3, 4, and 5 seeds
combined. It is not too surprising that no teams past the 11 seed
reached the Final 4, as 12-16 seeds are usually considered underdogs in
the first round, and they would have to win at least 4 consecutive
games.
Regarding the conferences, the variation of teams who reach the Final 4 is pretty even among the major conferences.The Big East and Atlantic Coastal Conference hold the biggest shares of 18.5% each, which is 5 teams a piece. The SEC and Big 12 follow with 4 teams each, and the Big 10 rounds out the top 5 conferences with 3 teams who appeared in the Final 4. I was surprised that the Pac-12, Missouri Valley Conference, and American Athletic Conference all had 2 teams that appeared in the Final 4. The parity among conferences highlights the unpredictable nature of March Madness. However, when going by seeds, it is pretty evident that 1 and 2 seeds typically make up at least 2 of the Final 4 teams remaining in a given season.
# Heatmap
# Adjust postseason labels
postseason_labels = {
"R64": "Round of 64", "R32": "Round of 32", "S16": "Sweet 16",
"E8": "Elite 8", "F4": "Final 4", "2ND": "Runner-Up", "Champions": "Champions"
}
# Define number of wins and games to subtract per round
postseason_adjustments = {
"R64": (0, 1), "R32": (1, 2), "S16": (2, 3), "E8": (3, 4),
"F4": (4, 5), "2nd": (5, 6), "Champions": (6, 6)
}
# Copy df1 to df5
df5 = df1.copy()
# Apply adjustments to compute regular season win percentage
df5["rswin%"] = df5.apply(
lambda row: (row["w"] - postseason_adjustments.get(row["postseason"], (0, 0))[0]) /
(row["g"] - postseason_adjustments.get(row["postseason"], (0, 0))[1]), axis=1)
# Replace postseason codes with full labels
df5["postseason"] = df5["postseason"].map(postseason_labels)
# Ensure postseason rounds are in the correct order
round_order = ["Round of 64", "Round of 32", "Sweet 16", "Elite 8", "Final 4", "Runner-Up", "Champions"]
# Create pivot table for heatmap
data_pivot = df5.groupby(["year", "postseason"])["rswin%"].mean().unstack()
# Reorder columns based on round order
data_pivot = data_pivot[round_order]
# Flip the order of the years
data_pivot = data_pivot.sort_index(ascending=False)
# Create heatmap with blue-to-red gradient
plt.figure(figsize=(10, 6))
ax = sns.heatmap(data_pivot, cmap="coolwarm", annot=True, fmt=".2f", linewidths=0.5, cbar_kws={'label': 'Regular Season Win Percentage'})
# Labels and title
plt.xlabel("March Madness Round Reached")
plt.ylabel("Year")
plt.title("Regular Season Win Percentage by March Madness Round (2013-2022)")
# Show plot
plt.show()
The heatmap shows the regular season win percentage (tournament games were removed to get rid of the bias that would exist) of March Madness teams grouped by year and the subsequent round reached. The graph shows us that within seasons, regular season win percentage does not appear to be a significant predictor for tournament success in the intial reasons. This is because the values are generally very close to each other, and sometimes even decreases for teams that reach further rounds. However, regular season win percentage is definitely a factor for the Runner-Ups and Champions, as one of those two groups always had the highest win percentage of any round for each year. When looking at how each round’s win percentage changed over time, there is a lack of a clear trend as the values continually increased and decreased at random. Relating the results to the last plot, we can say that having a high win percentage in the regular season sets teams up for a better seed, which could result in favorable matchups that set up deep tournament runs. Alternatively, the champions and runner-ups might have more tournament success not because of matchups due to seeding, but rather, they are more efficient on both sides of the ball (as mentioned in the net rating plot).
We looked at various factors and metrics that contribute to a NCAA basketball team’s success in March Madness. Since every champion had the highest net rating compared to other rounds, I would consider net rating as the most important metric in predicting the success of a team in the tournament. With each year, three point shooting (specifically three point percentage) appears to become increasingly important in winning a championship. However, teams that are not elite from beyond the arc still can win it all, as shown by Louisville, with elite defense and rebounding. Nearly 2/3 of the teams in this dataset that reached the Final 4 were 1 or 2 seeds, showing the strength of the high-end favorites and the potential impact of momentum. Conference is not an important predictor provided that the team is from a major conference such as the ACC. Regular season win percentage is only a relevant predictor in success for the runner-ups and champions, as there is extreme variance among the other groups. Clearly, while March Madness continues to be one of the most unpredictable tournaments in sports, there are some metrics that can serve as moderately reliable predictors of a team’s success.