Python Data Visualization Project

Introduction

The NCAA Division 1 Basketball Tournament, more commonly known as March Madness, is one of the most unpredictable sporting events. Upsets and Cinderella stories captivate fans every year. While top-seeded teams and “Power-5” Conferences often perform well, lower-seeded and lesser-known teams frequently defy expectations, making it challenging to predict tournament success. This analysis explores key statistical factors—such as offensive efficiency, three point percentage, regular season win percentage, and net rating—that may influence how far a team advances in the NCAA tournament. By analyzing historical data from past tournaments, we aim to identify metrics that provide a more accurate picture of a team’s potential for a deep run. Understanding these factors can offer valuable insights for analysts, fans, and even bracket enthusiasts looking for an edge each year when they fill out their brackets.

This project utilizes Python to explore and visualize College Basketball and March Madness Data using a variety of metrics. I am especially interested in the factors that are associated with a deep tournament run, and how those factors differ based on each season.

Dataset

The dataset was found on a Kaggle Page titled “college basketball march madness data”. The data originates from another Kaggle dataset titled “college-basketball-dataset”, and was updated with data from “https://barttorvik.com/”. The resulting dataset is a csv file named “alldataclean.csv”.The dataset involves the years of 2013-2022, with no data for 2020 as the March Madness tournament was cancelled. The dataset includes variables such as team name, season/year, conference, March Madness seed, games played, wins, offensive three point percentage, and adjusted offensive efficiency. The original dataset includes every Division 1 Basketball team, including those teams who did not qualify for March Madness. The filtered dataset that I use for the majority of my analysis includes only the teams who qualified for March Madness from 2013 to 2022. Each qualifier in a given year has a unique observation, meaning that a college such as Duke that qualified for the tournament in multiple years will have multiple observations.

# Import Libraries
import matplotlib.pyplot as plt
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
warnings.filterwarnings("ignore")

path = "C:/Users/Admin/Downloads/IS460/Python/Datafiles/"
filename = "alldataclean.csv"
# get rid of columns we are not interested in
df = pd.read_csv(path+filename, usecols = ['YEAR','TEAM','CONF','G','W','SEED','POSTSEASON','3P_O','ADJOE', 'ADJDE','EFG_O','ORB','ADJ_T','FTR','BARTHAG','TOR'])
pd.unique(df['POSTSEASON'])

## array(['S16', 'E8', 'Champions', 'R32', 'F4', 'R64', '2ND', nan, 'R68'],
##       dtype=object)

pd.unique(df['SEED'])

## array([ 1.,  5.,  3.,  2.,  4.,  6.,  8.,  9., 11., 10., 12., nan,  7.,
##        13., 15., 14., 16.])

# there should be less for Seed and Postseason as only 68 teams per year qualify for March Madness Tournament
df.notna().sum()

## TEAM          3160
## CONF          3160
## G             3160
## W             3160
## ADJOE         3160
## ADJDE         3160
## BARTHAG       3160
## EFG_O         3160
## TOR           3160
## ORB           3160
## FTR           3160
## 3P_O          3160
## ADJ_T         3160
## POSTSEASON     612
## SEED           612
## YEAR          3160
## dtype: int64

df

##                     TEAM  CONF   G   W  ...  ADJ_T  POSTSEASON  SEED  YEAR
## 0                Gonzaga   WCC  32  28  ...   72.6         S16   1.0  2022
## 1                Houston  Amer  38  32  ...   63.7          E8   5.0  2022
## 2                 Kansas   B12  40  34  ...   69.1   Champions   1.0  2022
## 3             Texas Tech   B12  37  27  ...   66.3         S16   3.0  2022
## 4                 Baylor   B12  34  27  ...   67.6         R32   1.0  2022
## ...                  ...   ...  ..  ..  ...    ...         ...   ...   ...
## 3155        Michigan St.   B10  35  26  ...   64.4         S16   3.0  2013
## 3156             Arizona   P12  35  27  ...   66.8         S16   6.0  2013
## 3157              Oregon   P12  37  28  ...   69.2         S16  12.0  2013
## 3158            La Salle   A10  34  24  ...   66.0         S16  13.0  2013
## 3159  Florida Gulf Coast  ASun  35  24  ...   69.1         S16  15.0  2013
## 
## [3160 rows x 16 columns]

df.describe()

##                  G            W  ...        SEED         YEAR
## count  3160.000000  3160.000000  ...  612.000000  3160.000000
## mean     30.427848    15.889557  ...    8.802288  2017.234494
## std       4.009704     6.616489  ...    4.674526     2.899457
## min       4.000000     0.000000  ...    1.000000  2013.000000
## 25%      29.000000    11.000000  ...    5.000000  2015.000000
## 50%      31.000000    15.500000  ...    9.000000  2017.000000
## 75%      33.000000    21.000000  ...   13.000000  2019.000000
## max      40.000000    38.000000  ...   16.000000  2022.000000
## 
## [8 rows x 13 columns]

Filtered Data Below (all 612 March Madness Qualifiers)

# Filtered Dataframe (only March Madness Teams)
df1 = df.copy()
df1 = df1[df1['SEED'].notna()].reset_index(drop=True)
df1.columns = df1.columns.str.lower()
df1['seed'] = df1['seed'].astype(int)

# removes the teams that did not make March Madness, changes seeds to integers
# create a win percentage column
df1['win%'] = round(df1['w']/df1['g'],2)
df1

##                    team  conf   g   w  ...  postseason  seed  year  win%
## 0               Gonzaga   WCC  32  28  ...         S16     1  2022  0.88
## 1               Houston  Amer  38  32  ...          E8     5  2022  0.84
## 2                Kansas   B12  40  34  ...   Champions     1  2022  0.85
## 3            Texas Tech   B12  37  27  ...         S16     3  2022  0.73
## 4                Baylor   B12  34  27  ...         R32     1  2022  0.79
## ..                  ...   ...  ..  ..  ...         ...   ...   ...   ...
## 607        Michigan St.   B10  35  26  ...         S16     3  2013  0.74
## 608             Arizona   P12  35  27  ...         S16     6  2013  0.77
## 609              Oregon   P12  37  28  ...         S16    12  2013  0.76
## 610            La Salle   A10  34  24  ...         S16    13  2013  0.71
## 611  Florida Gulf Coast  ASun  35  24  ...         S16    15  2013  0.69
## 
## [612 rows x 17 columns]

df1.describe()

##                 g           w       adjoe  ...        seed         year        win%
## count  612.000000  612.000000  612.000000  ...  612.000000   612.000000  612.000000
## mean    33.390523   24.063725  111.158987  ...    8.802288  2017.222222    0.719788
## std      3.365772    4.477716    6.372912  ...    4.674526     2.899793    0.101500
## min     16.000000   12.000000   90.600000  ...    1.000000  2013.000000    0.360000
## 25%     32.000000   21.000000  107.000000  ...    5.000000  2015.000000    0.640000
## 50%     34.000000   24.000000  111.200000  ...    9.000000  2017.000000    0.715000
## 75%     35.000000   27.000000  115.600000  ...   13.000000  2019.000000    0.790000
## max     40.000000   38.000000  129.100000  ...   16.000000  2022.000000    1.000000
## 
## [8 rows x 14 columns]

There are 3160 observations and 16 features in the original dataset, while the filtered dataset has 612 observations. Summary statistics can be seen above, both before the filtering and after (the filtering entails Division 1 Basketball Teams who qualified for March Madness from 2013-2022). Teams qualify in one of two ways: automatic bid by winning their conference, or an at-large bid. An at-large bid occurs when a team does not win their conference, but their resume is impressive enough to be chosen by the selection committee for the tournament.

Findings

The first Tab looks at all NCAA division 1 teams, including those who did not qualify for March Madness. Tabs 2-5 look at only the teams who qualified for March Madness between 2013 and 2022. The objective is to understand the factors that contribute to sustained success in the March Madness tournament.

Tab 1

Before diving into March Madness specific data, I wanted to get a general overview of how well all NCAA division 1 basketball teams shoot from three, and what the average adjusted offensive efficincy rating is (ADJOE). ADJOE is the number of points a team would score per 100 possessions against the average division 1 opponent. The adjusted part of this efficiency metric is that it accounts for the quality of the opponent that teams score on.

Note: This is the only graph that includes all Division 1 Men’s Basketball teams, as the rest of the graphs focus on teams who qualified for March Madness from 2013-2022.

df2 = df.copy()
df2.rename(columns={'YEAR': 'Year'}, inplace=True)

# Dual Axis Bar Chart

# Define years, skipping 2020
years = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022]

# Compute mean values for each year
mean_3P_O = df2[df2['Year'].isin(years)].groupby('Year')['3P_O'].mean()
mean_ADJOE = df2[df2['Year'].isin(years)].groupby('Year')['ADJOE'].mean()

# Define bar width and positions
bar_width = 0.4
x_indexes = np.arange(len(years))

# Create the figure and axes with increased size
fig, ax1 = plt.subplots(figsize=(20, 13))
ax2 = ax1.twinx()  # Create second y-axis

# Plot bar charts side by side
bars1 = ax1.bar(x_indexes - bar_width/2, mean_3P_O, bar_width, color='green', alpha=0.6, label="Mean 3P% (Left Axis)")
bars2 = ax2.bar(x_indexes + bar_width/2, mean_ADJOE, bar_width, color='gray', alpha=0.6, label="Mean ADJOE (Right Axis)")

# Set proper y-axis limits with extra space
ax1.set_ylim(0, (max(mean_3P_O) + 6));
ax2.set_ylim(0, (max(mean_ADJOE) + 8));  

# Set proper tick increments
ax1.set_yticks(np.arange(0, ax1.get_ylim()[1] + 1, 5));
ax2.set_yticks(np.arange(0, ax2.get_ylim()[1] + 1, 10));

# Labels and title
ax1.set_xlabel("Season")
ax1.set_ylabel("Mean Three Point Percentage", color='black',fontsize=15)
ax2.set_ylabel("Mean Adjusted Offensive Efficiency", color='black',fontsize=15)
ax1.set_title("NCAA Basketball 3 Point Percentage vs. Adjusted Offensive Efficiency",fontsize=20)

# Set x-ticks and labels
ax1.set_xticks(x_indexes);
ax1.set_xticklabels(years);

# Add values on top of each bar
for bar in bars1:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2, height + .3, f'{height:.2f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='black')

for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2, height + .3, f'{height:.2f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='black')

# Improve readability
ax1.tick_params(axis='y', colors='black')
ax2.tick_params(axis='y', colors='black')

# Position legends above bars
ax1.legend(loc="upper left", bbox_to_anchor=(0, 1.1),fontsize=15)
ax2.legend(loc="upper right", bbox_to_anchor=(1, 1.1),fontsize=15)

# Show the plot
plt.show()

The goal of the first visualization is to understand how three point percentage and ADJOE are related. We expected a strong, positive correlation between the two variables. The dual axis bar chart shows us that there is indeed a positive correlation between the two variables. In general, the higher the percentage of three pointers made, the higher the ADJOE, meaning the team is more efficient at scoring points when they are making threes at a high rate. The year with the highest mean ADJOE was 2014, where teams scored an average of 104.58 points per 100 possessions against a typical opponent. However, the mean three point percentage for this year was only 34.29%, which was not the highest percentage of the years. Therefore, the correlation between these two variables is moderate, suggesting that scoring efficiency is more complex than only considering three point percentage. Other factors, like turnover rate, free throw rate, true shooting percentage, etc. are also contributing factors to a team’s scoring ability.

Tab 2

# Extract champions sorted by year, then add them to a list
winners = df1[df1['postseason'] == 'Champions'][['year', 'team']].sort_values(by='year')
winners_list = winners['team'].tolist()

# Get the 3P_O values for each winner
winners_3p = df1.set_index(['year', 'team']).loc[winners.set_index(['year', 'team']).index]['3p_o'].tolist()

# get the mean 3p% of every MM team from 2013-2022 (2020 was cancelled)
threeprct_mean = round(df1['3p_o'].mean(),2)

# Vertical Bar Chart


years = [2013,2014,2015,2016,2017,2018,2019,2021,2022]
# Create labels that include both year and team name (to separate duplicate winners)
winners_labels = [f"{team} ({year})" for team, year in zip(winners_list, years)]

# Define colors based on whether the team's 3P% is above or below the mean
colors = ['blue' if pct >= threeprct_mean else 'red' for pct in winners_3p]

# Create the bar chart
plt.figure(figsize=(22, 14))
bars = plt.bar(winners_labels, winners_3p, color=colors, zorder=3)  # Ensure bars are on top of gridlines

# Add the overall mean line
plt.axhline(y=threeprct_mean, color='black', linestyle='--', linewidth=2, zorder=4)

# Add text labels on top of each bar
for bar, pct in zip(bars, winners_3p):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, height+.15, f'{pct:.2f}', 
             ha='center', va='bottom', fontsize=10, fontweight='bold', zorder=5)

# Labels and title
plt.xlabel("March Madness Winners from 2013 to 2022", fontsize=15)
plt.ylabel("Team 3 Point Percentage",fontsize=15)
plt.title("March Madness Winners and their 3 Point Percentages (2013-2022)",fontsize=20)

# Rotate x labels for better readability
plt.xticks(rotation=45, ha='right');

# Manually set the y-axis limit to create space for the legend at the top
plt.ylim(0, max(winners_3p) + 5);

# Manually set the y-ticks in increments of 5
plt.yticks(np.arange(0, max(winners_3p)+5, 5));

# Add the legend and adjust its position
plt.legend([plt.Line2D([0], [0], color='blue', lw=4), 
            plt.Line2D([0], [0], color='red', lw=4),
            plt.Line2D([0], [0], color='black', linestyle='--', lw=2)], 
           ['Above Mean', 'Below Mean', f'Mean of All MM Teams = {threeprct_mean:.2f}'],
           loc='upper left', bbox_to_anchor=(0.0, 1), fontsize=15)

# Show the plot
plt.show()

The goal of the vertical bar chart was to compare the March Madness winners’ three point percentages to the average of all March Madness Qualifiers. We want to know if elite three point shooting was a major reason that these teams won the championship. In 8 of the 10 years analyzed, the winner’s three point percentage was above the mean of 35.71%. North Carolina in 2017 was slightly below the mean at 35.5%, but Louisville is the only champion in this time period that shot well below the mean 3 point percentage, shooting 33.3% in 2013. In general, the March Madness winners have shot better from beyond the arc as the years have passed, as shown in 2018-2021 specifically. This upward trend of three point percentage over time for the winners might suggest shooting the three ball at a high clip is especially important in having a chance at the NCAA title.

Tab 3

df3 = df1.copy()

# Multiple Line Plot


# Define years, skipping 2020, ensuring no gap in the x-axis
years = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022]  
x_positions = np.arange(len(years))  # Create a continuous index for plotting

# Define postseason rounds and their labels
postseason_rounds = {
    "R64": "Round of 64",
    "R32": "Round of 32",
    "S16": "Sweet 16",
    "E8": "Elite 8",
    "F4": "Final 4",
    "Champions": "Champions"
}

# Add net_rating column to df3
df3["net_rating"] = df3["adjoe"] - df3["adjde"]

# Dictionary to store mean net_rating values for each round
round_net_rating = {round_name: [] for round_name in postseason_rounds.keys()}

# Compute mean values for each year and round
for year in years:
    yearly_data = df3[df3["year"] == year]
    for round_code, round_label in postseason_rounds.items():
        mean_net_rating = yearly_data[yearly_data["postseason"] == round_code]["net_rating"].mean()
        round_net_rating[round_code].append(mean_net_rating)

# Create the figure
fig, ax = plt.subplots(figsize=(20, 13))

# Define color map for rounds
colors = {
    "R64": "blue", "R32": "green", "S16": "purple",
    "E8": "gray", "F4": "red", "Champions": "gold"
}

# Plot each postseason round as a line
for round_code, round_label in postseason_rounds.items():
    ax.plot(x_positions, round_net_rating[round_code], marker="o", linestyle="-", 
             label=round_label, color=colors[round_code])

# Labels and title
ax.set_xlabel("Year",fontsize=15)
ax.set_ylabel("Net Rating",fontsize=15)
ax.set_title("NCAA Division 1 Basketball Net Rating by March Madness Round Reached",fontsize=20)

# Set x-ticks without a gap
ax.set_xticks(x_positions);
ax.set_xticklabels(years);

# Adjust y-axis range to start at 0
max_net_rating = max([max(values) for values in round_net_rating.values() if len(values) > 0])
ax.set_ylim(0, max_net_rating + 2);

# Set y-ticks in increments of 2
ax.set_yticks(np.arange(0, max_net_rating + 12, 2));

# Adjust layout to create whitespace at the top for the legend
fig.subplots_adjust(top=0.85)

# Legend adjusted to avoid overlapping with the "Champions" line
ax.legend(title="Postseason Round Reached", loc="upper left", fontsize=15)

# Grid for readability
ax.grid(True, linestyle="--", alpha=0.6)

# Show the plot
plt.show()

This multiple line plot looks at how net rating is related to the round reached for March Madness Teams. Net Rating is arguably the single best metric for evaluating how good a college basketball team is. It can be defined as adjusted offensive efficiency - adjusted defensive efficiency, resulting in a net efficiency rating. Adjusted defensive efficiency is how many points a team allows per 100 possessions, adjusting for the quality of the opponent. This difference results in a number that indicates how many more points a team scores than their opponent on average (while factoring in the difficulty of the opponents faced).

In the graph, the teams who did not win a game in the tournament (round of 64) had the lowest net rating every year compared to the teams who made it past the first round. With the exception of 2014, the champions had the highest net rating of all the groups each year. The middle rounds have net ratings that are much closer to each other. There are a few years where this does not hold true, but on average, the further a team advances, the better net rating they have. Net rating is clearly a major indicator for the strength of a team and their ability to win consecutive games in the tournament.

Tab 4

df4 = df3.copy()
# Nested Pie Chart


# Filter df4 for teams reaching at least the Final Four
final_four_teams = df4[df4["postseason"].isin(["F4", "2nd", "Champions"])]

# Count occurrences of each seed
seed_counts = final_four_teams["seed"].value_counts().sort_index()
seeds = seed_counts.index.astype(str)

# Count occurrences of each conference
conference_counts = final_four_teams["conf"].value_counts()
conferences = conference_counts.index.astype(str)

# Dictionary to map conference abbreviations to full names
conference_full_names = {
    "Amer": "American", "B12": "Big 12", "B10": "Big 10", "P12": "Pac-12",
    "SEC": "Southeastern", "ACC": "Atlantic Coast", "MWC": "Mountain West",
    "WCC": "West Coast", "BE": "Big East", "A10": "Atlantic 10"
}
full_conferences = [conference_full_names.get(conf, conf) for conf in conferences]

# Define more vibrant and distinct colors
seed_colors = plt.cm.tab10(np.linspace(0, 1, len(seeds)))
conf_colors = plt.cm.tab20(np.linspace(0, 1, len(conferences)))

# Pie chart sizes
seed_sizes = seed_counts.values
conf_sizes = conference_counts.values

# Create nested pie chart
fig, ax = plt.subplots(figsize=(10, 10))

# Outer ring (Seeds)
wedgeprops = dict(width=0.3, edgecolor='w')
outside_pie, texts, autotexts = ax.pie(seed_sizes, labels=None, colors=seed_colors, 
                                       radius=1, wedgeprops=wedgeprops, autopct='', pctdistance=0.85)

# Label each slice with seed and count
for wedge, seed, size in zip(outside_pie, seeds, seed_sizes):
    angle = (wedge.theta2 + wedge.theta1) / 2  # Compute the midpoint angle
    x = 1.1 * np.cos(np.radians(angle))  # Adjust text position outside the slice
    y = 1.1 * np.sin(np.radians(angle))
    ax.text(x, y, f"{seed} Seed\n{size} teams", ha='center', va='center', 
            fontsize=9, fontweight='bold')
    
    # Place percentage inside the slice with black text
    x_inside = 0.85 * np.cos(np.radians(angle))
    y_inside = 0.85 * np.sin(np.radians(angle))
    ax.text(x_inside, y_inside, f"{size/sum(seed_sizes)*100:.1f}%", ha='center', va='center', 
            fontsize=9, fontweight='bold', color='black')

# Inner ring (Conferences) with inside labels
wedgeprops_inner = dict(width=0.3, edgecolor='w')
inside_pie, _ = ax.pie(conf_sizes, labels=None, colors=conf_colors, 
                        radius=0.7, wedgeprops=wedgeprops_inner)

# Manually place text inside the slices for conferences
for wedge, conf, size in zip(inside_pie, full_conferences, conf_sizes):
    angle = (wedge.theta2 + wedge.theta1) / 2  # Compute the midpoint angle
    x = 0.5 * np.cos(np.radians(angle))  # Adjust text position inside the inner ring
    y = 0.5 * np.sin(np.radians(angle))
    ax.text(x, y, f"{conf}\n{size} teams\n{size/sum(conf_sizes)*100:.1f}%", ha='center', va='center', 
            fontsize=9, fontweight='bold')

# Add center hole with total team count, making it smaller
center_circle = plt.Circle((0, 0), 0.3, color='white')
ax.add_artist(center_circle)
ax.text(0, 0, "Total Teams:\n612", ha='center', va='center', fontsize=14, fontweight='bold')

# Title
plt.title("NCAA March Madness Final Four Appearances by Seed and Conference (2013-2022)")

# Show plot
plt.show()

The graph shows the percentage of each seed and conference that reach the Final Four in each year, which is a major milestone for any college team each season. For seeds, as we would expect, 1 and 2 seeds make up the majority of teams who reach the Final 4, at a rate of 62.9%. Surprisingly, after the first 2 seeds, seeding is not highly predictive of which teams end up in the Final 4. From 2013-2022, 3 7 seed teams reached the final four, which is as many teams as 3, 4, and 5 seeds combined. It is not too surprising that no teams past the 11 seed reached the Final 4, as 12-16 seeds are usually considered underdogs in the first round, and they would have to win at least 4 consecutive games.

Regarding the conferences, the variation of teams who reach the Final 4 is pretty even among the major conferences.The Big East and Atlantic Coastal Conference hold the biggest shares of 18.5% each, which is 5 teams a piece. The SEC and Big 12 follow with 4 teams each, and the Big 10 rounds out the top 5 conferences with 3 teams who appeared in the Final 4. I was surprised that the Pac-12, Missouri Valley Conference, and American Athletic Conference all had 2 teams that appeared in the Final 4. The parity among conferences highlights the unpredictable nature of March Madness. However, when going by seeds, it is pretty evident that 1 and 2 seeds typically make up at least 2 of the Final 4 teams remaining in a given season.

Tab 5

# Heatmap

# Adjust postseason labels
postseason_labels = {
    "R64": "Round of 64", "R32": "Round of 32", "S16": "Sweet 16",
    "E8": "Elite 8", "F4": "Final 4", "2ND": "Runner-Up", "Champions": "Champions"
}

# Define number of wins and games to subtract per round
postseason_adjustments = {
    "R64": (0, 1), "R32": (1, 2), "S16": (2, 3), "E8": (3, 4),
    "F4": (4, 5), "2nd": (5, 6), "Champions": (6, 6)
}

# Copy df1 to df5
df5 = df1.copy()

# Apply adjustments to compute regular season win percentage
df5["rswin%"] = df5.apply(
    lambda row: (row["w"] - postseason_adjustments.get(row["postseason"], (0, 0))[0]) /
                 (row["g"] - postseason_adjustments.get(row["postseason"], (0, 0))[1]), axis=1)

# Replace postseason codes with full labels
df5["postseason"] = df5["postseason"].map(postseason_labels)

# Ensure postseason rounds are in the correct order
round_order = ["Round of 64", "Round of 32", "Sweet 16", "Elite 8", "Final 4", "Runner-Up", "Champions"]

# Create pivot table for heatmap
data_pivot = df5.groupby(["year", "postseason"])["rswin%"].mean().unstack()

# Reorder columns based on round order
data_pivot = data_pivot[round_order]

# Flip the order of the years
data_pivot = data_pivot.sort_index(ascending=False)

# Create heatmap with blue-to-red gradient
plt.figure(figsize=(10, 6))
ax = sns.heatmap(data_pivot, cmap="coolwarm", annot=True, fmt=".2f", linewidths=0.5, cbar_kws={'label': 'Regular Season Win Percentage'})

# Labels and title
plt.xlabel("March Madness Round Reached")
plt.ylabel("Year")
plt.title("Regular Season Win Percentage by March Madness Round (2013-2022)")

# Show plot
plt.show()

The heatmap shows the regular season win percentage (tournament games were removed to get rid of the bias that would exist) of March Madness teams grouped by year and the subsequent round reached. The graph shows us that within seasons, regular season win percentage does not appear to be a significant predictor for tournament success in the intial reasons. This is because the values are generally very close to each other, and sometimes even decreases for teams that reach further rounds. However, regular season win percentage is definitely a factor for the Runner-Ups and Champions, as one of those two groups always had the highest win percentage of any round for each year. When looking at how each round’s win percentage changed over time, there is a lack of a clear trend as the values continually increased and decreased at random. Relating the results to the last plot, we can say that having a high win percentage in the regular season sets teams up for a better seed, which could result in favorable matchups that set up deep tournament runs. Alternatively, the champions and runner-ups might have more tournament success not because of matchups due to seeding, but rather, they are more efficient on both sides of the ball (as mentioned in the net rating plot).

Conclusion

We looked at various factors and metrics that contribute to a NCAA basketball team’s success in March Madness. Since every champion had the highest net rating compared to other rounds, I would consider net rating as the most important metric in predicting the success of a team in the tournament. With each year, three point shooting (specifically three point percentage) appears to become increasingly important in winning a championship. However, teams that are not elite from beyond the arc still can win it all, as shown by Louisville, with elite defense and rebounding. Nearly 2/3 of the teams in this dataset that reached the Final 4 were 1 or 2 seeds, showing the strength of the high-end favorites and the potential impact of momentum. Conference is not an important predictor provided that the team is from a major conference such as the ACC. Regular season win percentage is only a relevant predictor in success for the runner-ups and champions, as there is extreme variance among the other groups. Clearly, while March Madness continues to be one of the most unpredictable tournaments in sports, there are some metrics that can serve as moderately reliable predictors of a team’s success.