Code
library(tidyverse)
library(gganimate)
library(gifski)
library(magick)
library(knitr)
library(ggrepel)An Exploratory Analysis of Team Success and Failure (2013–2021)
The goal of this project is to explore the performance metrics of NCAA basketball teams from 2013 to 2021, as provided in the dataset from Kaggle. The dataset, contained in the cbb.csv file, includes key statistics such as wins (W), adjusted offensive efficiency (ADJOE), adjusted defensive efficiency (ADJDE), and postseason outcomes, among others, for multiple seasons (excluding 2020 due to COVID-19’s impact on postseason play).
The goal of this exploratory analysis is threefold:
Identify the statistical patterns that separate successful teams from underperforming ones.
Detect anomalies, such as programs with extreme statistical profiles.
Examine trends across seasons that may reflect changes in style of play or broader strategic evolutions.
These insights can inform sports analysts, coaches, and data scientists, offering perspectives on team strategy optimization and the predictive modeling of tournament performance.
I found my dataset on kaggle.com, you can also access it via this link:
I downloaded the dataset as a compressed zip from kaggle and then extracted cbb.csv which contains combined data from all 11 seasons.
I then decided to load in all packages that I will use during this EDA.
library(tidyverse)
library(gganimate)
library(gifski)
library(magick)
library(knitr)
library(ggrepel)A good practice is to start the analysis with basic descriptive statistics. This allows for a quick inspection of data irregularities and the identification of potential errors, such as missing values or unusual observations.
First, we will check whether our dataset is complete (i.e., contains no NA values) and whether the columns have the appropriate variable types.
glimpse(data) # rough summary of the dataRows: 3,885
Columns: 24
$ TEAM <chr> "North Carolina", "Wisconsin", "Michigan", "Texas Tech", "G…
$ CONF <chr> "ACC", "B10", "B10", "B12", "WCC", "SEC", "B10", "ACC", "AC…
$ G <int> 40, 40, 40, 38, 39, 40, 38, 39, 38, 39, 40, 40, 40, 40, 36,…
$ W <int> 33, 36, 33, 31, 37, 29, 30, 35, 35, 33, 35, 36, 32, 35, 27,…
$ ADJOE <dbl> 123.3, 129.1, 114.4, 115.2, 117.8, 117.2, 121.5, 125.2, 123…
$ ADJDE <dbl> 94.9, 93.6, 90.4, 85.2, 86.3, 96.2, 93.7, 90.6, 89.9, 91.5,…
$ BARTHAG <dbl> 0.9531, 0.9758, 0.9375, 0.9696, 0.9728, 0.9062, 0.9522, 0.9…
$ EFG_O <dbl> 52.6, 54.8, 53.9, 53.5, 56.6, 49.9, 54.6, 56.6, 55.2, 51.7,…
$ EFG_D <dbl> 48.1, 47.7, 47.7, 43.0, 41.1, 46.0, 48.0, 46.5, 44.7, 48.1,…
$ TOR <dbl> 15.4, 12.4, 14.0, 17.7, 16.2, 18.1, 14.6, 16.3, 14.7, 16.2,…
$ TORD <dbl> 18.2, 15.8, 19.5, 22.8, 17.1, 16.1, 18.7, 18.6, 17.5, 18.6,…
$ ORB <dbl> 40.7, 32.1, 25.5, 27.4, 30.0, 42.0, 32.5, 35.8, 30.4, 41.3,…
$ DRB <dbl> 30.0, 23.7, 24.9, 28.7, 26.2, 29.7, 29.4, 30.2, 25.4, 25.0,…
$ FTR <dbl> 32.3, 36.2, 30.7, 32.9, 39.0, 51.8, 28.4, 39.8, 29.1, 34.3,…
$ FTRD <dbl> 30.4, 22.4, 30.0, 36.6, 26.9, 36.8, 22.7, 23.9, 26.3, 31.6,…
$ X2P_O <dbl> 53.9, 54.8, 54.7, 52.8, 56.3, 50.0, 53.4, 55.9, 52.5, 51.0,…
$ X2P_D <dbl> 44.6, 44.7, 46.8, 41.9, 40.0, 44.9, 47.6, 46.3, 45.7, 46.3,…
$ X3P_O <dbl> 32.7, 36.5, 35.2, 36.5, 38.2, 33.2, 37.9, 38.7, 39.5, 35.5,…
$ X3P_D <dbl> 36.2, 37.5, 33.2, 29.7, 29.0, 32.2, 32.6, 31.4, 28.9, 33.9,…
$ ADJ_T <dbl> 71.7, 59.3, 65.9, 67.5, 71.5, 65.9, 64.8, 66.4, 60.7, 72.8,…
$ WAB <dbl> 8.6, 11.3, 6.9, 7.0, 7.7, 3.9, 6.2, 10.7, 11.1, 8.4, 8.9, 1…
$ POSTSEASON <chr> "2ND", "2ND", "2ND", "2ND", "2ND", "2ND", "2ND", "Champions…
$ SEED <chr> "1", "1", "3", "3", "1", "8", "4", "1", "1", "1", "2", "1",…
$ YEAR <int> 2016, 2015, 2018, 2019, 2017, 2014, 2013, 2015, 2019, 2017,…
colSums(is.na(data)) # cheking whether there are NA's in each column TEAM CONF G W ADJOE ADJDE BARTHAG
0 0 0 0 0 0 0
EFG_O EFG_D TOR TORD ORB DRB FTR
0 0 0 0 0 0 0
FTRD X2P_O X2P_D X3P_O X3P_D ADJ_T WAB
0 0 0 0 0 0 0
POSTSEASON SEED YEAR
1979 2258 0
In the columns Postseason and Seed we have NA values, let’s see what that means for us.
SEED - Seed in the NCAA March Madness Tournament. So we can safely assume that NA values in this column, suggests that that team failed to qualify for the NCAA March Madness Tournament
POSTSEASON - Round where the given team was eliminated or where their season ended.
Hmm but why does the number of NA values in the POSTSEASON column differ from the SEED column? Let’s check that out!
table_df_postseason <- as.data.frame(table(data['POSTSEASON'], useNA = "ifany" ))
table_df_postseason POSTSEASON Freq
1 2ND 11
2 Champions 11
3 E8 44
4 F4 22
5 N/A 1158
6 R32 176
7 R64 352
8 R68 44
9 S16 88
10 <NA> 1979
if(nrow(data['POSTSEASON']) == sum(table_df_postseason$Freq)){
cat("The table has the same amount of values as rows in main dataframe \n")
} ## checking whether the table has the correct amount of valuesThe table has the same amount of values as rows in main dataframe
table_df_seed <- as.data.frame(table(data['SEED'], useNA = "ifany"))
table_df_seed SEED Freq
1 1 44
2 10 46
3 11 62
4 12 46
5 13 45
6 14 44
7 15 44
8 16 66
9 2 44
10 3 45
11 4 43
12 5 44
13 6 44
14 7 44
15 8 44
16 9 43
17 N/A 879
18 <NA> 2258
if(nrow(data['SEED']) == sum(table_df_seed$Freq)){
cat("The table has the same amount of values as rows in main dataframe")
} ## checking whether the table has the correct amount of valuesThe table has the same amount of values as rows in main dataframe
Okay, so we have identified the problem, our data contains two NA values, one that is actually an NA and one that is a string "N/A". To make life easier in this EDA we will replace all NA values with text "N/A" values
data <- data %>%
mutate(POSTSEASON = replace_na(POSTSEASON, "N/A"),
SEED = replace_na(SEED, "N/A"))
table(data$POSTSEASON)
2ND Champions E8 F4 N/A R32 R64 R68
11 11 44 22 3137 176 352 44
S16
88
Nice, we quickly took care of that problem.
The POSTSEASON column could be numeric, to easier allow us to determine correlations, so we will add a new column for those values POSTSEASON_NUM.
data <- data %>%
mutate(POSTSEASON_NUM = case_when(
POSTSEASON == "Champions" ~ 1,
POSTSEASON == "2ND" ~ 2,
POSTSEASON == "F4" ~ 4,
POSTSEASON == "E8" ~ 8,
POSTSEASON == "S16" ~ 16,
POSTSEASON == "R32" ~ 32,
POSTSEASON == "R64" ~ 64,
POSTSEASON == "R68" ~ 68,
POSTSEASON == "N/A" ~ 400,
))Correlation analysis will allow us to identify the influence of various factors on the number of wins and the final position in the postseason. We will focus on detecting significant correlations between variables, which will allow for a better understanding of the mechanisms that influence a team’s final results.
A correlation matrix is a table that displays the correlation coefficients between various variables in a data set. It allows us to identify relationships between characteristics, such as whether more hours of study are associated with better exam results. Correlation values range from -1 to 1, where:
data %>%
select_if(is.numeric) %>%
cor() %>%
ggcorrplot::ggcorrplot(hc.order = TRUE, type = 'lower',
method = 'square',
lab = T,
lab_size = 2,
colors = c("red","white","green"),
title = "Correlation matrix of the dataset",
ggtheme = ggplot2::theme_classic(),
tl.cex = 8,
tl.srt = 45
)This correlation matrix is not really that readible, let’s try to filter out the strong correlations only.
cor_mat <- cor(data %>% select_if(is.numeric))
threshold <- 0.5
cor_mat[abs(cor_mat) < threshold] <- NA
ggcorrplot::ggcorrplot(cor_mat,
type = "lower",
method = "square",
lab = TRUE,
lab_size = 2.5,
colors = c("red","white","green"),
title = "Strong Correlations Only",
ggtheme = ggplot2::theme_classic())On our matrix of collerations we can see:
a strong correlation (0.73) between ADJOE and W might suggest that teams with a high ADJOE will have a high number of games won
a weak correlation (-0.59) between EFG_D and W might point towards teams with a high number of Effective Field Goal Percentage Allowed having a low number of wins
We will mostly be looking at what impacts wins, but we will also take a look at postseason position correlations.
Let’s see if these correlations actually show while visualizing data.
W and Adjusted Offensive Efficiency ADJOEIn our matrix of correlations, the value of correlation between W and ADJOE was (0.73)
Let’s check whether that is the case
ggplot(data, aes(x = W, y = ADJOE, color = factor(POSTSEASON)))+
geom_point(alpha = 0.6, size = 2)+
geom_smooth(method = "lm", color = "red", se = T)+
labs(
title = "Scatter plot of Wins vs. Adjusted Offensive Efficiency",
x = "Wins (W)",
y = "Adjusted Offensive Efficiency (ADJOE)",
color = "Postseason position"
)+
theme_classic()On the scatter plot, we can see that the correlation is consistent with the correlation matrix: the higher the ADJOE, the more wins a team tends to achieve. Additionally, in the upper-right part of the plot, we notice that many of the teams finishing in the top postseason stages are clustered together.
W and X2P_O Two-Point Shooting PercentageIn our matrix of correlations, the value of correlation between W and X2P_O was (0.55)
Let’s see if that is the case, here I took (50%) 2PT Shooting Percentage as a cutoff between good shooting and not. (Via https://www.teamrankings.com/ncaa-basketball/stat/two-point-pct)
p <- ggplot(data, aes(x = W, y = X2P_O, color = factor(POSTSEASON))) +
geom_point(alpha = 0.7, size = 2)+
geom_hline(yintercept = 50, linetype = "dashed", color = "black", size = 1)+
labs(
title = "Wins vs. 2PT Shooting Percentage ({closest_state})",
subtitle = "Season: {closest_state}",
x = "Wins (W)",
y = "2PT Shooting Percentage (X2P_O)",
color = "Postseason Stage"
) +
theme_classic() +
transition_states(YEAR, transition_length = 2, state_length = 5) +
ease_aes('linear')
animate(p,fps = 5, nframes = 300, width = 800, height = 600,units = "px", res = 72, renderer = gifski_renderer())anim_save("wins_vs_ax2p_o.gif") # saving the gif in our computerLooking at the teams that have the highest amount of wins each season, we can clearly see that they are above, (oscillating close to) the 50% line, which indicates that the correlation exists.
Furthermore let us look at them amount of teams (calculated relative to the amount of teams playing that year) that shot above that (50%) cutoff and as we can see, there was a continuous growth throughout 2015-2019, and even through 2021-2024. We can look at 2020 being an exception due to COVID-19. This suggests to us that teams now more then ever are focusing on their percentage.
# Calculate percentage of teams above 50% each year
teams_above_50 <- data %>%
group_by(YEAR) %>%
summarise(
count_above_50 = sum(X2P_O > 50, na.rm = TRUE),
total_teams = n(),
.groups = "drop"
) %>%
mutate(perc_above_50 = (count_above_50 / total_teams) * 100)
# Plot percentage over time
ggplot(teams_above_50, aes(x = YEAR, y = perc_above_50)) +
geom_line(color = "blue", size = 1.2) +
geom_point(color = "red", size = 2) +
geom_text_repel(aes(label = sprintf("%.1f%%", perc_above_50)), size = 4,nudge_y = 0,nudge_x = -0.8 ,direction = "both") +
labs(
title = "Percentage of Teams with 2PT% > 50 by Season",
x = "Season (Year)",
y = "Percentage of Teams (%)"
)+ scale_x_continuous(breaks = seq(min(teams_above_50$YEAR),max(teams_above_50$YEAR),by = 1))+
theme_classic()W and X3P_O Three-Point Shooting PercentageIn our matrix of correlations the value of correlation between games won W and 3pt shooting percentage X3P_O was around (0.42)
Here I took (38%) 3PT Shooting Percentage as a cutoff between good shooting and not. (Via https://www.teamrankings.com/ncaa-basketball/stat/three-point-pct)
q <- ggplot(data, aes(x = W, y = X3P_O, color = factor(POSTSEASON))) +
geom_point(alpha = 0.7, size = 2) +
geom_hline(yintercept = 38, linetype = "dashed", color = "black", size = 1)+
labs(
title = "Wins vs. 3PT Shooting Percentage ({closest_state})",
subtitle = "Season: {closest_state}",
x = "Wins (W)",
y = "3PT Shooting Percentage (X3P_O)",
color = "Postseason Stage"
) +
theme_classic() +
transition_states(YEAR, transition_length = 2, state_length = 5) +
ease_aes('linear')
animate(q,fps = 5, nframes = 300, width = 800, height = 600,units = "px", res = 72, renderer = gifski_renderer())anim_save("wins_vs_ax3p_o.gif")Taking a look at the data, shows us that not a lot of teams can be classified as an elite shooting team. This leads me to belive that there is no correlation between elite 3pt shooting teams and wins, at least at the college level.
Looking at the percent of teams that actually achieve that (38%) cutoff through the years, shows how hard it is to be an elite 3 pt shooting team, with less and less teams achieving that. It might also show that throughout the years, the defense has not allowed for such a volume of shots.
# Count teams above 38% by year
# Calculate percentage of teams above 50% each year
teams_above_50 <- data %>%
group_by(YEAR) %>%
summarise(
count_above_50 = sum(X3P_O > 38, na.rm = TRUE),
total_teams = n(),
.groups = "drop"
) %>%
mutate(perc_above_50 = (count_above_50 / total_teams) * 100)
# Plot percentage over time
ggplot(teams_above_50, aes(x = YEAR, y = perc_above_50)) +
geom_line(color = "blue", size = 1.2) +
geom_point(color = "red", size = 2) +
geom_text_repel(aes(label = sprintf("%.1f%%", perc_above_50)), size = 4,nudge_y = 0,nudge_x = -0.8 ,direction = "both") +
labs(
title = "Percentage of Teams with 3PT% > 38 by Season",
x = "Season (Year)",
y = "Percentage of Teams (%)"
)+ scale_x_continuous(breaks = seq(min(teams_above_50$YEAR),max(teams_above_50$YEAR),by = 1))+
theme_classic()Let us take a look at some of the teams that have the highest win percentage.
First lets calculate the Win percentage
data <- data %>%
mutate(WIN_P = W/G)
sort(data$WIN_P, decreasing = T) %>%
head(10) [1] 1.0526316 1.0344828 1.0000000 1.0000000 1.0000000 0.9743590 0.9714286
[8] 0.9677419 0.9583333 0.9583333
Notice how some of the WIN_P are above 1. Let us just ignore those cases.
Let us take a further look at the WIN_P column.
win_p_summary <- data %>%
group_by(WIN_P) %>%
summarise(n_teams = n(), .groups = "drop")
summary(data$WIN_P) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.3871 0.5312 0.5192 0.6500 1.0526
first_q <- 0.3871
third_q <- 0.6500We make variables of the first and third quartile. Let us take a look at the distribution of teams among WIN_P
# Create plot
ggplot(win_p_summary, aes(x = WIN_P, y = n_teams)) +
geom_point(color = "#FF0066") + # Use bars to show frequency
theme_classic() +
geom_vline(linetype = "dashed",color = "blue",xintercept = first_q)+
geom_vline(linetype = "dashed",color = "blue",xintercept = third_q)+
labs(x = "Win Percentage", y = "Number of Teams") +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1))On the graph we also marked the first and third quartile. We can notice a slight normal distibution, with slight deviations especially in the (0.7 - 0.8) range.
Searching through the internet (mostly reddit) I came to the conclusion that a good college season win percentage is above 65%. This also correlates to our third quartile, which implies that the NCAA is a very close and high level league.
Look at that, the 0.65 WIN_P is actually the 3rd quantile, which means that we were looking at 25% of all teams. This reflects well upon the NCAA suggesting that there is a high level of competition.
WIN_PLet us now filter out those teams into a different table.
top_teams <- data %>%
filter( WIN_P >= 0.65) %>%
filter(WIN_P <= 1)
top_teams <- data %>%
group_by(YEAR) %>%
summarise(
count_above_065 = sum(WIN_P >= 0.65, na.rm = TRUE),
.groups = "drop"
)
top_teams <- data %>%
filter( WIN_P >= 0.65) %>%
filter(WIN_P <= 1)
plot_teams_above_065 <- function(data, y_offset = 0.5, bar_color = "#FF0066", covid_year = 2020, covid_color = "gray50") {
top_teams <- data %>%
group_by(YEAR) %>%
summarise(
count_above_065 = sum(WIN_P >= 0.65, na.rm = TRUE),
.groups = "drop"
)
if (!covid_year %in% top_teams$YEAR) {
top_teams <- rbind(top_teams, data.frame(YEAR = covid_year, count_above_065 = 0))
}
top_teams$fill_group <- factor(ifelse(top_teams$YEAR == covid_year, "COVID", "Other"))
ggplot(top_teams, aes(x = YEAR, y = count_above_065, fill = fill_group)) +
geom_col() +
geom_text(aes(label = count_above_065), vjust = -0.5, size = 3, color = "black") +
labs(title = "Number of teams above 0.65% WIN_P",
x = "Year",
y = "Number of teams",
caption = "2020 - 0, because of COVID-19") +
scale_x_continuous(breaks = seq(min(top_teams$YEAR), max(top_teams$YEAR), by = 1)) +
scale_fill_manual(values = c("COVID" = covid_color, "Other" = bar_color)) +
theme_classic() +
theme(legend.position = "none")
}
plot_teams_above_065(data)Let us notice that throughout the years there has been a slight upwards tendency of the rise of amount of teams that achievewed that 0.65 WP.
Let us now see what team was most commonly a good team.
team_counts_all <- data %>%
group_by(TEAM) %>%
summarise(count = n(), .groups = "drop")
team_counts_top <- top_teams %>%
group_by(TEAM) %>%
summarise(count = n(),
avg_postseason_finish = mean(POSTSEASON_NUM, na.rm = TRUE))
top_10_by_count <- team_counts_top %>%
arrange(desc(count)) %>%
head(10)
count_summary <- team_counts_top %>%
group_by(count) %>%
summarise(n_teams = n(), .groups = "drop")
ggplot(count_summary, aes(x = count, y = n_teams)) +
geom_col(fill = "#FF0066") +
geom_text(aes(label = n_teams), vjust = -0.5, size = 3, color = "black") + # Add text labels
theme_classic() +
labs(x = "Count (Number of Appearances)", y = "Number of Teams") +
scale_x_continuous(breaks = unique(count_summary$count))Looking at the top teams that achieve that (0.65) win percentage we can see how many times each program (team) appears throughout the years. Let’s take a look at the teams that appeared the most.
I have chosen the ten teams that appeared the most, also adding on their avg postseason ranking (The lower the ranking the better the programs results throughout the years).
ggplot(top_10_by_count, aes(x = reorder(TEAM, avg_postseason_finish, decreasing = TRUE), y = avg_postseason_finish, fill = avg_postseason_finish))+
geom_col()+
geom_text(aes(label = round(avg_postseason_finish, 0)), vjust = -0.5, size = 3, color = "black") +
scale_fill_gradient(low = "green", high = "red") +
theme_classic() +
labs(title = "AVG postseason position", x = "Team", y = "AVG postseason position")+
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 7, color = "black"),
legend.position = "none"
)As we could have expected, some of the greatest college teams are here at the top (Gonzaga, Duke, NCU), the low postseason position indicates high results in the postseason, consistently.
Belmont on the other hand is a team that consistently has a good regular season, however fails to appear at the postseason stage.
Let’s take a look at the lower ranked teams.
First of all let’s look at our win percentage column that we made earlier.
summary(data$WIN_P) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.3871 0.5312 0.5192 0.6500 1.0526
Let us presume that a bad season will be below the first quartile, so underneath (0.3871) WIN_P.
plot_teams_below_039 <- function(data, y_offset = 0.5, bar_color = "#FF0066", covid_year = 2020, covid_color = "gray50") {
worst_teams <- data %>%
group_by(YEAR) %>%
summarise(
count_below_039 = sum(WIN_P <= 0.3971, na.rm = TRUE),
.groups = "drop"
)
if (!covid_year %in% worst_teams$YEAR) {
worst_teams <- rbind(worst_teams, data.frame(YEAR = covid_year, count_below_039 = 0))
}
worst_teams$fill_group <- factor(ifelse(worst_teams$YEAR == covid_year, "COVID", "Other"))
ggplot(worst_teams, aes(x = YEAR, y = count_below_039, fill = fill_group)) +
geom_col() +
geom_text(aes(label = count_below_039), vjust = -0.5, size = 3, color = "black") +
labs(title = "Number of teams below 0.39% WIN_P",
x = "Year",
y = "Number of teams",
caption = "2020 - 0, because of COVID-19") +
scale_x_continuous(breaks = seq(min(top_teams$YEAR), max(worst_teams$YEAR), by = 1)) +
scale_fill_manual(values = c("COVID" = covid_color, "Other" = bar_color)) +
theme_classic() +
theme(legend.position = "none")
}
plot_teams_below_039(data)Teams with win percentages below the first quartile (0.39) were categorized as underperformers. The number of such “struggling” programs has decreased slightly across the observed seasons, indicating a narrowing competitive gap in Division I basketball. This shift may reflect improved parity, better talent distribution, and rising program quality across conferences.
Through this exploratory data analysis of NCAA basketball team statistics from 2013 to 2024, several key patterns emerged regarding team success. The dataset revealed a strong positive correlation between wins (W) and adjusted offensive efficiency (ADJOE), indicating that teams with superior offensive ratings tend to secure more victories and advance further in the postseason. A moderate correlation (0.55) was also observed with two-point shooting percentage (2P_O), where teams exceeding the benchmark of 50%—a threshold aligned with top performers in recent seasons—consistently ranked among the winners. Over time, the proportion of teams achieving this 50% mark has increased, from around 40% in earlier years to over 60% by 2024, suggesting a growing emphasis on efficient interior scoring and close-range shot selection in modern college basketball strategies.
In contrast, the correlation between wins and three-point shooting percentage (3P_O) was weaker (0.42), with few teams reaching the elite benchmark of 38%. The percentage of such high-accuracy three-point teams has declined sharply over the years, dropping from approximately 10-15% in the early 2010s to under 5% by 2024. This trend may be attributed to the NCAA’s extension of the three-point line in the 2019-20 season, which coincided with the lowest Division I-wide three-point shooting average on record (33.47%), as well as evolving defensive schemes that limit long-range opportunities.
Powerhouse programs repeatedly sustain both strong regular-season records and deep postseason runs, while others demonstrate regular-season strength without tournament validation. We look at teams like Duke or Gonzaga, that consistently achieve good results both in the regular season, and in the postseason, as powerhouses. Competitive balance is improving in the NCAA. The decline in the number of chronically underperforming teams hints at broader parity, making NCAA basketball one of the most competitive environments in the world.
Overall, these findings underscore the evolving nature of NCAA basketball, offering actionable insights for coaches, analysts, and fans aiming to understand the drivers of success in a competitive landscape.