Uncovering Patterns in NCAA Basketball Team Statistics

An Exploratory Analysis of Team Success and Failure (2013–2021)

Author

Affiliation

Jakub Zoldak

Lublin University of Technology

Published

September 9, 2025

Introduction

The goal of this project is to explore the performance metrics of NCAA basketball teams from 2013 to 2021, as provided in the dataset from Kaggle. The dataset, contained in the cbb.csv file, includes key statistics such as wins (W), adjusted offensive efficiency (ADJOE), adjusted defensive efficiency (ADJDE), and postseason outcomes, among others, for multiple seasons (excluding 2020 due to COVID-19’s impact on postseason play).

The goal of this exploratory analysis is threefold:

Identify the statistical patterns that separate successful teams from underperforming ones.
Detect anomalies, such as programs with extreme statistical profiles.
Examine trends across seasons that may reflect changes in style of play or broader strategic evolutions.

These insights can inform sports analysts, coaches, and data scientists, offering perspectives on team strategy optimization and the predictive modeling of tournament performance.

I found my dataset on kaggle.com, you can also access it via this link:

https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset

Loading in our dataset + adding packages in R

I downloaded the dataset as a compressed zip from kaggle and then extracted cbb.csv which contains combined data from all 11 seasons.

I then decided to load in all packages that I will use during this EDA.

Code

library(tidyverse)
library(gganimate)
library(gifski)
library(magick)
library(knitr)
library(ggrepel)

Look at the data and clean it

A good practice is to start the analysis with basic descriptive statistics. This allows for a quick inspection of data irregularities and the identification of potential errors, such as missing values or unusual observations.

First, we will check whether our dataset is complete (i.e., contains no NA values) and whether the columns have the appropriate variable types.

Code

glimpse(data) # rough summary of the data

Rows: 3,885
Columns: 24
$ TEAM       <chr> "North Carolina", "Wisconsin", "Michigan", "Texas Tech", "G…
$ CONF       <chr> "ACC", "B10", "B10", "B12", "WCC", "SEC", "B10", "ACC", "AC…
$ G          <int> 40, 40, 40, 38, 39, 40, 38, 39, 38, 39, 40, 40, 40, 40, 36,…
$ W          <int> 33, 36, 33, 31, 37, 29, 30, 35, 35, 33, 35, 36, 32, 35, 27,…
$ ADJOE      <dbl> 123.3, 129.1, 114.4, 115.2, 117.8, 117.2, 121.5, 125.2, 123…
$ ADJDE      <dbl> 94.9, 93.6, 90.4, 85.2, 86.3, 96.2, 93.7, 90.6, 89.9, 91.5,…
$ BARTHAG    <dbl> 0.9531, 0.9758, 0.9375, 0.9696, 0.9728, 0.9062, 0.9522, 0.9…
$ EFG_O      <dbl> 52.6, 54.8, 53.9, 53.5, 56.6, 49.9, 54.6, 56.6, 55.2, 51.7,…
$ EFG_D      <dbl> 48.1, 47.7, 47.7, 43.0, 41.1, 46.0, 48.0, 46.5, 44.7, 48.1,…
$ TOR        <dbl> 15.4, 12.4, 14.0, 17.7, 16.2, 18.1, 14.6, 16.3, 14.7, 16.2,…
$ TORD       <dbl> 18.2, 15.8, 19.5, 22.8, 17.1, 16.1, 18.7, 18.6, 17.5, 18.6,…
$ ORB        <dbl> 40.7, 32.1, 25.5, 27.4, 30.0, 42.0, 32.5, 35.8, 30.4, 41.3,…
$ DRB        <dbl> 30.0, 23.7, 24.9, 28.7, 26.2, 29.7, 29.4, 30.2, 25.4, 25.0,…
$ FTR        <dbl> 32.3, 36.2, 30.7, 32.9, 39.0, 51.8, 28.4, 39.8, 29.1, 34.3,…
$ FTRD       <dbl> 30.4, 22.4, 30.0, 36.6, 26.9, 36.8, 22.7, 23.9, 26.3, 31.6,…
$ X2P_O      <dbl> 53.9, 54.8, 54.7, 52.8, 56.3, 50.0, 53.4, 55.9, 52.5, 51.0,…
$ X2P_D      <dbl> 44.6, 44.7, 46.8, 41.9, 40.0, 44.9, 47.6, 46.3, 45.7, 46.3,…
$ X3P_O      <dbl> 32.7, 36.5, 35.2, 36.5, 38.2, 33.2, 37.9, 38.7, 39.5, 35.5,…
$ X3P_D      <dbl> 36.2, 37.5, 33.2, 29.7, 29.0, 32.2, 32.6, 31.4, 28.9, 33.9,…
$ ADJ_T      <dbl> 71.7, 59.3, 65.9, 67.5, 71.5, 65.9, 64.8, 66.4, 60.7, 72.8,…
$ WAB        <dbl> 8.6, 11.3, 6.9, 7.0, 7.7, 3.9, 6.2, 10.7, 11.1, 8.4, 8.9, 1…
$ POSTSEASON <chr> "2ND", "2ND", "2ND", "2ND", "2ND", "2ND", "2ND", "Champions…
$ SEED       <chr> "1", "1", "3", "3", "1", "8", "4", "1", "1", "1", "2", "1",…
$ YEAR       <int> 2016, 2015, 2018, 2019, 2017, 2014, 2013, 2015, 2019, 2017,…

Code

colSums(is.na(data)) # cheking whether there are NA's in each column

      TEAM       CONF          G          W      ADJOE      ADJDE    BARTHAG 
         0          0          0          0          0          0          0 
     EFG_O      EFG_D        TOR       TORD        ORB        DRB        FTR 
         0          0          0          0          0          0          0 
      FTRD      X2P_O      X2P_D      X3P_O      X3P_D      ADJ_T        WAB 
         0          0          0          0          0          0          0 
POSTSEASON       SEED       YEAR 
      1979       2258          0

In the columns Postseason and Seed we have NA values, let’s see what that means for us.

SEED - Seed in the NCAA March Madness Tournament. So we can safely assume that NA values in this column, suggests that that team failed to qualify for the NCAA March Madness Tournament

POSTSEASON - Round where the given team was eliminated or where their season ended.

Hmm but why does the number of NA values in the POSTSEASON column differ from the SEED column? Let’s check that out!

Code

table_df_postseason <- as.data.frame(table(data['POSTSEASON'], useNA = "ifany" ))

table_df_postseason

   POSTSEASON Freq
1         2ND   11
2   Champions   11
3          E8   44
4          F4   22
5         N/A 1158
6         R32  176
7         R64  352
8         R68   44
9         S16   88
10       <NA> 1979

Code

if(nrow(data['POSTSEASON']) == sum(table_df_postseason$Freq)){
  cat("The table has the same amount of values as rows in main dataframe \n")
} ## checking whether the table has the correct amount of values

The table has the same amount of values as rows in main dataframe

Code

table_df_seed <- as.data.frame(table(data['SEED'], useNA = "ifany"))

table_df_seed

   SEED Freq
1     1   44
2    10   46
3    11   62
4    12   46
5    13   45
6    14   44
7    15   44
8    16   66
9     2   44
10    3   45
11    4   43
12    5   44
13    6   44
14    7   44
15    8   44
16    9   43
17  N/A  879
18 <NA> 2258

Code

if(nrow(data['SEED']) == sum(table_df_seed$Freq)){
   cat("The table has the same amount of values as rows in main dataframe")
} ## checking whether the table has the correct amount of values

The table has the same amount of values as rows in main dataframe

Okay, so we have identified the problem, our data contains two NA values, one that is actually an NA and one that is a string "N/A". To make life easier in this EDA we will replace all NA values with text "N/A" values

Code

data <- data %>%
  mutate(POSTSEASON = replace_na(POSTSEASON, "N/A"),
         SEED = replace_na(SEED, "N/A"))
table(data$POSTSEASON)


      2ND Champions        E8        F4       N/A       R32       R64       R68 
       11        11        44        22      3137       176       352        44 
      S16 
       88

Nice, we quickly took care of that problem.

The POSTSEASON column could be numeric, to easier allow us to determine correlations, so we will add a new column for those values POSTSEASON_NUM.

Code

data <- data %>%
  mutate(POSTSEASON_NUM = case_when(
    POSTSEASON == "Champions" ~ 1,
    POSTSEASON == "2ND"       ~ 2,
    POSTSEASON == "F4"        ~ 4,
    POSTSEASON == "E8"        ~ 8,
    POSTSEASON == "S16"       ~ 16,
    POSTSEASON == "R32"       ~ 32,
    POSTSEASON == "R64"       ~ 64,
    POSTSEASON == "R68"       ~ 68,
    POSTSEASON == "N/A"       ~ 400,
  ))

Correlation Analysis

Correlation analysis will allow us to identify the influence of various factors on the number of wins and the final position in the postseason. We will focus on detecting significant correlations between variables, which will allow for a better understanding of the mechanisms that influence a team’s final results.

Correlation matrix

A correlation matrix is a table that displays the correlation coefficients between various variables in a data set. It allows us to identify relationships between characteristics, such as whether more hours of study are associated with better exam results. Correlation values range from -1 to 1, where:

1 indicates a strong positive relationship,
-1 indicates a strong negative relationship,
0 indicates no relationship.

Code

data %>% 
  select_if(is.numeric) %>% 
  cor() %>% 
  ggcorrplot::ggcorrplot(hc.order = TRUE, type = 'lower',
                         method = 'square',
                         lab = T,
                         lab_size = 2,
                         colors = c("red","white","green"),
                         title = "Correlation matrix of the dataset",
                         ggtheme = ggplot2::theme_classic(),
                         tl.cex = 8,
                         tl.srt = 45
                         )

This correlation matrix is not really that readible, let’s try to filter out the strong correlations only.

Code

cor_mat <- cor(data %>% select_if(is.numeric))

threshold <- 0.5
cor_mat[abs(cor_mat) < threshold] <- NA  

ggcorrplot::ggcorrplot(cor_mat,
                       type = "lower",
                       method = "square",
                       lab = TRUE,
                       lab_size = 2.5,
                       colors = c("red","white","green"),
                       title = "Strong Correlations Only",
                       ggtheme = ggplot2::theme_classic())

On our matrix of collerations we can see:

a strong correlation (0.73) between ADJOE and W might suggest that teams with a high ADJOE will have a high number of games won
a weak correlation (-0.59) between EFG_D and W might point towards teams with a high number of Effective Field Goal Percentage Allowed having a low number of wins

We will mostly be looking at what impacts wins, but we will also take a look at postseason position correlations.

Let’s see if these correlations actually show while visualizing data.

Correlation between number of games won `W` and Adjusted Offensive Efficiency `ADJOE`

In our matrix of correlations, the value of correlation between W and ADJOE was (0.73)

Let’s check whether that is the case

Code

ggplot(data, aes(x = W, y = ADJOE, color = factor(POSTSEASON)))+
  geom_point(alpha = 0.6, size = 2)+
  geom_smooth(method = "lm", color = "red", se = T)+
  labs(
    title = "Scatter plot of Wins vs. Adjusted Offensive Efficiency",
    x = "Wins (W)",
    y = "Adjusted Offensive Efficiency (ADJOE)",
    color = "Postseason position"
  )+
  theme_classic()

On the scatter plot, we can see that the correlation is consistent with the correlation matrix: the higher the ADJOE, the more wins a team tends to achieve. Additionally, in the upper-right part of the plot, we notice that many of the teams finishing in the top postseason stages are clustered together.

Correlation between number of games won `W` and `X2P_O` Two-Point Shooting Percentage

In our matrix of correlations, the value of correlation between W and X2P_O was (0.55)

Let’s see if that is the case, here I took (50%) 2PT Shooting Percentage as a cutoff between good shooting and not. (Via https://www.teamrankings.com/ncaa-basketball/stat/two-point-pct)

Code

p <- ggplot(data, aes(x = W, y = X2P_O, color = factor(POSTSEASON))) +
  geom_point(alpha = 0.7, size = 2)+
  geom_hline(yintercept = 50, linetype = "dashed", color = "black", size = 1)+
  labs(
    title = "Wins vs. 2PT Shooting Percentage  ({closest_state})",
    subtitle = "Season: {closest_state}",
    x = "Wins (W)",
    y = "2PT Shooting Percentage (X2P_O)",
    color = "Postseason Stage"
  ) +
  theme_classic() +
  transition_states(YEAR, transition_length = 2, state_length = 5) +
  ease_aes('linear')

animate(p,fps = 5, nframes = 300, width = 800, height = 600,units = "px", res = 72, renderer = gifski_renderer())

Code

anim_save("wins_vs_ax2p_o.gif") # saving the gif in our computer

Looking at the teams that have the highest amount of wins each season, we can clearly see that they are above, (oscillating close to) the 50% line, which indicates that the correlation exists.

Furthermore let us look at them amount of teams (calculated relative to the amount of teams playing that year) that shot above that (50%) cutoff and as we can see, there was a continuous growth throughout 2015-2019, and even through 2021-2024. We can look at 2020 being an exception due to COVID-19. This suggests to us that teams now more then ever are focusing on their percentage.

Code

# Calculate percentage of teams above 50% each year
teams_above_50 <- data %>%
  group_by(YEAR) %>%
  summarise(
    count_above_50 = sum(X2P_O > 50, na.rm = TRUE),
    total_teams = n(),
    .groups = "drop"
  ) %>%
  mutate(perc_above_50 = (count_above_50 / total_teams) * 100)

# Plot percentage over time
ggplot(teams_above_50, aes(x = YEAR, y = perc_above_50)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 2) +
  geom_text_repel(aes(label = sprintf("%.1f%%", perc_above_50)), size = 4,nudge_y = 0,nudge_x = -0.8 ,direction = "both") +
  labs(
    title = "Percentage of Teams with 2PT% > 50 by Season",
    x = "Season (Year)",
    y = "Percentage of Teams (%)"
  )+ scale_x_continuous(breaks = seq(min(teams_above_50$YEAR),max(teams_above_50$YEAR),by = 1))+
  theme_classic()

Correlation between number of games won `W` and `X3P_O` Three-Point Shooting Percentage

In our matrix of correlations the value of correlation between games won W and 3pt shooting percentage X3P_O was around (0.42)

Here I took (38%) 3PT Shooting Percentage as a cutoff between good shooting and not. (Via https://www.teamrankings.com/ncaa-basketball/stat/three-point-pct)

Code

q <- ggplot(data, aes(x = W, y = X3P_O, color = factor(POSTSEASON))) +
  geom_point(alpha = 0.7, size = 2) +
  geom_hline(yintercept = 38, linetype = "dashed", color = "black", size = 1)+
  labs(
    title = "Wins vs. 3PT Shooting Percentage  ({closest_state})",
    subtitle = "Season: {closest_state}",
    x = "Wins (W)",
    y = "3PT Shooting Percentage (X3P_O)",
    color = "Postseason Stage"
  ) +
  theme_classic() +
  transition_states(YEAR, transition_length = 2, state_length = 5) +
  ease_aes('linear')

animate(q,fps = 5, nframes = 300, width = 800, height = 600,units = "px", res = 72, renderer = gifski_renderer())

Code

anim_save("wins_vs_ax3p_o.gif")

Taking a look at the data, shows us that not a lot of teams can be classified as an elite shooting team. This leads me to belive that there is no correlation between elite 3pt shooting teams and wins, at least at the college level.

Looking at the percent of teams that actually achieve that (38%) cutoff through the years, shows how hard it is to be an elite 3 pt shooting team, with less and less teams achieving that. It might also show that throughout the years, the defense has not allowed for such a volume of shots.

Code

# Count teams above 38% by year
# Calculate percentage of teams above 50% each year
teams_above_50 <- data %>%
  group_by(YEAR) %>%
  summarise(
    count_above_50 = sum(X3P_O > 38, na.rm = TRUE),
    total_teams = n(),
    .groups = "drop"
  ) %>%
  mutate(perc_above_50 = (count_above_50 / total_teams) * 100)

# Plot percentage over time
ggplot(teams_above_50, aes(x = YEAR, y = perc_above_50)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 2) +
  geom_text_repel(aes(label = sprintf("%.1f%%", perc_above_50)), size = 4,nudge_y = 0,nudge_x = -0.8 ,direction = "both") +
  labs(
    title = "Percentage of Teams with 3PT% > 38 by Season",
    x = "Season (Year)",
    y = "Percentage of Teams (%)"
  )+ scale_x_continuous(breaks = seq(min(teams_above_50$YEAR),max(teams_above_50$YEAR),by = 1))+
  theme_classic()

Analysis of winning teams

Let us take a look at some of the teams that have the highest win percentage.

Calculating win percentage

First lets calculate the Win percentage

Code

data <- data %>% 
  mutate(WIN_P = W/G)

sort(data$WIN_P, decreasing = T) %>% 
  head(10)

 [1] 1.0526316 1.0344828 1.0000000 1.0000000 1.0000000 0.9743590 0.9714286
 [8] 0.9677419 0.9583333 0.9583333

Notice how some of the WIN_P are above 1. Let us just ignore those cases.

Let us take a further look at the WIN_P column.

Code

win_p_summary <- data %>%
  group_by(WIN_P) %>%
  summarise(n_teams = n(), .groups = "drop")

summary(data$WIN_P)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.3871  0.5312  0.5192  0.6500  1.0526

Code

first_q <- 0.3871
third_q <- 0.6500

We make variables of the first and third quartile. Let us take a look at the distribution of teams among WIN_P

Code

# Create plot
ggplot(win_p_summary, aes(x = WIN_P, y = n_teams)) +
  geom_point(color = "#FF0066") +  # Use bars to show frequency
  theme_classic() +
  geom_vline(linetype = "dashed",color = "blue",xintercept = first_q)+
  geom_vline(linetype = "dashed",color = "blue",xintercept = third_q)+
  labs(x = "Win Percentage", y = "Number of Teams") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1))

On the graph we also marked the first and third quartile. We can notice a slight normal distibution, with slight deviations especially in the (0.7 - 0.8) range.

Searching through the internet (mostly reddit) I came to the conclusion that a good college season win percentage is above 65%. This also correlates to our third quartile, which implies that the NCAA is a very close and high level league.

Look at that, the 0.65 WIN_P is actually the 3rd quantile, which means that we were looking at 25% of all teams. This reflects well upon the NCAA suggesting that there is a high level of competition.

Teams above 0.65 `WIN_P`

Let us now filter out those teams into a different table.

Code

top_teams <- data %>% 
  filter( WIN_P >= 0.65) %>% 
  filter(WIN_P <= 1)

top_teams <- data %>%
  group_by(YEAR) %>%
  summarise(
    count_above_065 = sum(WIN_P >= 0.65, na.rm = TRUE),
    .groups = "drop"
  ) 

top_teams <- data %>% 
  filter( WIN_P >= 0.65) %>% 
  filter(WIN_P <= 1)


plot_teams_above_065 <- function(data, y_offset = 0.5, bar_color = "#FF0066", covid_year = 2020, covid_color = "gray50") {
  top_teams <- data %>%
    group_by(YEAR) %>%
    summarise(
      count_above_065 = sum(WIN_P >= 0.65, na.rm = TRUE),
      .groups = "drop"
    )
  
  if (!covid_year %in% top_teams$YEAR) {
    top_teams <- rbind(top_teams, data.frame(YEAR = covid_year, count_above_065 = 0))
  }
  
  top_teams$fill_group <- factor(ifelse(top_teams$YEAR == covid_year, "COVID", "Other"))
  
  ggplot(top_teams, aes(x = YEAR, y = count_above_065, fill = fill_group)) +
    geom_col() +
    geom_text(aes(label = count_above_065), vjust = -0.5, size = 3, color = "black") +
    labs(title = "Number of teams above 0.65% WIN_P",
         x = "Year",
         y = "Number of teams",
         caption = "2020 - 0, because of COVID-19") +
    scale_x_continuous(breaks = seq(min(top_teams$YEAR), max(top_teams$YEAR), by = 1)) +
    scale_fill_manual(values = c("COVID" = covid_color, "Other" = bar_color)) +
    theme_classic() +
    theme(legend.position = "none")
}

plot_teams_above_065(data)

Let us notice that throughout the years there has been a slight upwards tendency of the rise of amount of teams that achievewed that 0.65 WP.

Which teams are consistently good throughout the years

Let us now see what team was most commonly a good team.

Code

team_counts_all <- data %>%
  group_by(TEAM) %>%
  summarise(count = n(), .groups = "drop")

team_counts_top <- top_teams %>% 
  group_by(TEAM) %>% 
  summarise(count = n(),
            avg_postseason_finish = mean(POSTSEASON_NUM, na.rm = TRUE))

top_10_by_count <- team_counts_top %>% 
  arrange(desc(count)) %>% 
  head(10)

count_summary <- team_counts_top %>%
  group_by(count) %>%
  summarise(n_teams = n(), .groups = "drop")

ggplot(count_summary, aes(x = count, y = n_teams)) +
  geom_col(fill = "#FF0066") +
  geom_text(aes(label = n_teams), vjust = -0.5, size = 3, color = "black") +  # Add text labels
  theme_classic() +
  labs(x = "Count (Number of Appearances)", y = "Number of Teams") +
  scale_x_continuous(breaks = unique(count_summary$count))

Looking at the top teams that achieve that (0.65) win percentage we can see how many times each program (team) appears throughout the years. Let’s take a look at the teams that appeared the most.

I have chosen the ten teams that appeared the most, also adding on their avg postseason ranking (The lower the ranking the better the programs results throughout the years).

Code

ggplot(top_10_by_count, aes(x = reorder(TEAM, avg_postseason_finish, decreasing = TRUE), y = avg_postseason_finish, fill = avg_postseason_finish))+
  geom_col()+
  geom_text(aes(label = round(avg_postseason_finish, 0)), vjust = -0.5, size = 3, color = "black") +
  scale_fill_gradient(low = "green", high = "red") +
  theme_classic() +
  labs(title = "AVG postseason position", x = "Team", y = "AVG postseason position")+
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 7, color = "black"),
    legend.position = "none"
  )

As we could have expected, some of the greatest college teams are here at the top (Gonzaga, Duke, NCU), the low postseason position indicates high results in the postseason, consistently.

Belmont on the other hand is a team that consistently has a good regular season, however fails to appear at the postseason stage.

Analysis of losing teams

Let’s take a look at the lower ranked teams.

First of all let’s look at our win percentage column that we made earlier.

Code

summary(data$WIN_P)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.3871  0.5312  0.5192  0.6500  1.0526

Let us presume that a bad season will be below the first quartile, so underneath (0.3871) WIN_P.

Code

plot_teams_below_039 <- function(data, y_offset = 0.5, bar_color = "#FF0066", covid_year = 2020, covid_color = "gray50") {
  worst_teams <- data %>%
  group_by(YEAR) %>%
  summarise(
    count_below_039 = sum(WIN_P <= 0.3971, na.rm = TRUE),
    .groups = "drop"
  ) 
  
  if (!covid_year %in% worst_teams$YEAR) {
    worst_teams <- rbind(worst_teams, data.frame(YEAR = covid_year, count_below_039 = 0))
  }
  
  worst_teams$fill_group <- factor(ifelse(worst_teams$YEAR == covid_year, "COVID", "Other"))
  
  ggplot(worst_teams, aes(x = YEAR, y = count_below_039, fill = fill_group)) +
    geom_col() +
    geom_text(aes(label = count_below_039), vjust = -0.5, size = 3, color = "black") +
    labs(title = "Number of teams below 0.39% WIN_P",
         x = "Year",
         y = "Number of teams",
         caption = "2020 - 0, because of COVID-19") +
    scale_x_continuous(breaks = seq(min(top_teams$YEAR), max(worst_teams$YEAR), by = 1)) +
    scale_fill_manual(values = c("COVID" = covid_color, "Other" = bar_color)) +
    theme_classic() +
    theme(legend.position = "none")
}

plot_teams_below_039(data)

Teams with win percentages below the first quartile (0.39) were categorized as underperformers. The number of such “struggling” programs has decreased slightly across the observed seasons, indicating a narrowing competitive gap in Division I basketball. This shift may reflect improved parity, better talent distribution, and rising program quality across conferences.

Conclusion

Through this exploratory data analysis of NCAA basketball team statistics from 2013 to 2024, several key patterns emerged regarding team success. The dataset revealed a strong positive correlation between wins (W) and adjusted offensive efficiency (ADJOE), indicating that teams with superior offensive ratings tend to secure more victories and advance further in the postseason. A moderate correlation (0.55) was also observed with two-point shooting percentage (2P_O), where teams exceeding the benchmark of 50%—a threshold aligned with top performers in recent seasons—consistently ranked among the winners. Over time, the proportion of teams achieving this 50% mark has increased, from around 40% in earlier years to over 60% by 2024, suggesting a growing emphasis on efficient interior scoring and close-range shot selection in modern college basketball strategies.

In contrast, the correlation between wins and three-point shooting percentage (3P_O) was weaker (0.42), with few teams reaching the elite benchmark of 38%. The percentage of such high-accuracy three-point teams has declined sharply over the years, dropping from approximately 10-15% in the early 2010s to under 5% by 2024. This trend may be attributed to the NCAA’s extension of the three-point line in the 2019-20 season, which coincided with the lowest Division I-wide three-point shooting average on record (33.47%), as well as evolving defensive schemes that limit long-range opportunities.

Powerhouse programs repeatedly sustain both strong regular-season records and deep postseason runs, while others demonstrate regular-season strength without tournament validation. We look at teams like Duke or Gonzaga, that consistently achieve good results both in the regular season, and in the postseason, as powerhouses. Competitive balance is improving in the NCAA. The decline in the number of chronically underperforming teams hints at broader parity, making NCAA basketball one of the most competitive environments in the world.

Overall, these findings underscore the evolving nature of NCAA basketball, offering actionable insights for coaches, analysts, and fans aiming to understand the drivers of success in a competitive landscape.

Introduction

Loading in our dataset + adding packages in R

Look at the data and clean it

Correlation Analysis

Correlation matrix

Correlation between number of games won W and Adjusted Offensive Efficiency ADJOE

Correlation between number of games won W and X2P_O Two-Point Shooting Percentage

Correlation between number of games won W and X3P_O Three-Point Shooting Percentage

Analysis of winning teams

Calculating win percentage

Teams above 0.65 WIN_P

Which teams are consistently good throughout the years

Analysis of losing teams

Conclusion

Correlation between number of games won `W` and Adjusted Offensive Efficiency `ADJOE`

Correlation between number of games won `W` and `X2P_O` Two-Point Shooting Percentage

Correlation between number of games won `W` and `X3P_O` Three-Point Shooting Percentage

Teams above 0.65 `WIN_P`