Lawrence Summers_Sports_Analytics_and_Insights

Author

Lawrence Summers

Introduction

In this data set, each team from Division One NCAA basketball is included from the last 10 seasons (2013-2023). Included in this data set is what conference each team plays for, the number of games played and won, KPIs such as Offensive rebound rate, effective field goal percentage both offensively and defensively, turnovers percentage allowed (TOR) and committed (TORD), and two point and three point percentage offense and defense. The data set also includes adjusted offensive ratings and defensive ratings, which measures the efficiency of teams (points per possession) multiplied by 100 possessions to take pace out of the consideration for efficiency, as well as accounting for adjusted tempo (ADJ_T), which is an estimate of possessions per 40 minutes against a team that wants to play at an average Division I tempo. Finally, the data set explains if the team made the NCAA tournament or not (68 teams qualify) and then what stage of the tournament that team advanced to (POSTSEASON).

The format of the tournament, commonly referred to as March Madness, is that there are four regions of 16 teams who are drawn based on seeding and strength of schedule. Since the NCAA has expanded so rapidly, instead of the 64 teams that have historically made the tournament, now there are 68 teams, and those four teams have to play a “play in” game that then puts them into the round of 64 teams. For this dataset, the column labeled POSTSEASON tells us how far each team that made it to the tournament advanced. There is the play in game (R68), the first round (R64), second round (R32), the Sweet 16 (S16), the Elite 8 (E8), the Final 4 (F4), loser of the final (2ND), and the champion.

library(tidyverse)
library(kableExtra)
library(knitr)



college_bb <- read_csv("college_basketball.csv")

Each conference plays a conference tournament in which each winner of that receives an automatic bid to the NCAA tournament. From the table below, we can see that there are 33 conferences (35, but two of the columns are Independent conferences–ind and Ind), and so each conference is given an AUTOMATIC BID, which means if you win your conference tournament at the end of the year your team automatically gets a bid into the NCAA tournament. The other 35 teams are chosen from an NCAA committee and base at-large bids on strength of schedule of each conference and how teams are nationally ranked. Traditionally there are five power conferences, the ACC, Pac 12, Big 10, Big 12, and SEC that get a majority of the at-large bids, while smaller conferences such as the Southern conference for example would only get the one automatic bid into the tournament. Our goal with this report is to explore the data of the NCAA Division I men’s basketball over the last 10 seasons to see if there are any interesting trends or anything in the data that could tell us something possibly about future tournaments.

Number_Of_Conferences <- table(college_bb$CONF)

# Count the number of conferences
num_conferences <- length(unique(college_bb$CONF))

# Convert the table to a data frame 
conference_df <- as.data.frame(Number_Of_Conferences)


colnames(conference_df) <- c("Conference", "Number_of_Teams")

# Add a row for the total number of conferences
total_row <- c("Total Conferences", num_conferences)
conference_df <- rbind(conference_df, total_row)

# Print the table
print(conference_df)
   Conference Number_of_Teams
1         A10             142
2         ACC             147
3          AE              92
4        Amer             100
5        ASun              95
6         B10             136
7         B12             100
8          BE             108
9        BSky             114
10       BSth             111
11         BW              97
12        CAA             103
13       CUSA             137
14        GWC               5
15       Horz             102
16        ind               5
17        Ind               1
18        Ivy              72
19       MAAC             109
20        MAC             120
21       MEAC             115
22        MVC             102
23        MWC             108
24        NEC             101
25        OVC             116
26        P12             120
27        Pat              98
28         SB             117
29         SC             103
30        SEC             140
31       Slnd             120
32        Sum              89
33       SWAC             104
34        WAC              95
35        WCC              99
36       <NA>              35

Champions

The first thing we are going to explore in the data are the champions. Are there any similarities in statistics for the champions? Do conferences matter? Are there any repeat champions over the past 10 seasons?

Below we created a dataset inclduing just the champions of the last 10 seasons–2013 to 2023 (2020 is excluded because the season was canceled due to Covid-19. We then plotted what conferences won championships in those years.

Champions <- filter(college_bb, POSTSEASON == "Champions")
library(tidyverse)
library(ggplot2)
library(showtext)
library(xml2)
library(grid)
library(rsvg)
library (tidyverse)
library(devtools)
library(ggimage)
library(dplyr)

# To create a dataset with team logos (Champions2)
Champions2 <- data.frame(
  YEAR = c(2013, 2014, 2015, 2016, 2017, 2018, 2019, 2021, 2022, 2023),
  CONF = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  ImageURL = c(
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/louisville.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/uconn.svg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Duke_Athletics_logo.svg/300px-Duke_Athletics_logo.svg.png",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/villanova.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/north-carolina.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/villanova.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/virginia.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/baylor.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/kansas.svg",
    "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/uconn.svg"
  )
)

#Cut conferences out of Champions2
Champions2$CONF <- NULL

#Merge logos into original dataset (Champions)
Champions <- inner_join(Champions, Champions2, by = "YEAR")


# Create the ggplot with images
ggplot(data = Champions) +
  geom_image(aes(x = YEAR, y = CONF, image = ImageURL), size = 0.1) +
  xlim(2012, 2024) +
  scale_x_continuous(breaks = seq(2012, 2024, by = 1)) +
  labs(
    title = "NCAA Champions Last 10 Years" ,
    x = "Year",
    y = "Conference",
    colour = "Conference"
  ) +
  theme_minimal()

A couple of observations from the above graph show that both Villanova and UCONN were the only repeat champions during this period and that the Big East seem to be the most dominant conference during the tournament. In fact, the title won in 2014 by UCONN came during a period where that team had briefly left the Big East to go to the American Conference, but it’s entire history has always been in the Big East, and they made their way back into the Big East by the time they won the 2023 title. The ACC also shows how strong their conference is having three different teams winning the tournament over the 10 year period. The Big 12 responded well after Covid-19, winning the two titles after the 2020 tournament was canceled.

``

Champs_Over_Ten_Seasons <- filter(
  college_bb, 
  TEAM %in% c("Louisville", "Connecticut", "North Carolina", "Duke", "Villanova", "Virginia", "Baylor", "Kansas")
)

ggplot(data = Champs_Over_Ten_Seasons) +
  geom_boxplot(mapping = aes(x = TEAM, y = W)) +
  geom_hline(yintercept = median(Champs_Over_Ten_Seasons$W), color = "red", linetype = "dashed") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

Champs_Over_Ten_Seasons$TEAM <- factor(Champs_Over_Ten_Seasons$TEAM, 
levels = c("Louisville", "Connecticut", "Duke", "Villanova" , "North Carolina" , "Virginia" , "Baylor" , "Kansas"))

Champs_Over_Ten_Seasons %>%
  ggplot(mapping = aes(x = YEAR, y = W, color = TEAM)) +
  geom_line() +
  geom_point() +
  ggtitle("NCAA Champions Regular Season Wins from 2013-2023") +
  labs(x = "Year", y = "Regular Season Wins", color = "Team") +  
  theme_minimal() +  
  facet_wrap(~TEAM, scales = "free_y") +  # Separate scatterplots for each team
  theme(strip.text = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

Champs_Over_Ten_Seasons$POSTSEASON <- factor(Champs_Over_Ten_Seasons$POSTSEASON, 
levels = c("R68", "R64", "R32", "S16" , "E8" , "F4" , "2ND" , "Champions"))

Champs_Over_Ten_Seasons %>%
  ggplot(mapping = aes(x = reorder(TEAM, W), fill = POSTSEASON)) +
  geom_bar(position = position_dodge(width = 0.8), width = 0.7) +  
  ggtitle("Postseason Performance of NCAA Champions") +
  labs(x = "Team", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 8))

Exploring the Champions dataset further, I wanted to see how those eight specific championship team had performed over the whole ten year period. From the boxplot above, we can see that all of the teams are high level teams, with their median wins all being at or above 20 wins each season. Villanova, Duke, and Kansas seem to be the most consistent teams, The 2021 season is a bit down for each team based on the fact that the season was shorter due, still, to Covid-19 restriction. Louisville have seemed to have a downward spiral since winning the title in 2013, which makes sense, considering their Hall of Fame coach was fired and put under NCAA investigation for foul recruiting. And even though all of these teams are still powerhouse schools, the second graph above, the scatterplot, shows how even with these schools, wins do seem a bit erratic (although still at a high win rate), and the way the NCAA works now has a lot to do with that. Any five star recruit (high level high school player) coming into college is required to play only one year at the college level before he can enter the NBA draft. So what happens a lot is that players from these big name schools will come for a year, play really well, and then leave, so the continuity that used to be a trademark with college basketball has started to flucuate a bit more. But, even still, we can see that these powerhouse schools still are able to recruit more big name players in year after year in order to stay at the 25 to 27 win threshhold. The final graph above shows that Connecticut (also referred to as UCONN) are the most interesting team because although they have won two championships in the last 10 seasons, they have also failed to even make the NCAA tournament five out of the 10 times, and even the five that they have made it, they have lost twice in the first round, and once in the second round! Villanova and Kansas seem to be the most consistent of any of the teams here, with Villanova only missing the tournament once and Kansas not at all. Kansas has lost in the 2nd round five times in that ten year span, though, so if I were looking at making predictions about Kansas in this year’s tournament, I would look closely at who their second round opponents may be.

Power Five

The “Power Five” conferences in basketball are considered to be the ACC, the Pac 12, the SEC, the Big 10, and the Big 12. Generally (and we will put this hypothesis to the test later with this dataset), these conferences tend to have the most teams in the tournament. From the graph above, we can also see that the Big East should be in consideration for the fact that four of the last ten champions have come from that conference, and in fact, the year that a team from the American conference won it (UCONN in 2014), they have always been a Big East team and were only in the American conference because other teams bolted to bigger conferences and left the Big East depleted. Since their 2014 National Championship, UConn have moved back into the Big East, where they won the 2023 National Championship. So, the Big East isn’t considered a Power 5 conference anymore, but five of the last 10 national champions have come from what are widely known as Big East schools.

Before addressing the Big East’s belonging in that “Power 5,” first, we wanted to explore and see if there was any connection or trends with the teams who won the championship. If we look at the offensive and defensive efficiencies below:

Champions %>%
  ggplot(mapping = aes(x = YEAR)) +
  geom_line(aes(y = ADJDE, color = "Defensive Efficiency"), show.legend = TRUE) +
  geom_line(aes(y = ADJOE, color = "Offensive Efficiency"), show.legend = TRUE) +
  geom_point(aes(y = ADJDE), color = "red") +
  geom_point(aes(y = ADJOE), color = "green") +
  geom_hline(yintercept = c(90, 120), linetype = "dashed", color = "grey") +  # Add horizontal lines
  xlim(2012, 2024) +
  scale_x_continuous(breaks = seq(2012, 2024, by = 2)) +
  labs(
    y = "Efficiency",
    color = "Legend",
    title = "Defensive and Offensive Efficiency Over the Years"
  )

Here we can see that for the most part, championship teams are average around 120 points per 100 possessions (offensive efficiency), and also they tend to give up around 90-95 points per 100 possessions (defensive efficiency). The graph shows us that most of the time as a team’s points go up, so, too, do the points they give up. The 2014 UConn team are, by far, the least efficient team of all of the champions, and we can see in the dataset that their pace (ADJ_T) is 64.8, also the second slowest of any of the champions, meaning that they played a very slow, gritty type of game that is also synonomous historically with how Big East teams play. Lousiville fit the same mold. Villanova stands out to most Big East teams because they were so good offensively, but it helped, too, that the two championship teams they produced during this period had seven future NBA players!

And then below I wanted to see how every tournament team measured up as far as Offensive and defensive efficiencies. And from our graph below, it seems simple: if you want to have postseason success, you better be in that top left quadrant! There are a few teams who fall below the threshold off scoring enough or defending well enough, but we also know that the two Elite 8 teams and the one Final 4 team in the top right quadrant lost to a team in that left quadrant in the tournament, and a lot of that comes down to the winning teams being more efficient with each possession. So for tournaments forthcoming, these efficiencies are a very telling statistic.

Tourney_Teams <- filter(college_bb, ! POSTSEASON == "N/A" )

Tourney_Teams$POSTSEASON <- factor(Tourney_Teams$POSTSEASON, 
levels = c("R68", "R64", "R32", "S16" , "E8" , "F4" , "2ND" , "Champions"))


ggplot(data = Tourney_Teams) +
  geom_point(mapping = aes(x = ADJDE, y = ADJOE , colour = POSTSEASON ,  size = POSTSEASON, alpha = POSTSEASON )) +
  scale_color_manual(values=c('grey','grey' , 'grey' , 'grey' , 'blue' , 'yellow' , 'green' , 'red')) +
  geom_vline(xintercept=100) +
  geom_hline(yintercept = 110) +
  ggtitle("The Efficiency Landscape") +
  xlab("Defensive Rating") +
  ylab("Offensive Rating")

To get back to the rugged, gritty Big East, we wanted to see where that conference stood in terms of the Power 5 conferences. Each of the Power 5 conferences along with the Big East was put into a smaller dataset (Power_Five_Plus_Big_East).

From the graph and accompanying table below, we can see that the Big East can certainly stake their claim amongst the elite conferences in the country. The Big East has four titles and two Final 4 appearances in the 10 year period, not to mention the other championship that UConn won in 2014 as a part of the American Conference. The only conference that measure up with the Big East is the ACC, who have three championships, two 2ND place finishes, and three Final 4 participants, also quite a feat to show that conference’s dominance in the NCAA tournament.The Pac 12, Big 10 are both lacking titles in the last 10 years, despite having the most teams in the tournament in that span.

Power_Five_Plus_Big_East <- filter(Tourney_Teams, CONF %in% c("ACC", "B10", "B12" , "P12" , "SEC" , "BE")) 




Power_Five_Plus_Big_East %>%
  ggplot(mapping = aes(x = CONF, fill = POSTSEASON)) +
  geom_bar(position = position_dodge(width = 0.8), width = 1.5) +
  annotate("text", x = "BE", y = 25, label = "6 teams in Final 4 with 4 Champions", vjust = -0.5, size = 3, color = "black") +
  annotate("segment", x = "BE", xend = "BE", y = 25, yend = 23, arrow = arrow(length = unit(0.3, "cm"))) +
  ggtitle("Is the Big East Power 5 Worthy?") +
  labs(x = "Conference", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

tournament_table <- table(Power_Five_Plus_Big_East$CONF, Power_Five_Plus_Big_East$POSTSEASON)

# Create a table with margins (totals)
conference_table_with_totals <- addmargins(table(Power_Five_Plus_Big_East$CONF, Power_Five_Plus_Big_East$POSTSEASON))

# Convert the table to a data frame
conference_table_df <- as.data.frame.matrix(conference_table_with_totals)


print(conference_table_df)
    R68 R64 R32 S16 E8 F4 2ND Champions Sum
ACC   1  20  14  13  9  3   2         3  65
B10   3  18  27  13  5  3   3         0  72
B12   0  24  19  11  6  2   1         2  65
BE    3  22  15   6  3  2   0         4  55
P12   2  15   7  13  5  2   0         0  44
SEC   2  15  16  10  6  4   1         0  54
Sum  11 114  98  66 34 16   7         9 355

Does Defense Actually Win Championships?

There is an old adage in basketball that “offense sells tickets while defense wins championships,” so naturally we wanted to test that hypothesis. We are again looking at only the teams who made the NCAA tournament because we know that they either won their conference championship or received an at-large bid, which meant their body of work deemed them worthy of a bid into the tournament (ie, they were winning teams with respectable strength of schedules). In the graph below, we took a look at Effective Defensive Field Goal percentage and we wanted to see if that affected wins for a season, and then we wanted to see where the champions actually measured up to all of the other tournament participants in this area. Baylor’s 2021 win total is a bit skewed because of the shortened Covid-19 season, but they still won 22 of 24 games, even though they allowed teams to shoot at a pretty high percentage. But, for the most part, the data shows that the stingiest teams usually win the most games, and while winning regular season games doesn’t necessarily equate doing well in the tournament, the better you perform in the regular season, the higher seed you get in the tournament, which gives your team a better chance to beat lower seeded teams.

ggplot(data = Tourney_Teams) +
  geom_point(mapping = aes(x = EFG_D, y = W, colour = POSTSEASON)) +
  scale_color_manual(values = c('grey87', 'grey87', 'grey87', 'grey87', 'grey87', 'grey87', 'grey87', 'red')) +
  geom_smooth(mapping = aes(x = EFG_D, y = W), method = 'lm', se = FALSE, color = 'turquoise4') +
  theme_minimal() +
  geom_vline(xintercept = 47, size = 1.5) +
  labs(x = 'Effective Field Goal % Defense', y = 'Wins', title = 'Linear Regression') +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = 'bold')) +
  geom_text(x = 52, y = 19, label = "2021 Shortened Season due to Covid-19", size = 2, color = "black", vjust = -0.5) +
  geom_point(x = 49.1, y = 22, size = 3, shape = 1, color = "red", fill = NA)  # Circle around the point

To further test the idea that defense helps win championships, I looked at the exact same type of graph, only I substituted Efg% defense with ADJ_T, which is the number of possessions a team would average if playing against an average Division I team. Defensive minded teams play slower, trying to control the pace of the game and keep scores lower because they feel that their defense can win them games down the stretch in close games.

The graph shows us that controlling pace doesn’t affect winning as much, but that being able to get defensive stops while still playing faster is the way the game is played now. Virgina, winners of the 2019 National Championship, are a bit of an outlier, playing at the slowest pace of any champion and still winning at a very high percentage and using that style to actually win a championship. It must be noted, though, that Virginia was also the first number one seed EVER to lose to a 16 seed team in 2018, while their ADJ_T was 60.5 (compared to 60.7 in their championship year), so sometimes, regardless of wins or stature, coming up against a team with contrasting styles in the tournament is deathly–Virginia lost to UMBC, who averaged 69 possessions per game (ADJ_T) and shot 38% from the three point line and stunned Virginia!

library(ggplot2)
library(ggimage)


scatter_plot <- ggplot(data = Tourney_Teams) +
  geom_point(mapping = aes(x = ADJ_T, y = W, colour = POSTSEASON)) +
  scale_color_manual(values = c('grey87', 'grey87', 'grey87', 'grey87', 'grey87', 'grey87', 'grey87', 'red')) +
  geom_smooth(mapping = aes(x = ADJ_T, y = W), method = 'lm', se = FALSE, color = 'purple') +
  theme_minimal() +
  labs(x = 'Possessions', y = 'Wins', title = 'Linear Regression') +
  theme(plot.title = element_text(hjust = 0.5, size = 20, face = 'bold'))

# URL of image
image_url <- "https://i.turner.ncaa.com/sites/default/files/images/logos/schools/bgd/virginia.svg"

# Coordinates of the point 
x_coord <- 60.7
y_coord <- 35


final_plot <- scatter_plot +
  geom_image(mapping = aes(x = x_coord, y = y_coord, image = image_url), size = .08)


print(final_plot)

Just for Fun

Davidson College 2022 season

Because I am a fan of college basketball and was also lucky enough to partake in the NCAA tournament as a player, I thought it would be fun to take a quick glance at my alma mater, Davidson College. Davidson currently play in the Atlantic 10 conference after leaving the Southern Conference after the 2014 season. In 2022, Davidson went 27-6 and earned a 10 seed in the NCAA Championship, and was probably the best team the school had had since Steph Curry led them to the Elite 8 in 2008. I wanted to compare Davidson to the eventual champion that year, Kansas. It’s very fitting that I compare these two teams because that very year (2008) that Davidson lost in the Elite 8, it came to none other than Kansas, who went on to also win that 2008 National Championship.

Davidson_v_Kansas <- filter(college_bb, TEAM %in% c("Davidson" , "Kansas")) 

Davidson_v_Kansas_2022 <- filter(Davidson_v_Kansas, YEAR %in% c("2022"))



# Subset the dataset for the two teams
Davidson_data <- Davidson_v_Kansas_2022[Davidson_v_Kansas_2022$TEAM == "Davidson", ]
Kansas_data <- Davidson_v_Kansas_2022[Davidson_v_Kansas_2022$TEAM == "Kansas", ]

# Combine the data for the two teams
comparison_data <- rbind(Davidson_data, Kansas_data)


library(knitr)

kable(comparison_data, format = "markdown")
TEAM CONF G W ADJOE ADJDE BARTHAG EFG_O EFG_D TOR TORD ORB DRB FTR FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB POSTSEASON SEED YEAR
Davidson A10 33 27 115.8 101.1 0.8258 55.6 49.5 14.8 16.7 23.6 22.4 32.5 28.0 54.2 48.9 38.5 33.6 64.6 2.1 R64 10 2022
Kansas B12 40 34 119.8 91.3 0.9580 53.8 45.8 17.3 18.1 32.9 28.6 32.3 27.7 53.6 46.4 36.1 29.8 69.1 10.4 Champions 1 2022

From this simple table even, if the coaches from Davidson drew this matchup in the tournament in this particular year, Davidson could say that they are a better shooting team that plays at a slower tempo (ADJ_T) than Kansas. Kansas rebound it better and defend the three well, but if Davidson plays up to that speed and can shoot it well, get to the free throw line more, and try to somehow negate the rebounding disparities, then they have a chance against a juggernaut like Kansas. And that’s exactly how these smaller teams approach one-off games such as the tournament provides. Davidson ended up losing to Michigan State in the first round of the 2022 tournament 74-73. Michigan State, coming from the Big 10, are another Power Five school, and Davidson stayed in the game by shooting 40% from the three point line, only having eight turnovers, and only losing the rebounding game by three. Michigan State did go 11-15 from the free throw line compared to Davidson’s 7-12, but the statistics alone do give a good idea of how teams can compete with bigger, stronger teams in the tournament.

Conclusion

In conclusion, this dataset has shown us that the bigger schools usually prevail in the end, but that even those Power Five schools struggle in the tournament. Defense still matters, but efficiency is a better determinant of how teams may match up with each other in the tournament. From the data of the last 10 years, it tells us not to bet against the Big East and the ACC. At the moment, the Big East have three teams ranked in the top 15 in the country (UConn,3, Marquette, 5, Creighton, 12 ) and the ACC have North Carolina at number 9 and Duke at number 10. The 2024 NCAA Tournament starts on March 21st, and it will be interesting if any of the trends determine the ultimate winner this year. The fun part is always seeing if a “Cinderella” team from a smaller conference can somehow break through and be crowned as champion.