March Madness

For this assignemnt I chose to analyze the March Madnesss dataset. I must admit that I’m not a follower of basketball, not even during March Madness. In fact, last year was my first time making a bracket for the tournament and I only made it because I was peer pressured by my coworkers. I showcased how little I knew by picking the top seeds for every single matchup– neeless to say I didn’t do very well. That being said, I’m not sure where to begin with this overwhelmingly large dataset, but I’ll give it my best effort.

Win/Loss Ratio: 2015-2019 Regular Season

I know that wins are a good thing in sports, and losing is not so good. Let’s start off with an overview of which teams have had a high Win:Loss ratio over the last 5 seasons. The exact numer of seasons is arbitrary, but I think recent years would be a better predictor of future performance compared to older years, so I chose the last 5. Here’s a table of the top 100 W/L ratios over the last 5 seasons:

team_win <- 
     AllGames %>% 
     group_by(t1_name, season) %>% 
     filter(t1_win == 1, season %in% c(2015, 2016, 2017, 2018, 2019)) %>% 
     summarise(wins = n()) %>% 
     arrange(t1_name)

names(team_win)[1] <- "name"

team_loss <- 
        AllGames %>% 
        group_by(t1_name, season) %>% 
        filter(t1_win == 0, season %in% c(2015, 2016, 2017, 2018, 2019)) %>% 
        summarise(losses = n()) %>% 
        arrange(t1_name)

names(team_loss)[1] <- "name"

AllGames_WL <- 
    left_join(team_win, team_loss) %>% 
    mutate(WL_ratio = wins/losses) %>% 
    arrange(desc(WL_ratio))

team_win2 <- 
     AllGames %>% 
     group_by(t1_name) %>% 
     filter(t1_win == 1, season %in% c(2015, 2016, 2017, 2018, 2019)) %>% 
     summarise(wins = n()) %>% 
     arrange(t1_name)

names(team_win2)[1] <- "name"

team_loss2 <- 
        AllGames %>% 
        group_by(t1_name) %>% 
        filter(t1_win == 0, season %in% c(2015, 2016, 2017, 2018, 2019)) %>% 
        summarise(losses = n()) %>% 
        arrange(t1_name)

names(team_loss2)[1] <- "name"

AllGames_WL2 <- 
    left_join(team_win2, team_loss2) %>% 
    mutate(WL_ratio = wins/losses) %>% 
    arrange(desc(WL_ratio))

rmarkdown::paged_table(AllGames_WL[1:100,], options = list(rows.print = 10))

Gonzaga appears takes the cake with the highest W/L ratio, 32. However, I fact-checked this observation by searching for Gonzaga’s 2016-2017 season record online and it didn’t fully match up (Wins: 37, Losses: 2; source: https://www.sports-reference.com/cbb/schools/gonzaga/2017.html). I spent some time searching for where the error may have been but I wasn’t able to find it. I checked a few other team records and I was consistently missing a few observations; we’ll just roll with it since this is what I have to work with.

The next question I’m thinking about is: Did a team just have one really good season? Virginia is only listed once in the top 10, and so I wonder if they had one good season or if they’re also a good team on average. I’ll remove the group_by(season), that way I can look at the average WL ratio for each team over the last 5 years.

rmarkdown::paged_table(AllGames_WL2[1:100,], options = list(rows.print = 10))

Gonzaga, Villanova, and Virginia still come out on top with average W/L ratios of 8.7058824, 6.3913043, and 5.48, respectively. These appear to be some of the most winning teams, even on average.

Tournament Win/Loss Record: 2015-2019

A few teams appear more than once in the table with the top 20 W/L records over the last 5 seasons. Do these high W/L ratios translate to high tournament performance? Let’s take a look:

# Tourney 
tourn_win <- 
        TournAll %>% 
        group_by(t1_name, season) %>% 
        filter(t1_win == 1, season %in% c(2015, 2016, 2017, 2018, 2019)) %>% 
        summarise(wins = n()) %>% 
        arrange(t1_name)

names(tourn_win)[1] <- "name"

tourn_loss <- 
        TournAll %>% 
        group_by(t1_name, season) %>% 
        filter(t1_win == 0, season %in% c(2015, 2016, 2017, 2018, 2019)) %>% 
        summarise(losses = n()) %>% 
        arrange(t1_name)

names(tourn_loss)[1] <- "name"

TournGames_WL <- 
    left_join(tourn_win, tourn_loss) %>% 
    arrange(desc(wins))

TournGames_WL$losses <- 
    TournGames_WL$losses %>% 
    replace(is.na(TournGames_WL$losses), 0)

plot_tourn_w <- 
TournAll %>% 
    group_by(t1_name) %>% 
    filter(t1_win == 1, season %in% c(2015, 2016, 2017, 2018, 2019),
           t1_name %in% c("Duke",
                          "North Carolina",
                          "Villanova",
                          "Virginia",
                          "Gonzaga",
                          "Michigan",
                          "North Carolina",
                          "Texas Tech",
                          "Wisconsin",
                          "Auburn",
                          "Kansas",
                          "Kentucky",
                          "Loyola-Chicago",
                          "Michigan St",
                          "Oklahoma",
                          "Oregon",
                          "South Carolina",
                          "Syracuse")) %>%
    select(t1_win, t1_name, season) %>% 
    mutate(outcome = as.factor("win")) 

plot_tourn_l <- 
TournAll %>% 
    group_by(t1_name) %>% 
    filter(t1_win == 0, season %in% c(2015, 2016, 2017, 2018, 2019), 
           t1_name %in% c("Duke",
                          "North Carolina",
                          "Villanova",
                          "Virginia",
                          "Gonzaga",
                          "Michigan",
                          "North Carolina",
                          "Texas Tech",
                          "Wisconsin",
                          "Auburn",
                          "Kansas",
                          "Kentucky",
                          "Loyola-Chicago",
                          "Michigan St",
                          "Oklahoma",
                          "Oregon",
                          "South Carolina",
                          "Syracuse")) %>%
    select(t1_win, t1_name, season) %>% 
    mutate(outcome = as.factor("loss")) 

plot_tourn_wl <- rbind(plot_tourn_w, plot_tourn_l)

g1 <- 
  ggplot(plot_tourn_wl, mapping = aes(x = t1_name, fill = outcome)) + 
        geom_bar(color = "black", width = 0.75) 
        
g1 <- g1 +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.title.x = element_blank()) + 
  ylab("Count") + 
  scale_fill_brewer(palette="Paired")

plot(g1)

Above is a graph displaying teams with >= 4 wins in a single tournament between 2015-2019; meaning that they must’ve done at least decently well in at least one recent tournament. The “count” is the total number of wins and losses over the last 5 years for those teams. If we compare the table and the graph we can get an idea of the teams with the top records during the regular season and then compare that with how they performed in tournaments. Interestingly, North Carolina has the highest cumulative number of wins over the last 5 years in tournaments but they aren’t even in the top 50 for regular season W/L records. Gonzaga’s, Villanova’s, Virginia’s and Wisconsin’s tournament performance all coincides pretty well with their respective W/L records, and Duke’s sort of does. Arizona and Houston had some of the top records during the regular season, but it apparently didn’t translate well to the tournament.

FGP and Wins

Past W/L ratios would certainly be a good predictor of future W/L ratios. What other predictors could potentially predict team wins? It would make sense that higher field goal percentages leads to more wins. Of course, it could also be true that higher field goal percentages are a result of shooting less baskets overall, and making a higher percentage of fewer shots. The relationship will likely give an idea of which of these scenarios is actually occuring.

fgpct <- function(greaterthan, lessthan) {
    
    x <- 
        AllGames %>%
        filter(t1_fgpct > greaterthan, t1_fgpct < lessthan, t1_win == 0) %>%
        select(t1_win, t1_name, t1_fgpct) %>% 
        summarise(count = n(), 
                  avg_pct = mean(t1_fgpct),
                  outcome = "loss")

    y <- 
        AllGames %>%
        filter(t1_fgpct > greaterthan, t1_fgpct < lessthan, t1_win == 1 ) %>% 
        select(t1_win, t1_name, t1_fgpct) %>% 
        summarise(count = n(),
                  avg_pct = mean(t1_fgpct),
                  outcome = "win") 

    z <- rbind(x,y)
    
    z <- 
        z %>% 
        summarise(win_prop = y$count/(x$count + y$count), 
                  avg_pct = mean(avg_pct)) %>% 
        select(win_prop, avg_pct)

}

fgpct_comb <- rbind(fgpct(10, 20),
                    fgpct(20, 25),
                    fgpct(25, 30),
                    fgpct(30, 35),
                    fgpct(35, 45),
                    fgpct(45, 50),
                    fgpct(50, 55),
                    fgpct(55, 60),
                    fgpct(60, 65),
                    fgpct(65, 70),
                    fgpct(70, 75),
                    fgpct(75, 80))

g2 <-
  ggplot(fgpct_comb, mapping = aes(x = avg_pct, y = win_prop)) + 
  geom_bar(stat = "identity", fill = "coral3") + 
  geom_smooth(col = "black", se = FALSE)

g2 <- g2 + 
  theme_bw() + 
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank()) +
  xlab("Average FGPCT") + 
  ylab("Win Proportion")

plot(g2)

Above is a plot where win proportion is binned into field goal percentage groups. The count of wins and losses were totalled within a field goal percentage range of 5% (ie. 20-25%), and then a proportion for each group was calculated. As expected, there appears to be a consistent relationship between field goal percentage and wins.

Wins Throughout Season

Are wins and losses throughout a season “random”? In other words, do losses occur randomly throughout a season, or are they clustered together at a certain time? I’m not familiar with how basketball seasons work, and so I don’t know if the same teams play each other repeatedly at a certain point in the season or not. If they do, then this would of course affect the “randomness” of wins and losses obersved.

AllGames %>% 
    filter(season == 2019, 
           t1_name %in% c("Duke",
                          "North Carolina",
                          "Villanova",
                          "Virginia",
                          "Gonzaga",
                          "Michigan",
                          "North Carolina",
                          "Texas Tech",
                          "Wisconsin",
                          "Auburn",
                          "Kansas",
                          "Kentucky",
                          "Michigan St",
                          "Oklahoma",
                          "Oregon",
                          "South Carolina", 
                          "Syracuse")) %>% 
    select(daynum, t1_win, t1_name) %>% 
    arrange() %>% 
    ggplot(aes(x = daynum, y = t1_win)) + 
    geom_point() + 
    geom_abline(slope = 0, intercept = 0.5, color = "red") + 
    geom_line() + 
    facet_wrap(~t1_name) + 
    theme_bw() +
    xlab("Day Number") +
    theme(axis.text.y = element_blank(),
          axis.title.y = element_blank(),
          axis.ticks.y = element_blank(),
          panel.grid.major.y = element_blank(),
          panel.grid.minor.y = element_blank())

Above is a plot looking at when wins and losses occurred throughout the season (2019) for several of the top teams, and I’ll be the first to admit that it’s not the most aesthetically pleasing plot. There are a few interesting patterns to note. Michigan had zero losses throughout the first half followed by many losses in the second half. Were they simply on a win streak? Did they play harder teams? Did they lose motivation? Did a player get injured? Did the coaching staff change? Did they switch up game strategies? Michigan St had an overall good season, but they lost 3 games in a row about 2/3 of the way into the season. What gives? Were they playing the same team all 3 times? Were they on the road? Was a player injured? Could these patterns be explained by other variables in the dataset? Maybe I’ll find out later this semester.