knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Overview

My question: Can we use college basketball team’s regular season statistics to predict success in March Madness?

The two data sets give baseline and advanced statistics on D1 college basketball teams. They both contain different types of statistics for each team and together they produce well-rounded statistics on each team.

Prior work

I found this idea for a Shiny app from an article and decided to try to replicate it and enhance it as well. The Shiny app wasn’t set up well and I attempted to improve the setup of the results as well create some sort of predictive model for the stats at hand. The Shiny app gave the stats, but it didn’t use predictive modeling. It was an interesting app, but I thought much more could be added to it.

Approach

My goal is to test different statistics to predict what ones create the most wins controlled by strength of schedule. I will need to test a variety of statistics in order to generate a quality model. I plan to find specific game box score stats that contribute to winning games. I will use many different datasets. I have a data set on each team’s stats throughout the season. I have another which includes specific games and the general results of those games. Lastly, I used a plethora of data sets that contains a game log for a team. The teams I included are Alabama, Boston College, Cal Poly, Cleveland State, Gonzaga, Hawaii, Indiana State, Iona, LSU, Maryland, Northern Arizona, New Hampshire, Northwestern State, Penn State, Texas, UAB, UCLA, UNC, Utah, Vanderbilt, and Virginia. I would’ve used more, but I wanted to focus more on the predictive model rather than wrangling 50+ data sets on more teams. I wish my data sets has more advanced metrics in them. Some of them are able to calculated (such as points-per-possession), but others are too difficult to calculate.

(https://towardsdatascience.com/the-r-shiny-app-i-built-to-streamline-march-madness-predictions-e5e00c3a3691)

Data Sources

https://www.sports-reference.com/cbb/seasons/2021-school-stats.html

https://www.sportsbookreviewsonline.com/scoresoddsarchives/ncaabasketball/ncaabasketballoddsarchives.htm

import1 <- read_csv('2020_sports_reference.csv') %>%
  mutate(conference = paste(W_conf, "-", L_conf)) %>%
  mutate(record = paste(Wins, "-", Losses)) %>%
  mutate(home = paste(W_home, "-", L_home)) %>%
  mutate(away = paste(W_away, "-", L_away)) %>%
  mutate(school = tolower(School)) %>%
  mutate(ppg = points_for/Games) %>%
  mutate(`3pg` = `3P`/Games) %>%
  mutate(FTpg = FT/Games) %>%
  mutate(OREBpg = ORB/Games) %>%
  mutate(REBpg = TRB/Games) %>%
  mutate(ASTpg = AST/Games) %>%
  mutate(STLpg = STL/Games) %>%
  mutate(BLKpg = BLK/Games) %>%
  mutate(TOVpg = TOV/Games) %>%
  mutate(point_diff = points_for - points_against) %>%
  mutate(ast_to_tov = AST/TOV) %>%
  mutate(poss = round(FGA - ORB + TOV + (0.475 * FTA))) %>%
  mutate(ppp = points_for/poss)
advanced <- read_csv('advanced2020.csv') %>%
  mutate(school = tolower(school))
import1 <- left_join(import1, advanced, by = "school")
import1

## # A tibble: 353 × 58
##       Rk School Games  Wins Losses `W-L%`    SRS   SOS ...9  W_conf L_conf ...12
##    <dbl> <chr>  <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <lgl>  <dbl>  <dbl> <lgl>
##  1     1 Abile…    31    20     11  0.645  -2.87 -6.87 NA        15      5 NA   
##  2     2 Air F…    32    12     20  0.375  -0.37  3.02 NA         5     13 NA   
##  3     3 Akron     31    24      7  0.774   7.15 -0.4  NA        14      4 NA   
##  4     4 Alaba…    30     8     22  0.267 -18.9  -8.85 NA         5     13 NA   
##  5     5 Alaba…    32    19     13  0.594  -0.94 -1.64 NA         9      9 NA   
##  6     6 Alaba…    32     8     24  0.25  -16.2  -6.53 NA         7     11 NA   
##  7     7 Alaba…    31    16     15  0.516  11.1   8.12 NA         8     10 NA   
##  8     8 Alban…    32    14     18  0.438  -8.78 -5.91 NA         7      9 NA   
##  9     9 Alcor…    30    15     15  0.5   -14.6  -9.64 NA        11      7 NA   
## 10    10 Ameri…    30    16     14  0.533  -4.37 -6.5  NA        12      6 NA   
## # … with 343 more rows, and 46 more variables: W_home <dbl>, L_home <dbl>,
## #   ...15 <lgl>, W_away <dbl>, L_away <dbl>, ...18 <lgl>, points_for <dbl>,
## #   points_against <dbl>, ...21 <lgl>, MP <dbl>, FG <dbl>, FGA <dbl>,
## #   FG% <dbl>, 3P <dbl>, 3PA <dbl>, 3P% <dbl>, FT <dbl>, FTA <dbl>, FT% <dbl>,
## #   ORB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>,
## #   conference <chr>, record <chr>, home <chr>, away <chr>, school <chr>,
## #   ppg <dbl>, 3pg <dbl>, FTpg <dbl>, OREBpg <dbl>, REBpg <dbl>, ASTpg <dbl>, …

stats <- import1 %>%
  select(school, record, SimpleRatingSystem, StrengthOfSchedule, 
         conference, ppg, 'FG%', `3pg`, `3P%`, FTpg, `FT%`, OREBpg, REBpg,
         ASTpg, STLpg, BLKpg, point_diff, ast_to_tov, TOV, FGA, FTA, ORB, ppp, TOVpg) %>%
  mutate(school = toupper(school)) %>%
  arrange(school)
display_stats <- stats %>%
  mutate(School = school, Record = record, `Simple Rating System` = SimpleRatingSystem, `Strength Of Schedule` = StrengthOfSchedule, Conference = conference, PPG = ppg, `Point Differential` = point_diff, `Assist-to-Turnover` = ast_to_tov, `Points per Possession` = ppp) %>%
  select(School, Record, `Simple Rating System`, `Strength Of Schedule`, 
         Conference, PPG, `FG%`, `3P%`, FTpg, `FT%`, TOVpg, `Points per Possession`, OREBpg, REBpg, ASTpg, STLpg, BLKpg, `Point Differential`, `Assist-to-Turnover`, TOV, ORB)

import2 <- read_excel('ncaa basketball 2019-20.xlsx', sheet = 'Sheet1') %>%
  mutate(school = tolower(Team)) 
import2$school <- gsub("washingtonu", "washington", import2$school)
import2$school <- gsub("norfolkst", "norfolkstate", import2$school)
import2$school <- gsub("appalachianst", "appalachianstate", import2$school)
import2$school <- gsub("calsantabarbara", "uc-santabarbara", import2$school)
import2$school <- gsub("calsantabarb", "uc-santabarbara", import2$school)
import2$school <- gsub("usc", "southerncalifornia", import2$school)
import2$school <- gsub("e.washington", "easternwashington", import2$school)
import2$school <- gsub("vacommonwealth", "virginiacommonwealth", import2$school)
import2$school <- gsub("loyolachicago", "loyola(il)", import2$school)
import2$school <- gsub("houstonu", "houston", import2$school)
import2$school <- gsub("mountstmarys", "mountst.mary's", import2$school)
import2$school <- gsub("lsu", "louisianastate", import2$school)
import2$school <- gsub("ncgreensboro", "northcarolina-greensboro", import2$school)
import2$school <- gsub("byu", "brighamyoung", import2$school)
import2 <- import2 %>%
  mutate(school = toupper(school))

lookup1 <- import2 %>%
  select(school, Final, Rot, Date) %>%
  mutate(opponent_rot = ifelse(Rot %% 2 == 0, Rot - 1, Rot + 1) ) %>%
  mutate(game_date = paste(ifelse(sapply(Date,nchar)==3,'2020','2019'),Date, sep = '-'))
lookup2 <- import2 %>%
  mutate(opponent_rot = Rot) %>% 
  mutate(opponent_school = school) %>%
  mutate(opponent_final = Final) %>%
  select(opponent_rot, opponent_school, Date, opponent_final) %>%
  mutate(game_date = paste(ifelse(sapply(Date,nchar)==3,'2020','2019'),Date, sep = '-'))
games <- merge(lookup1, lookup2) %>%
  mutate(score = paste(Final, "-", opponent_final)) %>%
  select(game_date, school,Final, score, opponent_school, opponent_final) %>%
  mutate(win_or_loss = ifelse(Final > opponent_final, 'Win', 'Loss')) %>%
  select(-Final, -opponent_final)

This code was initally given from the article I read, but I made edits to it as I went.

Exploratory Data Analysis

But first…Wrangling

stats <- stats %>%
  separate(record, into = c("wins", "losses", sep = "-")) %>%
  select(-`-`) %>%
  mutate(wins = as.integer(wins),
         losses = as.integer(losses)) %>%
  mutate(srs = SimpleRatingSystem,
         sos = StrengthOfSchedule) %>%
  select(-SimpleRatingSystem, -StrengthOfSchedule)

stats %>%
  ggplot(aes(x = sos, y = wins)) +
  geom_point()

Interestingly, when plotting strength of schedule vs. wins, there is an apparent positive association. I would’ve expected there to be no correlation at all as the best teams would play harder schedules and have a tough time winning all their games, and the bad teams would play easy schedules and their win count would be higher than expected. I think this still makes sense as the hardest schedules are typically played by the best teams who will win most of their games. This also makes me wonder if the strength of schedule metric might be biased as well. The best teams strength of schedule rating is much higher than other teams, and so strength of schedule might be a poor way to predict the quality of a time.

stats %>%
  ggplot(aes(x = `FG%`, y = wins)) +
  geom_point()

The positive association between field goal percentage and wins makes plenty of sense as teams that hit more of their shots will probably win more games. This graph shows this positive association between field goal percentage and wins.

stats %>%
  ggplot(aes(x = `FT%`, y = wins)) +
  geom_point()

The random scatter in the free throw percentage vs. wins makes a lot of sense. Free throws a small portion of the game of basketball and just because you have good or bad free throw shooters isn’t a make or break situation.

stats %>%
  ggplot(aes(y = wins, x = ppp)) +
  geom_point()

This graph wasn’t much of a surprise. Teams that score more points per each posession will likely have more wins.

stats %>%
  ggplot(aes(y = wins, x = ast_to_tov)) +
  geom_point()

There does appear to be a positive association between assist-to-turnover ratio and wins.

stats %>%
  ggplot(aes(x = STLpg + BLKpg, y = wins)) +
  geom_point()

There is a small positive association between steals and blocks per game and wins, which makes sense. There isn’t a defensive rating statistic, so these will have to do as quality defensive statistics.

Conclusions from Exploratory Data Analysis

It appears that there is an association between multiple stats and wins from the above plots. I think it will be good to include strength of schedule, field goal percentage, points per possession, assist-to-turnover ratio, steals, and blocks.

Modeling

Decision Tree 1

gamestats <- read_csv('gamestats.csv')
gamestats_split <- initial_split(gamestats, prop = 2/3)
gamestats_train <- training(gamestats_split)
gamestats_test <- testing(gamestats_split)

decision_tree_fit <- decision_tree(mode = "regression") %>%
  set_engine("rpart") %>% 
  fit(
    point_diff ~ `FG%` + ast_to_tov + ppp + defense,
    data = gamestats_train)

decision_tree_fit %>% 
  extract_fit_engine() %>% 
  rpart.plot::rpart.plot(roundint = FALSE)

metrics <- yardstick::metric_set(mae, mape, rsq_trad, rmse)
decision_tree_fit %>% 
  predict(gamestats_test) %>% 
  bind_cols(gamestats_test) %>% 
  metrics(truth = point_diff, estimate = .pred) %>% 
  select(-.estimator)

## # A tibble: 4 × 2
##   .metric  .estimate
##   <chr>        <dbl>
## 1 mae          9.20 
## 2 mape       124.   
## 3 rsq_trad     0.497
## 4 rmse        12.0

The mean absolute error is about 10, which means the model will get the point differential of a game wrong by about 10 points. This isn’t ideal at all, but it is something to keep in mind when viewing the results. The R-squared value is about .40, which means about 40% of the variance in the results can be attributed to my predictors.

Decision Tree 2

decision_tree_fit2 <- decision_tree(mode = "regression") %>%
  set_engine("rpart") %>% 
  fit(
    point_diff ~ ast_to_tov + defense,
    data = gamestats_train)

decision_tree_fit2 %>% 
  extract_fit_engine() %>% 
  rpart.plot::rpart.plot(roundint = FALSE)

metrics <- yardstick::metric_set(mae, mape, rsq_trad, rmse)
decision_tree_fit2 %>% 
  predict(gamestats_test) %>% 
  bind_cols(gamestats_test) %>% 
  metrics(truth = point_diff, estimate = .pred) %>% 
  select(-.estimator)

## # A tibble: 4 × 2
##   .metric  .estimate
##   <chr>        <dbl>
## 1 mae         11.4  
## 2 mape       128.   
## 3 rsq_trad     0.247
## 4 rmse        14.6

The mean absolute error is about 12, which means the model will get the point differential of a game wrong by about 12 points. This is a little bit worse than the first decision tree. The R-squared value is about .2, which means about 20% of the variance in the results can be attributed to my predictors. This is very low, and I would have hoped for a higher R-squared.

Decision Tree 3

decision_tree_fit3 <- decision_tree(mode = "regression") %>%
  set_engine("rpart") %>% 
  fit(
    point_diff ~ ppp,
    data = gamestats_train)

decision_tree_fit3 %>% 
  extract_fit_engine() %>% 
  rpart.plot::rpart.plot(roundint = FALSE)

metrics <- yardstick::metric_set(mae, mape, rsq_trad, rmse)
decision_tree_fit3 %>% 
  predict(gamestats_test) %>% 
  bind_cols(gamestats_test) %>% 
  metrics(truth = point_diff, estimate = .pred) %>% 
  select(-.estimator)

## # A tibble: 4 × 2
##   .metric  .estimate
##   <chr>        <dbl>
## 1 mae          9.14 
## 2 mape       122.   
## 3 rsq_trad     0.513
## 4 rmse        11.8

I only included points-per-possession as a predictor in the third decision tree. Surprisingly, it had a mean absolute error of about 9.5 meaning the tree is off by 9.5 points on average. The R-squared value is about .45, meaning 45% of the variance in the response can be attributed to the predictor.

trained_linear_model <- 
  parsnip::linear_reg() %>%
  fit(point_diff ~ `FG%` + ast_to_tov + ppp + defense,
      data = gamestats_train)
trained_linear_model %>% 
  predict(gamestats_test) %>% 
  bind_cols(gamestats_test) %>% 
  metrics(truth = point_diff, estimate = .pred)

## # A tibble: 4 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 mae      standard       7.50 
## 2 mape     standard     109.   
## 3 rsq_trad standard       0.660
## 4 rmse     standard       9.84

I decided to try a linear model to see if I’d have any more luck. Fortunately, the mean absolute error was decreased to under 9. The R-squared value is above .5, which is higher than the other models.

linear_mod <- lm(point_diff ~ `FG%` + ppp + defense + ast_to_tov, data = gamestats)
summary(linear_mod)

## 
## Call:
## lm(formula = point_diff ~ `FG%` + ppp + defense + ast_to_tov, 
##     data = gamestats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.878  -7.110  -0.109   6.600  34.476 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -86.0729     3.3700 -25.541   <2e-16 ***
## `FG%`        10.5677    10.6209   0.995    0.320    
## ppp          71.2432     5.8472  12.184   <2e-16 ***
## defense       1.1286     0.1214   9.299   <2e-16 ***
## ast_to_tov   -0.1600     0.7205  -0.222    0.824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.57 on 542 degrees of freedom
## Multiple R-squared:  0.5845, Adjusted R-squared:  0.5814 
## F-statistic: 190.6 on 4 and 542 DF,  p-value: < 2.2e-16

I fit this model so I could grab some coefficients to use later in the app.

Modeling Wrangling, etc…

score_stats <- stats %>%
  mutate(scaledwins = wins * (sos + 30),
         poss = round(FGA - ORB + TOV + (0.475 * FTA))) %>%
  mutate(ppp = (ppg * (wins + losses)/poss),
         sos_scaled = rescale(sos, c(0.5,1.5)),
         defense = STLpg + BLKpg)

app_stats <- stats 
app_stats$school <- gsub(" ", "", app_stats$school, fixed = TRUE)

display_stats <- display_stats %>%
  mutate(school = School)
display_stats$school <- gsub(" ", "", display_stats$school, fixed = TRUE) 
   
score_stats$school <- gsub(" ", "", score_stats$school, fixed = TRUE)

The APP

rm(import1, import2, lookup1, lookup2)


# matchup
compare <- function(team1, team2){
  team1_stats <- display_stats %>%
    filter(school == team1) %>%
    select(-school, -ORB, -TOV)
  team2_stats <- display_stats %>%
    filter(school == team2) %>%
    select(-school, -ORB, -TOV)
  x <- rbind(team1_stats, team2_stats)
  x
}

# linear power score
score <- function(team1, team2){
  t1power_score <- score_stats %>%
    filter(school == team1) %>%
    mutate(`Unadjusted Score` = ((10.5677 * `FG%`) + (71.2432 * ppp) + (1.1286 * defense) + (-.1600 * ast_to_tov) - 86.0729)) %>%
    mutate(`Adjusted Score` = sos_scaled * ((10.5677 * `FG%`) + (71.2432 * ppp) + (1.1286 * defense) + (-.1600 * ast_to_tov) - 86.0729)) %>%
    select(school, `Unadjusted Score`, `Adjusted Score`)
  t2power_score <- score_stats %>%
    filter(school == team2) %>%
    mutate(`Unadjusted Score` = ((10.5677 * `FG%`) + (71.2432 * ppp) + (1.1286 * defense) + (-.1600 * ast_to_tov) - 86.0729)) %>%
    mutate(`Adjusted Score` = sos_scaled * ((10.5677 * `FG%`) + (71.2432 * ppp) + (1.1286 * defense) + (-.1600 * ast_to_tov) - 86.0729)) %>%
    select(school, `Unadjusted Score`, `Adjusted Score`)
  x <- rbind(t1power_score, t2power_score)
  x
}

# tree power scores
tree_score <- function(team1, team2){
  t1tree_score <- score_stats %>%
    filter(school == team1) 
  output1 <- decision_tree_fit %>%
    predict(t1tree_score) %>%
    mutate(school = team1,
           TreeScore1 = .pred) %>%
    select(school, TreeScore1)
  output1_2 <- decision_tree_fit2 %>%
    predict(t1tree_score) %>%
    mutate(school = team1,
           TreeScore2 = .pred) %>%
    select(school, TreeScore2) %>%
    left_join(output1, by = "school")
  output1_3 <- decision_tree_fit3 %>%
    predict(t1tree_score) %>%
    mutate(school = team1,
           TreeScore3 = .pred) %>%
    select(school, TreeScore3) %>%
    left_join(output1_2, by = "school") %>%
    select(school, TreeScore1, TreeScore2, TreeScore3) %>%
    mutate(`Average Score` = (TreeScore1 + TreeScore2 + TreeScore3) / 3)
  t2tree_score <- score_stats %>%
    filter(school == team2)
  output2 <- decision_tree_fit %>%
    predict(t2tree_score) %>%
    mutate(school = team2,
           TreeScore1 = .pred) %>%
    select(school, TreeScore1)
   output2_2 <- decision_tree_fit2 %>%
    predict(t2tree_score) %>%
    mutate(school = team2,
           TreeScore2 = .pred) %>%
     select(school, TreeScore2) %>%
     left_join(output2, by = "school")
   output2_3 <- decision_tree_fit3 %>%
    predict(t2tree_score) %>%
    mutate(school = team2,
           TreeScore3 = .pred) %>%
    select(school, TreeScore3) %>%
     left_join(output2_2, by = "school") %>%
     select(school, TreeScore1, TreeScore2, TreeScore3) %>%
     mutate(`Average Score` = (TreeScore1 + TreeScore2 + TreeScore3) / 3)
  x <- rbind(output1_3, output2_3)
  x
}

# last ten matches
games_last_ten <- function(team) {
  x <- games %>%
    filter(game_date > '2019-1231') %>%
    filter(school == team) %>%
    arrange(desc(game_date)) %>%
    top_n(10,game_date)
}

# shiny app
shinyApp(
  ui = fluidPage(
    titlePanel("March Madness Team Comparison"),
    selectInput("team1", 
                "Choose Team 1 (Scroll or Delete-and-Type)",
                app_stats$school
    ),
    selectInput("team2",
                "Choose Team 2 (Scroll or Delete-and-Type)",
                app_stats$school
    ),
    titlePanel("Statistic Comparison"),
    tableOutput("data1"),
    titlePanel("Linear Model Score (Strength of Schedule Adjusted)"),
    tableOutput("data2"),
    titlePanel("Tree Score"),
    tableOutput("data5"),
    titlePanel("Team 1's last 10 games"),
    tableOutput("data3"),
    titlePanel("Team 2's last 10 games"),
    tableOutput("data4"),

  ),

  server = function(input, output) {
    output$data1 <- renderTable({
      compare(input$team1, input$team2)
    }, rownames = TRUE)    
    
    output$data2 <- renderTable({
      score(input$team1, input$team2)
    }, rownames = TRUE)    
    
    output$data3 <- renderTable({
      games_last_ten(input$team1)
    }, rownames = TRUE)    
    
    output$data4 <- renderTable({
      games_last_ten(input$team2)
    }, rownames = TRUE) 
    
    output$data5 <- renderTable({
      tree_score(input$team1, input$team2)
    }, rownames = TRUE)
    
  }
)

Shiny applications not supported in static R Markdown documents

Findings

It is not easy predicting who will win games from any statistics. Sometimes it comes down to match-ups, and no one can account for human error. Sometimes good players will have bad games, which is hard to predict for. Overall, I found points-per-possession to be a very good indicator of a team’s success. That was my greatest takeaway, and I wonder why it is not a statistic not more discussed. It makes more sense to use that than it does to use points-per-game.

Limitations

In my first code chunk, multiple parts are hard coded to work with a specific year, which makes my work a little less reproducible. Since March Madness only happens once a year, I don’t think it would be too difficult to update the code as the years go by. Additionally, I used game log stats from a particular year for only about 20 teams. The models could change drastically if I included every team in NCAA D1 basketball. It would be a very large data set however.

Future Questions

I think this app could be better if it took into account historical data on March Madness and was able to predict a whole bracket.

March Madness Analytics

Jack Zuidema

December 13, 2021