Summary

Every spring college basketball fans in the US converge to watch a single-elimination tournament of the 64 best teams to determine the national champion.

A game has developed from the tournament where a person guesses the winner of each game and is given a certain number of points based on number of wins and which round those wins occurred in. The catch is that the guesser has to submit an entire tournament and cannot wait to see the results of the first round before picking the second round, and so on.

This paper explores three strategies that a person with no knowledge of American college basketball could use to pick winners for their tournament entry.
* Pure Naive - random selection
* Smart-Naive - Always picking the top ranked team
* Historic - Use historic brackets to predict future results

The result was that the smart-naive approach had significantly better results against the historic and pure-naive entries.

Spoiler Alert

When 1000 entries were generated with each approach and graphed across all years there is a clear indication that smart-naive is better, and also that the year-on-year results between the two are highly correlated.
Note that the historic-based entries never do better than smart-naive in any year.

Chart of the average score of 1000 entries for each strategy over all years

Source of Data

This data is hosted on Kaggle . I use only a subset of all of the data because my approach is not based on the individual colleges but only the rankings that are pre-assigned to the them.

Kaggle has been having an annual competition around this tournament but the purpose of the competition is different than than hypothesis that I am testing here.

Overview of Tournament and Game

If someone were playing the March Madness game (aka picking your brackets) this is the list that they would have been given to predict.

2018 NCAA March Madness starting brackets

The Tournament

You will notice that it is highly-structured.
There are four regions that I’ve labelled A, B, C, and D.
Each team in each region is ranked from 1 to 16 based on their perceived strength. 1 being the strongest.
The teams in each region play each other until there is a champion.
The champions of regions A & B play each other, and the champions of regions C & D play each other in round 5. The winner of that game plays for overall champion.

The Game

Each correct pick in round 1 is worth 1 point.
Each correct pick in round 2 is worth 2 point.
Each correct pick in round 3 is worth 4 point.
Each correct pick in round 4 is worth 8 point.
Each correct pick in round 5 is worth 16 point.
Each correct pick in round 6 is worth 32 point.

The points in the later rounds become very difficult to get because your entry may predict two winners in round 1 that play each other and you pick incorrectly in round 2. This shuts off the “points-pipeline” for that branch with no more points possible.

Approach # 1

Pure-Naive

The approach randomly selects winners in each game. This is expected to perform very poorly because it gives equal weight to a team ranked 16 to beat a team ranked #1 in the first round. Until 2018 that had never happened.

Libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

Set data directory

data_dir <- "input"

Import data needed for testing

# Import tourney seeds dataset
tourney_seeds <- read_csv(paste0(data_dir, "/TourneySeeds.csv")) %>%
  rename(season = Season) %>%
  rename(seed = Seed) %>%
  rename(team = Team)
## Parsed with column specification:
## cols(
##   Season = col_integer(),
##   Seed = col_character(),
##   Team = col_integer()
## )

# Import the seasons info
seasons <- read_csv(paste0(data_dir, "/Seasons.csv")) %>%
  mutate(day_zero = mdy(Dayzero)) %>%
  rename(season = Season) %>%
  select(season, day_zero)
## Parsed with column specification:
## cols(
##   Season = col_integer(),
##   Dayzero = col_character(),
##   Regionw = col_character(),
##   Regionx = col_character(),
##   Regiony = col_character(),
##   Regionz = col_character()
## )
# Dayzero indicates the baseline date that is referenced in other data sets

# Read in tournament results
tourney_compact <- read_csv(paste0(data_dir, "/TourneyCompactResults.csv")) %>%
  rename(season = Season) %>%
  rename(day_num = Daynum) %>%
  rename(winning_team = Wteam) %>%
  select(season, day_num, winning_team)
## Parsed with column specification:
## cols(
##   Season = col_integer(),
##   Daynum = col_integer(),
##   Wteam = col_integer(),
##   Wscore = col_integer(),
##   Lteam = col_integer(),
##   Lscore = col_integer(),
##   Wloc = col_character(),
##   Numot = col_integer()
## )

# This column will be used to assign round numbers to results; mysteriously missing.
round_df = tibble(round = c(rep(1, 32), rep(2, 16), rep(3, 8), rep(4, 4), rep(5, 2), 6))

Data formatting

The data is not in a format well-suited for predictions based on seeds so I’ll have to do a little wrangling to get it into a format that can be used.

Note: I have turned many of these processes into functions that feed their output into the next process that uses the data as input. Some of the functions are lengthy but must be kept intact for this paper to generate. My recommendation is to understand the general approach first and then dive into the function syntax where you may have questions.

FUNCTION: get_all_results()

This takes all of the historic results and stores them for future processing.
This function has a couple of parameters where you can request either a dataframe or a list, and which rounds you want.

## Sequence through seasons generating 4 bracket dataframes for each season
# The id is assigned based on year and bracket that it came from
get_all_results <- function(return_type = c("df", "list"), 
                            rounds = c("all", "1-4", "5-6")) {
  tc_w_seeds_list <- list()
  for (l in 1:(nrow(seasons) - 1)) {
    # Get the season and day_zero to work with in this loop
    this_season <- unlist(seasons[l, 1])
    this_day_zero <- seasons[l, 2]
    
    # Add date-played
    tc_w_date_played <- tourney_compact %>%
      filter(season == this_season) %>%
      mutate(date_played = this_day_zero$day_zero + day_num)
    
    # Which teams were assigned to each seed this season?
    this_season_seeds <- tourney_seeds %>%
      filter(season == this_season)
    
    # Join seeds to main data
    tc_w_region_seeds <- tc_w_date_played %>%
      left_join(this_season_seeds, by = c("winning_team" = "team")) %>%
      select(-season.y) %>%
      rename(season = season.x) %>%
      rename(region_seed = seed) %>%
      select(season, date_played, region_seed)
    
    # Some preliminary games are shown in the data.
    # These are always the first games and are insignificant to our modelling.
    # Remove preliminary games
    t <- nrow(tc_w_region_seeds) - 63
    if (t > 0) {
      s <- seq(1, t)
      tc_prelim_trim <- tc_w_region_seeds[-s,]
    } else {
      tc_prelim_trim <- tc_w_region_seeds
    }
    
    # Some teams from the prelim games won a game in the main tourney.
    # They are indicated by a suffix of a or b.
    # Remove the suffix because no other seeds with that ID will appear.
    tc_w_suffix_trimmed <- tc_prelim_trim %>%
      mutate(region_seed = str_replace(region_seed, "[ab]", ""))
    
    # Sort the dates, lowest to highest, and bind the round numbers to the df
    tc_w_rounds <- tc_w_suffix_trimmed %>%
      bind_cols(round_df) %>%
      select(season, round, region_seed)
    
    # Break out the region and rename to A, B, C, and D
    tc_w_regions <- tc_w_rounds %>%
      mutate(region_seed = str_replace_all(region_seed, c(
        "W" = "A",
        "X" = "B",
        "Y" = "C",
        "Z" = "D"
      ))) %>%
      mutate(region = str_extract(region_seed, "[ABCD]"))
    
    # Break out the seed and drop region_seed
    # We'll add region_seed back in a better format
    tc_w_seeds <- tc_w_regions %>%
      mutate(seed = str_extract(region_seed, "[01][0-9]")) %>%
      mutate(seed = as.integer(seed)) %>%
      select(-region_seed) %>%
      mutate(region_seed = paste0(region, "-", seed))
    
    # Filter based on rounds selected
    if(rounds == "all") {
      tc_w_seeds_filtered <- tc_w_seeds 
    } else if(rounds == "1-4") {
      tc_w_seeds_filtered <- tc_w_seeds %>%
        filter(round <= 4)
    } else if(rounds == "5-6") {
      tc_w_seeds_filtered <- tc_w_seeds %>%
        filter(round >= 5)
    } else {
      print("Please include the rounds parameter of all, 1-4, or 5-6")
    }
    
    tc_w_seeds_list[[l]] <- tc_w_seeds_filtered
  }
  
  tc_w_seeds_df <- do.call(rbind, tc_w_seeds_list) 
  
  if(return_type == "df") {
    return(tc_w_seeds_df)
  } else if(return_type == "list") {
    return(tc_w_seeds_list)
  } else {
    print("Please specify a return type of df or list")
  }
  
}

Get a list of the round 1-4 actuals for all seasons

After many attempts to work with this data in dataframes I decided that lists were more suited to this data.
Note: I probably worked more with lists and functions in this experiment than in all of my previous analysis combined.

rounds_1_4 <- get_all_results(return_type = "list", rounds = "1-4")

Format of each element

Each record only shows the winning region_seeds in each round. This will be used later to score predictions vs actuals.

rounds_1_4[[1]]
## # A tibble: 60 x 5
##    season round region  seed region_seed
##     <int> <dbl> <chr>  <int> <chr>      
##  1   1985    1. B          9 B-9        
##  2   1985    1. D         11 D-11       
##  3   1985    1. A          1 A-1        
##  4   1985    1. C          9 C-9        
##  5   1985    1. D          3 D-3        
##  6   1985    1. B         12 B-12       
##  7   1985    1. C          5 C-5        
##  8   1985    1. A          4 A-4        
##  9   1985    1. D          2 D-2        
## 10   1985    1. D          7 D-7        
## # ... with 50 more rows

FUNCTION: get_reference_season()

This function gets the actual results of a single season’s tournament.

# Put actuals for a year into same format as predictions
get_reference_season <- function(my_season = 1985) {
  actual_1_4 <- get_all_results(return_type = "df", rounds = "1-4")
  actual_5_6 <- get_all_results(return_type = "df", rounds = "5-6")
  
  final_results <- rbind(actual_1_4, actual_5_6)
  
  this_season <- final_results %>%
    filter(season == my_season)
  
  rounds <- list()
  for (i in 1:6) {
    rounds[[i+1]] <- this_season %>%
      filter(round == i) %>%
      select(region_seed) %>%
      arrange(region_seed) %>%
      unlist()
  }
  rounds[[1]] <- unique(this_season$season)
  for(j in 1:7) {
    attributes(rounds[[j]]) <- NULL
  }
  names(rounds) <- c("id", "round_1", "round_2", "round_3", 
                     "round_4", "round_5", "round_6")
  
  return(rounds)
}

Get an arbitrary reference season of actual results

In this case I intentionally chose 2007 because a #1 ranked team won the whole tournament. This will create some score variance in the smart-naive scores later.

reference_season <- get_reference_season(my_season = 2007)

Format of reference season element

reference_season
## $id
## [1] 2007
## 
## $round_1
##  [1] "A-1"  "A-2"  "A-3"  "A-4"  "A-5"  "A-6"  "A-7"  "A-9"  "B-1"  "B-2" 
## [11] "B-3"  "B-4"  "B-5"  "B-6"  "B-7"  "B-9"  "C-1"  "C-11" "C-2"  "C-3" 
## [21] "C-4"  "C-5"  "C-7"  "C-9"  "D-1"  "D-11" "D-2"  "D-3"  "D-4"  "D-5" 
## [31] "D-7"  "D-8" 
## 
## $round_2
##  [1] "A-1" "A-2" "A-5" "A-6" "B-1" "B-2" "B-3" "B-5" "C-1" "C-3" "C-5"
## [12] "C-7" "D-1" "D-2" "D-3" "D-4"
## 
## $round_3
## [1] "A-1" "A-2" "B-1" "B-2" "C-1" "C-3" "D-1" "D-2"
## 
## $round_4
## [1] "A-2" "B-1" "C-1" "D-2"
## 
## $round_5
## [1] "B-1" "C-1"
## 
## $round_6
## [1] "C-1"

FUNCTION: create_pure_naive_entries()

This function defaults to creating 1 entry but any number can be specified.
It is very verbose, sometimes intentionally and sometimes not.
It is randomized at every step of the way.
Results are expected to be very poor

Set random number seed

We’ll start using randomization at this point so I’ll set the seed for reproducibility.

set.seed(317)

Function code

## Function code
create_pure_naive_entries <- function(num_entries = 1) {
  final_entries <- list()
  #=============================
  # ROUND 1 SETUP
  #============================
  # 8 games with team 1 playing team 2, pairwise
  for (n in 1:num_entries) {
    final_df <- tibble()
    
    
    # These seeds always play each other, pairwise, in the first round.
    team_1_seed <- c(1, 8, 5, 4, 6, 3, 7, 2)
    team_2_seed <- c(16, 9, 12, 13, 11, 14, 10, 15)
    
    # Sequence through regions and generate random winners
    # Generate round 1 winners
    round_results_list <- list()
    this_region <- c(A = 1,
                     B = 2,
                     C = 3,
                     D = 4)
    
    for (r in this_region) {
      winner <- as.integer()
      for (i in 1:8) {
        winner[i] <- sample(c(team_1_seed[i], team_2_seed[i]), 1)
        #print(winner[i])
      }
      round_results_list[[r]] <- tibble(
        round = 1,
        region = names(this_region[r]),
        seed = winner
      )
    }
    round_df <- do.call(rbind, round_results_list)
    final_df <- round_df
    
    #========================
    # SETUP ROUND 2
    #========================
    # In order to traverse the bracket correctly I determine team_1 and team_2 this way...
    
    region_round_list <- list()
    round_results_list <- list()
    for (r in this_region) {
      region_round_list[[r]] <- round_df %>%
        filter(region == names(this_region[r])) %>%
        select(seed) %>%
        unlist()
      
      # Get every other record and assign to team 1 and team 2
      team_1_seed <- region_round_list[[r]][c(1, 3, 5, 7)]
      team_2_seed <- region_round_list[[r]][c(2, 4, 6, 8)]
      
      winner <- as.integer()
      for (i in 1:4) {
        winner[i] <- sample(c(team_1_seed[i], team_2_seed[i]), 1)
      }
      
      #this_round <- rep(2, 4)
      round_results_list[[r]] <- tibble(
        round = 2,
        region = names(this_region[r]),
        seed = winner
      )
    }
    
    round_df <- do.call(rbind, round_results_list)
    final_df <- rbind(final_df, round_df)
    
    #=====================================
    # SETUP ROUND 3
    #====================================
    region_round_list <- list()
    round_results_list <- list()
    for (r in this_region) {
      region_round_list[[r]] <- round_df %>%
        filter(region == names(this_region[r])) %>%
        select(seed) %>%
        unlist()
      
      team_1_seed <- region_round_list[[r]][c(1, 3)]
      team_2_seed <- region_round_list[[r]][c(2, 4)]
      
      winner <- as.integer()
      for (i in 1:2) {
        winner[i] <- sample(c(team_1_seed[i], team_2_seed[i]), 1)
      }
      
      #this_round <- rep(3, 2)
      round_results_list[[r]] <- tibble(
        round = 3,
        region = names(this_region[r]),
        seed = winner
      )
    }
    round_df <- do.call(rbind, round_results_list)
    final_df <- rbind(final_df, round_df)
    
    
    
    #===============================
    # SETUP ROUND 4
    #===============================
    # Determine regional champions
    #this_round <- 4
    region_round_list <- list()
    round_results_list <- list()
    for (r in this_region) {
      region_round_list[[r]] <- round_df %>%
        filter(region == names(this_region[r])) %>%
        select(seed) %>%
        unlist()
      
      team_1_seed <- region_round_list[[r]][1]
      team_2_seed <- region_round_list[[r]][2]
      
      winner <- as.integer()
      winner[r] <- sample(c(team_1_seed, team_2_seed), 1)
      
      round_results_list[[r]] <- tibble(
        round = 4,
        region = names(this_region[r]),
        seed = winner[r]
      )
      
    }
    round_df <- do.call(rbind, round_results_list)
    final_df <- rbind(final_df, round_df)
    
    #=============================================
    # SETUP ROUND 5
    #=============================================
    # Round 5 has the champions of region A & B platying each other
    # and the champions of C & D playing each other.
    
    # We need to append the region to the seed now.
    round_results_list <- list()
    final_df_2 <- final_df %>%
      mutate(region_seed = paste0(region, "-", seed))
    
    ab_game <- final_df_2 %>%
      filter(round == 4) %>%
      filter(region %in% c("A", "B")) %>%
      select(region_seed) %>%
      sample_n(1) %>%
      unlist()
    
    this_seed <- str_replace(ab_game, "^[ABCD]-", "")
    round_results_list[[1]] <- tibble(
      round = 5,
      region = "AB",
      seed = as.integer(str_replace(ab_game, "^[ABCD]-", "")),
      region_seed = ab_game
    )
    cd_game <- final_df_2 %>%
      filter(round == 4) %>%
      filter(region %in% c("C", "D")) %>%
      select(region_seed) %>%
      sample_n(1) %>%
      unlist()
    
    round_results_list[[2]] <- tibble(
      round = 5,
      region = "CD",
      seed = as.integer(str_replace(cd_game, "^[ABCD]-", "")),
      region_seed = cd_game
    )
    
    round_df <- do.call(rbind, round_results_list)
    final_df_2 <- rbind(final_df_2, round_df)
    
    #===========================
    # SETUP ROUND 6
    #===========================
    round_results_list <- list()
    abcd_game <- final_df_2 %>%
      filter(round == 5) %>%
      select(region_seed) %>%
      sample_n(1) %>%
      unlist()
    
    round_results_list[[1]] <- tibble(
      round = 6,
      region = "ABCD",
      seed = as.integer(str_replace(abcd_game, "^[ABCD]-", "")),
      region_seed = ab_game
    )
    
    round_df <- do.call(rbind, round_results_list)
    final_df_2 <- rbind(final_df_2, round_df)
    
    #==================================
    # CONVERT TO FORMAT USED BY SCORER
    #==================================
    # entry_num = n
    #final_entry <- list()
    round_results_list <- list()
    for (i in 1:6) {
      round_results_list[[i]] <- final_df_2 %>%
        filter(round == i) %>%
        select(region_seed) %>%
        unlist()
      
      #print(round_results_list[[i]])
    }
    for (j in 1:6) {
      attributes(round_results_list[[j]]) <- NULL
    }
    
    final_entries[[n]] <- list(
      id = n,
      round_1 = round_results_list[[1]],
      round_2 = round_results_list[[2]],
      round_3 = round_results_list[[3]],
      round_4 = round_results_list[[4]],
      round_5 = round_results_list[[5]],
      round_6 = round_results_list[[6]]
    )
  }
  return(final_entries)
}

Make 1000 Pure-Naive entries

my_pure_naive_entries <- create_pure_naive_entries(1000)

Format of entry

This is what the first entry looks like.

my_pure_naive_entries[[1]]
## $id
## [1] 1
## 
## $round_1
##  [1] "A-1"  "A-8"  "A-12" "A-13" "A-11" "A-3"  "A-7"  "A-15" "B-1"  "B-8" 
## [11] "B-12" "B-13" "B-11" "B-14" "B-10" "B-15" "C-16" "C-8"  "C-5"  "C-13"
## [21] "C-11" "C-14" "C-10" "C-2"  "D-1"  "D-9"  "D-12" "D-4"  "D-11" "D-14"
## [31] "D-10" "D-2" 
## 
## $round_2
##  [1] "A-8"  "A-13" "A-3"  "A-7"  "B-8"  "B-12" "B-14" "B-10" "C-8"  "C-13"
## [11] "C-11" "C-2"  "D-1"  "D-12" "D-11" "D-2" 
## 
## $round_3
## [1] "A-13" "A-7"  "B-12" "B-14" "C-13" "C-2"  "D-12" "D-11"
## 
## $round_4
## [1] "A-13" "B-14" "C-2"  "D-12"
## 
## $round_5
## [1] "B-14" "D-12"
## 
## $round_6
## [1] "B-14"

In this case the id is the entry number instead of the reference year that we saw in the previous function.

The remaining format is exactly the same as the reference year format. A list with winning predictions/actuals for each round.
Later our scoring engine will determine the number of elements in the intersect between the two elements and assign points appropriately.

FUNCTION: score_entries()

The scoring engine assigns points based on the number of correct selections in each round.

Almost everyone uses the same scoring system in their game:
Round 1 win = 1 point
Round 2 win = 2 points
Round 3 win = 4 points
Round 4 win = 8 points
Round 5 win = 16 points
Round 6 win = 32 points

I accomplish this by counting the number of elements in each round that are in the corresponding reference season and applying a multiplier. AUTHOR’S NOTE: In high school I thought I would never use Set Theory in real life, and here I am ;-)

The inputs are:
* Entry list
* Reference Season results

score_entries <- function(my_entries, reference_season) {
  entry_results_list <- list()
  for (i in 1:length(my_entries)) {
    entry_results_list[[i]] <- tibble(
      entry_num = i,
      rnd_1_pts = length(intersect(
        my_entries[[i]]$round_1, reference_season$round_1
      )),
      rnd_2_pts = length(intersect(
        my_entries[[i]]$round_2, reference_season$round_2
      )) * 2,
      rnd_3_pts = length(intersect(
        my_entries[[i]]$round_3, reference_season$round_3
      )) * 4,
      rnd_4_pts = length(intersect(
        my_entries[[i]]$round_4, reference_season$round_4
      )) * 8, 
      rnd_5_pts = length(intersect(
        my_entries[[i]]$round_5, reference_season$round_5
      )) * 16,
      rnd_6_pts = length(intersect(
        my_entries[[i]]$round_6, reference_season$round_6
      )) * 32, 
      total_points = sum(
        rnd_1_pts,
        rnd_2_pts,
        rnd_3_pts,
        rnd_4_pts,
        rnd_5_pts,
        rnd_6_pts
      ))
    entry_results_final <- do.call(rbind, entry_results_list)
  }
  return(entry_results_final)
}

Score Pure-Naive entries

my_pure_naive_results <- score_entries(my_pure_naive_entries, reference_season)

Summarize Pure-Naive scores

summary(my_pure_naive_results$total_points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   23.00   28.00   30.63   36.00  103.00
sd(my_pure_naive_results$total_points)
## [1] 11.82051

Plot Pure-Naive score distribution

The vertical line represents the mean.

model_type <- c("pure-naive", "smart_naive", "historic")
ggplot(my_pure_naive_results, aes(x=total_points, fill = model_type[1], color = model_type[1])) +
  geom_density(aes(alpha = 0.5)) + 
  geom_vline(aes(xintercept = mean(my_pure_naive_results$total_points), 
                 color = model_type[1])) +
  scale_color_brewer(palette = "Set1") + 
  theme(legend.position = "none")

A definite right-skew similar to a poisson distribution.

Approach 2

Smart-Naive

In order to understand this method I want to re-acquaint you with the brackets that the game participant must fill out in full before the tournament starts.

2018 NCAA March Madness starting brackets

Remember that the Smart-Naive person has no knowledge of US college basketball but uses the information provided to make selections.

In this case they notice that in each quadrant someone has gone to the trouble to rank each of the teams in that region 1 to 16. That person probably knows a lot more about college basketball than they do. They also understand that the #1 ranked team has the highest perceived strength and #16 has the lowest perceived strength.

They rationalize that they should always pick the highest ranked team in each game which causes the first four rounds to always play out the same.
Round 1 winners: #1-8
Round 2 winners: #1-4
Round 3 winners: #1-2
Round 4 winner: #1

Since the Smart-Naive person has no additional information to work with in round 5 they randomly select the winner in the region A vs B, and Region C vs D games. They also randomly select round 6.

FUNCTION: create_smart_naive_entries()

The first four rounds will ALWAYS have the same picks for each region
Rounds 5 & 6 will be randomized because all teams playing will be #1 seeds in their region

create_smart_naive_entries <- function(num_entries = 1) {
  # Every region in rounds 1 - 4 are the same since the smart naive person always
  # selects the lowest seed, because that seed is assumed to have the most strenth
  
  # Setup rounds 1 - 4
  sn_round_1 <- c(paste0("A-", 1:8),
                  paste0("B-", 1:8),
                  paste0("C-", 1:8),
                  paste0("D-", 1:8))
  
  sn_round_2 <-  c(paste0("A-", 1:4),
                   paste0("B-", 1:4),
                   paste0("C-", 1:4),
                   paste0("D-", 1:4))
  
  sn_round_3 <-  c(paste0("A-", 1:2),
                   paste0("B-", 1:2),
                   paste0("C-", 1:2),
                   paste0("D-", 1:2))
  
  sn_round_4 <-  c(paste0("A-", 1),
                   paste0("B-", 1),
                   paste0("C-", 1),
                   paste0("D-", 1))
  
  sn_list <- list()
  for (n in 1:num_entries) {
    sn_list[[n]] <- list(
      id = n,
      round_1 = sn_round_1,
      round_2 = sn_round_2,
      round_3 = sn_round_3,
      round_4 = sn_round_4
    )
    # Randomize round 5; since they are both seed #1
    sn_round_5_ab <- sample(c("A-1", "B-1"), 1)
    sn_round_5_cd <- sample(c("C-1", "D-1"), 1)
    
    sn_list[[n]]$round_5 <- c(sn_round_5_ab, sn_round_5_cd)
    
    # Randomize round 6; since they are both seed #1
    sn_list[[n]]$round_6 <-
      sample(c(sn_round_5_ab, sn_round_5_cd), 1)
    
    
  }
  return(sn_list)
}

Make 1000 Smart-Naive entries

  my_smart_naive_entries <- create_smart_naive_entries(1000)

Score Smart-Naive entries

  my_smart_naive_results <- score_entries(my_smart_naive_entries, reference_season)

Summarize Smart-Naive scores

summary(my_smart_naive_results$total_points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    93.0    93.0   109.0   116.9   141.0   157.0
sd(my_smart_naive_results$total_points)
## [1] 21.18821

These are pretty incredible results which accurately indicates that a lot of the top seeds won in the later rounds. This season may be an outlier. We’ll see when I compare across all seasons.

Plot Smart-Naive distribution

Remember that the mean score of Pure-Naive entries was 30.73.

#model_type <- c("pure-naive", "smart_naive", "historic")
ggplot(my_smart_naive_results, aes(x=total_points, fill = model_type[2], color = model_type[2])) +
  geom_density(aes(alpha = 0.5)) + 
  geom_vline(aes(xintercept = mean(my_smart_naive_results$total_points), 
                 color = model_type[2])) +
  scale_color_brewer(palette = "Set1") + 
  theme(legend.position = "none")

Interesting that there are five distinct humps.

Approach 3

Historic

This approach is based on the theory that past-performance can be used to predict future results.

Note: This should not be considered machine learning since there is no mechanism for the algorithm to get better.

FUNCTION: create_historic_entries()

The function below is very long and the major pieces are:

PART 1

Get all historic results for rounds 1-4.
Break each year into the four individual regions.
Strip out the region identifier so that only the rank remains.
Add to the overall anonymized results for round 1-4.
There should be a total of 128 elements created
32 years * 4 regions = 128 final elements

PART 2

Select 4 random historic brackets and assign to Regions A, B, C, and D.
Create a table of how many times each seed has won round 5.
Create another table of how many times each seed has won round 6.
Use these tables to make an informed decision on which region and rank to select in rounds 5 & 6.

  create_historic_entries <- function(num_entries = 1) {
    all_results <- get_all_results(return_type = "df", rounds = "1-4")
    
    bracket_list <- list()
    master_list <- list()
    season_list <- list()
    # Create incrementer for master_list
    j <- 1
    for (this_season in unique(all_results$season)) {
      season_results <- all_results %>%
        filter(season == this_season)
      
      for (i in 1:4) {
        if (i == 1) {
          this_region = "A"
        } else if (i == 2) {
          this_region = "B"
        } else if (i == 3) {
          this_region = "C"
        } else if (i == 4) {
          this_region = "D"
        }
        
        region_results <- season_results %>%
          filter(region == this_region)
        
        round_list <- list()
        
        id <- all_results %>%
          filter(season == this_season) %>%
          filter(region == this_region) %>%
          mutate(id = paste0(season, "-", this_region)) %>%
          select(id) %>%
          unique() %>%
          unlist()
        
        round_list[[1]] <- all_results %>%
          filter(season == this_season) %>%
          filter(round == 1) %>%
          filter(region == this_region) %>%
          select(seed) %>%
          arrange(seed)
        
        round_list[[2]] <- all_results %>%
          filter(season == this_season) %>%
          filter(round == 2) %>%
          filter(region == this_region) %>%
          select(seed) %>%
          arrange(seed)
        
        round_list[[3]] <- all_results %>%
          filter(season == this_season) %>%
          filter(round == 3) %>%
          filter(region == this_region) %>%
          select(seed) %>%
          arrange(seed)
        
        round_list[[4]] <- all_results %>%
          filter(season == this_season) %>%
          filter(round == 4) %>%
          filter(region == this_region) %>%
          select(seed) %>%
          arrange(seed)
        
        region_list <- list(
          id = id,
          round_1 = round_list[[1]]$seed,
          round_2 = round_list[[2]]$seed,
          round_3 = round_list[[3]]$seed,
          round_4 = round_list[[4]]$seed
        )
        
        master_list[[j]] <- region_list
        j = j + 1
      }
    }
  
    round_5_6 <- get_all_results(return_type = "df", rounds = "5-6")
    all_seeds <- tibble(seed = 1:16)
    
    round_5_stats <- round_5_6 %>%
      filter(round == 5) %>%
      group_by(seed) %>%
      summarize(number = n()) %>%
      right_join(all_seeds, by = "seed") %>%
      replace_na(list(number = 0))
    
    round_6_stats <- round_5_6 %>%
      filter(round == 6) %>%
      group_by(seed) %>%
      summarize(number = n()) %>%
      right_join(all_seeds, by = "seed") %>%
      replace_na(list(number = 0))
    
    
    #==============================
    # ROUNDS 1-4
    #==============================
    entry_list <- list()
    for (j in 1:num_entries) {
      regional_bracket_list <- list()
      # Get four historic brackets by sampling the master list
      regional_bracket_list <-
        sample(master_list, 4, replace = TRUE)
      
      # For each element we need to append the region code: A, B, C, or D
      
      for (i in 1:4) {
        if (i == 1) {
          this_region = "A"
        } else if (i == 2) {
          this_region = "B"
        } else if (i == 3) {
          this_region = "C"
        } else if (i == 4) {
          this_region = "D"
        }
        
        regional_bracket_list[[i]]$round_1 <- paste0(this_region,
                                                     "-",
                                                     regional_bracket_list[[i]]$round_1)
        regional_bracket_list[[i]]$round_2 <- paste0(this_region,
                                                     "-",
                                                     regional_bracket_list[[i]]$round_2)
        regional_bracket_list[[i]]$round_3 <- paste0(this_region,
                                                     "-",
                                                     regional_bracket_list[[i]]$round_3)
        regional_bracket_list[[i]]$round_4 <- paste0(this_region,
                                                     "-",
                                                     regional_bracket_list[[i]]$round_4)
        
      }
      
      # Condense to a single list for the final entry
      entry_num <- j
      this_round_1 <- regional_bracket_list %>%
        map("round_1") %>%
        unlist()
      this_round_2 <- regional_bracket_list %>%
        map("round_2") %>%
        unlist()
      this_round_3 <- regional_bracket_list %>%
        map("round_3") %>%
        unlist()
      this_round_4 <- regional_bracket_list %>%
        map("round_4") %>%
        unlist()
      
      final_entry <- list(
        id = entry_num,
        round_1 = this_round_1,
        round_2 = this_round_2,
        round_3 = this_round_3,
        round_4 = this_round_4
      )
      
      #===============================
      # ROUND 5
      #===============================
      a_winner <- unlist(regional_bracket_list[[1]][5])
      b_winner <- unlist(regional_bracket_list[[2]][5])
      c_winner <- unlist(regional_bracket_list[[3]][5])
      d_winner <- unlist(regional_bracket_list[[4]][5])
      
      a_df <- tibble(
        ref_bracket = unlist(regional_bracket_list[[1]][1]),
        winner = unlist(regional_bracket_list[[1]][5]),
        seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
        region = str_extract(winner, "^[ABCD]")
      )
      
      b_df <- tibble(
        ref_bracket = unlist(regional_bracket_list[[2]][1]),
        winner = unlist(regional_bracket_list[[2]][5]),
        seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
        region = str_extract(winner, "^[ABCD]")
      )
      c_df <- tibble(
        ref_bracket = unlist(regional_bracket_list[[3]][1]),
        winner = unlist(regional_bracket_list[[3]][5]),
        seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
        region = str_extract(winner, "^[ABCD]")
      )
      d_df <- tibble(
        ref_bracket = unlist(regional_bracket_list[[4]][1]),
        winner = unlist(regional_bracket_list[[4]][5]),
        seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
        region = str_extract(winner, "^[ABCD]")
      )
      
      ab_round_5 <- rbind(a_df, b_df)
      
      # Does A win; this is binary so if A doesn't win then B wins
      # join in round 5 stats
      ab_final <- ab_round_5 %>%
        left_join(round_5_stats, by = "seed")
      
      if (ab_final$number[1] == ab_final$number[2]) {
        ab_winner <- sample(c(ab_final$winner[1], ab_final$winner[2]), 1)
      } else if (ab_final$number[1] == 0) {
        ab_winner <- ab_final$winner[2]
      } else if (ab_final$number[2] == 0) {
        ab_winner <- ab_final$winner[2]
      } else {
        a_num <- ab_final$number[1]
        b_num <- ab_final$number[2]
        
        a_pct <- a_num / (a_num + b_num)
        if (runif(1) <= a_pct) {
          ab_winner <- ab_final$winner[1]
        } else {
          ab_winner <- ab_final$winner[2]
        }
      }
      
      ab_winner_record <- ab_final %>%
        filter(winner == ab_winner) %>%
        select(-number)
      
      # Do the same for C & D
      cd_round_5 <- rbind(c_df, d_df)
      
      cd_final <- cd_round_5 %>%
        left_join(round_5_stats, by = "seed")
      
      if (cd_final$number[1] == cd_final$number[2]) {
        cd_winner <- sample(c(cd_final$winner[1], cd_final$winner[2]), 1)
      } else if (cd_final$number[1] == 0) {
        cd_winner <- cd_final$winner[2]
      } else if (cd_final$number[2] == 0) {
        cd_winner <- cd_final$winner[2]
      } else {
        c_num <- cd_final$number[1]
        d_num <- cd_final$number[2]
        
        c_pct <- c_num / (c_num + d_num)
        if (runif(1) <= c_pct) {
          cd_winner <- cd_final$winner[1]
        } else {
          cd_winner <- cd_final$winner[2]
        }
      }
      cd_winner_record <- cd_final %>%
        filter(winner == cd_winner) %>%
        select(-number)
      
      # Append round 5 winners to final entry
      final_entry$round_5 <- c(ab_winner_record$winner,
                               cd_winner_record$winner)
      
      #==================================
      # ROUND 6
      #==================================
      abcd_final <- rbind(ab_winner_record, cd_winner_record) %>%
        left_join(round_6_stats, by = "seed")
      
      # Check for special conditions
      if (abcd_final$number[1] == abcd_final$number[2]) {
        abcd_winner <-
          sample(c(abcd_final$winner[1], abcd_final$winner[2]), 1)
      } else if (abcd_final$number[1] == 0) {
        abcd_winner <- abcd_final$winner[2]
      } else if (abcd_final$number[2] == 0) {
        abcd_winner <- abcd_final$winner[1]
      } else {
        ab_num <- abcd_final$number[1]
        cd_num <- abcd_final$number[2]
        
        ab_pct <- ab_num / (ab_num + cd_num)
        if (runif(1) <= ab_pct) {
          abcd_winner <- abcd_final$winner[1]
        } else {
          abcd_winner <- abcd_final$winner[2]
        }
      }
      
      abcd_winner_record <- abcd_final %>%
        filter(winner == abcd_winner) %>%
        select(-number)
      
      # Append round 6 winner to final_entry
      final_entry$round_6 <- abcd_winner_record$winner
      
      entry_list[[j]] <- final_entry
    }
    return(entry_list)
  }

Make 1000 Historic entries

  my_historic_entries <- create_historic_entries(1000)

Score Historic entries

  my_historic_results <- score_entries(my_historic_entries, reference_season)

Summarize Historic scores

summary(my_historic_results$total_points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.00   62.00   77.00   83.08  100.00  168.00
sd(my_historic_results$total_points)
## [1] 27.34004

WOW!! A max of 168 out of 192 total points possible. That is a very high score.

Plot Historic entry distribution

model_type <- c("pure-naive", "smart_naive", "historic")
ggplot(my_historic_results, aes(x=total_points, fill = model_type[3], color = model_type[3])) +
  geom_density(aes(alpha = 0.5)) + 
  geom_vline(aes(xintercept = mean(my_historic_results$total_points), 
                 color = model_type[3])) +
  scale_color_brewer(palette = "Set1") + 
  theme(legend.position = "none")

Right skewed with the mean of 83.076.

Perform t-test to see if difference is statistically significant

t.test(my_smart_naive_results$total_points, my_historic_results$total_points)
## 
##  Welch Two Sample t-test
## 
## data:  my_smart_naive_results$total_points and my_historic_results$total_points
## t = 30.883, df = 1880.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  31.63479 35.92521
## sample estimates:
## mean of x mean of y 
##   116.856    83.076

Yes, the difference is significant. The p-value is well below 0.05 so for this year the Smart-Naive approach is significantly better than the Historic approach.

Score each model across all years

The data includes the results since 1985 which gives us 32 years to test against to see if there really is a significant difference on-average between the two methods.

Merge all entries into one big list

 all_entries <- list(pure_naive = my_pure_naive_entries, 
                      smart_naive = my_smart_naive_entries, 
                      historic_entries = my_historic_entries)

FUNCTION: create_reference_season_list()

This function gathers all of the historical results into a single list.

# Create function to build list of the results of every season
create_reference_season_list <- function() {
reference_season_list <- list()
#start_time <- Sys.time()
j <- 1
for(i in 1985:2016) {
  #print(paste("Building", i,  "season ..."))
  
  reference_season_list[[j]] <- get_reference_season(my_season = i)
  #print(paste("Season", i , "complete"))
  j <- j + 1
}
#end_time <- Sys.time() 
#print(end_time - start_time)

return(reference_season_list)
}

Get list of actual tournament results for all Seasons

reference_season_list <- create_reference_season_list()

Format of Reference Season list

reference_season_list[[1]]
## $id
## [1] 1985
## 
## $round_1
##  [1] "A-1"  "A-2"  "A-3"  "A-4"  "A-5"  "A-6"  "A-7"  "A-8"  "B-1"  "B-11"
## [11] "B-12" "B-2"  "B-3"  "B-4"  "B-7"  "B-9"  "C-1"  "C-11" "C-2"  "C-3" 
## [21] "C-4"  "C-5"  "C-7"  "C-9"  "D-1"  "D-11" "D-13" "D-2"  "D-3"  "D-5" 
## [31] "D-7"  "D-8" 
## 
## $round_2
##  [1] "A-1"  "A-2"  "A-3"  "A-4"  "B-1"  "B-12" "B-3"  "B-7"  "C-1"  "C-11"
## [11] "C-2"  "C-5"  "D-11" "D-2"  "D-5"  "D-8" 
## 
## $round_3
## [1] "A-1" "A-2" "B-1" "B-3" "C-1" "C-2" "D-2" "D-8"
## 
## $round_4
## [1] "A-1" "B-1" "C-2" "D-8"
## 
## $round_5
## [1] "A-1" "D-8"
## 
## $round_6
## [1] "D-8"

FUNCTION: score_entry_lists()

score_entry_lists <- function(my_entries) {
  #start_time <- Sys.time()
  #print(paste("Start time is:", start_time))
  all_seasons_score_list <- list()
  j <- 1
  for (i in seq_along(reference_season_list)) {
    all_scores_list <- list()
    this_season <- reference_season_list[[i]]$id
    #start_time <- Sys.time()
    #print(paste("Pure-naive for", this_season, "started ..."))
    my_pure_naive_scores <-
      score_entries(my_entries[[1]], reference_season_list[[i]])
    #end_time <- Sys.time()
    #print("COMPLETE")
    #print(end_time - start_time)
    
    #start_time <- Sys.time()
    #print(paste("Smart-naive for", this_season, "started ..."))
    my_smart_naive_scores <-
      score_entries(my_entries[[2]], reference_season_list[[i]])
    # end_time <- Sys.time()
    # print("COMPLETE")
    # print(end_time - start_time)
    # 
    # start_time <- Sys.time()
    # print(paste("Historic for", this_season, "started ..."))
    my_historic_scores <-
      score_entries(my_entries[[3]], reference_season_list[[i]])
    # end_time <- Sys.time()
    # print("COMPLETE")
    # print(end_time - start_time)
    
    all_scores_list <- list(
      pure_naive = my_pure_naive_scores,
      smart_naive = my_smart_naive_scores,
      historic_entries = my_historic_scores
    )
    
    all_scores_list[[1]]$season <- this_season
    all_scores_list[[2]]$season <- this_season
    all_scores_list[[3]]$season <- this_season
    
    all_scores_list[[1]]$model <- "pure-naive"
    all_scores_list[[2]]$model <- "smart-naive"
    all_scores_list[[3]]$model <- "historic"
    
    all_seasons_score_list[[j]] <- all_scores_list
    j <- j + 1
  }
  all_seasons_score_final_list <- do.call(rbind, all_seasons_score_list)
  all_seasons_score_final_df <- do.call(rbind, all_seasons_score_final_list)
  
  #end_time <- Sys.time() 
  #print(end_time - start_time)
  
  return(all_seasons_score_final_df)
}

Score all entries across all seasons

This function takes a long time to run
I will load a saved version of the output because the results won’t cache.

scored_entries <- score_entry_lists(all_entries)

scored_entries <- readRDS("scored_entries_v3.RDS")

FUNCTION: summarize_scores()

This function summarizes the output of the score_entry_lists() function.

summarize_scores <- function(my_scores) {
  summarized_scores <- my_scores %>%
    group_by(season, model) %>%
    summarize(
      avg_total_pts = mean(total_points),
      avg_round_1 = mean(rnd_1_pts),
      avg_round_2 = mean(rnd_2_pts),
      avg_round_3 = mean(rnd_3_pts),
      avg_round_4 = mean(rnd_4_pts),
      avg_round_5 = mean(rnd_5_pts),
      avg_round_6 = mean(rnd_6_pts)
    ) %>%
    ungroup()
  
  return(summarized_scores)
}

Get summarized scores

my_summarized_scores <- summarize_scores(scored_entries)
sample_n(my_summarized_scores, 10)
## # A tibble: 10 x 9
##    season model       avg_total_pts avg_round_1 avg_round_2 avg_round_3
##     <int> <chr>               <dbl>       <dbl>       <dbl>       <dbl>
##  1   2007 smart-naive         117.         27.0       22.0        28.0 
##  2   2004 pure-naive           31.2        15.8        7.85        4.06
##  3   1987 pure-naive           31.2        16.0        7.99        4.08
##  4   1986 pure-naive           30.7        16.0        7.88        3.93
##  5   1996 smart-naive         103.         23.0       24.0        24.0 
##  6   2005 smart-naive          92.0        24.0       16.0        12.0 
##  7   2010 smart-naive          78.0        22.0       16.0        16.0 
##  8   1997 smart-naive          94.9        25.0       18.0        20.0 
##  9   2009 historic             76.0        21.5       19.0        16.5 
## 10   2015 pure-naive           32.2        15.9        8.05        3.97
## # ... with 3 more variables: avg_round_4 <dbl>, avg_round_5 <dbl>,
## #   avg_round_6 <dbl>

Display score summary by model

my_summarized_scores %>% 
  group_by(model) %>%
  summarize(model_avg = mean(avg_total_pts), model_sd = sd(avg_total_pts)) %>%
  arrange(-model_avg)
## # A tibble: 3 x 3
##   model       model_avg model_sd
##   <chr>           <dbl>    <dbl>
## 1 smart-naive      87.4   16.5  
## 2 historic         66.4    7.69 
## 3 pure-naive       31.5    0.597

Test if the difference between Smart-Naive and Historic is statistically significant

smart_naive <- my_summarized_scores %>%
  filter(model == "smart-naive")

historic <- my_summarized_scores %>%
  filter(model == "historic")

t.test(smart_naive$avg_total_pts, historic$avg_total_pts)
## 
##  Welch Two Sample t-test
## 
## data:  smart_naive$avg_total_pts and historic$avg_total_pts
## t = 6.532, df = 43.817, p-value = 5.667e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  14.54988 27.53662
## sample estimates:
## mean of x mean of y 
##  87.44275  66.39950

Yes, the difference is significant. The p-value again is far below 0.05 so we know that smart-naive is significantly better than historic year-on-year.

Density plot of all three models

ggplot(scored_entries, aes(x=total_points, fill = model, color = model)) +
  geom_density(aes(alpha = 0.5)) + 
  scale_color_brewer(palette = "Set1")

This chart clearly shows the performance of each model. Each model is significantly different than the others.

Top 10 entries

scored_entries %>%
  arrange(-total_points) %>%
  select(season, model, entry_num, total_points) %>%
  head(10)
## # A tibble: 10 x 4
##    season model       entry_num total_points
##     <int> <chr>           <int>        <dbl>
##  1   2007 historic           81         171.
##  2   1993 historic          538         163.
##  3   2008 smart-naive        13         162.
##  4   2008 smart-naive        23         162.
##  5   2008 smart-naive        31         162.
##  6   2008 smart-naive        50         162.
##  7   2008 smart-naive        56         162.
##  8   2008 smart-naive        83         162.
##  9   2008 smart-naive       104         162.
## 10   2008 smart-naive       109         162.

The top 2 are Historic and the rest are Smart-Naive from 2008. The rankers of the teams must have gotten it right in 2008.

There appears to be more upside potential with the Historic selections but on average Historic performs significantly worse than Smart-Naive.

How did they do in 2018?

The 2018 tournament just concluded and the winner was a #1 seeded team so smart_naive may have an advantage.

The 2018 tournament had a couple of unusual events that may make it nearly impossible to have a perfect bracket.
* In Region A the #1 ranked team lost to the #16 ranked team in the first round. That has never happened before.
* Also in Region A, the #11 ranked team won the region. That has only happened three times in the 32 years, and 128 brackets in the data since 1985.

Let’s see if smart-naive would win again in 2018.

Create 2018 Reference Season in correct format

round_1 <- c("A-16", "A-9", "A-5", "A-13", "A-11", "A-3", "A-7", "A-2", 
             "B-1", "B-9", "B-5", "B-4", "B-6", "B-3", "B-7", "B-2", 
             "C-1", "C-9", "C-5", "C-13", "C-6", "C-3", "C-10", "C-2", 
             "D-1", "D-8", "D-5", "D-4", "D-11", "D-3", "D-7", "D-2")

round_2 <- c("A-9", "A-5", "A-11", "A-7", 
             "B-9", "B-4", "B-3", "B-7", 
             "C-1", "C-5", "C-3", "C-2", 
             "D-1", "D-5", "D-11", "D-2")

round_3 <- c("A-9", "A-11", 
             "B-9", "B-3", 
             "C-1", "C-3", 
             "D-1", "D-2")

round_4 <- c("A-11",
             "B-3", 
             "C-1", 
             "D-1")

round_5 <- c("B-3", "C-1")
round_6 <- "C-1"

reference_year <- list(id = 2018, 
                       round_1 = round_1, 
                       round_2 = round_2, 
                       round_3 = round_3, 
                       round_4 = round_4, 
                       round_5 = round_5,
                       round_6 = round_6)

Score all entries against 2018 results

results_2018_pure_naive <- score_entries(my_pure_naive_entries, reference_season) %>%
  mutate(model = "pure-naive")
results_2018_smart_naive <- score_entries(my_smart_naive_entries, reference_season) %>%
  mutate(model = "smart-naive")
results_2018_historic <- score_entries(my_historic_entries, reference_season) %>%
  mutate(model = "historic")

Bind all results into single dataframe

all_2018_results <- bind_rows(results_2018_pure_naive, 
                              results_2018_smart_naive, 
                              results_2018_historic)

Group by model and summarize

all_2018_results %>% 
  group_by(model) %>%
  summarize(model_avg = mean(total_points), model_sd = sd(total_points)) %>%
  arrange(-model_avg)
## # A tibble: 3 x 3
##   model       model_avg model_sd
##   <chr>           <dbl>    <dbl>
## 1 smart-naive     117.      21.2
## 2 historic         83.1     27.3
## 3 pure-naive       30.6     11.8

Again, smart_naive won by a large margin.

Test for significance

smart_naive <- all_2018_results %>%
  filter(model == "smart-naive")

historic <- all_2018_results %>%
  filter(model == "historic")

t.test(smart_naive$total_points, historic$total_points)
## 
##  Welch Two Sample t-test
## 
## data:  smart_naive$total_points and historic$total_points
## t = 30.883, df = 1880.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  31.63479 35.92521
## sample estimates:
## mean of x mean of y 
##   116.856    83.076

Yes, the difference is significant. The p-value again is far below 0.05 so we know that smart-naive is significantly better than historic for 2018.

Did Historic get the overall high score in 2018?

all_2018_results %>%
  select(model, entry_num, total_points) %>%
  arrange(-total_points) %>%
  top_n(10)
## Selecting by total_points
## # A tibble: 126 x 3
##    model       entry_num total_points
##    <chr>           <int>        <dbl>
##  1 historic          227         168.
##  2 historic          528         166.
##  3 historic          212         161.
##  4 historic          503         161.
##  5 historic          589         158.
##  6 historic          858         158.
##  7 smart-naive         1         157.
##  8 smart-naive         3         157.
##  9 smart-naive         4         157.
## 10 smart-naive        15         157.
## # ... with 116 more rows

Yes, Historic got the top entries again in 2018 also.

Plot 2018 model results

ggplot(all_2018_results, aes(x=total_points, fill = model, color = model)) +
  geom_density(aes(alpha = 0.5)) + 
  scale_color_brewer(palette = "Set1")

Again, we see the 5 humps for smart-naive that indicates that the #1 ranked team won from one of the regions.

Conclusion

Next year when someone puts a tournament sheet in front of you, just pick all of the highest ranked teams and you’ll save a lot of time and may even finish in 1st place, tied with everyone else that uses the Smart-Naive strategy.

NCAA March Madness - How the receptionist wins the office brackets

Kier O’Neil

3/3/2019