Every spring college basketball fans in the US converge to watch a single-elimination tournament of the 64 best teams to determine the national champion.
A game has developed from the tournament where a person guesses the winner of each game and is given a certain number of points based on number of wins and which round those wins occurred in. The catch is that the guesser has to submit an entire tournament and cannot wait to see the results of the first round before picking the second round, and so on.
This paper explores three strategies that a person with no knowledge of American college basketball could use to pick winners for their tournament entry.
* Pure Naive - random selection
* Smart-Naive - Always picking the top ranked team
* Historic - Use historic brackets to predict future results
The result was that the smart-naive approach had significantly better results against the historic and pure-naive entries.
When 1000 entries were generated with each approach and graphed across all years there is a clear indication that smart-naive is better, and also that the year-on-year results between the two are highly correlated.
Note that the historic-based entries never do better than smart-naive in any year.
Chart of the average score of 1000 entries for each strategy over all years
This data is hosted on Kaggle . I use only a subset of all of the data because my approach is not based on the individual colleges but only the rankings that are pre-assigned to the them.
Kaggle has been having an annual competition around this tournament but the purpose of the competition is different than than hypothesis that I am testing here.
If someone were playing the March Madness game (aka picking your brackets) this is the list that they would have been given to predict.
2018 NCAA March Madness starting brackets
You will notice that it is highly-structured.
There are four regions that I’ve labelled A, B, C, and D.
Each team in each region is ranked from 1 to 16 based on their perceived strength. 1 being the strongest.
The teams in each region play each other until there is a champion.
The champions of regions A & B play each other, and the champions of regions C & D play each other in round 5. The winner of that game plays for overall champion.
Each correct pick in round 1 is worth 1 point.
Each correct pick in round 2 is worth 2 point.
Each correct pick in round 3 is worth 4 point.
Each correct pick in round 4 is worth 8 point.
Each correct pick in round 5 is worth 16 point.
Each correct pick in round 6 is worth 32 point.
The points in the later rounds become very difficult to get because your entry may predict two winners in round 1 that play each other and you pick incorrectly in round 2. This shuts off the “points-pipeline” for that branch with no more points possible.
The approach randomly selects winners in each game. This is expected to perform very poorly because it gives equal weight to a team ranked 16 to beat a team ranked #1 in the first round. Until 2018 that had never happened.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
data_dir <- "input"
# Import tourney seeds dataset
tourney_seeds <- read_csv(paste0(data_dir, "/TourneySeeds.csv")) %>%
rename(season = Season) %>%
rename(seed = Seed) %>%
rename(team = Team)
## Parsed with column specification:
## cols(
## Season = col_integer(),
## Seed = col_character(),
## Team = col_integer()
## )
# Import the seasons info
seasons <- read_csv(paste0(data_dir, "/Seasons.csv")) %>%
mutate(day_zero = mdy(Dayzero)) %>%
rename(season = Season) %>%
select(season, day_zero)
## Parsed with column specification:
## cols(
## Season = col_integer(),
## Dayzero = col_character(),
## Regionw = col_character(),
## Regionx = col_character(),
## Regiony = col_character(),
## Regionz = col_character()
## )
# Dayzero indicates the baseline date that is referenced in other data sets
# Read in tournament results
tourney_compact <- read_csv(paste0(data_dir, "/TourneyCompactResults.csv")) %>%
rename(season = Season) %>%
rename(day_num = Daynum) %>%
rename(winning_team = Wteam) %>%
select(season, day_num, winning_team)
## Parsed with column specification:
## cols(
## Season = col_integer(),
## Daynum = col_integer(),
## Wteam = col_integer(),
## Wscore = col_integer(),
## Lteam = col_integer(),
## Lscore = col_integer(),
## Wloc = col_character(),
## Numot = col_integer()
## )
# This column will be used to assign round numbers to results; mysteriously missing.
round_df = tibble(round = c(rep(1, 32), rep(2, 16), rep(3, 8), rep(4, 4), rep(5, 2), 6))
The data is not in a format well-suited for predictions based on seeds so I’ll have to do a little wrangling to get it into a format that can be used.
Note: I have turned many of these processes into functions that feed their output into the next process that uses the data as input. Some of the functions are lengthy but must be kept intact for this paper to generate. My recommendation is to understand the general approach first and then dive into the function syntax where you may have questions.
This takes all of the historic results and stores them for future processing.
This function has a couple of parameters where you can request either a dataframe or a list, and which rounds you want.
## Sequence through seasons generating 4 bracket dataframes for each season
# The id is assigned based on year and bracket that it came from
get_all_results <- function(return_type = c("df", "list"),
rounds = c("all", "1-4", "5-6")) {
tc_w_seeds_list <- list()
for (l in 1:(nrow(seasons) - 1)) {
# Get the season and day_zero to work with in this loop
this_season <- unlist(seasons[l, 1])
this_day_zero <- seasons[l, 2]
# Add date-played
tc_w_date_played <- tourney_compact %>%
filter(season == this_season) %>%
mutate(date_played = this_day_zero$day_zero + day_num)
# Which teams were assigned to each seed this season?
this_season_seeds <- tourney_seeds %>%
filter(season == this_season)
# Join seeds to main data
tc_w_region_seeds <- tc_w_date_played %>%
left_join(this_season_seeds, by = c("winning_team" = "team")) %>%
select(-season.y) %>%
rename(season = season.x) %>%
rename(region_seed = seed) %>%
select(season, date_played, region_seed)
# Some preliminary games are shown in the data.
# These are always the first games and are insignificant to our modelling.
# Remove preliminary games
t <- nrow(tc_w_region_seeds) - 63
if (t > 0) {
s <- seq(1, t)
tc_prelim_trim <- tc_w_region_seeds[-s,]
} else {
tc_prelim_trim <- tc_w_region_seeds
}
# Some teams from the prelim games won a game in the main tourney.
# They are indicated by a suffix of a or b.
# Remove the suffix because no other seeds with that ID will appear.
tc_w_suffix_trimmed <- tc_prelim_trim %>%
mutate(region_seed = str_replace(region_seed, "[ab]", ""))
# Sort the dates, lowest to highest, and bind the round numbers to the df
tc_w_rounds <- tc_w_suffix_trimmed %>%
bind_cols(round_df) %>%
select(season, round, region_seed)
# Break out the region and rename to A, B, C, and D
tc_w_regions <- tc_w_rounds %>%
mutate(region_seed = str_replace_all(region_seed, c(
"W" = "A",
"X" = "B",
"Y" = "C",
"Z" = "D"
))) %>%
mutate(region = str_extract(region_seed, "[ABCD]"))
# Break out the seed and drop region_seed
# We'll add region_seed back in a better format
tc_w_seeds <- tc_w_regions %>%
mutate(seed = str_extract(region_seed, "[01][0-9]")) %>%
mutate(seed = as.integer(seed)) %>%
select(-region_seed) %>%
mutate(region_seed = paste0(region, "-", seed))
# Filter based on rounds selected
if(rounds == "all") {
tc_w_seeds_filtered <- tc_w_seeds
} else if(rounds == "1-4") {
tc_w_seeds_filtered <- tc_w_seeds %>%
filter(round <= 4)
} else if(rounds == "5-6") {
tc_w_seeds_filtered <- tc_w_seeds %>%
filter(round >= 5)
} else {
print("Please include the rounds parameter of all, 1-4, or 5-6")
}
tc_w_seeds_list[[l]] <- tc_w_seeds_filtered
}
tc_w_seeds_df <- do.call(rbind, tc_w_seeds_list)
if(return_type == "df") {
return(tc_w_seeds_df)
} else if(return_type == "list") {
return(tc_w_seeds_list)
} else {
print("Please specify a return type of df or list")
}
}
After many attempts to work with this data in dataframes I decided that lists were more suited to this data.
Note: I probably worked more with lists and functions in this experiment than in all of my previous analysis combined.
rounds_1_4 <- get_all_results(return_type = "list", rounds = "1-4")
Each record only shows the winning region_seeds in each round. This will be used later to score predictions vs actuals.
rounds_1_4[[1]]
## # A tibble: 60 x 5
## season round region seed region_seed
## <int> <dbl> <chr> <int> <chr>
## 1 1985 1. B 9 B-9
## 2 1985 1. D 11 D-11
## 3 1985 1. A 1 A-1
## 4 1985 1. C 9 C-9
## 5 1985 1. D 3 D-3
## 6 1985 1. B 12 B-12
## 7 1985 1. C 5 C-5
## 8 1985 1. A 4 A-4
## 9 1985 1. D 2 D-2
## 10 1985 1. D 7 D-7
## # ... with 50 more rows
This function gets the actual results of a single season’s tournament.
# Put actuals for a year into same format as predictions
get_reference_season <- function(my_season = 1985) {
actual_1_4 <- get_all_results(return_type = "df", rounds = "1-4")
actual_5_6 <- get_all_results(return_type = "df", rounds = "5-6")
final_results <- rbind(actual_1_4, actual_5_6)
this_season <- final_results %>%
filter(season == my_season)
rounds <- list()
for (i in 1:6) {
rounds[[i+1]] <- this_season %>%
filter(round == i) %>%
select(region_seed) %>%
arrange(region_seed) %>%
unlist()
}
rounds[[1]] <- unique(this_season$season)
for(j in 1:7) {
attributes(rounds[[j]]) <- NULL
}
names(rounds) <- c("id", "round_1", "round_2", "round_3",
"round_4", "round_5", "round_6")
return(rounds)
}
In this case I intentionally chose 2007 because a #1 ranked team won the whole tournament. This will create some score variance in the smart-naive scores later.
reference_season <- get_reference_season(my_season = 2007)
reference_season
## $id
## [1] 2007
##
## $round_1
## [1] "A-1" "A-2" "A-3" "A-4" "A-5" "A-6" "A-7" "A-9" "B-1" "B-2"
## [11] "B-3" "B-4" "B-5" "B-6" "B-7" "B-9" "C-1" "C-11" "C-2" "C-3"
## [21] "C-4" "C-5" "C-7" "C-9" "D-1" "D-11" "D-2" "D-3" "D-4" "D-5"
## [31] "D-7" "D-8"
##
## $round_2
## [1] "A-1" "A-2" "A-5" "A-6" "B-1" "B-2" "B-3" "B-5" "C-1" "C-3" "C-5"
## [12] "C-7" "D-1" "D-2" "D-3" "D-4"
##
## $round_3
## [1] "A-1" "A-2" "B-1" "B-2" "C-1" "C-3" "D-1" "D-2"
##
## $round_4
## [1] "A-2" "B-1" "C-1" "D-2"
##
## $round_5
## [1] "B-1" "C-1"
##
## $round_6
## [1] "C-1"
This function defaults to creating 1 entry but any number can be specified.
It is very verbose, sometimes intentionally and sometimes not.
It is randomized at every step of the way.
Results are expected to be very poor
We’ll start using randomization at this point so I’ll set the seed for reproducibility.
set.seed(317)
## Function code
create_pure_naive_entries <- function(num_entries = 1) {
final_entries <- list()
#=============================
# ROUND 1 SETUP
#============================
# 8 games with team 1 playing team 2, pairwise
for (n in 1:num_entries) {
final_df <- tibble()
# These seeds always play each other, pairwise, in the first round.
team_1_seed <- c(1, 8, 5, 4, 6, 3, 7, 2)
team_2_seed <- c(16, 9, 12, 13, 11, 14, 10, 15)
# Sequence through regions and generate random winners
# Generate round 1 winners
round_results_list <- list()
this_region <- c(A = 1,
B = 2,
C = 3,
D = 4)
for (r in this_region) {
winner <- as.integer()
for (i in 1:8) {
winner[i] <- sample(c(team_1_seed[i], team_2_seed[i]), 1)
#print(winner[i])
}
round_results_list[[r]] <- tibble(
round = 1,
region = names(this_region[r]),
seed = winner
)
}
round_df <- do.call(rbind, round_results_list)
final_df <- round_df
#========================
# SETUP ROUND 2
#========================
# In order to traverse the bracket correctly I determine team_1 and team_2 this way...
region_round_list <- list()
round_results_list <- list()
for (r in this_region) {
region_round_list[[r]] <- round_df %>%
filter(region == names(this_region[r])) %>%
select(seed) %>%
unlist()
# Get every other record and assign to team 1 and team 2
team_1_seed <- region_round_list[[r]][c(1, 3, 5, 7)]
team_2_seed <- region_round_list[[r]][c(2, 4, 6, 8)]
winner <- as.integer()
for (i in 1:4) {
winner[i] <- sample(c(team_1_seed[i], team_2_seed[i]), 1)
}
#this_round <- rep(2, 4)
round_results_list[[r]] <- tibble(
round = 2,
region = names(this_region[r]),
seed = winner
)
}
round_df <- do.call(rbind, round_results_list)
final_df <- rbind(final_df, round_df)
#=====================================
# SETUP ROUND 3
#====================================
region_round_list <- list()
round_results_list <- list()
for (r in this_region) {
region_round_list[[r]] <- round_df %>%
filter(region == names(this_region[r])) %>%
select(seed) %>%
unlist()
team_1_seed <- region_round_list[[r]][c(1, 3)]
team_2_seed <- region_round_list[[r]][c(2, 4)]
winner <- as.integer()
for (i in 1:2) {
winner[i] <- sample(c(team_1_seed[i], team_2_seed[i]), 1)
}
#this_round <- rep(3, 2)
round_results_list[[r]] <- tibble(
round = 3,
region = names(this_region[r]),
seed = winner
)
}
round_df <- do.call(rbind, round_results_list)
final_df <- rbind(final_df, round_df)
#===============================
# SETUP ROUND 4
#===============================
# Determine regional champions
#this_round <- 4
region_round_list <- list()
round_results_list <- list()
for (r in this_region) {
region_round_list[[r]] <- round_df %>%
filter(region == names(this_region[r])) %>%
select(seed) %>%
unlist()
team_1_seed <- region_round_list[[r]][1]
team_2_seed <- region_round_list[[r]][2]
winner <- as.integer()
winner[r] <- sample(c(team_1_seed, team_2_seed), 1)
round_results_list[[r]] <- tibble(
round = 4,
region = names(this_region[r]),
seed = winner[r]
)
}
round_df <- do.call(rbind, round_results_list)
final_df <- rbind(final_df, round_df)
#=============================================
# SETUP ROUND 5
#=============================================
# Round 5 has the champions of region A & B platying each other
# and the champions of C & D playing each other.
# We need to append the region to the seed now.
round_results_list <- list()
final_df_2 <- final_df %>%
mutate(region_seed = paste0(region, "-", seed))
ab_game <- final_df_2 %>%
filter(round == 4) %>%
filter(region %in% c("A", "B")) %>%
select(region_seed) %>%
sample_n(1) %>%
unlist()
this_seed <- str_replace(ab_game, "^[ABCD]-", "")
round_results_list[[1]] <- tibble(
round = 5,
region = "AB",
seed = as.integer(str_replace(ab_game, "^[ABCD]-", "")),
region_seed = ab_game
)
cd_game <- final_df_2 %>%
filter(round == 4) %>%
filter(region %in% c("C", "D")) %>%
select(region_seed) %>%
sample_n(1) %>%
unlist()
round_results_list[[2]] <- tibble(
round = 5,
region = "CD",
seed = as.integer(str_replace(cd_game, "^[ABCD]-", "")),
region_seed = cd_game
)
round_df <- do.call(rbind, round_results_list)
final_df_2 <- rbind(final_df_2, round_df)
#===========================
# SETUP ROUND 6
#===========================
round_results_list <- list()
abcd_game <- final_df_2 %>%
filter(round == 5) %>%
select(region_seed) %>%
sample_n(1) %>%
unlist()
round_results_list[[1]] <- tibble(
round = 6,
region = "ABCD",
seed = as.integer(str_replace(abcd_game, "^[ABCD]-", "")),
region_seed = ab_game
)
round_df <- do.call(rbind, round_results_list)
final_df_2 <- rbind(final_df_2, round_df)
#==================================
# CONVERT TO FORMAT USED BY SCORER
#==================================
# entry_num = n
#final_entry <- list()
round_results_list <- list()
for (i in 1:6) {
round_results_list[[i]] <- final_df_2 %>%
filter(round == i) %>%
select(region_seed) %>%
unlist()
#print(round_results_list[[i]])
}
for (j in 1:6) {
attributes(round_results_list[[j]]) <- NULL
}
final_entries[[n]] <- list(
id = n,
round_1 = round_results_list[[1]],
round_2 = round_results_list[[2]],
round_3 = round_results_list[[3]],
round_4 = round_results_list[[4]],
round_5 = round_results_list[[5]],
round_6 = round_results_list[[6]]
)
}
return(final_entries)
}
my_pure_naive_entries <- create_pure_naive_entries(1000)
This is what the first entry looks like.
my_pure_naive_entries[[1]]
## $id
## [1] 1
##
## $round_1
## [1] "A-1" "A-8" "A-12" "A-13" "A-11" "A-3" "A-7" "A-15" "B-1" "B-8"
## [11] "B-12" "B-13" "B-11" "B-14" "B-10" "B-15" "C-16" "C-8" "C-5" "C-13"
## [21] "C-11" "C-14" "C-10" "C-2" "D-1" "D-9" "D-12" "D-4" "D-11" "D-14"
## [31] "D-10" "D-2"
##
## $round_2
## [1] "A-8" "A-13" "A-3" "A-7" "B-8" "B-12" "B-14" "B-10" "C-8" "C-13"
## [11] "C-11" "C-2" "D-1" "D-12" "D-11" "D-2"
##
## $round_3
## [1] "A-13" "A-7" "B-12" "B-14" "C-13" "C-2" "D-12" "D-11"
##
## $round_4
## [1] "A-13" "B-14" "C-2" "D-12"
##
## $round_5
## [1] "B-14" "D-12"
##
## $round_6
## [1] "B-14"
In this case the id is the entry number instead of the reference year that we saw in the previous function.
The remaining format is exactly the same as the reference year format. A list with winning predictions/actuals for each round.
Later our scoring engine will determine the number of elements in the intersect between the two elements and assign points appropriately.
The scoring engine assigns points based on the number of correct selections in each round.
Almost everyone uses the same scoring system in their game:
Round 1 win = 1 point
Round 2 win = 2 points
Round 3 win = 4 points
Round 4 win = 8 points
Round 5 win = 16 points
Round 6 win = 32 points
I accomplish this by counting the number of elements in each round that are in the corresponding reference season and applying a multiplier. AUTHOR’S NOTE: In high school I thought I would never use Set Theory in real life, and here I am ;-)
The inputs are:
* Entry list
* Reference Season results
score_entries <- function(my_entries, reference_season) {
entry_results_list <- list()
for (i in 1:length(my_entries)) {
entry_results_list[[i]] <- tibble(
entry_num = i,
rnd_1_pts = length(intersect(
my_entries[[i]]$round_1, reference_season$round_1
)),
rnd_2_pts = length(intersect(
my_entries[[i]]$round_2, reference_season$round_2
)) * 2,
rnd_3_pts = length(intersect(
my_entries[[i]]$round_3, reference_season$round_3
)) * 4,
rnd_4_pts = length(intersect(
my_entries[[i]]$round_4, reference_season$round_4
)) * 8,
rnd_5_pts = length(intersect(
my_entries[[i]]$round_5, reference_season$round_5
)) * 16,
rnd_6_pts = length(intersect(
my_entries[[i]]$round_6, reference_season$round_6
)) * 32,
total_points = sum(
rnd_1_pts,
rnd_2_pts,
rnd_3_pts,
rnd_4_pts,
rnd_5_pts,
rnd_6_pts
))
entry_results_final <- do.call(rbind, entry_results_list)
}
return(entry_results_final)
}
my_pure_naive_results <- score_entries(my_pure_naive_entries, reference_season)
summary(my_pure_naive_results$total_points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 23.00 28.00 30.63 36.00 103.00
sd(my_pure_naive_results$total_points)
## [1] 11.82051
The vertical line represents the mean.
model_type <- c("pure-naive", "smart_naive", "historic")
ggplot(my_pure_naive_results, aes(x=total_points, fill = model_type[1], color = model_type[1])) +
geom_density(aes(alpha = 0.5)) +
geom_vline(aes(xintercept = mean(my_pure_naive_results$total_points),
color = model_type[1])) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "none")
A definite right-skew similar to a poisson distribution.
In order to understand this method I want to re-acquaint you with the brackets that the game participant must fill out in full before the tournament starts.
2018 NCAA March Madness starting brackets
Remember that the Smart-Naive person has no knowledge of US college basketball but uses the information provided to make selections.
In this case they notice that in each quadrant someone has gone to the trouble to rank each of the teams in that region 1 to 16. That person probably knows a lot more about college basketball than they do. They also understand that the #1 ranked team has the highest perceived strength and #16 has the lowest perceived strength.
They rationalize that they should always pick the highest ranked team in each game which causes the first four rounds to always play out the same.
Round 1 winners: #1-8
Round 2 winners: #1-4
Round 3 winners: #1-2
Round 4 winner: #1
Since the Smart-Naive person has no additional information to work with in round 5 they randomly select the winner in the region A vs B, and Region C vs D games. They also randomly select round 6.
create_smart_naive_entries <- function(num_entries = 1) {
# Every region in rounds 1 - 4 are the same since the smart naive person always
# selects the lowest seed, because that seed is assumed to have the most strenth
# Setup rounds 1 - 4
sn_round_1 <- c(paste0("A-", 1:8),
paste0("B-", 1:8),
paste0("C-", 1:8),
paste0("D-", 1:8))
sn_round_2 <- c(paste0("A-", 1:4),
paste0("B-", 1:4),
paste0("C-", 1:4),
paste0("D-", 1:4))
sn_round_3 <- c(paste0("A-", 1:2),
paste0("B-", 1:2),
paste0("C-", 1:2),
paste0("D-", 1:2))
sn_round_4 <- c(paste0("A-", 1),
paste0("B-", 1),
paste0("C-", 1),
paste0("D-", 1))
sn_list <- list()
for (n in 1:num_entries) {
sn_list[[n]] <- list(
id = n,
round_1 = sn_round_1,
round_2 = sn_round_2,
round_3 = sn_round_3,
round_4 = sn_round_4
)
# Randomize round 5; since they are both seed #1
sn_round_5_ab <- sample(c("A-1", "B-1"), 1)
sn_round_5_cd <- sample(c("C-1", "D-1"), 1)
sn_list[[n]]$round_5 <- c(sn_round_5_ab, sn_round_5_cd)
# Randomize round 6; since they are both seed #1
sn_list[[n]]$round_6 <-
sample(c(sn_round_5_ab, sn_round_5_cd), 1)
}
return(sn_list)
}
my_smart_naive_entries <- create_smart_naive_entries(1000)
my_smart_naive_results <- score_entries(my_smart_naive_entries, reference_season)
summary(my_smart_naive_results$total_points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 93.0 93.0 109.0 116.9 141.0 157.0
sd(my_smart_naive_results$total_points)
## [1] 21.18821
These are pretty incredible results which accurately indicates that a lot of the top seeds won in the later rounds. This season may be an outlier. We’ll see when I compare across all seasons.
Remember that the mean score of Pure-Naive entries was 30.73.
#model_type <- c("pure-naive", "smart_naive", "historic")
ggplot(my_smart_naive_results, aes(x=total_points, fill = model_type[2], color = model_type[2])) +
geom_density(aes(alpha = 0.5)) +
geom_vline(aes(xintercept = mean(my_smart_naive_results$total_points),
color = model_type[2])) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "none")
Interesting that there are five distinct humps.
This approach is based on the theory that past-performance can be used to predict future results.
Note: This should not be considered machine learning since there is no mechanism for the algorithm to get better.
The function below is very long and the major pieces are:
create_historic_entries <- function(num_entries = 1) {
all_results <- get_all_results(return_type = "df", rounds = "1-4")
bracket_list <- list()
master_list <- list()
season_list <- list()
# Create incrementer for master_list
j <- 1
for (this_season in unique(all_results$season)) {
season_results <- all_results %>%
filter(season == this_season)
for (i in 1:4) {
if (i == 1) {
this_region = "A"
} else if (i == 2) {
this_region = "B"
} else if (i == 3) {
this_region = "C"
} else if (i == 4) {
this_region = "D"
}
region_results <- season_results %>%
filter(region == this_region)
round_list <- list()
id <- all_results %>%
filter(season == this_season) %>%
filter(region == this_region) %>%
mutate(id = paste0(season, "-", this_region)) %>%
select(id) %>%
unique() %>%
unlist()
round_list[[1]] <- all_results %>%
filter(season == this_season) %>%
filter(round == 1) %>%
filter(region == this_region) %>%
select(seed) %>%
arrange(seed)
round_list[[2]] <- all_results %>%
filter(season == this_season) %>%
filter(round == 2) %>%
filter(region == this_region) %>%
select(seed) %>%
arrange(seed)
round_list[[3]] <- all_results %>%
filter(season == this_season) %>%
filter(round == 3) %>%
filter(region == this_region) %>%
select(seed) %>%
arrange(seed)
round_list[[4]] <- all_results %>%
filter(season == this_season) %>%
filter(round == 4) %>%
filter(region == this_region) %>%
select(seed) %>%
arrange(seed)
region_list <- list(
id = id,
round_1 = round_list[[1]]$seed,
round_2 = round_list[[2]]$seed,
round_3 = round_list[[3]]$seed,
round_4 = round_list[[4]]$seed
)
master_list[[j]] <- region_list
j = j + 1
}
}
round_5_6 <- get_all_results(return_type = "df", rounds = "5-6")
all_seeds <- tibble(seed = 1:16)
round_5_stats <- round_5_6 %>%
filter(round == 5) %>%
group_by(seed) %>%
summarize(number = n()) %>%
right_join(all_seeds, by = "seed") %>%
replace_na(list(number = 0))
round_6_stats <- round_5_6 %>%
filter(round == 6) %>%
group_by(seed) %>%
summarize(number = n()) %>%
right_join(all_seeds, by = "seed") %>%
replace_na(list(number = 0))
#==============================
# ROUNDS 1-4
#==============================
entry_list <- list()
for (j in 1:num_entries) {
regional_bracket_list <- list()
# Get four historic brackets by sampling the master list
regional_bracket_list <-
sample(master_list, 4, replace = TRUE)
# For each element we need to append the region code: A, B, C, or D
for (i in 1:4) {
if (i == 1) {
this_region = "A"
} else if (i == 2) {
this_region = "B"
} else if (i == 3) {
this_region = "C"
} else if (i == 4) {
this_region = "D"
}
regional_bracket_list[[i]]$round_1 <- paste0(this_region,
"-",
regional_bracket_list[[i]]$round_1)
regional_bracket_list[[i]]$round_2 <- paste0(this_region,
"-",
regional_bracket_list[[i]]$round_2)
regional_bracket_list[[i]]$round_3 <- paste0(this_region,
"-",
regional_bracket_list[[i]]$round_3)
regional_bracket_list[[i]]$round_4 <- paste0(this_region,
"-",
regional_bracket_list[[i]]$round_4)
}
# Condense to a single list for the final entry
entry_num <- j
this_round_1 <- regional_bracket_list %>%
map("round_1") %>%
unlist()
this_round_2 <- regional_bracket_list %>%
map("round_2") %>%
unlist()
this_round_3 <- regional_bracket_list %>%
map("round_3") %>%
unlist()
this_round_4 <- regional_bracket_list %>%
map("round_4") %>%
unlist()
final_entry <- list(
id = entry_num,
round_1 = this_round_1,
round_2 = this_round_2,
round_3 = this_round_3,
round_4 = this_round_4
)
#===============================
# ROUND 5
#===============================
a_winner <- unlist(regional_bracket_list[[1]][5])
b_winner <- unlist(regional_bracket_list[[2]][5])
c_winner <- unlist(regional_bracket_list[[3]][5])
d_winner <- unlist(regional_bracket_list[[4]][5])
a_df <- tibble(
ref_bracket = unlist(regional_bracket_list[[1]][1]),
winner = unlist(regional_bracket_list[[1]][5]),
seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
region = str_extract(winner, "^[ABCD]")
)
b_df <- tibble(
ref_bracket = unlist(regional_bracket_list[[2]][1]),
winner = unlist(regional_bracket_list[[2]][5]),
seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
region = str_extract(winner, "^[ABCD]")
)
c_df <- tibble(
ref_bracket = unlist(regional_bracket_list[[3]][1]),
winner = unlist(regional_bracket_list[[3]][5]),
seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
region = str_extract(winner, "^[ABCD]")
)
d_df <- tibble(
ref_bracket = unlist(regional_bracket_list[[4]][1]),
winner = unlist(regional_bracket_list[[4]][5]),
seed = as.integer(str_replace(winner, "^[ABCD]-", "")),
region = str_extract(winner, "^[ABCD]")
)
ab_round_5 <- rbind(a_df, b_df)
# Does A win; this is binary so if A doesn't win then B wins
# join in round 5 stats
ab_final <- ab_round_5 %>%
left_join(round_5_stats, by = "seed")
if (ab_final$number[1] == ab_final$number[2]) {
ab_winner <- sample(c(ab_final$winner[1], ab_final$winner[2]), 1)
} else if (ab_final$number[1] == 0) {
ab_winner <- ab_final$winner[2]
} else if (ab_final$number[2] == 0) {
ab_winner <- ab_final$winner[2]
} else {
a_num <- ab_final$number[1]
b_num <- ab_final$number[2]
a_pct <- a_num / (a_num + b_num)
if (runif(1) <= a_pct) {
ab_winner <- ab_final$winner[1]
} else {
ab_winner <- ab_final$winner[2]
}
}
ab_winner_record <- ab_final %>%
filter(winner == ab_winner) %>%
select(-number)
# Do the same for C & D
cd_round_5 <- rbind(c_df, d_df)
cd_final <- cd_round_5 %>%
left_join(round_5_stats, by = "seed")
if (cd_final$number[1] == cd_final$number[2]) {
cd_winner <- sample(c(cd_final$winner[1], cd_final$winner[2]), 1)
} else if (cd_final$number[1] == 0) {
cd_winner <- cd_final$winner[2]
} else if (cd_final$number[2] == 0) {
cd_winner <- cd_final$winner[2]
} else {
c_num <- cd_final$number[1]
d_num <- cd_final$number[2]
c_pct <- c_num / (c_num + d_num)
if (runif(1) <= c_pct) {
cd_winner <- cd_final$winner[1]
} else {
cd_winner <- cd_final$winner[2]
}
}
cd_winner_record <- cd_final %>%
filter(winner == cd_winner) %>%
select(-number)
# Append round 5 winners to final entry
final_entry$round_5 <- c(ab_winner_record$winner,
cd_winner_record$winner)
#==================================
# ROUND 6
#==================================
abcd_final <- rbind(ab_winner_record, cd_winner_record) %>%
left_join(round_6_stats, by = "seed")
# Check for special conditions
if (abcd_final$number[1] == abcd_final$number[2]) {
abcd_winner <-
sample(c(abcd_final$winner[1], abcd_final$winner[2]), 1)
} else if (abcd_final$number[1] == 0) {
abcd_winner <- abcd_final$winner[2]
} else if (abcd_final$number[2] == 0) {
abcd_winner <- abcd_final$winner[1]
} else {
ab_num <- abcd_final$number[1]
cd_num <- abcd_final$number[2]
ab_pct <- ab_num / (ab_num + cd_num)
if (runif(1) <= ab_pct) {
abcd_winner <- abcd_final$winner[1]
} else {
abcd_winner <- abcd_final$winner[2]
}
}
abcd_winner_record <- abcd_final %>%
filter(winner == abcd_winner) %>%
select(-number)
# Append round 6 winner to final_entry
final_entry$round_6 <- abcd_winner_record$winner
entry_list[[j]] <- final_entry
}
return(entry_list)
}
my_historic_entries <- create_historic_entries(1000)
my_historic_results <- score_entries(my_historic_entries, reference_season)
summary(my_historic_results$total_points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.00 62.00 77.00 83.08 100.00 168.00
sd(my_historic_results$total_points)
## [1] 27.34004
WOW!! A max of 168 out of 192 total points possible. That is a very high score.
model_type <- c("pure-naive", "smart_naive", "historic")
ggplot(my_historic_results, aes(x=total_points, fill = model_type[3], color = model_type[3])) +
geom_density(aes(alpha = 0.5)) +
geom_vline(aes(xintercept = mean(my_historic_results$total_points),
color = model_type[3])) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "none")
Right skewed with the mean of 83.076.
t.test(my_smart_naive_results$total_points, my_historic_results$total_points)
##
## Welch Two Sample t-test
##
## data: my_smart_naive_results$total_points and my_historic_results$total_points
## t = 30.883, df = 1880.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 31.63479 35.92521
## sample estimates:
## mean of x mean of y
## 116.856 83.076
Yes, the difference is significant. The p-value is well below 0.05 so for this year the Smart-Naive approach is significantly better than the Historic approach.
The data includes the results since 1985 which gives us 32 years to test against to see if there really is a significant difference on-average between the two methods.
all_entries <- list(pure_naive = my_pure_naive_entries,
smart_naive = my_smart_naive_entries,
historic_entries = my_historic_entries)
This function gathers all of the historical results into a single list.
# Create function to build list of the results of every season
create_reference_season_list <- function() {
reference_season_list <- list()
#start_time <- Sys.time()
j <- 1
for(i in 1985:2016) {
#print(paste("Building", i, "season ..."))
reference_season_list[[j]] <- get_reference_season(my_season = i)
#print(paste("Season", i , "complete"))
j <- j + 1
}
#end_time <- Sys.time()
#print(end_time - start_time)
return(reference_season_list)
}
reference_season_list <- create_reference_season_list()
reference_season_list[[1]]
## $id
## [1] 1985
##
## $round_1
## [1] "A-1" "A-2" "A-3" "A-4" "A-5" "A-6" "A-7" "A-8" "B-1" "B-11"
## [11] "B-12" "B-2" "B-3" "B-4" "B-7" "B-9" "C-1" "C-11" "C-2" "C-3"
## [21] "C-4" "C-5" "C-7" "C-9" "D-1" "D-11" "D-13" "D-2" "D-3" "D-5"
## [31] "D-7" "D-8"
##
## $round_2
## [1] "A-1" "A-2" "A-3" "A-4" "B-1" "B-12" "B-3" "B-7" "C-1" "C-11"
## [11] "C-2" "C-5" "D-11" "D-2" "D-5" "D-8"
##
## $round_3
## [1] "A-1" "A-2" "B-1" "B-3" "C-1" "C-2" "D-2" "D-8"
##
## $round_4
## [1] "A-1" "B-1" "C-2" "D-8"
##
## $round_5
## [1] "A-1" "D-8"
##
## $round_6
## [1] "D-8"
score_entry_lists <- function(my_entries) {
#start_time <- Sys.time()
#print(paste("Start time is:", start_time))
all_seasons_score_list <- list()
j <- 1
for (i in seq_along(reference_season_list)) {
all_scores_list <- list()
this_season <- reference_season_list[[i]]$id
#start_time <- Sys.time()
#print(paste("Pure-naive for", this_season, "started ..."))
my_pure_naive_scores <-
score_entries(my_entries[[1]], reference_season_list[[i]])
#end_time <- Sys.time()
#print("COMPLETE")
#print(end_time - start_time)
#start_time <- Sys.time()
#print(paste("Smart-naive for", this_season, "started ..."))
my_smart_naive_scores <-
score_entries(my_entries[[2]], reference_season_list[[i]])
# end_time <- Sys.time()
# print("COMPLETE")
# print(end_time - start_time)
#
# start_time <- Sys.time()
# print(paste("Historic for", this_season, "started ..."))
my_historic_scores <-
score_entries(my_entries[[3]], reference_season_list[[i]])
# end_time <- Sys.time()
# print("COMPLETE")
# print(end_time - start_time)
all_scores_list <- list(
pure_naive = my_pure_naive_scores,
smart_naive = my_smart_naive_scores,
historic_entries = my_historic_scores
)
all_scores_list[[1]]$season <- this_season
all_scores_list[[2]]$season <- this_season
all_scores_list[[3]]$season <- this_season
all_scores_list[[1]]$model <- "pure-naive"
all_scores_list[[2]]$model <- "smart-naive"
all_scores_list[[3]]$model <- "historic"
all_seasons_score_list[[j]] <- all_scores_list
j <- j + 1
}
all_seasons_score_final_list <- do.call(rbind, all_seasons_score_list)
all_seasons_score_final_df <- do.call(rbind, all_seasons_score_final_list)
#end_time <- Sys.time()
#print(end_time - start_time)
return(all_seasons_score_final_df)
}
This function takes a long time to run
I will load a saved version of the output because the results won’t cache.
scored_entries <- score_entry_lists(all_entries)
scored_entries <- readRDS("scored_entries_v3.RDS")
This function summarizes the output of the score_entry_lists() function.
summarize_scores <- function(my_scores) {
summarized_scores <- my_scores %>%
group_by(season, model) %>%
summarize(
avg_total_pts = mean(total_points),
avg_round_1 = mean(rnd_1_pts),
avg_round_2 = mean(rnd_2_pts),
avg_round_3 = mean(rnd_3_pts),
avg_round_4 = mean(rnd_4_pts),
avg_round_5 = mean(rnd_5_pts),
avg_round_6 = mean(rnd_6_pts)
) %>%
ungroup()
return(summarized_scores)
}
my_summarized_scores <- summarize_scores(scored_entries)
sample_n(my_summarized_scores, 10)
## # A tibble: 10 x 9
## season model avg_total_pts avg_round_1 avg_round_2 avg_round_3
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2007 smart-naive 117. 27.0 22.0 28.0
## 2 2004 pure-naive 31.2 15.8 7.85 4.06
## 3 1987 pure-naive 31.2 16.0 7.99 4.08
## 4 1986 pure-naive 30.7 16.0 7.88 3.93
## 5 1996 smart-naive 103. 23.0 24.0 24.0
## 6 2005 smart-naive 92.0 24.0 16.0 12.0
## 7 2010 smart-naive 78.0 22.0 16.0 16.0
## 8 1997 smart-naive 94.9 25.0 18.0 20.0
## 9 2009 historic 76.0 21.5 19.0 16.5
## 10 2015 pure-naive 32.2 15.9 8.05 3.97
## # ... with 3 more variables: avg_round_4 <dbl>, avg_round_5 <dbl>,
## # avg_round_6 <dbl>
my_summarized_scores %>%
group_by(model) %>%
summarize(model_avg = mean(avg_total_pts), model_sd = sd(avg_total_pts)) %>%
arrange(-model_avg)
## # A tibble: 3 x 3
## model model_avg model_sd
## <chr> <dbl> <dbl>
## 1 smart-naive 87.4 16.5
## 2 historic 66.4 7.69
## 3 pure-naive 31.5 0.597
smart_naive <- my_summarized_scores %>%
filter(model == "smart-naive")
historic <- my_summarized_scores %>%
filter(model == "historic")
t.test(smart_naive$avg_total_pts, historic$avg_total_pts)
##
## Welch Two Sample t-test
##
## data: smart_naive$avg_total_pts and historic$avg_total_pts
## t = 6.532, df = 43.817, p-value = 5.667e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 14.54988 27.53662
## sample estimates:
## mean of x mean of y
## 87.44275 66.39950
Yes, the difference is significant. The p-value again is far below 0.05 so we know that smart-naive is significantly better than historic year-on-year.
ggplot(scored_entries, aes(x=total_points, fill = model, color = model)) +
geom_density(aes(alpha = 0.5)) +
scale_color_brewer(palette = "Set1")
This chart clearly shows the performance of each model. Each model is significantly different than the others.
scored_entries %>%
arrange(-total_points) %>%
select(season, model, entry_num, total_points) %>%
head(10)
## # A tibble: 10 x 4
## season model entry_num total_points
## <int> <chr> <int> <dbl>
## 1 2007 historic 81 171.
## 2 1993 historic 538 163.
## 3 2008 smart-naive 13 162.
## 4 2008 smart-naive 23 162.
## 5 2008 smart-naive 31 162.
## 6 2008 smart-naive 50 162.
## 7 2008 smart-naive 56 162.
## 8 2008 smart-naive 83 162.
## 9 2008 smart-naive 104 162.
## 10 2008 smart-naive 109 162.
The top 2 are Historic and the rest are Smart-Naive from 2008. The rankers of the teams must have gotten it right in 2008.
There appears to be more upside potential with the Historic selections but on average Historic performs significantly worse than Smart-Naive.
The 2018 tournament just concluded and the winner was a #1 seeded team so smart_naive may have an advantage.
The 2018 tournament had a couple of unusual events that may make it nearly impossible to have a perfect bracket.
* In Region A the #1 ranked team lost to the #16 ranked team in the first round. That has never happened before.
* Also in Region A, the #11 ranked team won the region. That has only happened three times in the 32 years, and 128 brackets in the data since 1985.
Let’s see if smart-naive would win again in 2018.
round_1 <- c("A-16", "A-9", "A-5", "A-13", "A-11", "A-3", "A-7", "A-2",
"B-1", "B-9", "B-5", "B-4", "B-6", "B-3", "B-7", "B-2",
"C-1", "C-9", "C-5", "C-13", "C-6", "C-3", "C-10", "C-2",
"D-1", "D-8", "D-5", "D-4", "D-11", "D-3", "D-7", "D-2")
round_2 <- c("A-9", "A-5", "A-11", "A-7",
"B-9", "B-4", "B-3", "B-7",
"C-1", "C-5", "C-3", "C-2",
"D-1", "D-5", "D-11", "D-2")
round_3 <- c("A-9", "A-11",
"B-9", "B-3",
"C-1", "C-3",
"D-1", "D-2")
round_4 <- c("A-11",
"B-3",
"C-1",
"D-1")
round_5 <- c("B-3", "C-1")
round_6 <- "C-1"
reference_year <- list(id = 2018,
round_1 = round_1,
round_2 = round_2,
round_3 = round_3,
round_4 = round_4,
round_5 = round_5,
round_6 = round_6)
results_2018_pure_naive <- score_entries(my_pure_naive_entries, reference_season) %>%
mutate(model = "pure-naive")
results_2018_smart_naive <- score_entries(my_smart_naive_entries, reference_season) %>%
mutate(model = "smart-naive")
results_2018_historic <- score_entries(my_historic_entries, reference_season) %>%
mutate(model = "historic")
all_2018_results <- bind_rows(results_2018_pure_naive,
results_2018_smart_naive,
results_2018_historic)
all_2018_results %>%
group_by(model) %>%
summarize(model_avg = mean(total_points), model_sd = sd(total_points)) %>%
arrange(-model_avg)
## # A tibble: 3 x 3
## model model_avg model_sd
## <chr> <dbl> <dbl>
## 1 smart-naive 117. 21.2
## 2 historic 83.1 27.3
## 3 pure-naive 30.6 11.8
Again, smart_naive won by a large margin.
smart_naive <- all_2018_results %>%
filter(model == "smart-naive")
historic <- all_2018_results %>%
filter(model == "historic")
t.test(smart_naive$total_points, historic$total_points)
##
## Welch Two Sample t-test
##
## data: smart_naive$total_points and historic$total_points
## t = 30.883, df = 1880.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 31.63479 35.92521
## sample estimates:
## mean of x mean of y
## 116.856 83.076
Yes, the difference is significant. The p-value again is far below 0.05 so we know that smart-naive is significantly better than historic for 2018.
all_2018_results %>%
select(model, entry_num, total_points) %>%
arrange(-total_points) %>%
top_n(10)
## Selecting by total_points
## # A tibble: 126 x 3
## model entry_num total_points
## <chr> <int> <dbl>
## 1 historic 227 168.
## 2 historic 528 166.
## 3 historic 212 161.
## 4 historic 503 161.
## 5 historic 589 158.
## 6 historic 858 158.
## 7 smart-naive 1 157.
## 8 smart-naive 3 157.
## 9 smart-naive 4 157.
## 10 smart-naive 15 157.
## # ... with 116 more rows
Yes, Historic got the top entries again in 2018 also.
ggplot(all_2018_results, aes(x=total_points, fill = model, color = model)) +
geom_density(aes(alpha = 0.5)) +
scale_color_brewer(palette = "Set1")
Again, we see the 5 humps for smart-naive that indicates that the #1 ranked team won from one of the regions.
Next year when someone puts a tournament sheet in front of you, just pick all of the highest ranked teams and you’ll save a lot of time and may even finish in 1st place, tied with everyone else that uses the Smart-Naive strategy.