In this report, we’ll start with the aggregate submission predictions for 2018 tournament (that we created in a previous lab) and learn how to alter them in several ways.

library(kaggleNCAA); library(dplyr)
dat <- parseBracket('/home/jcross/MarchMadness/data/aggregate_submission.csv', w=0) 

dat <- dat %>% filter(season == 2018)

head(dat)

Adding Randomness

One way of modifying a set of predictions is to add random noise.

If we add randomness to a probabilities we risk forecasting something outside of [0,1]. Instead, we’ll convert our probabilities into Log Odds, add randomness to the Log Odds and then convert back into a probability.

In the code below, I’ll add random noise centered on 0 with a standard deviation of 0.5 to the log odds. You can change this code to add more or less random noise to your predictions.

dat <- dat %>% mutate(logodds = log(pred/(1-pred)), 
               logodds_plus_noise = logodds + rnorm(n(), 0, 0.5),
                 pred_plus_noise = exp(logodds_plus_noise)/(1+ exp(logodds_plus_noise)))

head(dat)

library(ggplot2)
dat %>% ggplot(aes(pred, pred_plus_noise))+geom_point()

Picking a Winner

Let’s say that I want to pick Duke as a winner. First, I’ll find Dukes team number.

teams <- read.csv('/home/jcross/MarchMadness/data/Teams.csv') 
teams %>% filter(TeamName=="Duke")
# or just use View(teams)

Then, whenever Duke is team #1, we’ll change the predicted probability to 1 and whenever Duke is team #2, we’ll change the predicted probability to 0.

dat <- dat %>% mutate(pred_plus_pick = ifelse(teamid_1 == 1181, 1, pred),
                pred_plus_pick = ifelse(teamid_2 == 1181, 0, pred_plus_pick)
                )

dat %>% ggplot(aes(pred, pred_plus_pick))+geom_point()

Picking two Winners

Maybe, I think that Duke will win all of its games until it faces Kentucky and then Kentucky will defeat it and will everything from there. I can code this with:

teams %>% filter(TeamName=="Kentucky")

dat <- dat %>% mutate(pred_plus_pick = ifelse(teamid_1 == 1181, 1, pred),
                pred_plus_pick = ifelse(teamid_2 == 1181, 0, pred_plus_pick),
                pred_plus_pick = ifelse(teamid_1 == 1246, 1, pred_plus_pick),
                pred_plus_pick = ifelse(teamid_2 == 1246, 0, pred_plus_pick)
                )

dat %>% ggplot(aes(pred, pred_plus_pick))+geom_point()

Picking Round One Games

Maybe I want my gambles to come early by picking one or more first round games. The advantage of gambling on first round games is that, unlike all other games, I know for certain that these games will take place. To do this, I need to identify which games are in which rounds. I’ll do that loading “all slots” data from our data folder.

load('/home/jcross/MarchMadness/data/all_slots.rda') 
head(all_slots)

dat <- left_join(dat, all_slots, by=c("women", "season", "teamid_1", "teamid_2"))

head(dat)

Now, I can find all of the close first round games (removing those where one team had to play in)

dat %>% filter(round==1, teamid_1_playedin == 0, teamid_1_playedin == 0, pred> 0.4, pred < 0.6) %>% arrange(pred)

I’m going to gamble and guess at the outcomes of the two closest match-ups

dat <- dat %>% mutate(pred_plus_pick = ifelse(teamid_1==1199 & teamid_2 ==1281, 1, pred), #picking 1199 to win
               pred_plus_pick = ifelse(teamid_1==1166 & teamid_2 ==1243, 0, pred_plus_pick) #picking 1243 to win
               )
dat %>% ggplot(aes(pred, pred_plus_pick))+geom_point()

Picking the West Team to Win it All

The 2018 tournament was split into four regions, the South, East, West and Mid West:

In real life, Michigan one the West and we can find them in our data:

teams %>% filter(TeamName=="Michigan") # we see that they are team number 1276
dat %>% filter(round==1, teamid_1 == 1276 | teamid_1 == 1276)

We see that Michigan was see “Z03” this means that the West was the “Z” region and that Michigan was the 3 seed.

Let’s extract the region from each seed:

dat<- dat %>% 
    mutate(region_1 = gsub("[0-9+a-z]", "", seed_1),
           region_2 = gsub("[0-9+a-z]", "", seed_2)
           ) 
head(dat)

Now, to pick the West winner (whoever it is) I’ll pick the Z region team to win whenever it’s playing a non-Z region team. (Note: Equivalently, we could pick Z region teams to win all round 5 and round 6 games.)

dat <- dat %>% mutate(pred_plus_pick = ifelse(region_1 == "Z" & region_2 != "Z", 1, pred),
               pred_plus_pick = ifelse(region_2 == "Z" & region_1 != "Z", 0, pred_plus_pick)
               )
dat %>% ggplot(aes(pred, pred_plus_pick))+geom_point()

Challenge and Report: Create 10 Brackets

After you’ve altered your bracket you will need to make sure that it’s back in the Kaggle-appropriate format and then write it to a .csv file.

submission_form <- dat %>% mutate(Id = paste(season, teamid_1, teamid_2, sep="_")) %>% select(Id, Pred=pred_plus_pick)

head(submission_form)

write.csv(submission_form, "west_must_win.csv", row.names=FALSE)

Chalenge:

Create 10 named sets of predictions for the 2018 Men’s Tournament. I will simulate this tournament 1000 times using the probabilities in “aggregate_submission.csv” and score each of your 10 brackets for each of these tournaments.

This is an odd challenge! You know the probabilities of each team winning each game. Your goal is to win as many of the 1000 simulations as possible. In the case of ties, competitors will get a fraction of a win (for instance, three submissions tie for first they will each get one third of a victory).

You also need to write a report with (brief) description of each of your 10 sets of predictions so that we can learn from our class results. Your report should also include a description (again, this can be brief) of your strategy.