In the late 1970’s Bill James, now a famous baseball writer and historian, was working night shifts a security guard at a pork and beans cannery. He, much like us, was interested in building a model to predict the probability that one team would beat another team given the relative qualities of the two teams.

James reasoned that with:

$$p_{a,b}$$ defined as the probability that Team A will beat team B, $$p_a$$ as Team A’s true winning percentage (the probability that they would beat an average team) and, $$p_b$$ as Team B’s true winning percentage,

the following six statements must be true:

1. $$p_{a,a} = 0.5$$ (note that this rule concerns two teams of equal quality not a team playing itself)
2. $$p_{a,.5} = a$$
3. If $$a > b$$ then $$p_{a,b} > 0.5$$ and if $$a < b$$ then $$p_{a,b} < 0.5$$
4. If $$b < 0.5$$ then $$p_{a,b} > a$$ and if $$b > 0.5$$ then $$p_{a,b} < a$$
5. $$0 \leq p_{a,b} \leq 1$$ and if $$0 < a < 1$$ then $$p_{a,0}=1$$ and $$p_{a,1}=0.$$
6. $$p_{a,b} + p_{b,a} = 1$$

He realized that no linear combination of team qualities would do the trick and produced and published the following formula, dubbed the “log5” formula despite the fact that is has no immediately obvious connection to logarithms:

$p_{a,b} = \frac {p_a - p_a \cdot p_b}{p_a + p_b - 2 \cdot p_a \cdot p_b}$

It turns out that James had rediscovered a probability model that been created by R.A. Bradley and M.E Terry in 1952 who had themselves rediscovered a model published by Ernst Zermelo in 1929. Or, as notable sports statistician and fellow March Machine Learning Competitor octonion puts it, “Everyone Uses the Same Damn Rating System”.

Loath to be left out, let’s use this log5/Bradley-Terry/Zermelo model to predict the outcomes of NCAA tournament games.

# 1. Seeds Based Logistic Regression

Bill James’ formula can be rearranged to give a formula for the odds of Team A emerging victorious:

$\frac{p_{a,b}}{1-p_{a,b}} = \frac {p_a - p_a \cdot p_b}{p_b - p_a \cdot p_b} = \frac {p_a}{1 - p_a} \cdot \frac {1-p_b}{p_b}$

or to give the log odds of Team A winning:

$log(\frac{p_{a,b}}{1-p_{a,b}}) = log(\frac {p_a}{1 - p_a}) - log( \frac {p_b}{1-p_b} )$

What James had found was that while the probability of Team A beating Team B could not be expressed as a linear combination of team qualities, the log odds of Team A’s chance of victory could be.

This suggests that our regression model should predict the log odds of the outcome (a win or a loss) rather than the outcome itself. In statistics lingo, a linear model that predicts a function of the outcome varible is called a “generalized” linear model and the function that you choose is called the “link function”. The log odds function, $$log(\frac{y}{1-y})$$, is known as the logistic link and a regression model that uses the logistic link is known as logistic regression.

Before using a Bradley-Terry model based on estimated team qualities, we can create a simple seeds-based model using logistic regression. To get our data in the proper format we can follow the procedure outlined in the “getting started” script up to the section where we created a simple linear model but this time we will use the glm (for generalized linear model) function in place of the lm function and specify that we would like to use the logistic link.

library(dplyr); library(data.table); library(reshape2)

data.table=FALSE)

#                                     data.table=FALSE)

TourneySeeds <- TourneySeeds %>%
mutate(SeedNum = gsub("[A-Z+a-z]", "", Seed)) %>% select(Season, TeamID, SeedNum)

games.to.predict <- cbind(SampleSubmission$ID, colsplit(SampleSubmission$ID, pattern = "_", names = c('season', 'team1', 'team2')))

temp <- left_join(games.to.predict, TourneySeeds, by=c("season"="Season", "team1"="TeamID"))
games.to.predict <- left_join(temp, TourneySeeds, by=c("season"="Season", "team2"="TeamID"))
colnames(games.to.predict)[c(1,5:6)] <- c("Id", "team1seed", "team2seed")
games.to.predict <- games.to.predict %>% mutate(team1seed = as.numeric(team1seed), team2seed = as.numeric(team2seed))

temp <- left_join(as.data.frame(TourneyCompactResults), TourneySeeds, by=c("Season", "WTeamID"="TeamID"))
compact.results <- left_join(temp, TourneySeeds, by=c("Season", "LTeamID"="TeamID"))

set1 <- compact.results %>% select(SeedNum.x, SeedNum.y) %>% mutate(result=1)
set2 <- compact.results %>% select(SeedNum.y, SeedNum.x) %>% mutate(result=0)
colnames(set1) <- c("team1seed", "team2seed", "team1win")
colnames(set2) <- c("team1seed", "team2seed", "team1win")
full.set <- rbind(set1, set2)
full.set <- full.set %>% mutate(team1seed = as.numeric(team1seed), team2seed = as.numeric(team2seed))

## calculated the difference in seeds

full.set <- full.set %>% mutate(seed_diff = team2seed-team1seed)
games.to.predict <- games.to.predict %>% mutate(seed_diff = team2seed-team1seed)
m.logistic.seed.diff <- glm(team1win~ seed_diff, data=full.set, family=binomial())
coef(m.logistic.seed.diff)
##   (Intercept)     seed_diff
## -1.253236e-16  1.666295e-01

The coefficients of this model (which predicts the log odds of a team 1 win) are hard to interpret. However, if we exponentiate these coefficients (raise $$e$$ to the power of our coefficients) we can see how they affect the odds of succes.

$log(\frac{p}{1-p}) = \beta_0 + \beta_1 \cdot seed\ diff$

becomes:

$\frac{p}{1-p} = e^{\beta_0} \cdot (e^{\beta_1})^{seed\ diff}$

exp(coef(m.logistic.seed.diff))
## (Intercept)   seed_diff
##    1.000000    1.181317

In our case, simply:

$\frac{p}{1-p} = 1.00 \cdot 1.183^{seed\ diff}$

In other words, our model predict that team 1’s odds of victory are 1 when team 2 has the same seed. If team 1 is a 5-seed and team2 is an 8-seed…

$\frac{p}{1-p} = 1.00 \cdot 1.183^{8-5} \approx 1.66$

our model predicts that team 1 is 1.66 times more likely to win, or, put another way, that it has

$\frac{1.66}{1+1.66} \approx 0.62$

a $$62\%$$ chance of victory. We can use this logistic seeds based model to make predictions for the entire test set as follows:

games.to.predict$Pred <- predict(m.logistic.seed.diff, games.to.predict, type="response") write.csv(games.to.predict %>% select(Id, Pred), 'logistic_seed_submission.csv', row.names=FALSE) # 2. A Bradley-Terry Model to Estimate Team Quality We can also use logistic regression along with regular season results to estimate the quality of every team and then use those estimated team qualities to predict tournament results. First, we’ll create a data.frame where the outcome takes on values of 0 or 1 depending on whether team 1 won the game. Each game occupies two rows in this data.frame with each opponnent taking its turn as team1. Using the Wloc variable, we also create a home variable which takes on the values $$-1$$ when a team is playing an away game, $$0$$ for a game played at a neutral site and $$+1$$ for a game played at home. head(RegularSeasonCompactResults) ## Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT ## 1 1985 20 1228 81 1328 64 N 0 ## 2 1985 25 1106 77 1354 70 H 0 ## 3 1985 25 1112 63 1223 56 H 0 ## 4 1985 25 1165 70 1432 54 H 0 ## 5 1985 25 1192 86 1447 74 H 0 ## 6 1985 25 1218 79 1337 78 H 0 RegularSeasonCompactResults <- RegularSeasonCompactResults %>% mutate(home = case_when( WLoc == "N" ~ 0, WLoc == "H" ~ 1, WLoc == "A" ~ -1, TRUE ~ 0 )) season <- 2015 #as an example sub1 <- RegularSeasonCompactResults %>% filter(Season==season) %>% mutate(team1=as.factor(WTeamID), team2=as.factor(LTeamID), outcome=1) %>% select(team1, team2, home, outcome) sub2 <- RegularSeasonCompactResults %>% filter(Season==season) %>% mutate(team1=as.factor(LTeamID), team2=as.factor(WTeamID), home=-1*home, outcome=0) %>% select(team1, team2, home, outcome) reg.results <- rbind(sub1, sub2) Using this data set we could simply predict the outcome (whether team1 won) based on the identities of team1 and team2 and home field advantage. Something like: glm(outcome ~ home + team1 + team2, data = reg.results, family = binomial) This, however, would have the effect of taking every team’s observed quality and the observed qualities of their opponents at face value. In truth, we know that the most successful teams are not quite as good as their results and the least succesful not as bad as theirs. Even if all teams were of equal quality, some teams would have won more games than others. We can better estimate team strengths by “shrinking” teams’ observed qualities towards the mean. We can do this by using a mixed effects model and including team1 and team2 in the model as random effects. library(lme4) mbt <- glmer(outcome ~ home + (1 | team1) + (1 | team2), data = reg.results, family = binomial) exp(coef(mbt)$team1$home) ##  1.937445 Once again our coeffecients are hard to interpret but we can gain insight by exponentiating them. We see that home field advantage nearly doubles a team’s odds of winning and that, according to our model, when two evenly matched teams face off the home team has a $$\frac{1.937}{1+1.937} \approx 0.66 \%$$ chance of victory. We can also look at the top 10 teams from last season, according to our model: re <- ranef(mbt)$team1

teamquality <-  data.frame(TeamID= as.numeric(row.names(re)),quality=exp(re[,"(Intercept)"]))

left_join(teamquality %>% top_n(10, quality) %>% arrange(desc(quality)), Teams %>% select(TeamID, TeamName))
## Joining, by = "TeamID"
##    TeamID   quality      TeamName
## 1    1246 34.334707      Kentucky
## 2    1437 19.603919     Villanova
## 3    1438 17.986260      Virginia
## 4    1181 17.734007          Duke
## 5    1458 16.161561     Wisconsin
## 6    1211 15.434632       Gonzaga
## 7    1112 14.570116       Arizona
## 8    1323 11.298356    Notre Dame
## 9    1320 10.190826 Northern Iowa
## 10   1242  9.410988        Kansas

Kentucky stands out far above the rest with a quality of 34.33, meaning that Kentucky has a $$\frac{34.33}{1+34.33} \approx 97 \%$$ chance of beating an average team and a $$\frac{(34.33/19.6)}{1+(34.33/19.6)} \approx 64\%$$ chance of beating Villanova, the second highest rated team.

Kentucky’s high rating is consistent with betting odds entering the 2015 tournament.

We can use this model to create ratings for each team for each season and use them to predict tournament results as follows:

games.to.predict <- cbind(SampleSubmission$ID, colsplit(SampleSubmission$ID, pattern = "_", names = c('season', 'team1', 'team2')))
colnames(games.to.predict) <- "ID"
games.to.predict$home <- 0 games.to.predict$Pred <- 0.5

for (pred.season in 2014:2018){
print(pred.season)
sub1 <- RegularSeasonCompactResults %>% filter(Season==pred.season) %>% mutate(team1=as.factor(WTeamID), team2=as.factor(LTeamID), outcome=1) %>% select(team1, team2, home, outcome)

sub2 <- RegularSeasonCompactResults %>% filter(Season==pred.season) %>% mutate(team1=as.factor(LTeamID), team2=as.factor(WTeamID), home=-1*home, outcome=0) %>% select(team1, team2, home, outcome)

reg.results <- rbind(sub1, sub2)

mbt <- glmer(outcome ~ home +  (1 | team1) + (1 | team2), data = reg.results, family = binomial)

games.to.predict[games.to.predict$season==pred.season,"Pred"]<- predict(mbt, newdata=games.to.predict[games.to.predict$season==pred.season,], type="response")
}
##  2014
##  2015
##  2016
##  2017
##  2018
write.csv(games.to.predict %>% select(ID, Pred), 'bradleyterry_submission.csv', row.names=FALSE)