One shortcoming of the Bradley-Terry team ratings (which used wins and losses and logistic regression) is that, when rating teams, they entirely ignore margins of victory. In what follows, we’ll build a model that rates teams based on point differentials and also predicts point differentials. We can then use the residuals of this model to turn these predicted point differentials into projected win probabilities.
.libPaths(c("/home/rstudioshared", "/home/rstudioshared/packages", "/home/rstudioshared/shared_files/packages"))
library(data.table); library(dplyr); library(reshape)
load('/home/rstudioshared/shared_files/data/march_madness/MMdata.RData')
RegularSeasonCompactResults$home <- RegularSeasonCompactResults$Wloc
levels(RegularSeasonCompactResults$home) <- list("-1"="A", "1"="H", "0"="N")
RegularSeasonCompactResults$home <-
as.numeric(as.character(RegularSeasonCompactResults$home))
library(lme4)
games.to.predict <- cbind(SampleSubmission$Id, colsplit(SampleSubmission$Id, split = "_", names = c('season', 'team1', 'team2')))
colnames(games.to.predict)[1] <- "Id"
games.to.predict$home <- 0
Our data wrangling looks much like it did for the Bradley-Terry model except that we now also compute the point differential for each game:
season <- 2015
sub1 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Wteam), team2=as.factor(Lteam), outcome=1, ptdiff=Wscore-Lscore) %>%
select(team1, team2, home, outcome, ptdiff)
sub2 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Lteam), team2=as.factor(Wteam), home=-1*home, outcome=0, ptdiff=Lscore-Wscore) %>%
select(team1, team2, home, outcome, ptdiff)
reg.results <- rbind(sub1, sub2)
Once again, we used mixed effects model but this time we use a linear regression to predict point differential instead of a logistic regression to predict the log odds of victory.
m.ptdiff <- lmer(ptdiff ~ home + (1 | team1) + (1 | team2), data = reg.results)
re <- ranef(m.ptdiff)$team1
teamquality = data.frame(Team_Id= as.numeric(row.names(re)),
quality=re[,"(Intercept)"])
left_join(teamquality %>% top_n(10, quality) %>%
arrange(desc(quality)), Teams)
## Team_Id quality Team_Name
## 1 1246 25.47505 Kentucky
## 2 1458 21.68493 Wisconsin
## 3 1112 21.53109 Arizona
## 4 1181 20.80873 Duke
## 5 1437 20.25855 Villanova
## 6 1211 19.57700 Gonzaga
## 7 1438 19.51281 Virginia
## 8 1428 18.31339 Utah
## 9 1314 17.89928 North Carolina
## 10 1326 17.21508 Ohio St
The highest rated teams is, again, Kentucky and here the interpretation of these coefficients is straigh forward. Kentucky would be expected to beat the average team by 24.5 points and be expected to be Wisconsin, the second highest rated team by 3.8 points. But what is Kentucky’s chance of winning that game? To estimate Kentucky’s win percentage, we can look at the residuals of our model.
hist(resid(m.ptdiff))
qqnorm(resid(m.ptdiff))
summary(m.ptdiff)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ptdiff ~ home + (1 | team1) + (1 | team2)
## Data: reg.results
##
## REML criterion at convergence: 82113.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.2747 -0.6406 0.0000 0.6406 4.2747
##
## Random effects:
## Groups Name Variance Std.Dev.
## team1 (Intercept) 70.0 8.367
## team2 (Intercept) 70.0 8.367
## Residual 102.6 10.128
## Number of obs: 10708, groups: team1, 351; team2, 351
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.090e-10 6.392e-01 0.00
## home 3.458e+00 1.094e-01 31.59
##
## Correlation of Fixed Effects:
## (Intr)
## home 0.000
The residuals appear to be normally distributed and, based on our model summary, we can expect our point differential predictions to have a root mean square error (RMSE) of 10.1 points. In order for Kentucky to lose to Wisconsin, our prediction would have to be 3.8 point too optimistic, or put another way, we’d need to get a result that is \(\frac{25.5 - 21.7}{10.1} = 0.38\) standard deviation or further below average (from Kentucky’s point) of view.
pnorm(3.8, mean=0, sd=10.1)
## [1] 0.6466299
Using the pnorm function, we can estimate that Kentucky has a 64.7% chance of victory in a game in which they are favored by 3.8 points if our lines have a RMSE of 10.1 points. To make predictions for all of the games in the test set, we need to make ratings ofr
for (season in 2012:2015){
sub1 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Wteam), team2=as.factor(Lteam), outcome=1, ptdiff=Wscore-Lscore) %>%
select(team1, team2, home, outcome, ptdiff)
sub2 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Lteam), team2=as.factor(Wteam), home=-1*home, outcome=0, ptdiff=Lscore-Wscore) %>%
select(team1, team2, home, outcome, ptdiff)
reg.results <- rbind(sub1, sub2)
m.ptdiff <- lmer(ptdiff ~ home + (1 | team1) + (1 | team2), data = reg.results)
pred.pt.diffs <- predict(m.ptdiff, games.to.predict[games.to.predict$season==season,],
type="response")
games.to.predict[games.to.predict$season==season,"Pred"]<- pnorm(pred.pt.diffs, mean=0, sd=10.1)
}
write.csv(games.to.predict %>% select(Id, Pred), 'ptdiff_submission.csv',
row.names=FALSE)
We can also tweak the code is several ways. Shown below are three possible tweaks: 1) reducing the point differential for games that got out of hand so that a 30 point victory is seen as not that much more compelling than a 20 point victory and 2) doing a weighted regression with games closer to the tournament given more weight and 3) increasing the estimated error in the point differentials (from 10.1 points to 10.5, in this case) since team qualities may change in the tournament.
for (season in 2012:2015){
sub1 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Wteam), team2=as.factor(Lteam), outcome=1, ptdiff=Wscore-Lscore, dw=.99^(132-Daynum)) %>%
select(team1, team2, home, outcome, ptdiff, dw)
sub2 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Lteam), team2=as.factor(Wteam), home=-1*home, outcome=0, ptdiff=Lscore-Wscore, dw=.99^(132-Daynum)) %>%
select(team1, team2, home, outcome, ptdiff, dw)
reg.results <- rbind(sub1, sub2)
reg.results <- reg.results %>% mutate(ptdiff = ifelse(abs(ptdiff)>15,sign(ptdiff)*(15+sqrt(abs(ptdiff-15))),ptdiff))
m.ptdiff <- lmer(ptdiff ~ home + (1 | team1) + (1 | team2), data = reg.results)
pred.pt.diffs <- predict(m.ptdiff, games.to.predict[games.to.predict$season==season,],
type="response")
games.to.predict[games.to.predict$season==season,"point_diff_pred"]<- pnorm(pred.pt.diffs, mean=0, sd=10.5)
}
We can also combine these forecast with the Bradley-Terry forecasts and the Seed Benchmark forecasts to see if a small ensemble of forecasts performs better than any of the individual models:
for (season in 2012:2015){
sub1 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Wteam), team2=as.factor(Lteam), outcome=1, ptdiff=Wscore-Lscore) %>%
select(team1, team2, home, outcome, ptdiff)
sub2 <- RegularSeasonCompactResults %>%
filter(Season==season) %>%
mutate(team1=as.factor(Lteam), team2=as.factor(Wteam), home=-1*home, outcome=0, ptdiff=Lscore-Wscore) %>%
select(team1, team2, home, outcome, ptdiff)
reg.results <- rbind(sub1, sub2)
m.bt <- glmer(outcome ~ home + (1 | team1) + (1 | team2), data = reg.results, family=binomial())
games.to.predict[games.to.predict$season==season,"bt_pred"]<- predict(m.bt, games.to.predict[games.to.predict$season==season,],
type="response")
}
temp <- left_join(games.to.predict, TourneySeeds, by=c("season"="Season", "team1"="Team"))
games.to.predict <- left_join(temp, TourneySeeds, by=c("season"="Season", "team2"="Team"))
colnames(games.to.predict)[which(names(games.to.predict) %in% c("SeedNum.x", "SeedNum.y" ))] <- c("team1seed", "team2seed")
games.to.predict <- games.to.predict %>% mutate(team1seed = as.numeric(team1seed), team2seed = as.numeric(team2seed))
games.to.predict <- games.to.predict %>% mutate(seed.pred = 0.5 + .03*(team2seed-team1seed))
Here we’ll give 40% of the weight to the seed benchmark, 30% to the Bradley-Terry model and 30% to the point differential model.
games.to.predict <- games.to.predict %>% mutate(Pred = .4*seed.pred + .3*bt_pred + .3*point_diff_pred)
write.csv(games.to.predict %>% select(Id, Pred), 'aggregate_submission.csv', row.names=FALSE)