Bradley Terry models in R are nothing new - in fact there are some very good packages already in existence (like BradleyTerry2) that can be used to run these models with a minimal amount of elbow grease. That being said, sometimes it’s fun (and instructive) to take a shot at building a simple model from the ground up. Let’s give it a shot together using simple game score data from the NHL. We’ll use our model to estimate team strength and then forecast a hypothetical matchup. Let’s dive in.
First let’s load some game result data from the 2018-2019 NHL season as a .csv file:
library(knitr)
data <- read.csv('hockeydata.csv')
kable(head(data), caption = "2018-2019 NHL Game Result Data")
| Date | Visitor | AG | Home | HG |
|---|---|---|---|---|
| 2018-10-03 | Anaheim Ducks | 5 | San Jose Sharks | 2 |
| 2018-10-03 | Montreal Canadiens | 2 | Toronto Maple Leafs | 3 |
| 2018-10-03 | Calgary Flames | 2 | Vancouver Canucks | 5 |
| 2018-10-03 | Boston Bruins | 0 | Washington Capitals | 7 |
| 2018-10-04 | Boston Bruins | 4 | Buffalo Sabres | 0 |
| 2018-10-04 | New York Islanders | 2 | Carolina Hurricanes | 1 |
Now let’s make some new variables and clean the variable names up a bit. First we’ll create a margin of victory [h_mov] variable from the home team’s perspective, and then create a binary result variable [h_win] where the home team records a win [1] or a loss [0] based on the positive or negative value of the margin of victory. Then we’ll rename the variables using lowercase letters and select the data that we want for our new cleaned up dataframe.
library(tidyverse)
bt.games <- data %>%
mutate(h_mov = HG - AG, h_win = ifelse(h_mov > 0,1,0)) %>%
select(date = Date,
home = Home,
away = Visitor,
h_score = HG,
a_score = AG,
h_mov,
h_win)
kable(head(bt.games))
| date | home | away | h_score | a_score | h_mov | h_win |
|---|---|---|---|---|---|---|
| 2018-10-03 | San Jose Sharks | Anaheim Ducks | 2 | 5 | -3 | 0 |
| 2018-10-03 | Toronto Maple Leafs | Montreal Canadiens | 3 | 2 | 1 | 1 |
| 2018-10-03 | Vancouver Canucks | Calgary Flames | 5 | 2 | 3 | 1 |
| 2018-10-03 | Washington Capitals | Boston Bruins | 7 | 0 | 7 | 1 |
| 2018-10-04 | Buffalo Sabres | Boston Bruins | 0 | 4 | -4 | 0 |
| 2018-10-04 | Carolina Hurricanes | New York Islanders | 1 | 2 | -1 | 0 |
Next we need to create a model formula. For a Bradley Terry model we’ll want to use a logistic function that combines the home rating, away rating, and home ice advantage. There are other ways to build this, but this method closely follows the Excel version of this model from my book Statistical Sports Models in Excel Volume 1, so I think it will be approachable for those trying to make the leap from Excel to R.
bt.home_forecast <- function(home_rating,
visitor_rating,
homeice_adv){
1/(1+exp(-(homeice_adv + home_rating - visitor_rating))) #Logistic Function
}
Next we create a vector of dummy team ratings and start all teams off with a rating of 1.000. We’ll later optimize the ratings using maximum likelihood estimation, but this gives us a starting point to work from.
teams <- as.vector(sort(unique(bt.games$home)))
bt.ratings <- rep(1.000,32) #Starting Rating is 1.000, 31 teams + hia
names(bt.ratings) <- c(teams, 'hia') #'hia' is our home ice advantage
Now we need to create a log likelihood function which we can use to optimize the team ratings in a way that makes the observed game results most probable. This function compares the estimated win probability to the binary result variable [h_win] and returns the estimated probability if [h_win] is 1, and 1 minus the estimated probability if h_win] is 0. The function then takes the natural logarithm of the resulting probabilities and sums them together to return a log likelihood as [ll.final].
ll <- function(bt.ratings) {
bt.games %>%
mutate(forecast = bt.home_forecast(bt.ratings[home],
bt.ratings[away],
bt.ratings[32])) %>% #32 is the home ice advantage
mutate(result.f = ifelse(h_win==1,forecast, (1-forecast))) %>% # The result function
summarise(ll.final = sum(log(result.f))) %>% #sum of the log likelihood
pull(ll.final)
}
Using Optim in R, we can replicate the non-linear Solver add-on from Excel. We input the team ratings [bt.ratings] as the parameters, the log likelihood function [ll] as the function to optimize, and then make the fnscale equal to -1. This is important to remember, as Optim will attempt to minimize a given function otherwise.
bt.optim <- optim(bt.ratings, ll,
method = "BFGS", #Broyden–Fletcher–Goldfarb–Shanno algorithm - a non-linear optimization
control = list(fnscale=-1)) #[-1] in fnscale necessary for maximization of function
bt.optim.ratings <- bt.optim$par #Isolate team rating estimates
bt.optim.ratings
## Anaheim Ducks Arizona Coyotes Boston Bruins
## 0.6897626 0.8817131 1.4004137
## Buffalo Sabres Calgary Flames Carolina Hurricanes
## 0.6137645 1.4175853 1.2509437
## Chicago Blackhawks Colorado Avalanche Columbus Blue Jackets
## 0.7533807 0.8507915 1.2998748
## Dallas Stars Detroit Red Wings Edmonton Oilers
## 1.0926134 0.5632553 0.6936003
## Florida Panthers Los Angeles Kings Minnesota Wild
## 0.7628435 0.4932208 0.8064469
## Montreal Canadiens Nashville Predators New Jersey Devils
## 1.1523788 1.2887784 0.5146795
## New York Islanders New York Rangers Ottawa Senators
## 1.3499992 0.5672938 0.4107100
## Philadelphia Flyers Pittsburgh Penguins San Jose Sharks
## 0.8143463 1.1538429 1.2197812
## St. Louis Blues Tampa Bay Lightning Toronto Maple Leafs
## 1.1881748 2.1196881 1.2509553
## Vancouver Canucks Vegas Golden Knights Washington Capitals
## 0.6962435 1.0724194 1.3500038
## Winnipeg Jets hia
## 1.2804951 0.1562270
We can also use ggplot to make a quick visual of the estimated team strengths:
library(ggplot2)
plot.ratings <- stack(bt.optim.ratings) %>% #Convert named vector into dataframe with labels
select(team = ind,
rating = values)
plot.ratings$team <- as.character(plot.ratings$team)
plot.ratings %>%
arrange(rating) %>% #Sorting plot by logistic rating strength using dplyr
mutate(team=factor(team, levels=team)) %>%
ggplot( aes(x=rating, y=team)) +
geom_point(aes(size=2, colour=rating))+
labs(y="Team",
x="Logistic Strength Rating")
To forecast a future game we can create a function that combines our freshly derived team strength ratings with the model structure we used earlier. We’ll code the function so that teams are entered as [“home team”, “away team”].
xbt.forecast <- function(home, away){
1/(1+exp(-(bt.optim.ratings['hia'] + bt.optim.ratings[home] - bt.optim.ratings[away])))
}
Entering two teams into our newly created function produces the desired forecast:
xbt <- xbt.forecast("Tampa Bay Lightning", "Anaheim Ducks")
names(xbt) <- NULL #Remove names from the returned probability
round(xbt,4)#Round output probability to 4 decimal points
## [1] 0.8301
Sanity Check for Home Ice Advantage
As a quick sanity check, we can also forecast the same team playing itself to determine if home ice advantage appears reasonable. Typical home ice advantage is ~3-4%. Let’s try that now with Tampa Bay:
xbt <- xbt.forecast("Tampa Bay Lightning", "Tampa Bay Lightning")
names(xbt) <- NULL #Remove names from the returned probability
round(xbt,4) #Round output probability to 4 decimal points
## [1] 0.539