Sports Data Science
James reasoned that with \( p_{a,b} \) defined as the probability that Team A will beat team B, \( a \) as Team A's true winning percentage (the probability that they would beat an average team) and, \( b \) as Team B's true winning percentage, the following six statements must be true:
\( p_{a,a} = 0.5 \) (note that this rule concerns two teams of equal quality not a team playing itself)
\( p_{a,0.5} = a \)
If \( a > b \) then \( p_{a,b} > 0.5 \) and if \( a < b \) then \( p_{a,b} < 0.5 \)
If \( b < 0.5 \) then \( p_{a,b} > a \) and if \( b > 0.5 \) then \( p_{a,b} < a \)
\( 0 \leq p_{a,b} \leq 1 \) and if \( 0 < a < 1 \) then \( p_{a,0}=1 \) and \( p_{a,1}=0. \)
\( p_{a,b} + p_{b,a} = 1 \)
He realized that no linear combination of team qualities would do the trick and produced and published the following formula, dubbed the “log5” formula despite the fact that is has no immediately obvious connection to logarithms:
\[ p_{a,b} = \frac {p_a - p_a \cdot p_b}{p_a + p_b - 2 \cdot p_a \cdot p_b} \]
\[ p_{a,b} = \beta_0 + \beta_1 \cdot a + \beta_2 \cdot b \]
a = 0.7, b = 0.4
\[ p_{a,b} = \frac {p_a - p_a \cdot p_b}{p_a + p_b - 2 \cdot p_a \cdot p_b} \]
\[ p_{a,b} = \frac {0.7 - 0.7 \cdot 0.4}{0.7 + 0.4 - 2 \cdot 0.7 \cdot 0.4} \]
\[ p_{a,b} = \frac {0.7 - 0.28}{1.10 - 0.56} = \frac {0.42}{0.54} \]
\[ p_{a,b} = \frac{7}{9} = 0.\bar{7} \]
log5 <- function(a, b){
(a - a*b)/(a + b - 2*a*b)
}
log5(0.7, 0.4)
[1] 0.7777778
log5(0.7, 0.5)
[1] 0.7
log5(0.7, 1)
[1] 0
It turns out that James had rediscovered a probability model that been created by R.A. Bradley and M.E Terry in 1952 who had themselves rediscovered a model published by Ernst Zermelo in 1929.
Bill James' formula can be rearranged to give a formula for the odds of Team A emerging victorious:
\[ \frac{p_{a,b}}{1-p_{a,b}} = \frac {p_a - p_a \cdot p_b}{p_b - p_a \cdot p_b} = \frac {p_a}{1 - p_a} \cdot \frac {1-p_b}{p_b} \]
or to give the log odds of Team A winning:
\[ log(\frac{p_{a,b}}{1-p_{a,b}}) = log(\frac {p_a}{1 - p_a}) - log( \frac {p_b}{1-p_b} ) \]
What James had found was that while the probability of Team A beating Team B could not be expressed as a linear combination of team qualities, the log odds of Team A's chance of victory could be.
This suggests that our regression model should predict the log odds of the outcome (a win or a loss) rather than the outcome itself. In the statistics talk, a model that predicts a function of the outcome varible is called a “generalized” linear model and the function that you choose is called the “link function”. The log odds function, \( log(\frac{y}{1-y}) \), is known as the logistic link and a regression model that uses the logistic link is known as logistic regression.
lm(won.game~I(elo.start-opp.elo.start)+home, data=results$df %>% filter(season>=2004))
(Intercept): | 0.504071 |
---|---|
I(elo.start - opp.elo.start) | 0.001171 |
home: | 0.068794 |
What does this model predict when a 1700 team plays a 1300 team at home?
0.504071 + 0.001171*400 + 0.068794 = 1.041
The difference is 400 + 55 (from the HFA) = 455
ELO says: \[ \frac{1}{1 + 10^\frac{-455}{400}} = 0.932 = 93.2 \% \]
glm(won.game~I(elo.start-opp.elo.start)+home, family=binomial(), data=results$df %>% filter(season>=2004))
(Intercept): | 0.018577 |
---|---|
I(elo.start - opp.elo.start) | 0.005484 |
home: | 0.319901 |
What does this model predict when a 1700 team plays a 1300 team at home?
log(odds) = 0.018577 + 0.005484*400 + 0.319901 = 2.532
exp(2.532) = 12.579
12.578/13.579 = 92.6%