Logistic Regression and Bill James

Sports Data Science

In the late 1970's Bill James, now a famous baseball writer and historian, was working night shifts a security guard at a pork and beans cannery. He was interested in building a model to predict the probability that one team would beat another team given the relative qualities of the two teams.

James reasoned that with $ p_{a,b} $ defined as the probability that Team A will beat team B, $ a $ as Team A's true winning percentage (the probability that they would beat an average team) and, $ b $ as Team B's true winning percentage, the following six statements must be true:

$ p_{a,a} = 0.5 $ (note that this rule concerns two teams of equal quality not a team playing itself)
$ p_{a,0.5} = a $
If $ a > b $ then $ p_{a,b} > 0.5 $ and if $ a < b $ then $ p_{a,b} < 0.5 $
If $ b < 0.5 $ then $ p_{a,b} > a $ and if $ b > 0.5 $ then $ p_{a,b} < a $
$ 0 \leq p_{a,b} \leq 1 $ and if $ 0 < a < 1 $ then $ p_{a,0}=1 $ and $ p_{a,1}=0. $
$ p_{a,b} + p_{b,a} = 1 $

Bill James and "Log5" continued

He realized that no linear combination of team qualities would do the trick and produced and published the following formula, dubbed the “log5” formula despite the fact that is has no immediately obvious connection to logarithms:

\[ p_{a,b} = \frac {p_a - p_a \cdot p_b}{p_a + p_b - 2 \cdot p_a \cdot p_b} \]

(Sidenote: A Linear Model Won't Work)

\[ p_{a,b} = \beta_0 + \beta_1 \cdot a + \beta_2 \cdot b \]

Example

a = 0.7, b = 0.4

\[ p_{a,b} = \frac {p_a - p_a \cdot p_b}{p_a + p_b - 2 \cdot p_a \cdot p_b} \]

\[ p_{a,b} = \frac {0.7 - 0.7 \cdot 0.4}{0.7 + 0.4 - 2 \cdot 0.7 \cdot 0.4} \]

\[ p_{a,b} = \frac {0.7 - 0.28}{1.10 - 0.56} = \frac {0.42}{0.54} \]

\[ p_{a,b} = \frac{7}{9} = 0.\bar{7} \]

log5 <- function(a, b){
  (a - a*b)/(a + b - 2*a*b)
}

log5(0.7, 0.4)

[1] 0.7777778

log5(0.7, 0.5)

[1] 0.7

log5(0.7, 1)

[1] 0

Log5 = Bradley-Terry = Logistic Regression

It turns out that James had rediscovered a probability model that been created by R.A. Bradley and M.E Terry in 1952 who had themselves rediscovered a model published by Ernst Zermelo in 1929.

A Prettier Formula

Bill James' formula can be rearranged to give a formula for the odds of Team A emerging victorious:

\[ \frac{p_{a,b}}{1-p_{a,b}} = \frac {p_a - p_a \cdot p_b}{p_b - p_a \cdot p_b} = \frac {p_a}{1 - p_a} \cdot \frac {1-p_b}{p_b} \]

or to give the log odds of Team A winning:

\[ log(\frac{p_{a,b}}{1-p_{a,b}}) = log(\frac {p_a}{1 - p_a}) - log( \frac {p_b}{1-p_b} ) \]

What James had found was that while the probability of Team A beating Team B could not be expressed as a linear combination of team qualities, the log odds of Team A's chance of victory could be.

This suggests that our regression model should predict the log odds of the outcome (a win or a loss) rather than the outcome itself. In the statistics talk, a model that predicts a function of the outcome varible is called a “generalized” linear model and the function that you choose is called the “link function”. The log odds function, $ log(\frac{y}{1-y}) $, is known as the logistic link and a regression model that uses the logistic link is known as logistic regression.

An Linear Regression NFL ELO Example

lm(won.game~I(elo.start-opp.elo.start)+home, data=results$df %>% filter(season>=2004))

Coefficients:

(Intercept):	0.504071
I(elo.start - opp.elo.start)	0.001171
home:	0.068794

What does this model predict when a 1700 team plays a 1300 team at home?

0.504071 + 0.001171*400 + 0.068794 = 1.041

What should it be?

The difference is 400 + 55 (from the HFA) = 455

ELO says: \[ \frac{1}{1 + 10^\frac{-455}{400}} = 0.932 = 93.2 \% \]

An Logistic Regression NFL ELO Example

glm(won.game~I(elo.start-opp.elo.start)+home, family=binomial(), data=results$df %>% filter(season>=2004))

Coefficients:

(Intercept):	0.018577
I(elo.start - opp.elo.start)	0.005484
home:	0.319901

What does this model predict when a 1700 team plays a 1300 team at home?

log(odds) = 0.018577 + 0.005484*400 + 0.319901 = 2.532

exp(2.532) = 12.579

12.578/13.579 = 92.6%