Logistic Regression

2023-01-30

Winning Elections

Let’s take another look at the Hibbs data but this time we’ll try to predict the chance of a candidate winning the popular vote.

hibbs = 
hibbs %>% 
  mutate(won_pop = ifelse(vote>=50, 1, 0))

Winning the Popular Vote

Winning the Popular Vote…

We can add a best fit line…

The Equation (1/2)

m = lm(won_pop ~ growth, data=hibbs)
summary(m)

## 
## Call:
## lm(formula = won_pop ~ growth, data = hibbs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78700 -0.28250  0.01001  0.34950  0.62700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.18301    0.19167   0.955   0.3559  
## growth       0.20000    0.08228   2.431   0.0291 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4447 on 14 degrees of freedom
## Multiple R-squared:  0.2968, Adjusted R-squared:  0.2465 
## F-statistic: 5.908 on 1 and 14 DF,  p-value: 0.02911

The Equation (2/2)

\[won\_pop = 0.183 + 0.2 \cdot growth\]

\[ 0.183 + 0.2\cdot growth > 1\] \[ 0.2 \cdot growth > 0.817\] \[ growth > 4.085\]

What if instead…

We predict the log odds of winning…

\[odds = \frac{probability}{1-probability}\] \[ 0 <= odds <= \infty \]

\[ log\ odds = log(\frac{probability}{1-probability}) \] \[ -\infty <= log\ odds <= \infty \]

Fitting the logistic model

Just add a g to lm to make glm: (from linear model to “generalized” linear model) Here the family is “binomial” (“successes” and “failures”)

## 
## Call:
## glm(formula = won_pop ~ growth, family = "binomial", data = hibbs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8679  -0.7647   0.3773   0.8479   1.4466  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -1.608      1.089  -1.476   0.1401  
## growth         1.046      0.544   1.924   0.0544 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21.930  on 15  degrees of freedom
## Residual deviance: 16.581  on 14  degrees of freedom
## AIC: 20.581
## 
## Number of Fisher Scoring iterations: 4

The Logistic Equation

\[log\ odds\ prob = -1.608 + 1.046 \cdot growth\]

\[ -1.608 + 1.046 \cdot growth > 0 \] when \[ 1.046 \cdot growth > 1.608\] \[ growth > 1.54 \]

Plotting the Logistic Equation (1/2)

Plotting the Logistic Equation (2/2)

The Gory Math

\[log\ odds\ winning = -1.608 + 1.046 \cdot growth\]

\[odds\ winning = e^{-1.608 + 1.046 \cdot growth}\] \[odds\ winning = e^{-1.608} \cdot e^{1.046 \cdot growth}\] \[odds\ winning = e^{-1.608} \cdot (e^{1.046})^{growth}\] \[odds\ winning = 0.2003 \cdot 2.85^{growth}\]

How do we make sense of this?

Logistic Equation Interpretation

\[odds\ winning = 0.2003 \cdot 2.85^{growth}\]

With 0% growth the incumbent party has 0.2 odds of winning (meaning 1 win for every 5 losses). For every additional 1% growth the incumbents odds of winning are multiplied by 2.85!

\[ prob\ winning = \frac{0.2003 \cdot 2.85^{growth}}{ 1 + 0.2003 \cdot 2.85^{growth} }\]

Field Goals

The Data

##   Team Year GameMinute Kicker Distance ScoreDiff Grass Success
## 1  PHI 2005          3  Akers       49         0 FALSE       0
## 2  PHI 2005         29  Akers       49        -7 FALSE       0
## 3  PHI 2005         51  Akers       44        -7 FALSE       1
## 4  PHI 2005         14  Akers       43        14  TRUE       0
## 5  PHI 2005         60  Akers       23         0  TRUE       1
## 6  PHI 2005         39  Akers       34        -3  TRUE       1

The Data (continued)

##      Team                Year        GameMinute       Kicker         
##  Length:11187       Min.   :2005   Min.   : 1.00   Length:11187      
##  Class :character   1st Qu.:2007   1st Qu.:19.00   Class :character  
##  Mode  :character   Median :2010   Median :30.00   Mode  :character  
##                     Mean   :2010   Mean   :32.74                     
##                     3rd Qu.:2013   3rd Qu.:46.00                     
##                     Max.   :2015   Max.   :77.00                     
##     Distance      ScoreDiff          Grass            Success      
##  Min.   :18.0   Min.   :-45.0000   Mode :logical   Min.   :0.0000  
##  1st Qu.:28.0   1st Qu.: -4.0000   FALSE:5053      1st Qu.:1.0000  
##  Median :37.0   Median :  0.0000   TRUE :6134      Median :1.0000  
##  Mean   :36.9   Mean   :  0.5843                   Mean   :0.8327  
##  3rd Qu.:45.0   3rd Qu.:  6.0000                   3rd Qu.:1.0000  
##  Max.   :76.0   Max.   : 48.0000                   Max.   :1.0000

Plotting the Data (1/3)

kickers %>% group_by(Distance) %>% 
  summarize(rate = mean(Success), 
            n=length(Success)) %>% 
  ggplot(aes(Distance, rate, size=n)) + 
  geom_point()

Plotting the Data (2/3)

Plotting the Data (3/3)

Oh no!!!

Logistic Regression?

Fitting the Model

m.logistic <- glm(Success ~ Distance, data=kickers, family="binomial")
coef(m.logistic)

## (Intercept)    Distance 
##   5.7246196  -0.1026151

\[log\ odds\ success = 5.72 - 0.1026 \cdot Distance\] \[ odds\ success = e^{5.72 - 0.1026 \cdot Distance}\]

\[ odds\ success = e^{5.72} \cdot (e^{-0.1026})^{Distance} \]

\[ odds\ success = 304.9 \cdot 0.9025^{Distance}\]

Interpreting the Model

\[304.9 \cdot 0.9025^{Distance} = 1\]

when

\[ 0.9025^{Distance} = 0.003279764\]

log(1/304.9, base=0.9025)

## [1] 55.75762

There are even odds to make a field goal at 56 yards.

The odds get multiple by 0.9 every extra yard or but in half (roughly) every 7 yards.

On Grass or Turf?

Use the following code to find the effect of kicking on grass relative to kicking on turf.

m.logistic <- glm(Success ~ Distance + Grass,
                  data=kickers, family="binomial")

Looking at the Grass or Turf Model

## 
## Call:
## glm(formula = Success ~ Distance + Grass, family = "binomial", 
##     data = kickers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7542   0.2486   0.4028   0.6511   1.5849  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.825574   0.141694  41.114  < 2e-16 ***
## Distance    -0.102782   0.003139 -32.742  < 2e-16 ***
## GrassTRUE   -0.168174   0.054718  -3.073  0.00212 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10105.0  on 11186  degrees of freedom
## Residual deviance:  8738.9  on 11184  degrees of freedom
## AIC: 8744.9
## 
## Number of Fisher Scoring iterations: 5

Interpreting the Grass or Turf Model

coef(m.logistic)

## (Intercept)    Distance   GrassTRUE 
##   5.8255740  -0.1027818  -0.1681737

\[log\ odds\ success = 5.826 -0.103 \cdot Distance -0.168 \cdot Grass\]

exp(coef(m.logistic))

## (Intercept)    Distance   GrassTRUE 
## 338.8555700   0.9023238   0.8452070

\[odds\ success = 338.9 \cdot 0.902^{Distance} \cdot 0.845^{Grass}\]

Does Grass affect field goals?

Plotting the Grass or Turf Model

Red is Turf

Are Kickers Getting Better? (1/3)

m.logistic <- glm(Success ~ Distance+Grass+Year,
                  data=kickers, family="binomial")

Are Kickers Getting Better? (2/3)

## 
## Call:
## glm(formula = Success ~ Distance + Grass + Year, family = "binomial", 
##     data = kickers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7683   0.2456   0.3975   0.6371   1.6056  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.042e+02  1.732e+01  -6.017 1.78e-09 ***
## Distance    -1.047e-01  3.175e-03 -32.976  < 2e-16 ***
## GrassTRUE   -1.549e-01  5.488e-02  -2.823  0.00476 ** 
## Year         5.479e-02  8.626e-03   6.352 2.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10105.0  on 11186  degrees of freedom
## Residual deviance:  8698.3  on 11183  degrees of freedom
## AIC: 8706.3
## 
## Number of Fisher Scoring iterations: 5

Are Kickers Getting Better? (3/3)

Is there something we could do better?

Trying to Understand What’s Going On!

Imagine a simpler world where kicks miss for one of two reasons:

The weren’t long enough
The missed left or right

I’ll further pretend (for now) that these are independent.

\[P(success) = P(long\ enough)\cdot P(good\ angle)\]

P(long enough)

Let’s pretend that at Distance = 0 all kicks are long enough and that P(long enough) goes down with distance but never drops below zero

\[P(long\ enough) = \frac{1}{1 + e^{a + b\cdot Distance}}\] where I don’t know what a and b are.

P(good angle)

The width of the crossbar is about 6 meters, so if kickers aim for the middle they can miss in either direction by, as long as they miss by less than 3 meters and still make the kick.

Let’s express that acceptable miss as an angle:

\[Acceptable\ Miss\ Angle < sin(\frac{3}{Distance})\]

P(good angle) continued

Let’s assume that kickers angles are normally distributed around dead center with a standard deviation of \(\sigma\).

Then the probability of missing is the chance of a z-score more than the acceptable miss angle.

In other words, if the acceptable miss angle is \(10^{\circ}\) and a kicker has a standard deviation of \(5^{\circ}\) then the chance of a good angle is:

pnorm(10/5) - pnorm(-10/5)

## [1] 0.9544997

2*pnorm(10/5) - 1

## [1] 0.9544997

P(good angle) … putting it all together

\[ P(good\ angle) = 2*pnorm(\frac{Largest\ Acceptable\ Miss}{\sigma}) - 1 \] \[Largest\ Acceptable\ Miss\ Angle = sin(\frac{3}{Distance})\] \[P(good\ angle) = 2*pnorm(\frac{sin(\frac{3}{Distance})}{\sigma}) - 1\]

P(success)