Let’s take another look at the Hibbs data but this time we’ll try to predict the chance of a candidate winning the popular vote.
hibbs = hibbs %>% mutate(won_pop = ifelse(vote>=50, 1, 0))
2023-01-30
Let’s take another look at the Hibbs data but this time we’ll try to predict the chance of a candidate winning the popular vote.
hibbs = hibbs %>% mutate(won_pop = ifelse(vote>=50, 1, 0))
We can add a best fit line…
m = lm(won_pop ~ growth, data=hibbs) summary(m)
## ## Call: ## lm(formula = won_pop ~ growth, data = hibbs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.78700 -0.28250 0.01001 0.34950 0.62700 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.18301 0.19167 0.955 0.3559 ## growth 0.20000 0.08228 2.431 0.0291 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4447 on 14 degrees of freedom ## Multiple R-squared: 0.2968, Adjusted R-squared: 0.2465 ## F-statistic: 5.908 on 1 and 14 DF, p-value: 0.02911
\[won\_pop = 0.183 + 0.2 \cdot growth\]
\[ 0.183 + 0.2\cdot growth > 1\] \[ 0.2 \cdot growth > 0.817\] \[ growth > 4.085\]
We predict the log odds of winning…
\[odds = \frac{probability}{1-probability}\] \[ 0 <= odds <= \infty \]
\[ log\ odds = log(\frac{probability}{1-probability}) \] \[ -\infty <= log\ odds <= \infty \]
Just add a g to lm to make glm: (from linear model to “generalized” linear model) Here the family is “binomial” (“successes” and “failures”)
## ## Call: ## glm(formula = won_pop ~ growth, family = "binomial", data = hibbs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.8679 -0.7647 0.3773 0.8479 1.4466 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.608 1.089 -1.476 0.1401 ## growth 1.046 0.544 1.924 0.0544 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 21.930 on 15 degrees of freedom ## Residual deviance: 16.581 on 14 degrees of freedom ## AIC: 20.581 ## ## Number of Fisher Scoring iterations: 4
\[log\ odds\ prob = -1.608 + 1.046 \cdot growth\]
\[ -1.608 + 1.046 \cdot growth > 0 \] when \[ 1.046 \cdot growth > 1.608\] \[ growth > 1.54 \]
\[log\ odds\ winning = -1.608 + 1.046 \cdot growth\]
\[odds\ winning = e^{-1.608 + 1.046 \cdot growth}\] \[odds\ winning = e^{-1.608} \cdot e^{1.046 \cdot growth}\] \[odds\ winning = e^{-1.608} \cdot (e^{1.046})^{growth}\] \[odds\ winning = 0.2003 \cdot 2.85^{growth}\]
How do we make sense of this?
\[odds\ winning = 0.2003 \cdot 2.85^{growth}\]
With 0% growth the incumbent party has 0.2 odds of winning (meaning 1 win for every 5 losses). For every additional 1% growth the incumbents odds of winning are multiplied by 2.85!
\[ prob\ winning = \frac{0.2003 \cdot 2.85^{growth}}{ 1 + 0.2003 \cdot 2.85^{growth} }\]
## Team Year GameMinute Kicker Distance ScoreDiff Grass Success ## 1 PHI 2005 3 Akers 49 0 FALSE 0 ## 2 PHI 2005 29 Akers 49 -7 FALSE 0 ## 3 PHI 2005 51 Akers 44 -7 FALSE 1 ## 4 PHI 2005 14 Akers 43 14 TRUE 0 ## 5 PHI 2005 60 Akers 23 0 TRUE 1 ## 6 PHI 2005 39 Akers 34 -3 TRUE 1
## Team Year GameMinute Kicker ## Length:11187 Min. :2005 Min. : 1.00 Length:11187 ## Class :character 1st Qu.:2007 1st Qu.:19.00 Class :character ## Mode :character Median :2010 Median :30.00 Mode :character ## Mean :2010 Mean :32.74 ## 3rd Qu.:2013 3rd Qu.:46.00 ## Max. :2015 Max. :77.00 ## Distance ScoreDiff Grass Success ## Min. :18.0 Min. :-45.0000 Mode :logical Min. :0.0000 ## 1st Qu.:28.0 1st Qu.: -4.0000 FALSE:5053 1st Qu.:1.0000 ## Median :37.0 Median : 0.0000 TRUE :6134 Median :1.0000 ## Mean :36.9 Mean : 0.5843 Mean :0.8327 ## 3rd Qu.:45.0 3rd Qu.: 6.0000 3rd Qu.:1.0000 ## Max. :76.0 Max. : 48.0000 Max. :1.0000
kickers %>% group_by(Distance) %>% summarize(rate = mean(Success), n=length(Success)) %>% ggplot(aes(Distance, rate, size=n)) + geom_point()
Oh no!!!
m.logistic <- glm(Success ~ Distance, data=kickers, family="binomial") coef(m.logistic)
## (Intercept) Distance ## 5.7246196 -0.1026151
\[log\ odds\ success = 5.72 - 0.1026 \cdot Distance\] \[ odds\ success = e^{5.72 - 0.1026 \cdot Distance}\]
\[ odds\ success = e^{5.72} \cdot (e^{-0.1026})^{Distance} \]
\[ odds\ success = 304.9 \cdot 0.9025^{Distance}\]
\[304.9 \cdot 0.9025^{Distance} = 1\]
when
\[ 0.9025^{Distance} = 0.003279764\]
log(1/304.9, base=0.9025)
## [1] 55.75762
There are even odds to make a field goal at 56 yards.
The odds get multiple by 0.9 every extra yard or but in half (roughly) every 7 yards.
Use the following code to find the effect of kicking on grass relative to kicking on turf.
m.logistic <- glm(Success ~ Distance + Grass, data=kickers, family="binomial")
## ## Call: ## glm(formula = Success ~ Distance + Grass, family = "binomial", ## data = kickers) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7542 0.2486 0.4028 0.6511 1.5849 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.825574 0.141694 41.114 < 2e-16 *** ## Distance -0.102782 0.003139 -32.742 < 2e-16 *** ## GrassTRUE -0.168174 0.054718 -3.073 0.00212 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 10105.0 on 11186 degrees of freedom ## Residual deviance: 8738.9 on 11184 degrees of freedom ## AIC: 8744.9 ## ## Number of Fisher Scoring iterations: 5
coef(m.logistic)
## (Intercept) Distance GrassTRUE ## 5.8255740 -0.1027818 -0.1681737
\[log\ odds\ success = 5.826 -0.103 \cdot Distance -0.168 \cdot Grass\]
exp(coef(m.logistic))
## (Intercept) Distance GrassTRUE ## 338.8555700 0.9023238 0.8452070
\[odds\ success = 338.9 \cdot 0.902^{Distance} \cdot 0.845^{Grass}\]
Does Grass affect field goals?
Red is Turf
m.logistic <- glm(Success ~ Distance+Grass+Year, data=kickers, family="binomial")
## ## Call: ## glm(formula = Success ~ Distance + Grass + Year, family = "binomial", ## data = kickers) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7683 0.2456 0.3975 0.6371 1.6056 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.042e+02 1.732e+01 -6.017 1.78e-09 *** ## Distance -1.047e-01 3.175e-03 -32.976 < 2e-16 *** ## GrassTRUE -1.549e-01 5.488e-02 -2.823 0.00476 ** ## Year 5.479e-02 8.626e-03 6.352 2.13e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 10105.0 on 11186 degrees of freedom ## Residual deviance: 8698.3 on 11183 degrees of freedom ## AIC: 8706.3 ## ## Number of Fisher Scoring iterations: 5
Imagine a simpler world where kicks miss for one of two reasons:
The weren’t long enough
The missed left or right
I’ll further pretend (for now) that these are independent.
\[P(success) = P(long\ enough)\cdot P(good\ angle)\]
Let’s pretend that at Distance = 0 all kicks are long enough and that P(long enough) goes down with distance but never drops below zero
\[P(long\ enough) = \frac{1}{1 + e^{a + b\cdot Distance}}\] where I don’t know what a and b are.
The width of the crossbar is about 6 meters, so if kickers aim for the middle they can miss in either direction by, as long as they miss by less than 3 meters and still make the kick.
Let’s express that acceptable miss as an angle:
\[Acceptable\ Miss\ Angle < sin(\frac{3}{Distance})\]
Let’s assume that kickers angles are normally distributed around dead center with a standard deviation of \(\sigma\).
Then the probability of missing is the chance of a z-score more than the acceptable miss angle.
In other words, if the acceptable miss angle is \(10^{\circ}\) and a kicker has a standard deviation of \(5^{\circ}\) then the chance of a good angle is:
pnorm(10/5) - pnorm(-10/5)
## [1] 0.9544997
2*pnorm(10/5) - 1
## [1] 0.9544997
\[ P(good\ angle) = 2*pnorm(\frac{Largest\ Acceptable\ Miss}{\sigma}) - 1 \] \[Largest\ Acceptable\ Miss\ Angle = sin(\frac{3}{Distance})\] \[P(good\ angle) = 2*pnorm(\frac{sin(\frac{3}{Distance})}{\sigma}) - 1\]
\[P(success) = P(long\ enough)\cdot P(good\ angle)\]
\[P(success) = \frac{2*pnorm(\frac{sin(\frac{3}{Distance})}{\sigma}) - 1}{1 + e^{a + b\cdot Distance}}\] We just need to use the data to solve for a, b and \(\sigma\)!
success_at_distance = kickers %>% group_by(Distance) %>% summarize(rate = mean(Success), n=length(Success))
m = nls(rate ~ (2*pnorm(sin(3/Distance)/s)-1)/ (1 + exp(a + b*Distance)), start=list(s=pi/36, a = -9, b=0.15), data=success_at_distance, weights=n)
## ## Formula: rate ~ (2 * pnorm(sin(3/Distance)/s) - 1)/(1 + exp(a + b * Distance)) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## s 5.501e-02 8.479e-04 64.880 < 2e-16 *** ## a -1.386e+01 1.665e+00 -8.327 5.22e-11 *** ## b 2.288e-01 2.993e-02 7.642 5.97e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3416 on 50 degrees of freedom ## ## Number of iterations to convergence: 6 ## Achieved convergence tolerance: 7.891e-06
(with the old logistic regression in red)
P(Good Angle) in blue P(Far Enough) in green
Maybe kickers angles aren’t normally distributed. Are there more mishits/outliers than we’d expect based on the normal distribution? If so, maybe we need a distribution with fatter tails? What about a t-distribution with 3 degrees of freedom?
What if angles are more unpredictable for longer kicks? This could if kickers are less accurate when they try to kick harder or because footballs slice or hook.
\[ P(good\ angle) = (2\cdot pt(\frac{sin(\frac{3}{Distance})}{\sigma_1+\sigma_2\cdot Distance}, df=3)-1)\] ## The New Model
m2 = nls(rate ~ (2*pt(sin(3/Distance)/(s+s2*Distance), df=3)-1)/ (1 + exp(a + b*Distance)), start=list(s=5.501e-02, a = -1.386e+01, b=2.288e-01, s2=0), data=success_at_distance, weights=n)