Friday April 19, 2013 Stats 155 Class Notes

The ideas behind logistic regression

Straight-line models of probability are too stiff — go outside of 0-1 interval
Residual is problematic when dealing with observations that are Yes/No.
- Suppose the model probability of an event is 0.2 and it happens. What's the residual?
- Suppose the model probability of an event is 0.1 and it happens. What's the residual? How different should it be from the residual for a model probability of 0.2?
- What should the residual be if the model probability were 0.000001? Certainly it should be much bigger than for a model value of 0.1
We fit models by optimizing the “likelihood” — the probability of the observations given the model. There is a quantity called the “residual deviance” that is calculated from the likelihood that is analogous to the residual variance. There's a nice mathematical theory of this, which we won't be covering in this course.

Linear Modeling of Groups

Calculates the groupwise mean, which gives a probability if the response is 0/1.

w = fetchData("whickham.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/whickham.csv

# Create a zero-one survived variable
w = transform(w, survived = outcome == "Alive")
# Model survival by smoking
mod = lm(survived ~ smoker, data = w)
f = makeFun(mod)
f(smoker = c("Yes", "No"))

##      1      2 
## 0.7612 0.6858

Now to calculate the probability of the outcome according to the model. This is really a conditional probability, p(outcome|model). Work through this by hand, then translate into a number for the entire group.

w = transform(w, prob = ifelse(survived, f(smoker), 1 - f(smoker)))

Now we have a probability for each outcome. Assuming that the outcomes are independent, the probability of the whole set of outcomes is the product of all the outcomes.

with(w, prod(prob))

## [1] 0

Although this looks like zero, it's really not — it's just a very small number. Computer round-off has come into play. To avoid this problem, and for other, theoretical reasons, one works with the log-likelihood: the logarithm of the likelihood.

with(w, sum(log(prob)))

## [1] -775.6

This approach is fine for categorical variables, but it doesn't work when there is more than one variable or when there are quantitative explanatory variables. The reason is that the model value can go outside the range 0-1.

mod2 = lm(survived ~ age, data = w)
f2 = makeFun(mod2)
w = transform(w, prob2 = ifelse(survived, f2(age), 1 - f2(age)))
with(w, range(prob2))

## [1] -0.1874  1.1874

Fitting logistic models

Constructing a yes/no variable with ==
glm and the family="binomial"
The summary report

Confounding, covariates, and Logistic Regression

The Whickham data.

This is drawn from a cohort study of nurses in Whickhamshire, England. Nurses were asked many questions about their lifestyle, diet, etc. Then they were revisited 20 years later. This small data set just records their age at the first interview, whether they identified themselves as smokers, and whether they were still alive at the 20 year follow-up.

First, let's look at the relationship between age and probability of being dead at the follow-up. We'll compare a straightforward linear model with a logistic regression model:

m = lm(survived ~ age, data = w)
m2 = glm(survived ~ age, data = w, family = "binomial")
f = makeFun(m)
f2 = makeFun(m2)
plotFun(f(age) ~ age, age.lim = c(20, 100))
plotFun(f2(age) ~ age, add = TRUE, col = "red", lwd = 3)

plot of chunk unnamed-chunk-5

Note that the models agree on the probability of surviving for women in their middle ages.

At the extremes, the linear model is silly. It has probability outside the range 0 to 1.
It doesn't work so well on the inside either, since the linear function can't be steeper in the center without being absurdly high near the edges.

Interpreting the coefficients

coef(summary(m))

##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)  1.47880  0.0277403   53.31  0.000e+00
## age         -0.01619  0.0005542  -29.21 6.517e-145

The coefficient on age tells how the probability changes with an increase of one year in age. This simple linear model says that the change in probability is constant across the ages. (We asked it to say that: it's a linear model!)

Linear Model: the probability of dying increases by 1.6 percentage points each year.
Logistic Model: the odds of dying increases by 12% per year. That's not percentage points, but a multiplier of 0.12. In other words, the odds of dying increases exponentially.
- QUESTION: This means that the odds reaches \( \infty \). What probability corresponds to an odds of \( \infty \)?

95% confidence interval on the rate of odds increase

exp(log(0.122) + 2 * c(-1, 1) * 0.00694)

## [1] 0.1203 0.1237

Smoking and Death

mod0 = lm(survived ~ smoker, data = w)
mod1 = glm(survived ~ smoker, data = w, family = "binomial")
coef(summary(mod0))

##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)  0.68579    0.01657  41.400 2.440e-240
## smokerYes    0.07538    0.02489   3.028  2.507e-03

coef(summary(mod1))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)   0.7805    0.07962   9.803 1.096e-22
## smokerYes     0.3786    0.12566   3.013 2.590e-03

xyplot(fitted(mod1) ~ fitted(mod0))

plot of chunk unnamed-chunk-8

Interpret the two models as probabilities. Translate the coefficients in the logistic model into probabilities in the second. They are essentially the same think. Both indicate that a smoker is significantly less likely to be dead.

Now include the covariate of age, perhaps with an interaction of age and smoker**:

moda = lm(survived ~ smoker * age, data = w)
modb = glm(survived ~ smoker * age, data = w, family = "binomial")
coef(summary(moda))

##                Estimate Std. Error t value   Pr(>|t|)
## (Intercept)    1.506301  0.0359446  41.906 3.630e-244
## smokerYes     -0.082244  0.0574953  -1.430  1.528e-01
## age           -0.016848  0.0006887 -24.466 3.492e-109
## smokerYes:age  0.002013  0.0011737   1.715  8.659e-02

coef(summary(modb))

##               Estimate Std. Error z value  Pr(>|z|)
## (Intercept)    8.16923   0.606600  13.467 2.438e-41
## smokerYes     -1.45784   0.837232  -1.741 8.164e-02
## age           -0.13323   0.009953 -13.386 7.278e-41
## smokerYes:age  0.02223   0.014495   1.534 1.250e-01

xyplot(fitted(modb) ~ fitted(moda))

plot of chunk unnamed-chunk-9

Notice how the linear model admits probabilities that are smaller than zero.

Why not use a polynomial model with `lm()`

Polynomials go the wrong way: their tails head off to infinity. The use of odds and the logarithm essentially solve the problem of getting a function that behaves nicely.

Introduction to R/Markdown

Activity**: Modeling Barry Bonds in 2001.

bonds = fetchData("bonds2001.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/bonds2001.csv

Source**: Journal of Statistics Education. Bonds data submitted by Jerome P. Reiter, Institute of Statistics and Decision Sciences, Duke University, Box 90251, Durham, NC 27708, jerry@stat.duke.edu

Student task**: Build a model of whether Bonds gets on base as a function of whatever other variables appear appropriate to you. Write it up in R/Markdown, including some nice graphic and an interpretation of your results.

Here's one model**:

mod1 = glm(reachesBase == 0 ~ outs + onFirst + onSecond + onThird + plateWithinGame, 
    data = bonds, family = "binomial")
summary(mod1)

## 
## Call:
## glm(formula = reachesBase == 0 ~ outs + onFirst + onSecond + 
##     onThird + plateWithinGame, family = "binomial", data = bonds)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.491  -1.136  -0.803   1.153   1.669  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.0796     0.2315   -0.34  0.73082    
## outs             -0.1368     0.1016   -1.35  0.17842    
## onFirst          -0.0984     0.1714   -0.57  0.56610    
## onSecond         -0.7418     0.2246   -3.30  0.00096 ***
## onThird          -0.2954     0.2923   -1.01  0.31222    
## plateWithinGame   0.1273     0.0589    2.16  0.03070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 897.82  on 647  degrees of freedom
## Residual deviance: 877.21  on 642  degrees of freedom
## AIC: 889.2
## 
## Number of Fisher Scoring iterations: 4

Evidently, Bonds was significantly more likely to get on base if there was another player on second base (onSecond), and taking into account his general downward trend over the course of a game (plateWithinGame)

Variables**:

plateNo Plate appearance number.
gameNo: Number of the game in the season.
plateWithinGame: Number of the plate appearance within the game.
inSF: Equals one for games in San Francisco and equals zero otherwise.
onFirst: Equals one when there is a runner on first base when Bonds appears and equals zero otherwise.
onSecond: Equals one when there is a runner on second base when Bonds appears and equals zero otherwise.
onThird: Equals one when there is a runner on third base when Bonds appears and equals zero otherwise.
outs: Number of outs in inning when Bonds appears.
inning: Inning of plate appearance.
runs: Number of runs scored by Giants in the inning after first pitch to Bonds.
walk: Equals one if Bonds walks and equals zero otherwise.
intentionalWalk: Equals one if Bonds walks intentionally and equals zero otherwise.
reachesBase:
- Equals zero if Bonds does not reach base.
- Equals one if Bonds reaches first base on a single or error.
- Equals two if Bonds reaches second base on a double or error.
- Equals three if Bonds reaches third base on a triple or error.
- Equals four if Bonds hits a home run.
- Equals five if Bonds walks or is hit by a pitch.
ERA: Opposing pitchers' career earned run average as of the end of the 2000 season.
scoreGiants: Giants score just before first pitch to Bonds
scoreOpponent: Opposing team's score just before first pitch to Bonds.