Lecture 23: Logistic regression

Linear modeling of groups

Calculates the groupwise mean, which gives a probability if the response is 0/1.

data("Whickham")
# Create a zero-one survived variable
Whickham$survived = Whickham$outcome=="Alive"
# Model survival by smoking
mod = lm(survived ~ smoker, data=Whickham)
f = makeFun(mod)
f(smoker=c("Yes","No"))
##         1         2 
## 0.7611684 0.6857923

Now to calculate the probability of the outcome according to the model. This is really a conditional probability, p(outcome|model).

Whickham$prob = ifelse(Whickham$survived, f(Whickham$smoker), 1-f(Whickham$smoker))

Now we have a probability for each outcome. Assuming that the outcomes are independent, the probability of the whole set of outcomes is the product of all the outcomes.

with(Whickham, prod(prob))
## [1] 0

Although this looks like zero, it’s really not — it’s just a very small number. Computer round-off has come into play. To avoid this problem, and for other, theoretical reasons, one works with the log-likelihood: the logarithm of the likelihood.

with(Whickham, sum(log(prob)))
## [1] -775.5584

This approach is fine for categorical variables, but it doesn’t work when there is more than one variable or when there are quantitative explanatory variables. The reason is that the model value can go outside the range 0-1.

mod2 = lm(survived ~ age, data=Whickham)
f2 = makeFun(mod2)
Whickham$prob2 = ifelse(Whickham$survived, f2(Whickham$age), 1-f2(Whickham$age))
with(Whickham, range(prob2))
## [1] -0.1873843  1.1873843

Fitting logistic models

  • Constructing a yes/no variable with ==
  • glm and the family="binomial"
  • The summary report

Confounding, covariates, and logistic regression

The Whickham data

This is drawn from a cohort study of nurses in Whickhamshire, England. Nurses were asked many questions about their lifestyle, diet, etc. Then they were revisited 20 years later. This small data set just records their age at the first interview, whether they identified themselves as smokers, and whether they were still alive at the 20 year follow-up.

First, let’s look at the relationship between age and probability of being dead at the follow-up. We’ll compare a straightforward linear model with a logistic regression model:

m = lm(survived ~ age, data=Whickham)
m2 = glm(survived ~ age, data=Whickham, family="binomial")

f = makeFun(m)
f2 = makeFun(m2)

plotFun(f(age) ~ age, age.lim=c(20,100))

plotFun(f2(age) ~ age, add=TRUE, col="red", lwd=3)

Note that the models agree on the probability of surviving for women in their middle ages. * At the extremes, the linear model is silly. It has probability outside the range 0 to 1. * It doesn’t work so well on the inside either, since the linear function can’t be steeper in the center without being absurdly high near the edges.

Interpreting the coefficients

coef(summary(m))
##                Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)  1.47879798 0.0277403309  53.30859  0.000000e+00
## age         -0.01618965 0.0005542372 -29.21069 6.516797e-145

The coefficient on age tells how the probability changes with an increase of one year in age. This simple linear model says that the change in probability is constant across the ages. (We asked it to say that: it’s a linear model!)

coef(summary(m2))
##               Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept)  7.4031257 0.403521694  18.34629 3.534427e-75
## age         -0.1218615 0.006941204 -17.55624 5.328854e-69
  • Linear Model: the probability of dying increases by 1.6 percentage points each year.
  • Logistic Model: the odds of dying increases by 12% per year. That’s not percentage points, but a multiplier of 0.12. In other words, the odds of dying increases exponentially!

95% confidence interval on the rate of odds increase

exp(-0.122 + 2*c(-1,1)*0.00694)
## [1] 0.8729474 0.8975199

Smoking and Death

mod0 = lm(survived ~ smoker, data=Whickham)
mod1 = glm(survived ~ smoker, data=Whickham, family="binomial")

coef(summary(mod0))
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 0.68579235 0.01656519 41.399609 2.440451e-240
## smokerYes   0.07537604 0.02489044  3.028313  2.507251e-03
coef(summary(mod1))
##              Estimate Std. Error  z value     Pr(>|z|)
## (Intercept) 0.7805208  0.0796232 9.802681 1.096363e-22
## smokerYes   0.3785750  0.1256639 3.012599 2.590206e-03
xyplot(fitted(mod1) ~ fitted(mod0))

Interpret the two models as probabilities. Translate the coefficients in the logistic model into probabilities in the second:

tally(predict(mod1))
## First argument should be a formula... But I'll try to guess what you meant
## 
## 0.780520810767942  1.15909583691192 
##               732               582
pred = predict(mod1)
pred = exp(pred) / (1+exp(pred))
# are these all pretty much the same?
all.equal(pred, fitted(mod0), fitted(mod1))
## [1] TRUE

They are essentially the same thing! Both indicate that a smoker is significantly less likely to be dead.

Now include the covariate of age, perhaps with an interaction of age and smoker:

moda = lm(survived ~ smoker*age, data=Whickham)
modb = glm(survived ~ smoker*age, data=Whickham, family="binomial")

coef(summary(moda))
##                   Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)    1.506301282 0.0359445984  41.906193 3.630461e-244
## smokerYes     -0.082244146 0.0574952597  -1.430451  1.528261e-01
## age           -0.016848422 0.0006886548 -24.465701 3.492310e-109
## smokerYes:age  0.002012765 0.0011736610   1.714946  8.659167e-02
coef(summary(modb))
##                  Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)    8.16923147 0.606599881  13.467249 2.437817e-41
## smokerYes     -1.45784304 0.837231919  -1.741265 8.163705e-02
## age           -0.13323143 0.009952868 -13.386236 7.277701e-41
## smokerYes:age  0.02223495 0.014494677   1.534008 1.250277e-01
xyplot(fitted(modb)~fitted(moda))

Notice how the linear model admits probabilities that are smaller than zero (and/or larger than one).

Why not use a polynomial model with lm()

Polynomials go the wrong way: their tails head off to infinity. The use of odds and the logarithm essentially solve the problem of getting a function that behaves nicely.

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.