Logistic regression

** Regression with “binary data” **

What kinds of things are “Binary”

  • 0 / 1
  • either / or
  • yes / no
  • alive / dead
  • sick / healthy
  • flowered / didn’t flower
  • eaten by deer / not eaten by deer
  • caught something in trap / didn’t catch something
  • shot a deer/didn’t shoot a deer

In R, you can have binary data coded as 0s and 1s, or two categorical variables, “alive” vs. “dead”.





Two types of binary data

Depending on how a study was conducted or how an experiemnt carried out, binary data usually gets colelcted in one of two ways.

Binary data type 1:

(# of “events”)/(# of “trials”)

Often, a fixed number of things were observed, and a certain number of times something happend. This data is in the form (# of “events”)/(# of “trials”)

  • (# of seeds germinated) / (# of seeds planted)
  • (# of dead tadpoles) / (# exposed to pesticide)
  • (# of eaten plants) / (# of plants in forest)
  • (# of deer shot) / (# of hunters in forest)


Requires that the denominator is known

These data are usually easily aggregated in a table

Often used when you have a categorical predictor

Often used for experiments that generate binary data


Comments:

In each case a denominator, the number of trials, can be defined, and the number of events counted. It can be tempting to just use the counts, but including the number of trials is important.

A major assumption of the models used to analyze binary data is that each and every trial and each and every event that happens are independent. Ideally, this means they take place in a seperate, isolated location and what happens in one trial does not impact the other trials. So, if you are looking at seed germination, ideally you would germinate each seed in a seperate pot. If you germinate all of your seeds in a single pot, they are not indepednent. If you are assesing the toxicity of a pesticide on an insect, each insect would be exposed to the chemical in a seperate trial. If all of your insects are housed in a single container and are all sprayed at the same time, they are not indepenent.

Several people in class have this kind of data. In some cases their trials are independent; in others, they are not.





Binary data type 2:

Did an event happen? ~ predictor variable

  • Did you shoot a deer? ~ distance from forest edge
  • Did the Trillium flower? ~ size of the plant
  • Was the Trillium eaten? ~ number of deer in forest
  • Did the person get the flu? ~ concentration of vaccine given


These data are usually hard to aggregate because each “trial”" (a plant that might flower, or that might be eaten) is associated with a unique predictor variable (size of the plant)


A few people in class might have this kind of data. Shows up often in field studies.





Binary data Example 1: (# of “events”)/(# of “trials”)


Experiment: effect of cold exposure on fish

Whitlock & Shulter logistic regression example page 568

## Load data from page 568

### Original data has a sample size of 40
df1.n40 <- data.frame(died.YN = c(11,24,29,38),
                  lived.YN = c(29,16,11,2),
                  exposure.min = c(3,8,12,18))

df1.n40$total.fish <- df1.n40$died.YN +df1.n40$lived.YN

#reduce same size to 20
#make change to data for illustration purposes
df1.n20 <- df1.n40
df1.n20$died.YN <- round(df1.n20$died.YN/2,0)
df1.n20$lived.YN <- round(df1.n20$lived.YN/2,0)
df1.n20$total.fish <- df1.n20$died.YN +df1.n20$lived.YN





Look at dataframe

df1.n20
##   died.YN lived.YN exposure.min total.fish
## 1       6       14            3         20
## 2      12        8            8         20
## 3      14        6           12         20
## 4      19        1           18         20

Common way to think about binary data: percentages

  • Binary data often converted to percentage
  • Divide # of events by # of trials
  • This can cause problems



Calculate perentage mortality for fish data

percent.died <- df1.n20$died.YN/df1.n20$total.fish

df1.n20 <- cbind(percent.died, df1.n20)



Look at new percentage data

df1.n20
##   percent.died died.YN lived.YN exposure.min total.fish
## 1         0.30       6       14            3         20
## 2         0.60      12        8            8         20
## 3         0.70      14        6           12         20
## 4         0.95      19        1           18         20





The problem with percentages from binary data

Problem 1: they hid the sample size

  • What if I said “50% of people who stand in front of a microrwave while warming up their coffee get cancer, and I have data to prove it.”
  • What would be a key feature of this study? … Sample size!
  • What if my data had n = 2? (1 got cancer)
  • What if my data had n = 100? (50 got cancer)
  • Both data have caner rate = 50%
  • Which dataset is more reliable?





Problem 2: people who use them often incorrectly calcualte confidence intervals (CIs)

Model percent.died vs. exposure to cold

Use standard regression w/ lm()

lm.percent <- lm(percent.died ~ exposure.min, 
                 data = df1.n20)





Look at model output

Use summary(…)

summary(lm.percent)
## 
## Call:
## lm(formula = percent.died ~ exposure.min, data = df1.n20)
## 
## Residuals:
##        1        2        3        4 
## -0.03354  0.05683 -0.01087 -0.01242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.207764   0.050907   4.081   0.0551 .
## exposure.min 0.041925   0.004377   9.578   0.0107 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0481 on 2 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.968 
## F-statistic: 91.74 on 1 and 2 DF,  p-value: 0.01073


  • As exposure time increase, the percentage of fish that die increases.
  • What number tells you this?





Plot the data

newdat <- data.frame(exposure.min = 3:20)
ci.out <- predict(lm.percent,
                  interval = "confidence",
                  newdata = newdat)





Plot the data with confidence intervals

par(mfrow = c(1,1))
plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))

Ok, as exposure to cold increases, mortality increase





Plot the percentage data w/ the regression line

par(mfrow = c(1,1))
plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))
abline(lm.percent, col = 2, lwd = 2)

Nice regression line showing trend




Plot the data w/regression line & 95% CI

par(mfrow = c(1,1))
plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))
abline(lm.percent, col = 2, lwd = 2)
points(ci.out[,"lwr"] ~ newdat$exposure.min, 
       type = "l")
points(ci.out[,"upr"] ~ newdat$exposure.min, type = "l")

Fancy confidnece interval around trend. Nice, but…





Plot the data w/regression line, 95% CI, and refernece line for percent.died = 1

plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))
abline(lm.percent, col = 2, lwd = 2)
points(ci.out[,"lwr"] ~ newdat$exposure.min, 
       type = "l")
points(ci.out[,"upr"] ~ newdat$exposure.min, type = "l")
abline(h = 1, col = 2, lwd = 2, lty = 2)

  • What does it mean the the confidence interval crosses 1.0?
  • Subtle point: if your model can make confidence intervals for data points that don’t make sense, something is wrong with the model!




Correct analysis of binary data

A correct analysis uses a model that

1)Includes information on the sample size

2)Properly models the binary nature of the data


This requires “logistic regression”

This is a type of “generalized linear model”



Coding a logistic regression GLM

  • uses R’s glm() function
  • requires a statement of the “family” of the model
  • here family = “binomial”
  • Also need to indicate sample size (n)
  • use “weights =” argument
  • here, weights = total.fish





Code to run the logistic regression model model

glm1 <- glm(percent.died ~ exposure.min, 
            data = df1.n20,
            
            #model family:
            family = "binomial",
            
            #sample size
            weights = total.fish)




Logistic regression GLM output

Use summary() command just as for regular regression

summary(glm1)
## 
## Call:
## glm(formula = percent.died ~ exposure.min, family = "binomial", 
##     data = df1.n20, weights = total.fish)
## 
## Deviance Residuals: 
##        1         2         3         4  
##  0.01735   0.30995  -0.62850   0.45772  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.5280     0.5613  -2.722 0.006479 ** 
## exposure.min   0.2241     0.0579   3.870 0.000109 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21.04467  on 3  degrees of freedom
## Residual deviance:  0.70089  on 2  degrees of freedom
## AIC: 16.691
## 
## Number of Fisher Scoring iterations: 4
  • This looks almost the sames as for regular regression
  • BUT - the “slope” parameter for a GLM only tells you the overall direction of a trend (positive or negative)
  • The slope is on a special “logistic regression” scale




Plot output of model

  • Logistic regression output is on a special scale
  • Why this is is beyond this class: it has to do with logs :(
  • R’s “predict” function converts from this “special” to a percentage/probability scale
  • This special scale is called the “link” scale
  • The normal scale we can interpret is the “response” scale
  • That is, the scale that the original response variable was measured on

Get predictions from our GLM

#Predictions from GLM
newdat <- data.frame(exposure.min = 3:20)
se.outlink <- predict(glm1,
                  se.fit = T,
                  type = "link",
                  newdata = newdat)

library(arm)
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## 
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
## 
##     sigma
## 
## arm (Version 1.8-6, built: 2015-7-7)
## Working directory is C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lecture/Unit3_regression/week_after_thanksgiving
glm.cis <- data.frame(predictions = invlogit(se.outlink$fit),
                      lwr = invlogit(se.outlink$fit - 1.96*se.outlink$se.fit),
                      upr = invlogit(se.outlink$fit + 1.96*se.outlink$se.fit))

glm.cis <- cbind(glm.cis,newdat)





Plot the output of the GLM:

Data w/regression line
plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))
points(predictions ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2, lwd = 2)




Plot the output of the GLM:

Data, line, confidence intervals
plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))
points(predictions ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2, lwd = 2)
points(upr ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2)
points(lwr ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2)




Plot the output of the GLM:

Data, line, CIs, and reference line at 1.0
plot(percent.died ~ exposure.min, data = df1.n20, 
     pch = 18, cex = 2,
     xlim = c(3, 19),
     ylim = c(0,1.125))
points(predictions ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2, lwd = 2)
points(upr ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2)
points(lwr ~ exposure.min, 
       type = "l", data  = glm.cis, col = 2)

abline(h = 1, col = 2, lwd = 2, lty = 2)




Binary data Example 1: Did event happen? ~ predictor

Fish survival ~ fish size

Say we did this experiment at just the longest exposure time (20 minutes) and want to know what the relationship is between size and mortality?

Make up some data

size <- seq(10,100,length.out = 50)
died.YN <- c(1,1,1,1,1,1,1,1,1,1,1,1,
             1,0,1,1,1,0,0,1,0,1,0,1,
             1,1,0,0,0,0,1,0,0,0,1,0,
             1,0,0,1,0,0,0,0,0,1,0,0,
             0,0)

df3 <- data.frame(died.YN = died.YN, 
                  size = size)

Plot data

  • 1 = died
  • 0 = didn’t die (alive)
  • size varies continuosly from 5 cm to 100 cm
plot(died.YN ~ size, data = df3)





Model data

  • Model with glm()
  • family = binomial
  • DO NOT put anything in for sample size
  • that is, don’t use anything for weight arguement


glm2 <- glm(died.YN ~ size, data = df3,
            
            # family of modle
            family = "binomial")




summary(glm2)
## 
## Call:
## glm(formula = died.YN ~ size, family = "binomial", data = df3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8127  -0.7531   0.3369   0.6730   2.1434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.51383    1.01460   3.463 0.000534 ***
## size        -0.06157    0.01685  -3.655 0.000257 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69.235  on 49  degrees of freedom
## Residual deviance: 48.266  on 48  degrees of freedom
## AIC: 52.266
## 
## Number of Fisher Scoring iterations: 4
  • The negative value means that as fish get bigger, the probablity that they die decreases
  • As before, these numbers that come from glm() cannot be directly interpretted
  • We can interpret the p-value for the slope as we normally would




Plot with regression line

predictions <- predict(glm2,type = "response")
plot(died.YN ~ size, data = df3)
lines(predictions ~ df3$size, type = "l", col = 2, lwd = 2)