** Regression with “binary data” **
In R, you can have binary data coded as 0s and 1s, or two categorical variables, “alive” vs. “dead”.
Depending on how a study was conducted or how an experiemnt carried out, binary data usually gets colelcted in one of two ways.
Often, a fixed number of things were observed, and a certain number of times something happend. This data is in the form (# of “events”)/(# of “trials”)
These data are usually hard to aggregate because each “trial”" (a plant that might flower, or that might be eaten) is associated with a unique predictor variable (size of the plant)
A few people in class might have this kind of data. Shows up often in field studies.
Whitlock & Shulter logistic regression example page 568
## Load data from page 568
### Original data has a sample size of 40
df1.n40 <- data.frame(died.YN = c(11,24,29,38),
lived.YN = c(29,16,11,2),
exposure.min = c(3,8,12,18))
df1.n40$total.fish <- df1.n40$died.YN +df1.n40$lived.YN
#reduce same size to 20
#make change to data for illustration purposes
df1.n20 <- df1.n40
df1.n20$died.YN <- round(df1.n20$died.YN/2,0)
df1.n20$lived.YN <- round(df1.n20$lived.YN/2,0)
df1.n20$total.fish <- df1.n20$died.YN +df1.n20$lived.YN
df1.n20
## died.YN lived.YN exposure.min total.fish
## 1 6 14 3 20
## 2 12 8 8 20
## 3 14 6 12 20
## 4 19 1 18 20
Calculate perentage mortality for fish data
percent.died <- df1.n20$died.YN/df1.n20$total.fish
df1.n20 <- cbind(percent.died, df1.n20)
Look at new percentage data
df1.n20
## percent.died died.YN lived.YN exposure.min total.fish
## 1 0.30 6 14 3 20
## 2 0.60 12 8 8 20
## 3 0.70 14 6 12 20
## 4 0.95 19 1 18 20
Use standard regression w/ lm()
lm.percent <- lm(percent.died ~ exposure.min,
data = df1.n20)
Use summary(…)
summary(lm.percent)
##
## Call:
## lm(formula = percent.died ~ exposure.min, data = df1.n20)
##
## Residuals:
## 1 2 3 4
## -0.03354 0.05683 -0.01087 -0.01242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.207764 0.050907 4.081 0.0551 .
## exposure.min 0.041925 0.004377 9.578 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0481 on 2 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.968
## F-statistic: 91.74 on 1 and 2 DF, p-value: 0.01073
newdat <- data.frame(exposure.min = 3:20)
ci.out <- predict(lm.percent,
interval = "confidence",
newdata = newdat)
par(mfrow = c(1,1))
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
Ok, as exposure to cold increases, mortality increase
par(mfrow = c(1,1))
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
abline(lm.percent, col = 2, lwd = 2)
Nice regression line showing trend
par(mfrow = c(1,1))
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
abline(lm.percent, col = 2, lwd = 2)
points(ci.out[,"lwr"] ~ newdat$exposure.min,
type = "l")
points(ci.out[,"upr"] ~ newdat$exposure.min, type = "l")
Fancy confidnece interval around trend. Nice, but…
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
abline(lm.percent, col = 2, lwd = 2)
points(ci.out[,"lwr"] ~ newdat$exposure.min,
type = "l")
points(ci.out[,"upr"] ~ newdat$exposure.min, type = "l")
abline(h = 1, col = 2, lwd = 2, lty = 2)
glm1 <- glm(percent.died ~ exposure.min,
data = df1.n20,
#model family:
family = "binomial",
#sample size
weights = total.fish)
Use summary() command just as for regular regression
summary(glm1)
##
## Call:
## glm(formula = percent.died ~ exposure.min, family = "binomial",
## data = df1.n20, weights = total.fish)
##
## Deviance Residuals:
## 1 2 3 4
## 0.01735 0.30995 -0.62850 0.45772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5280 0.5613 -2.722 0.006479 **
## exposure.min 0.2241 0.0579 3.870 0.000109 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21.04467 on 3 degrees of freedom
## Residual deviance: 0.70089 on 2 degrees of freedom
## AIC: 16.691
##
## Number of Fisher Scoring iterations: 4
Get predictions from our GLM
#Predictions from GLM
newdat <- data.frame(exposure.min = 3:20)
se.outlink <- predict(glm1,
se.fit = T,
type = "link",
newdata = newdat)
library(arm)
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
##
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
##
## sigma
##
## arm (Version 1.8-6, built: 2015-7-7)
## Working directory is C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lecture/Unit3_regression/week_after_thanksgiving
glm.cis <- data.frame(predictions = invlogit(se.outlink$fit),
lwr = invlogit(se.outlink$fit - 1.96*se.outlink$se.fit),
upr = invlogit(se.outlink$fit + 1.96*se.outlink$se.fit))
glm.cis <- cbind(glm.cis,newdat)
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
points(predictions ~ exposure.min,
type = "l", data = glm.cis, col = 2, lwd = 2)
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
points(predictions ~ exposure.min,
type = "l", data = glm.cis, col = 2, lwd = 2)
points(upr ~ exposure.min,
type = "l", data = glm.cis, col = 2)
points(lwr ~ exposure.min,
type = "l", data = glm.cis, col = 2)
plot(percent.died ~ exposure.min, data = df1.n20,
pch = 18, cex = 2,
xlim = c(3, 19),
ylim = c(0,1.125))
points(predictions ~ exposure.min,
type = "l", data = glm.cis, col = 2, lwd = 2)
points(upr ~ exposure.min,
type = "l", data = glm.cis, col = 2)
points(lwr ~ exposure.min,
type = "l", data = glm.cis, col = 2)
abline(h = 1, col = 2, lwd = 2, lty = 2)
Say we did this experiment at just the longest exposure time (20 minutes) and want to know what the relationship is between size and mortality?
Make up some data
size <- seq(10,100,length.out = 50)
died.YN <- c(1,1,1,1,1,1,1,1,1,1,1,1,
1,0,1,1,1,0,0,1,0,1,0,1,
1,1,0,0,0,0,1,0,0,0,1,0,
1,0,0,1,0,0,0,0,0,1,0,0,
0,0)
df3 <- data.frame(died.YN = died.YN,
size = size)
plot(died.YN ~ size, data = df3)
glm2 <- glm(died.YN ~ size, data = df3,
# family of modle
family = "binomial")
summary(glm2)
##
## Call:
## glm(formula = died.YN ~ size, family = "binomial", data = df3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8127 -0.7531 0.3369 0.6730 2.1434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.51383 1.01460 3.463 0.000534 ***
## size -0.06157 0.01685 -3.655 0.000257 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.235 on 49 degrees of freedom
## Residual deviance: 48.266 on 48 degrees of freedom
## AIC: 52.266
##
## Number of Fisher Scoring iterations: 4
predictions <- predict(glm2,type = "response")
plot(died.YN ~ size, data = df3)
lines(predictions ~ df3$size, type = "l", col = 2, lwd = 2)
Comments:
In each case a denominator, the number of trials, can be defined, and the number of events counted. It can be tempting to just use the counts, but including the number of trials is important.
A major assumption of the models used to analyze binary data is that each and every trial and each and every event that happens are independent. Ideally, this means they take place in a seperate, isolated location and what happens in one trial does not impact the other trials. So, if you are looking at seed germination, ideally you would germinate each seed in a seperate pot. If you germinate all of your seeds in a single pot, they are not indepednent. If you are assesing the toxicity of a pesticide on an insect, each insect would be exposed to the chemical in a seperate trial. If all of your insects are housed in a single container and are all sprayed at the same time, they are not indepenent.
Several people in class have this kind of data. In some cases their trials are independent; in others, they are not.