M. Drew LaMar
April 20, 2016
“Statistics may be defined as 'a body of methods for making wise decisions in the face of uncertainty.'”
- W.A. Wallis
Definition:
Logistic regression is nonlinear regression developed for a categorical response variable with two levels.
Caption: Mortality of guppies in relation to duration of exposure to a temperature of 5\( ^{\circ} \) C. (Estimating probability of an event)
Properties:
The following curve is fit in logistic regression:
\[ \textrm{log-odds}(Y) = \alpha + \beta X, \]
where
\[ \textrm{log-odds}(Y) = \ln\left[\frac{Y}{1-Y}\right] \]
and \( \alpha, \beta \) are population parameters.
Note: \( \textrm{Log-odds}(Y) \) is called the logit link function.
The estimates \( a \) and \( b \) for \( \alpha \) and \( \beta \), respectively, describe the linear relationship between \( X \) and the predicted log odds of \( Y \), i.e.
\[ \textrm{log-odds}(\hat{Y}) = a + b X. \]
To obtain the original predicted values, we need to solve for \( \hat{Y} \), obtaining
\[ \hat{Y} = \frac{e^{a+bX}}{1+e^{a+bX}} = \frac{1}{1 + e^{-(a + bX)}}. \]
This is a sigmoidal (S-shaped) function!! When \( b > 0 \), function is increasing. When \( b < 0 \), function is decreasing. When \( b = 0 \), function is constant.
The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentrations.
Look at data
anthrax <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17q35AnthraxMortality.csv")
str(anthrax)
'data.frame': 9 obs. of 3 variables:
$ anthraxConcentration: int 29300 32100 45300 57300 64800 67000 100000 125000 166000
$ survived : int 7 4 3 2 3 5 0 1 0
$ died : int 1 4 5 6 5 3 8 7 8
Discuss: What format is this data in, tidy or messy?
From wide table to long
# Create survived data frame
survived <- with(anthrax, data.frame(anthraxConcentration = rep(anthraxConcentration, survived), mortality = rep(1, sum(survived))))
# Create died data frame
died <- with(anthrax, data.frame(anthraxConcentration = rep(anthraxConcentration, died), mortality = rep(0, sum(died))))
# Bind them together
anthrax_long <- rbind(died, survived)
From wide table to long
str(anthrax_long)
'data.frame': 72 obs. of 2 variables:
$ anthraxConcentration: int 29300 32100 32100 32100 32100 45300 45300 45300 45300 45300 ...
$ mortality : num 0 0 0 0 0 0 0 0 0 0 ...
Show data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")
Show data
Generalized linear models
anthraxGlm <- glm(mortality ~ anthraxConcentration, data = anthrax_long, family = binomial(link = logit))
glm
stands for generalized linear model.
Notice the family = binomial(link = logit)
argument.
Add regression curve
# Plot data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")
# Plot fit
xpts <- seq(min(anthrax_long$anthraxConcentration), max(anthrax_long$anthraxConcentration), length.out = 100)
ypts <- predict(anthraxGlm, newdata = data.frame(anthraxConcentration = xpts), type = "response")
lines(xpts, ypts)
Add regression curve
Table of parameter estimates
summary(anthraxGlm)
Table of parameter estimates
Call:
glm(formula = mortality ~ anthraxConcentration, family = binomial(link = logit),
data = anthrax_long)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4750 -0.9292 -0.3420 0.9161 2.3951
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.745e+00 6.921e-01 2.521 0.01171 *
anthraxConcentration -3.643e-05 1.119e-05 -3.255 0.00113 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 92.982 on 71 degrees of freedom
Residual deviance: 73.962 on 70 degrees of freedom
AIC: 77.962
Number of Fisher Scoring iterations: 5
Note: \( P \)-values are not to be trusted!!! These report Wald statistics, which we know are inaccurate.
Better to create an analysis of deviance table.
anova(anthraxGlm, test = "Chi")
This will test the null hypothesis that anthrax concentration has no effect on mortality (i.e. \( \beta_{0} = 0 \)). See the book for details.
Analysis of Deviance Table
Model: binomial, link: logit
Response: mortality
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 71 92.982
anthraxConcentration 1 19.02 70 73.962 1.293e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Definition: a
mathematical model is a mathematical representation of the relationship between a response variable \( Y \) and one or more explanatory variables.
Definition: A
general linear model is a statistical linear model that generalizes regression to multiple explanatory variables, some possibly categorical.
For a general linear model, we have:
Multiple response variables = multivariate. We will only explore 1 response variable.
Notationally, if we have two explanatory variables \( X_{1} \) and \( X_{2} \) and one response variable \( Y \), a general linear model is of the form
\[ Y = \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{12}X_{1}X_{2} + \epsilon, \]
where \( \epsilon \sim N(0,\sigma^{2}) \).
This can also be written without referring to slopes as
\[ \begin{align} \textrm{RESPONSE} = & \textrm{CONSTANT} + \textrm{VARIABLE1} + \\ & \textrm{VARIABLE2} + \\ & \textrm{VARIABLE2*VARIABLE2} \end{align} \]
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + X \) | Linear regression | Dose-response |
\( Y = \mu + A \) | One-way ANOVA | Completely randomized |
\( Y = \mu + A + b \) | Two-way ANOVA, no replication |
Randomized block |
Key:
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + A + B + A*B \) | Two-way, fixed- effects ANOVA |
Factorial experiment |
\( Y = \mu + A + b + A*b \) | Two-way, mixed- effects ANOVA |
Factorial experiment |
\( Y = \mu + X + A \) | Analysis of covariance (ANCOVA) |
Observational study |
\( Y = \mu + X_{1} + X_{2} + X_{1}*X_{2} \) | Multiple regression | Dose-response |