Multiple explanatory variables

M. Drew LaMar
April 20, 2016

“Statistics may be defined as 'a body of methods for making wise decisions in the face of uncertainty.'”

- W.A. Wallis

Class announcements

All HWs for Friday exam have solutions on Blackboard
All HWs (up through HW #9) have been graded and “returned”
Final Homework #10 is on Blackboard. ~~Graded by completion only.~~ Due on Friday, April 29, 11:59 pm.
Lab today and tomorrow is for you to ask individual questions, work on the HW for next week, and/or discuss individual/group projects.
Reminder: Exam #3 is ~~this Friday.~~ Bring a calculator.

Final example: Logistic regression

Definition: Logistic regression is nonlinear regression developed for a categorical response variable with two levels.

Caption: Mortality of guppies in relation to duration of exposure to a temperature of 5\( ^{\circ} \) C. (Estimating probability of an event)

Logistic regression

Properties:

Outcomes at every \( X \) have a binomial distribution (not a normal distribution).
Probability of an event is given by corresponding predicted value on the regression curve.
To correct for differences in variance of residuals at different \( X \) values, residuals are weighted by estimated variance obtained from binomial distribution.

Logistic regression

The following curve is fit in logistic regression:

\[ \textrm{log-odds}(Y) = \alpha + \beta X, \]

where

\[ \textrm{log-odds}(Y) = \ln\left[\frac{Y}{1-Y}\right] \]

and \( \alpha, \beta \) are population parameters.

Note: \( \textrm{Log-odds}(Y) \) is called the logit link function.

Logistic regression

The estimates \( a \) and \( b \) for \( \alpha \) and \( \beta \), respectively, describe the linear relationship between \( X \) and the predicted log odds of \( Y \), i.e.

\[ \textrm{log-odds}(\hat{Y}) = a + b X. \]

To obtain the original predicted values, we need to solve for \( \hat{Y} \), obtaining

\[ \hat{Y} = \frac{e^{a+bX}}{1+e^{a+bX}} = \frac{1}{1 + e^{-(a + bX)}}. \]

This is a sigmoidal (S-shaped) function!! When \( b > 0 \), function is increasing. When \( b < 0 \), function is decreasing. When \( b = 0 \), function is constant.

Example: Logistic regression

The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentrations.

Example: Logistic regression

Look at data

anthrax <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17q35AnthraxMortality.csv")
str(anthrax)

'data.frame':   9 obs. of  3 variables:
 $ anthraxConcentration: int  29300 32100 45300 57300 64800 67000 100000 125000 166000
 $ survived            : int  7 4 3 2 3 5 0 1 0
 $ died                : int  1 4 5 6 5 3 8 7 8

Discuss: What format is this data in, tidy or messy?

Example: Logistic regression

From wide table to long

# Create survived data frame
survived <- with(anthrax, data.frame(anthraxConcentration = rep(anthraxConcentration, survived), mortality = rep(1, sum(survived))))

# Create died data frame
died <- with(anthrax, data.frame(anthraxConcentration = rep(anthraxConcentration, died), mortality = rep(0, sum(died))))

# Bind them together
anthrax_long <- rbind(died, survived)

Example: Logistic regression

From wide table to long

str(anthrax_long)

'data.frame':   72 obs. of  2 variables:
 $ anthraxConcentration: int  29300 32100 32100 32100 32100 45300 45300 45300 45300 45300 ...
 $ mortality           : num  0 0 0 0 0 0 0 0 0 0 ...

Example: Logistic regression

Show data

plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")

Example: Logistic regression

Show data

plot of chunk unnamed-chunk-5

Example: Logistic regression

Generalized linear models

anthraxGlm <- glm(mortality ~ anthraxConcentration, data = anthrax_long, family = binomial(link = logit))

glm stands for generalized linear model.

Notice the family = binomial(link = logit) argument.

Example: Logistic regression

Add regression curve

# Plot data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")

# Plot fit
xpts <- seq(min(anthrax_long$anthraxConcentration), max(anthrax_long$anthraxConcentration), length.out = 100)
ypts <- predict(anthraxGlm, newdata = data.frame(anthraxConcentration = xpts), type = "response")
lines(xpts, ypts)

Example: Logistic regression

Add regression curve

plot of chunk unnamed-chunk-8

Example: Logistic regression

Table of parameter estimates

summary(anthraxGlm)

Example: Logistic regression

Table of parameter estimates


Call:
glm(formula = mortality ~ anthraxConcentration, family = binomial(link = logit), 
    data = anthrax_long)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4750  -0.9292  -0.3420   0.9161   2.3951  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)   
(Intercept)           1.745e+00  6.921e-01   2.521  0.01171 * 
anthraxConcentration -3.643e-05  1.119e-05  -3.255  0.00113 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 92.982  on 71  degrees of freedom
Residual deviance: 73.962  on 70  degrees of freedom
AIC: 77.962

Number of Fisher Scoring iterations: 5

Example: Logistic regression

Note: \( P \)-values are not to be trusted!!! These report Wald statistics, which we know are inaccurate.

Better to create an analysis of deviance table.

anova(anthraxGlm, test = "Chi")

This will test the null hypothesis that anthrax concentration has no effect on mortality (i.e. \( \beta_{0} = 0 \)). See the book for details.

Example: Logistic regression

Analysis of Deviance Table

Model: binomial, link: logit

Response: mortality

Terms added sequentially (first to last)

                     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                    71     92.982              
anthraxConcentration  1    19.02        70     73.962 1.293e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is a model?

Definition: a mathematical model is a mathematical representation of the relationship between a response variable \( Y \) and one or more explanatory variables.

Definition: A general linear model is a statistical linear model that generalizes regression to multiple explanatory variables, some possibly categorical.

For a general linear model, we have:

Explanatory: Multiple; Numerical or categorical
Response: Multiple; Numerical

Multiple response variables = multivariate. We will only explore 1 response variable.

General linear models

Notationally, if we have two explanatory variables \( X_{1} \) and \( X_{2} \) and one response variable \( Y \), a general linear model is of the form

\[ Y = \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{12}X_{1}X_{2} + \epsilon, \]

where \( \epsilon \sim N(0,\sigma^{2}) \).

This can also be written without referring to slopes as

\[ \begin{align} \textrm{RESPONSE} = & \textrm{CONSTANT} + \textrm{VARIABLE1} + \\ & \textrm{VARIABLE2} + \\ & \textrm{VARIABLE2*VARIABLE2} \end{align} \]

General linear models: Types

Linear model	Other name	Example study design
\( Y = \mu + X \)	Linear regression	Dose-response
\( Y = \mu + A \)	One-way ANOVA	Completely randomized
\( Y = \mu + A + b \)	Two-way ANOVA, no replication	Randomized block

Key:

\( Y \): [Response] Numerical
\( X \): [Explanatory] Numerical explanatory
\( A,B \): [Explanatory] Categorical, fixed-effect
\( b \): [Explanatory] Categorical, random-effect
\( \mu \): Constant (mean or intercept)

General linear models: Types (cont'd)

Linear model	Other name	Example study design
\( Y = \mu + A + B + A*B \)	Two-way, fixed- effects ANOVA	Factorial experiment
\( Y = \mu + A + b + A*b \)	Two-way, mixed- effects ANOVA	Factorial experiment
\( Y = \mu + X + A \)	Analysis of covariance (ANCOVA)	Observational study
\( Y = \mu + X_{1} + X_{2} + X_{1}*X_{2} \)	Multiple regression	Dose-response