M. Drew LaMar
April 12, 2017
There are two other types of “confidence intervals” of interest:
(biteAnova <- anova(biteRegression))
Analysis of Variance Table
Response: territory.area
Df Sum Sq Mean Sq F value Pr(>F)
bite 1 194.37 194.374 5.8012 0.03934 *
Residuals 9 301.55 33.506
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tests null hypothesis \( H_{0}: \beta = 0 \).
Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as \( \mathrm{SE}_{b} \).
\[ H_{0}: \beta = \beta_{0}\\
H_{A}: \beta \neq \beta_{0} \]
\( F \)-test using ANOVA tests \( \beta_{0}=0 \) only, but a \( t \)-test can be used when \( \beta_{0}\neq 0 \)!!!
Test statistic: \[ t = \frac{b-\beta_{0}}{\mathrm{SE}_{b}} \]
Degree of freedom: \[ df = n-2 \]
You can get the results of a \( t \)-test on slope in R for \( \beta_{0}=0 \) by using the summary
command on a regression results from lm
:
summary(biteRegression)
Call:
lm(formula = territory.area ~ bite, data = biteData)
Residuals:
Min 1Q Median 3Q Max
-7.0472 -4.5101 -0.5504 3.6689 10.2237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.539 23.513 -1.341 0.2127
bite 11.677 4.848 2.409 0.0393 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3244
F-statistic: 5.801 on 1 and 9 DF, p-value: 0.03934
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
First, get estimate.
biteRegression$coefficients
(Intercept) bite
-31.53929 11.67730
(b <- biteRegression$coefficients['bite'])
bite
11.6773
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
Second, get standard error.
(coef <- summary(biteRegression)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.53929 23.512646 -1.341375 0.2126627
bite 11.67730 4.848226 2.408571 0.0393411
(sderr <- coef['bite','Std. Error'])
[1] 4.848226
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
Third, compute test statistic.
\[ \begin{align} t & = \frac{b - \beta_{0}}{\mathrm{SE}_{b}} \\ & = \frac{11.6772959 - \beta_{0}}{4.8482259} \end{align} \]
Then continue with usual hypothesis test steps by computing \( P \)-value of test statistic with \( df = n-2 \).
We discussed using residual plots to check the second and third point above.
Outliers can also have large effects on linear fits.
Visualizing the data will help spot nonlinearities.
Transformations can make some nonlinear relationships linear.
Transformations can also correct for non-equal variances (just like in ANOVA).
In this example, the response variable for the untransformed data on the left was transformed to the right using a square-root transformation.
The problem with nonlinear regression? An infinite number of possible nonlinear functions to fit!
Book covers:
We will just discuss asymptotic curve fitting.
Consider the relationship between population growth rate of a phytoplankton in culture and the concentration of iron in the medium.
The right panel shows a fit with the Michaelis-Menten curve
\[ Y = \frac{aX}{b+X}, \]
with \( a \) and \( b \) parameters.
phytoplankton <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f8_1IronAndPhytoplanktonGrowth.csv"))
head(phytoplankton)
ironConcentration phytoGrowthRate
1 0.01096478 0.00
2 0.02398833 0.50
3 0.02818383 0.35
4 0.07244360 1.04
5 0.07585776 0.97
6 0.11481536 1.19
One can perform a nonlinear regression using nonlinear least squares. This can be accomplished in R as follows with the package nls
.
phytoCurve <- nls(phytoGrowthRate ~ a*ironConcentration / ( b+ironConcentration),
data = phytoplankton,
list(a = 1, b = 1))
The last argument is a list containing initial values for the parameters (initial guesses at the best fit).
To get information on the fit, we can use the summary
command as before in the linear case.
summary(phytoCurve)
Formula: phytoGrowthRate ~ a * ironConcentration/(b + ironConcentration)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.94164 0.06802 28.545 1.14e-11 ***
b 0.07833 0.01120 6.994 2.29e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1133 on 11 degrees of freedom
Number of iterations to convergence: 6
Achieved convergence tolerance: 6.672e-06
Definition:
Logistic regression is nonlinear regression developed for a categorical response variable with two levels.
Caption: Mortality of guppies in relation to duration of exposure to a temperature of 5\( ^{\circ} \) C. (Estimating probability of an event)
Properties:
The following curve is fit in logistic regression:
\[ \textrm{log-odds}(Y) = \alpha + \beta X, \]
where
\[ \textrm{log-odds}(Y) = \ln\left[\frac{Y}{1-Y}\right] \]
and \( \alpha, \beta \) are population parameters.
Note: \( \textrm{Log-odds}(Y) \) is called the logit link function.
The estimates \( a \) and \( b \) for \( \alpha \) and \( \beta \), respectively, describe the linear relationship between \( X \) and the predicted log odds of \( Y \), i.e.
\[ \textrm{log-odds}(\hat{Y}) = a + b X. \]
To obtain the original predicted values, we need to solve for \( \hat{Y} \), obtaining
\[ \hat{Y} = \frac{e^{a+bX}}{1+e^{a+bX}} = \frac{1}{1 + e^{-(a + bX)}}. \]
This is a sigmoidal (S-shaped) function!! When \( b > 0 \), function is increasing. When \( b < 0 \), function is decreasing. When \( b = 0 \), function is constant.
The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentrations.
Look at data
anthrax <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17q35AnthraxMortality.csv")
str(anthrax)
'data.frame': 9 obs. of 3 variables:
$ anthraxConcentration: int 29300 32100 45300 57300 64800 67000 100000 125000 166000
$ survived : int 7 4 3 2 3 5 0 1 0
$ died : int 1 4 5 6 5 3 8 7 8
Discuss: What format is this data in, tidy or messy?
From wide table to long
# Create survived data frame
survived <- with(anthrax,
data.frame(anthraxConcentration = rep(anthraxConcentration, survived),
mortality = rep(1, sum(survived))))
# Create died data frame
died <- with(anthrax,
data.frame(anthraxConcentration = rep(anthraxConcentration, died),
mortality = rep(0, sum(died))))
# Bind them together
anthrax_long <- rbind(died, survived)
From wide table to long
str(anthrax_long)
'data.frame': 72 obs. of 2 variables:
$ anthraxConcentration: int 29300 32100 32100 32100 32100 45300 45300 45300 45300 45300 ...
$ mortality : num 0 0 0 0 0 0 0 0 0 0 ...
Show data
plot(jitter(mortality,
factor = 0.1) ~
jitter(anthraxConcentration,
factor = 6),
data = anthrax_long,
pch=16,
cex=1.5,
cex.lab=1.5,
col="firebrick",
xlab="Anthrax concentration (spores/l)",
ylab="Mortality")
Show data
Generalized linear models
anthraxGlm <- glm(mortality ~ anthraxConcentration,
data = anthrax_long,
family = binomial(link = logit))
glm
stands for generalized linear model.
Notice the family = binomial(link = logit)
argument.
Add regression curve
# Plot data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")
# Plot fit
xpts <- seq(min(anthrax_long$anthraxConcentration),
max(anthrax_long$anthraxConcentration),
length.out = 100)
ypts <- predict(anthraxGlm,
newdata = data.frame(anthraxConcentration = xpts),
type = "response")
lines(xpts, ypts)
Add regression curve
Table of parameter estimates
summary(anthraxGlm)
Table of parameter estimates
Call:
glm(formula = mortality ~ anthraxConcentration, family = binomial(link = logit),
data = anthrax_long)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4750 -0.9292 -0.3420 0.9161 2.3951
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.745e+00 6.921e-01 2.521 0.01171 *
anthraxConcentration -3.643e-05 1.119e-05 -3.255 0.00113 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 92.982 on 71 degrees of freedom
Residual deviance: 73.962 on 70 degrees of freedom
AIC: 77.962
Number of Fisher Scoring iterations: 5
Note: \( P \)-values are not to be trusted!!! These report Wald statistics, which we know are inaccurate.
Better to create an analysis of deviance table.
anova(anthraxGlm, test = "Chi")
This will test the null hypothesis that anthrax concentration has no effect on mortality (i.e. \( \beta_{0} = 0 \)). See the book for details.
Analysis of Deviance Table
Model: binomial, link: logit
Response: mortality
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 71 92.982
anthraxConcentration 1 19.02 70 73.962 1.293e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1