M. Drew LaMar
December 6, 2019
Definition: The
variance of the residuals is known as \( \mathrm{MS}_{\mathrm{residuals}} \) and is given by
\[ \mathrm{MS}_{\mathrm{residuals}} = \frac{\sum_{i}\left(Y_{i}-\hat{Y}_{i}\right)^2}{n-2}. \]
This is also known asresidual mean square .
\( \mathrm{MS}_{\mathrm{residuals}} \) is the average variance in the data around the regression line. This is completely analogous to the error mean square in ANOVA.
\( \mathrm{MS}_{\mathrm{residuals}} \) has \( n-2 \) degrees of freedom since we needed to estimate slope and intercept for the predicted values.
Use anova
command on biteRegression
anova(biteRegression)
Analysis of Variance Table
Response: territory.area
Df Sum Sq Mean Sq F value Pr(>F)
bite 1 194.37 194.374 5.8012 0.03934 *
Residuals 9 301.55 33.506
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual mean square is 33.506.
Can calculate residual mean square manually: \[ \mathrm{MS}_{\mathrm{residuals}} = \frac{\sum_{i}\left(Y_{i}-\hat{Y}_{i}\right)^2}{n-2}. \]
n <- nrow(biteData)
(MSresid <- ((n-1)/(n-2))*var(biteData$res))
[1] 33.50573
Same answer as ANOVA gave!!!
Two sources of variation:
In the ANOVA table, we can compute the \( F \)-ratio
\[ F = \frac{\mathrm{MS}_{\mathrm{regression}}}{\mathrm{MS}_{\mathrm{residual}}} \]
which is a test on the slope with \( H_{0}: \beta = 0 \).
Analysis of Variance Table
Response: territory.area
Df Sum Sq Mean Sq F value Pr(>F)
bite 1 194.37 194.374 5.8012 0.03934 *
Residuals 9 301.55 33.506
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\( P \)-value is less than 0.05, so we can reject the null hypothesis that the slope \( \beta = 0 \).
We can measure how well the line “fits” the data by estimating the \( R^{2} \) value using ANOVA, i.e.
\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{regression}}}{\mathrm{SS}_{\mathrm{total}}}. \]
This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).
Basic idea is:
For the lizard example,
biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418
Thus, 39% of the variation in territory area is explained by bite force.
# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))
# Scatter plot
plot(territory.area ~ bite, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)), data=biteData)
# Add in the best-fit line
abline(biteRegression, lwd=3)
# Text
text(5, 30.5, expression(R^{2} ~ "=" ~ 0.39), cex=1.5)
text(5.14, 31.7, expression(y ~ "=" -31.54 + 11.68), cex=1.5)
summary(biteRegression)
Call:
lm(formula = territory.area ~ bite, data = biteData)
Residuals:
Min 1Q Median 3Q Max
-7.0472 -4.5101 -0.5504 3.6689 10.2237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.539 23.513 -1.341 0.2127
bite 11.677 4.848 2.409 0.0393 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3244
F-statistic: 5.801 on 1 and 9 DF, p-value: 0.03934
Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as
\[ \mathrm{SE}_{b} = \sqrt{\frac{\mathrm{MS}_{\mathrm{residual}}}{\sum_{i}(X_{i}-\bar{X})^2}}. \]
Definition: A confidence interval for the parameter \( \beta \) is given by
\[ b - t_{\alpha(2),df}\mathrm{SE}_{b} < \beta < b + t_{\alpha(2),df}\mathrm{SE}_{b}, \]
with \( df = n-2 \).
There are two other types of “confidence intervals” of interest:
(biteAnova <- anova(biteRegression))
Analysis of Variance Table
Response: territory.area
Df Sum Sq Mean Sq F value Pr(>F)
bite 1 194.37 194.374 5.8012 0.03934 *
Residuals 9 301.55 33.506
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tests null hypothesis \( H_{0}: \beta = 0 \).
Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as \( \mathrm{SE}_{b} \).
\[ H_{0}: \beta = \beta_{0}\\
H_{A}: \beta \neq \beta_{0} \]
\( F \)-test using ANOVA tests \( \beta_{0}=0 \) only, but a \( t \)-test can be used when \( \beta_{0}\neq 0 \)!!!
Test statistic: \[ t = \frac{b-\beta_{0}}{\mathrm{SE}_{b}} \]
Degree of freedom: \[ df = n-2 \]
You can get the results of a \( t \)-test on slope in R for \( \beta_{0}=0 \) by using the summary
command on a regression results from lm
:
summary(biteRegression)
Call:
lm(formula = territory.area ~ bite, data = biteData)
Residuals:
Min 1Q Median 3Q Max
-7.0472 -4.5101 -0.5504 3.6689 10.2237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.539 23.513 -1.341 0.2127
bite 11.677 4.848 2.409 0.0393 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3244
F-statistic: 5.801 on 1 and 9 DF, p-value: 0.03934
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
First, get estimate.
biteRegression$coefficients
(Intercept) bite
-31.53929 11.67730
(b <- biteRegression$coefficients['bite'])
bite
11.6773
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
Second, get standard error.
(coef <- summary(biteRegression)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.53929 23.512646 -1.341375 0.2126627
bite 11.67730 4.848226 2.408571 0.0393411
(sderr <- coef['bite','Std. Error'])
[1] 4.848226
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
Third, compute test statistic.
\[ \begin{align} t & = \frac{b - \beta_{0}}{\mathrm{SE}_{b}} \\ & = \frac{11.6772959 - \beta_{0}}{4.8482259} \end{align} \]
Then continue with usual hypothesis test steps by computing \( P \)-value of test statistic with \( df = n-2 \).
We discussed using residual plots to check the second and third point above.
Outliers can also have large effects on linear fits.
Visualizing the data will help spot nonlinearities.
Transformations can make some nonlinear relationships linear.
Transformations can also correct for non-equal variances (just like in ANOVA).
In this example, the response variable for the untransformed data on the left was transformed to the right using a square-root transformation.
The problem with nonlinear regression? An infinite number of possible nonlinear functions to fit!
Book covers:
We will just discuss asymptotic curve fitting.
Consider the relationship between population growth rate of a phytoplankton in culture and the concentration of iron in the medium.
The right panel shows a fit with the Michaelis-Menten curve
\[ Y = \frac{aX}{b+X}, \]
with \( a \) and \( b \) parameters.
phytoplankton <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f8_1IronAndPhytoplanktonGrowth.csv"))
head(phytoplankton)
ironConcentration phytoGrowthRate
1 0.01096478 0.00
2 0.02398833 0.50
3 0.02818383 0.35
4 0.07244360 1.04
5 0.07585776 0.97
6 0.11481536 1.19
One can perform a nonlinear regression using nonlinear least squares. This can be accomplished in R as follows with the package nls
.
phytoCurve <- nls(phytoGrowthRate ~ a*ironConcentration / ( b+ironConcentration),
data = phytoplankton,
list(a = 1, b = 1))
The last argument is a list containing initial values for the parameters (initial guesses at the best fit).
To get information on the fit, we can use the summary
command as before in the linear case.
summary(phytoCurve)
Formula: phytoGrowthRate ~ a * ironConcentration/(b + ironConcentration)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.94164 0.06802 28.545 1.14e-11 ***
b 0.07833 0.01120 6.994 2.29e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1133 on 11 degrees of freedom
Number of iterations to convergence: 6
Achieved convergence tolerance: 6.672e-06
Definition:
Logistic regression is nonlinear regression developed for a categorical response variable with two levels.
Caption: Mortality of guppies in relation to duration of exposure to a temperature of 5\( ^{\circ} \) C. (Estimating probability of an event)
Properties:
The following curve is fit in logistic regression:
\[ \textrm{log-odds}(Y) = \alpha + \beta X, \]
where
\[ \textrm{log-odds}(Y) = \ln\left[\frac{Y}{1-Y}\right] \]
and \( \alpha, \beta \) are population parameters.
Note: \( \textrm{Log-odds}(Y) \) is called the logit link function.
The estimates \( a \) and \( b \) for \( \alpha \) and \( \beta \), respectively, describe the linear relationship between \( X \) and the predicted log odds of \( Y \), i.e.
\[ \textrm{log-odds}(\hat{Y}) = a + b X. \]
To obtain the original predicted values, we need to solve for \( \hat{Y} \), obtaining
\[ \hat{Y} = \frac{e^{a+bX}}{1+e^{a+bX}} = \frac{1}{1 + e^{-(a + bX)}}. \]
This is a sigmoidal (S-shaped) function!! When \( b > 0 \), function is increasing. When \( b < 0 \), function is decreasing. When \( b = 0 \), function is constant.
The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentrations.
Look at data
anthrax <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17q35AnthraxMortality.csv")
str(anthrax)
'data.frame': 9 obs. of 3 variables:
$ anthraxConcentration: int 29300 32100 45300 57300 64800 67000 100000 125000 166000
$ survived : int 7 4 3 2 3 5 0 1 0
$ died : int 1 4 5 6 5 3 8 7 8
Discuss: What format is this data in, tidy or messy?
From wide table to long
# Create survived data frame
survived <- with(anthrax,
data.frame(anthraxConcentration = rep(anthraxConcentration, survived),
mortality = rep(1, sum(survived))))
# Create died data frame
died <- with(anthrax,
data.frame(anthraxConcentration = rep(anthraxConcentration, died),
mortality = rep(0, sum(died))))
# Bind them together
anthrax_long <- rbind(died, survived)
From wide table to long
str(anthrax_long)
'data.frame': 72 obs. of 2 variables:
$ anthraxConcentration: int 29300 32100 32100 32100 32100 45300 45300 45300 45300 45300 ...
$ mortality : num 0 0 0 0 0 0 0 0 0 0 ...
Show data
plot(jitter(mortality,
factor = 0.1) ~
jitter(anthraxConcentration,
factor = 6),
data = anthrax_long,
pch=16,
cex=1.5,
cex.lab=1.5,
col="firebrick",
xlab="Anthrax concentration (spores/l)",
ylab="Mortality")
Show data
Generalized linear models
anthraxGlm <- glm(mortality ~ anthraxConcentration,
data = anthrax_long,
family = binomial(link = logit))
glm
stands for generalized linear model.
Notice the family = binomial(link = logit)
argument.
Add regression curve
# Plot data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")
# Plot fit
xpts <- seq(min(anthrax_long$anthraxConcentration),
max(anthrax_long$anthraxConcentration),
length.out = 100)
ypts <- predict(anthraxGlm,
newdata = data.frame(anthraxConcentration = xpts),
type = "response")
lines(xpts, ypts)
Add regression curve
Table of parameter estimates
summary(anthraxGlm)
Table of parameter estimates
Call:
glm(formula = mortality ~ anthraxConcentration, family = binomial(link = logit),
data = anthrax_long)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4750 -0.9292 -0.3420 0.9161 2.3951
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.745e+00 6.921e-01 2.521 0.01171 *
anthraxConcentration -3.643e-05 1.119e-05 -3.255 0.00113 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 92.982 on 71 degrees of freedom
Residual deviance: 73.962 on 70 degrees of freedom
AIC: 77.962
Number of Fisher Scoring iterations: 5
Note: \( P \)-values are not to be trusted!!! These report Wald statistics, which we know are inaccurate.
Better to create an analysis of deviance table.
anova(anthraxGlm, test = "Chi")
This will test the null hypothesis that anthrax concentration has no effect on mortality (i.e. \( \beta_{0} = 0 \)). See the book for details.
Analysis of Deviance Table
Model: binomial, link: logit
Response: mortality
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 71 92.982
anthraxConcentration 1 19.02 70 73.962 1.293e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1