Regression (Part 2)

M. Drew LaMar
November 9, 2020

Predicted values and residuals

Definition: The predicted value of \( Y \) (denoted \( \hat{Y} \), or \( \mu_{Y\, |\, X} \)) from a regression line estimates the mean value of \( Y \) for all individuals having a given value of \( X \).

Definition: Residuals measure the scatter of points above and below the least-squares regression line, and are denoted by

\[ r_{i} = Y_{i} - \hat{Y}_{i}, \]
where \( \hat{Y}_{i} = a + bX_{i} \).

Predicted values and residuals

Prediction values

We can predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.

The function predict accomplishes this, and even gives us a standard error for our estimate.

(pred_5.1 <- predict(biteRegression, data.frame(bite = 5.1), se.fit = TRUE))
$fit
       1 
28.01492 

$se.fit
[1] 2.163259

$df
[1] 9

$residual.scale
[1] 5.788413

Prediction values

plot of chunk unnamed-chunk-3

Prediction values - Extrapolation

Definition: Extrapolation is the prediction of the value of a response variable outside the range of \( X \)-values in the data.

Regression should not be used to predict the value of the response values for an \( X \)-value that lies well outside the range of the data.

Residual plot

Definition: a residual plot is a scatter plot of the residuals \( (Y_{i}-\hat{Y}_{i}) \) against the \( X_{i} \), the values of the explanatory variable.

plot of chunk unnamed-chunk-4

Residual plots

# Get residuals from regression output
biteData$res = resid(biteRegression)

# Plot residuals
plot(res ~ bite, data=biteData, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Bite force (N)", ylab="Residuals")

# Add a horizontal line at zero
abline(h=0, lty=2)

Residual plots to check assumptions

hist(biteData$res)

plot of chunk unnamed-chunk-6

qqnorm(biteData$res)

plot of chunk unnamed-chunk-7

Variance of the residuals

Definition: The variance of the residuals is known as \( \mathrm{MS}_{\mathrm{residuals}} \) and is given by

\[ \mathrm{MS}_{\mathrm{residuals}} = \frac{\sum_{i}\left(Y_{i}-\hat{Y}_{i}\right)^2}{n-2}. \]
This is also known as residual mean square.

\( \mathrm{MS}_{\mathrm{residuals}} \) is the average variance in the data around the regression line. This is completely analogous to the error mean square in ANOVA.

\( \mathrm{MS}_{\mathrm{residuals}} \) has \( n-2 \) degrees of freedom since we needed to estimate slope and intercept for the predicted values.

Linear regression vs ANOVA

Use anova command on biteRegression

anova(biteRegression)
Analysis of Variance Table

Response: territory.area
          Df Sum Sq Mean Sq F value  Pr(>F)  
bite       1 194.37 194.374  5.8012 0.03934 *
Residuals  9 301.55  33.506                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual mean square is 33.506.

Linear regression vs ANOVA

Can calculate residual mean square manually: \[ \mathrm{MS}_{\mathrm{residuals}} = \frac{\sum_{i}\left(Y_{i}-\hat{Y}_{i}\right)^2}{n-2}. \]

n <- nrow(biteData)
(MSresid <- ((n-1)/(n-2))*var(biteData$res))
[1] 33.50573

Same answer as ANOVA gave!!!

Linear regression vs ANOVA

Two sources of variation:

  • Residual part: Deviation between \( Y_{i} \) and predicted values \( \hat{Y}_{i} \) (analogous to error component in ANOVA).
  • Regression part: Deviation between predicted values \( \hat{Y}_{i} \) and grand mean \( \bar{Y} \) (analogous to groups component in ANOVA).

In the ANOVA table, we can compute the \( F \)-ratio

\[ F = \frac{\mathrm{MS}_{\mathrm{regression}}}{\mathrm{MS}_{\mathrm{residual}}} \]

which is a test on the slope with \( H_{0}: \beta = 0 \).

Linear regression vs ANOVA

Analysis of Variance Table

Response: territory.area
          Df Sum Sq Mean Sq F value  Pr(>F)  
bite       1 194.37 194.374  5.8012 0.03934 *
Residuals  9 301.55  33.506                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\( P \)-value is less than 0.05, so we can reject the null hypothesis that the slope \( \beta = 0 \).

Variation explained by explanatory variable

We can measure how well the line “fits” the data by estimating the \( R^{2} \) value using ANOVA, i.e.

\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{regression}}}{\mathrm{SS}_{\mathrm{total}}}. \]

This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).

Variation explained by explanatory variable

Basic idea is:

  • If \( R^{2} \) is close to 1, then \( X \) is explaining most of the variation in \( Y \), and any other variation which could be caused by other sources is negligible in comparison.
  • If \( R^{2} \) is close to 0, then \( X \) is explaining very little of the variation in \( Y \), and the remaining variation is caused by other sources not accounted for or measured in the system of study.

Variation explained by explanatory variable

For the lizard example,

biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418

Thus, 39% of the variation in territory area is explained by bite force.

Annotate plot

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(territory.area ~ bite, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)), data=biteData)

# Add in the best-fit line
abline(biteRegression, lwd=3)

# Text
text(5, 30.5, expression(R^{2} ~ "=" ~ 0.39), cex=1.5)
text(5.14, 31.7, expression(y ~ "=" -31.54 + 11.68), cex=1.5)

Annotate plot

plot of chunk unnamed-chunk-14

Annotate plot (ggplot2)

ggplot(data = biteData,
       aes(x = bite, y = territory.area)) +
  geom_point(color = "firebrick") +
  geom_smooth(method = "lm", fill = NA) +
  annotate("text", x = 5, y = 30.5, 
           label = expression(R^{2} ~ "=" ~ 0.39),
           size = 5.0) +
  annotate("text", x = 5, y = 31.7, 
           label = expression(y ~ "=" -31.54 + 11.68),
           size = 5.0)

Annotate plot (ggplot2)

plot of chunk unnamed-chunk-15

Summary of a regression in R

summary(biteRegression)

Summary of a regression in R


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

Estimating uncertainty in slope

Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as

\[ \mathrm{SE}_{b} = \sqrt{\frac{\mathrm{MS}_{\mathrm{residual}}}{\sum_{i}(X_{i}-\bar{X})^2}}. \]

Estimating uncertainty in slope

Definition: A confidence interval for the parameter \( \beta \) is given by

\[ b - t_{\alpha(2),df}\mathrm{SE}_{b} < \beta < b + t_{\alpha(2),df}\mathrm{SE}_{b}, \]
with   \( df = n-2 \).

Confidence bands vs prediction intervals

There are two other types of “confidence intervals” of interest:

  • Confidence bands: Confidence intervals of mean \( Y \) values for fixed \( X \) (i.e. confidence intervals for specific points on regression line)
  • Prediction intervals: \( (1-\alpha)\% \) of the probability distribution of \( Y \) values for fixed \( X \).

Confidence bands

plot of chunk unnamed-chunk-18

95% confidence interval for \( \mu_{Y | X=\mu_{X}} \).

Confidence bands

plot of chunk unnamed-chunk-19

95% confidence interval for \( \mu_{Y | X=5.1} \).

Prediction intervals

plot of chunk unnamed-chunk-20

95% of the data lies between the prediction intervals.

Hypothesis tests about slope

(biteAnova <- anova(biteRegression))
Analysis of Variance Table

Response: territory.area
          Df Sum Sq Mean Sq F value  Pr(>F)  
bite       1 194.37 194.374  5.8012 0.03934 *
Residuals  9 301.55  33.506                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tests null hypothesis \( H_{0}: \beta = 0 \).

Hypothesis tests about slope

Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as \( \mathrm{SE}_{b} \).


\[ H_{0}: \beta = \beta_{0}\\ H_{A}: \beta \neq \beta_{0} \]

\( F \)-test using ANOVA tests \( \beta_{0}=0 \) only, but a \( t \)-test can be used when \( \beta_{0}\neq 0 \)!!!

Hypothesis tests about slope

Test statistic: \[ t = \frac{b-\beta_{0}}{\mathrm{SE}_{b}} \]

Degree of freedom: \[ df = n-2 \]

You can get the results of a \( t \)-test on slope in R for \( \beta_{0}=0 \) by using the summary command on a regression results from lm:

summary(biteRegression)

Hypothesis tests about slope


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

Hypothesis tests about slope

For a \( t \)-test with   \( H_{0}: \beta_{0} \neq 0 \), go manual.

First, get estimate.

biteRegression$coefficients
(Intercept)        bite 
  -31.53929    11.67730 
(b <- biteRegression$coefficients['bite'])
   bite 
11.6773 

Hypothesis tests about slope

For a \( t \)-test with   \( H_{0}: \beta_{0} \neq 0 \), go manual.

Second, get standard error.

(coef <- summary(biteRegression)$coefficients)
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) -31.53929  23.512646 -1.341375 0.2126627
bite         11.67730   4.848226  2.408571 0.0393411
(sderr <- coef['bite','Std. Error'])
[1] 4.848226

Hypothesis tests about slope

For a \( t \)-test with   \( H_{0}: \beta_{0} \neq 0 \), go manual.

Third, compute test statistic.

\[ \begin{align} t & = \frac{b - \beta_{0}}{\mathrm{SE}_{b}} \\ & = \frac{11.6772959 - \beta_{0}}{4.8482259} \end{align} \]

Then continue with usual hypothesis test steps by computing \( P \)-value of test statistic with \( df = n-2 \).

Linear regression assumptions (revisited)

  • Linearity
  • \( Y \)-values are normally distributed
  • Variance of \( Y \)-values the same for all \( X \)
  • \( Y \)-values a random sample from sub-population with corresponding \( X \)-values.

We discussed using residual plots to check the second and third point above.

Linear regression assumptions (revisited)

Outliers can also have large effects on linear fits.

Linear regression assumptions (revisited)

Visualizing the data will help spot nonlinearities.

Data transformations

Transformations can make some nonlinear relationships linear.

Data transformations

Transformations can also correct for non-equal variances (just like in ANOVA).

In this example, the response variable for the untransformed data on the left was transformed to the right using a square-root transformation.

Measurement error

  • Measurement error in \( Y \) increases variance of residuals, but has no effect on the estimate of the slope (does effect precision of estimate, however).
  • Measurement error in \( X \) affects both residual error and estimate of slope (underestimates).

Nonlinear regression

The problem with nonlinear regression? An infinite number of possible nonlinear functions to fit!

Book covers:

  • Asymptotic curve fitting
  • Quadratic curve fitting
  • Formula-free curve fitting

We will just discuss asymptotic curve fitting.

Asymptotic curve fitting

Consider the relationship between population growth rate of a phytoplankton in culture and the concentration of iron in the medium.

  • Left panel is overfitting
  • Left panel fit will perform poorly for prediction
  • Left panel fit has poor biological motivation

Asymptotic curve fitting

The right panel shows a fit with the Michaelis-Menten curve

\[ Y = \frac{aX}{b+X}, \]
with \( a \) and \( b \) parameters.

Asymptotic curve fitting

phytoplankton <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f8_1IronAndPhytoplanktonGrowth.csv"))
head(phytoplankton)
  ironConcentration phytoGrowthRate
1        0.01096478            0.00
2        0.02398833            0.50
3        0.02818383            0.35
4        0.07244360            1.04
5        0.07585776            0.97
6        0.11481536            1.19

Asymptotic curve fitting

One can perform a nonlinear regression using nonlinear least squares. This can be accomplished in R as follows with the package nls.

phytoCurve <- nls(phytoGrowthRate ~ a*ironConcentration / ( b+ironConcentration), 
                  data = phytoplankton, 
                  list(a = 1, b = 1))

The last argument is a list containing initial values for the parameters (initial guesses at the best fit).

Asymptotic curve fitting

To get information on the fit, we can use the summary command as before in the linear case.

summary(phytoCurve)

Formula: phytoGrowthRate ~ a * ironConcentration/(b + ironConcentration)

Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a  1.94164    0.06802  28.545 1.14e-11 ***
b  0.07833    0.01120   6.994 2.29e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1133 on 11 degrees of freedom

Number of iterations to convergence: 6 
Achieved convergence tolerance: 6.672e-06

Final example: Logistic regression

Definition: Logistic regression is nonlinear regression developed for a categorical response variable with two levels.

Caption: Mortality of guppies in relation to duration of exposure to a temperature of 5\( ^{\circ} \) C. (Estimating probability of an event)

Logistic regression

Properties:

  • Outcomes at every \( X \) have a binomial distribution (not a normal distribution).
  • Probability of an event is given by corresponding predicted value on the regression curve.
  • To correct for differences in variance of residuals at different \( X \) values, residuals are weighted by estimated variance obtained from binomial distribution.

Logistic regression

The following curve is fit in logistic regression:

\[ \textrm{log-odds}(Y) = \alpha + \beta X, \]

where

\[ \textrm{log-odds}(Y) = \ln\left[\frac{Y}{1-Y}\right] \]

and \( \alpha, \beta \) are population parameters.

Note: \( \textrm{Log-odds}(Y) \) is called the logit link function.

Logistic regression

The estimates \( a \) and \( b \) for \( \alpha \) and \( \beta \), respectively, describe the linear relationship between \( X \) and the predicted log odds of \( Y \), i.e.

\[ \textrm{log-odds}(\hat{Y}) = a + b X. \]

To obtain the original predicted values, we need to solve for \( \hat{Y} \), obtaining

\[ \hat{Y} = \frac{e^{a+bX}}{1+e^{a+bX}} = \frac{1}{1 + e^{-(a + bX)}}. \]

This is a sigmoidal (S-shaped) function!! When \( b > 0 \), function is increasing. When \( b < 0 \), function is decreasing. When \( b = 0 \), function is constant.

Example: Logistic regression

The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentrations.

Example: Logistic regression

Look at data

anthrax <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17q35AnthraxMortality.csv")
str(anthrax)
'data.frame':   9 obs. of  3 variables:
 $ anthraxConcentration: int  29300 32100 45300 57300 64800 67000 100000 125000 166000
 $ survived            : int  7 4 3 2 3 5 0 1 0
 $ died                : int  1 4 5 6 5 3 8 7 8

Discuss: What format is this data in, tidy or messy?

Example: Logistic regression

From wide table to long

# Create survived data frame
survived <- with(anthrax, 
                 data.frame(anthraxConcentration = rep(anthraxConcentration, survived), 
                            mortality = rep(1, sum(survived))))

# Create died data frame
died <- with(anthrax, 
             data.frame(anthraxConcentration = rep(anthraxConcentration, died), 
                        mortality = rep(0, sum(died))))

# Bind them together
anthrax_long <- rbind(died, survived)

Example: Logistic regression

From wide table to long

str(anthrax_long)
'data.frame':   72 obs. of  2 variables:
 $ anthraxConcentration: int  29300 32100 32100 32100 32100 45300 45300 45300 45300 45300 ...
 $ mortality           : num  0 0 0 0 0 0 0 0 0 0 ...

Example: Logistic regression

Show data

plot(jitter(mortality, 
            factor = 0.1) ~ 
     jitter(anthraxConcentration, 
            factor = 6), 
     data = anthrax_long, 
     pch=16, 
     cex=1.5, 
     cex.lab=1.5, 
     col="firebrick", 
     xlab="Anthrax concentration (spores/l)", 
     ylab="Mortality")

Example: Logistic regression

Show data

plot of chunk unnamed-chunk-32

Example: Logistic regression

Generalized linear models

anthraxGlm <- glm(mortality ~ anthraxConcentration, 
                  data = anthrax_long, 
                  family = binomial(link = logit))

glm stands for generalized linear model.

Notice the family = binomial(link = logit) argument.

Example: Logistic regression

Add regression curve

# Plot data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")

# Plot fit
xpts <- seq(min(anthrax_long$anthraxConcentration), 
            max(anthrax_long$anthraxConcentration), 
            length.out = 100)
ypts <- predict(anthraxGlm, 
                newdata = data.frame(anthraxConcentration = xpts), 
                type = "response")
lines(xpts, ypts)

Example: Logistic regression

Add regression curve

plot of chunk unnamed-chunk-35

Example: Logistic regression

Table of parameter estimates

summary(anthraxGlm)

Example: Logistic regression

Table of parameter estimates


Call:
glm(formula = mortality ~ anthraxConcentration, family = binomial(link = logit), 
    data = anthrax_long)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4750  -0.9292  -0.3420   0.9161   2.3951  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)   
(Intercept)           1.745e+00  6.921e-01   2.521  0.01171 * 
anthraxConcentration -3.643e-05  1.119e-05  -3.255  0.00113 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 92.982  on 71  degrees of freedom
Residual deviance: 73.962  on 70  degrees of freedom
AIC: 77.962

Number of Fisher Scoring iterations: 5

Example: Logistic regression

Note: \( P \)-values are not to be trusted!!! These report Wald statistics, which we know are inaccurate.

Better to create an analysis of deviance table.

anova(anthraxGlm, test = "Chi")

This will test the null hypothesis that anthrax concentration has no effect on mortality (i.e. \( \beta_{0} = 0 \)). See the book for details.

Example: Logistic regression

Analysis of Deviance Table

Model: binomial, link: logit

Response: mortality

Terms added sequentially (first to last)

                     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                    71     92.982              
anthraxConcentration  1    19.02        70     73.962 1.293e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1