Linear Regression

Alban Guillaumet, Troy University

Objectives

  • Introduction to Linear Regression

  • Other types of regression include non-linear and logistic regression -> another time!

Regression

Definition: Regression is the method used to predict values of one numerical variable (response) from values of another (explanatory).

Note: Difference with correlation, measuring the strength and direction of an association between numerical variables, without a distinction between explanatory and response variables.

Linear regression

Definition: Linear regression draws a straight line through the data to predict the response variable from the explanatory variable.

Linear regression

Modern humans emerged from Africa ~ 60 KYA, and our ancestors lost some genetic variation at each step as they spread to new lands.

Linear regression

The line fitted to the data is the regression line. It can be used to predict the genetic diversity of a local human pop. (the response variable), even for a locale not included in the study, based on its distance from East Africa (the explanatory variable).

Linear regression

The slope of the regression line indicates how much \( Y \) changes per unit change in \( X \). In this case, humans lose 0.076 units of genetic diversity with every 10,000 km distance from East Africa.

Formula for the line

Definition: For the population, the regression line is

\[ Y = \alpha + \beta X, \]
where \( \alpha \) (the intercept) and \( \beta \) (the slope) are population parameters.

Definition: For a sample, the regression line is

\[ Y = a + b X, \]
where \( a \) and \( b \) are estimates of \( \alpha \) and \( \beta \), respectively.

Graph of the line

  • \( a \): intercept
  • \( b \): slope

Assumptions of linear regression

  • At each value of \( X \), there is a population of possible \( Y \)-values whose mean lies on the true regression line (this is the linear assumption).

Assumptions of linear regression

  • At each value of \( X \), the \( Y \)-measurements represent a random sample from the population of possible \( Y \)-values.
  • At each value of \( X \), the distribution of possible \( Y \)-values is normal.
  • The variance of \( Y \)-values is the same at all values of \( X \).

Important!

Technically, the linear regression equation is

\[ \mu_{Y\, |\, X=X^{*}} = \alpha + \beta X^{*}, \]

were \( \mu_{Y\, |\, X=X^{*}} \) is the mean of \( Y \) in the sub-population with \( X=X^{*} \)

We are predicting the mean of Y given X.

How do you find the "best fit" line?

Many straight lines can be drawn through a scatter of points, so how do we find the 'best' one?

How do you find the "best fit" line?

Method of least squares

Definition: The least-squares regression line is the line for which the sum of all the squared deviations in \( Y \) is smallest.

How do you find the "best fit" line?

The method of least-squares leads to the following estimates for intercept and slope:

\[ \begin{align} b & = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i}(X_{i}-\bar{X})^2} \\ a & = \bar{Y}-b\bar{X} \end{align} \]

Note:

\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} = r\frac{s_{Y}}{s_{X}}, \]

where \( r \) is the correlation coefficient!

Example: Biting lizards

Example: Biting lizards

Example: Biting lizards

Practice Problem #12

Male lizards in the species Crotaphytus collaris use their jaws as weapons during territorial interactions. Lappin and Husak (2005) tested whether weapon performance (bite force) predicted territory size in this species.

Example: Biting lizards

plot of chunk unnamed-chunk-1

Example: Biting lizards

Long way: FYI only

\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} \]

# Slope
(b <- cov(biteData$bite, biteData$territory.area)/var(biteData$bite))
[1] 11.6773

Example: Biting lizards

Long way: FYI only

\[ a = \bar{Y}-b\bar{X} \]

# Intercept
(a <- mean(biteData$territory.area) - b * mean(biteData$bite))
[1] -31.53929

How do you interpret the negative intercept?

Example: Biting lizards

Faster: Use lm

(biteRegression <- lm(territory.area ~ bite, data = biteData))

Call:
lm(formula = territory.area ~ bite, data = biteData)

Coefficients:
(Intercept)         bite  
     -31.54        11.68  

Remember: lm was used for ANOVA too!

Example: Biting lizards

Bonus! With lm, we can add the regression line to plot.

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(territory.area ~ bite, data = biteData, pch=16, col="forestgreen", cex=2, bty = "n", cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)))

# Add in the best-fit line
abline(biteRegression, col="forestgreen", lty = 2, lwd=3)

Example: Biting lizards

Bonus! With lm, we can add the regression line to plot.

plot of chunk unnamed-chunk-6

Predicted values and residuals

Definition: The predicted value of \( Y \) (denoted \( \hat{Y} \), or \( \mu_{Y\, |\, X} \)) from a regression line estimates the mean value of \( Y \) for all individuals having a given value of \( X \).

Definition: Residuals measure the scatter of points above and below the least-squares regression line, and are denoted by

\[ r_{i} = Y_{i} - \hat{Y}_{i}, \]
where \( \hat{Y}_{i} = a + bX_{i} \).

Predicted values and residuals

Predicted values

We can predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.

Predicted values

The function predict.lm accomplishes this, and even gives us a standard error for our estimate.

(pred_5.1 <- predict.lm(biteRegression, data.frame(bite = 5.1), se.fit = TRUE))
$fit
       1 
28.01492 

$se.fit
[1] 2.163259

$df
[1] 9

$residual.scale
[1] 5.788413

Predicted value + standard error

plot of chunk unnamed-chunk-8

Predicted values - Extrapolation

Definition: Extrapolation is the prediction of the value of a response variable outside the range of \( X \)-values in the data.

Regression should not be used to predict the value of the response values for an \( X \)-value that lies well outside the range of the data.

Summary

Definition: Regression is the method used to predict values of one numerical variable (response) from values of another (explanatory).

Use the R function lm to perform a linear regression, i.e. estimate the intercept and slope of a regression line, and the function abline to add the regression line to a plot.

Use the R function predict.lm to predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.

Linear regression assumptions

  • Random samples
  • Linearity
  • Residuals are normally distributed (\( Y \)-values are normally distributed for all \( X \))
  • Equal variance of residuals (Variance of \( Y \)-values the same for all \( X \))

Residual plot

Definition: a residual plot is a scatter plot of the residuals \( (Y_{i}-\hat{Y}_{i}) \) against the \( X_{i} \), the values of the explanatory variable.

plot of chunk unnamed-chunk-9

Residual plots

# Get residuals from regression output
biteData$res = resid(biteRegression)

# Plot residuals
plot(res ~ bite, data=biteData, pch=16, cex=2, bty = "n", cex.lab=1.5, col="orange", xlab="Bite force (N)", ylab="Residuals")

# Add a horizontal line at zero
abline(h=0, lty=2)

Residual plots to check assumptions

If the assumptions of normality and equal variance of residuals are met, we expect:

  • a roughly symmetric cloud of points above and below the horizontal line at zero, with a higher density of points close to the line

  • Approximately equal variance of points above and below the line at all values of \( X \).

In addition, we expect little noticeable curvature as we move from left to right along the \( x \)-axis (linearity).

Residual plots to check assumptions

Normality of residuals

hist(biteData$res)

plot of chunk unnamed-chunk-11

shapiro.test(biteData$res)

    Shapiro-Wilk normality test

data:  biteData$res
W = 0.95404, p-value = 0.6959

Linear regression assumptions

Visualizing the data will help spot nonlinearities.

Data transformations

Transformations can make some nonlinear relationships linear.

Data transformations

Transformations can also correct for non-equal variances (just like in ANOVA).

In this example, the response variable for the untransformed data on the left was transformed to the right using a square-root transformation.

Linear regression assumptions

Outliers can also have large effects on linear fits.

Linear regression assumptions

  • If an outlier is present, compare the regression line produced with and without the outlier.

  • If the outlier has a large effect, alternative approaches may be tried, such as data transformation and permutation tests.

Objectives (2)

  • Make hypothesis tests about the slope of a regression line

  • Define three types of 'confidence intervals' of interest:

    • for the slope estimate
    • prediction of the mean Y for a given X (confidence bands)
    • prediction of an individual Y-value for a given X (prediction intervals).

Estimating uncertainty in slope

Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as

\[ \mathrm{SE}_{b} = \sqrt{\frac{\mathrm{MS}_{\mathrm{residual}}}{\sum_{i}(X_{i}-\bar{X})^2}}. \]

Variance of the residuals

Definition: The variance of the residuals is known as \( \mathrm{MS}_{\mathrm{residuals}} \) and is given by

\[ \mathrm{MS}_{\mathrm{residuals}} = \frac{\sum_{i}\left(Y_{i}-\hat{Y}_{i}\right)^2}{n-2}. \]
This is also known as residual mean square.

\( \mathrm{MS}_{\mathrm{residuals}} \) is completely analogous to the error mean square in ANOVA.

\( \mathrm{MS}_{\mathrm{residuals}} \) has \( n-2 \) degrees of freedom since we needed to estimate slope and intercept for the predicted values.

Hypothesis tests about slope

Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as \( \mathrm{SE}_{b} \).


\[ H_{0}: \beta = \beta_{0}\\ H_{A}: \beta \neq \beta_{0} \]

Test statistic: \[ t = \frac{b-\beta_{0}}{\mathrm{SE}_{b}} \]

Degree of freedom: \[ df = n-2 \]

Hypothesis tests about slope

You can get the results of a \( t \)-test on slope in R for \( \beta_{0}=0 \) by using the summary command on a regression results from lm:

summary(biteRegression)

Hypothesis tests about slope


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

Hypothesis tests about slope

For a \( t \)-test with   \( H_{0}: \beta_{0} \neq 0 \), go manual.

First, get estimate.

biteRegression$coefficients
(Intercept)        bite 
  -31.53929    11.67730 
(b <- biteRegression$coefficients['bite'])
   bite 
11.6773 

Hypothesis tests about slope

For a \( t \)-test with   \( H_{0}: \beta_{0} \neq 0 \), go manual.

Second, get standard error.

(coef <- summary(biteRegression)$coefficients)
             Estimate Std. Error   t value  Pr(>|t|)
(Intercept) -31.53929  23.512646 -1.341375 0.2126627
bite         11.67730   4.848226  2.408571 0.0393411
(sderr <- coef['bite','Std. Error'])
[1] 4.848226

Hypothesis tests about slope

For a \( t \)-test with   \( H_{0}: \beta_{0} \neq 0 \), go manual.

Third, compute test statistic.

\[ \begin{align} t & = \frac{b - \beta_{0}}{\mathrm{SE}_{b}} \\ & = \frac{11.6772959 - \beta_{0}}{4.8482259} \end{align} \]

Then continue with usual hypothesis test steps by computing \( P \)-value of test statistic with \( df = n-2 \).

Linear regression vs ANOVA

Two sources of variation:

  • Residual part: Deviation between \( Y_{i} \) and predicted values \( \hat{Y}_{i} \) (analogous to error component in ANOVA).
  • Regression part: Deviation between predicted values \( \hat{Y}_{i} \) and grand mean \( \bar{Y} \) (analogous to groups component in ANOVA).

In the ANOVA table, we can compute the \( F \)-ratio

\[ F = \frac{\mathrm{MS}_{\mathrm{regression}}}{\mathrm{MS}_{\mathrm{residual}}} \]

which is a test on the slope with \( H_{0}: \beta = 0 \).

Linear regression vs ANOVA

Use anova command on biteRegression

(biteAnova <- anova(biteRegression))
Analysis of Variance Table

Response: territory.area
          Df Sum Sq Mean Sq F value  Pr(>F)  
bite       1 194.37 194.374  5.8012 0.03934 *
Residuals  9 301.55  33.506                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tests null hypothesis \( H_{0}: \beta = 0 \).

Variation explained by explanatory variable

We can measure how well the line “fits” the data by estimating the \( R^{2} \) value using the relationship:

\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{regression}}}{\mathrm{SS}_{\mathrm{total}}}. \]

This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).

Variation explained by explanatory variable

Basic idea is:

  • If \( R^{2} \) is close to 1, then \( X \) is explaining most of the variation in \( Y \), and any other variation which could be caused by other sources is negligible in comparison.
  • If \( R^{2} \) is close to 0, then \( X \) is explaining very little of the variation in \( Y \), and the remaining variation is caused by other sources not accounted for or measured in the system of study.

Variation explained by explanatory variable

For the lizard example,

biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418

Annotate plot

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(territory.area ~ bite, pch=16, col="forestgreen", cex=2, bty = "n", cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)),data=biteData)

# Add in the best-fit line
abline(biteRegression, col="forestgreen", lty = 2, lwd=3)

# Text
text(5.11, 31.5, expression(paste(R^2,"= ", 0.39)), cex=1.5)
text(5.11, 32.9, expression(paste("y = ", -31.54 + 11.68)), cex=1.5)

Annotate plot

plot of chunk unnamed-chunk-21

Estimating uncertainty in slope

Definition: A confidence interval for the parameter \( \beta \) is given by

\[ b - t_{\alpha(2),df}\mathrm{SE}_{b} < \beta < b + t_{\alpha(2),df}\mathrm{SE}_{b}, \]
with   \( df = n-2 \).

Confidence bands vs prediction intervals

  • Confidence bands: predicts the mean \( Y \) for a given \( X \) (i.e., confidence intervals of mean \( Y \) values for fixed \( X \)).
  • Prediction intervals: predicts an individual \( Y \)-value (i.e. prediction interval contains \( (1-\alpha)\% \) of the probability distribution of \( Y \) values for fixed \( X \)).

Confidence bands

# Get x values
xpt <- seq(min(biteData$bite), max(biteData$bite), length.out = 100)

# Call predict function
ypt <- data.frame( predict.lm(biteRegression, newdata = data.frame(bite = xpt), interval = "confidence") )

# Draw the confidence bands 
lines(ypt$lwr ~ xpt, lwd = 2, lty = 3, col="forestgreen")
lines(ypt$upr ~ xpt, lwd = 2, lty = 3, col="forestgreen")

Confidence bands

plot of chunk unnamed-chunk-23

Confidence bands

plot of chunk unnamed-chunk-24

95% confidence interval for \( \mu_{Y | X=5.1} \).

Confidence bands

plot of chunk unnamed-chunk-25

95% confidence interval for \( \mu_{Y | X=\mu_{X}} \).

Prediction intervals

# Get x values
xpt <- seq(min(biteData$bite), max(biteData$bite), length.out = 100)

# Call predict function
ypt <- data.frame( predict(biteRegression, newdata = data.frame(bite = xpt), interval = "prediction", level = 0.95) )

# Draw the confidence bands 
lines(ypt$lwr ~ xpt, lwd = 2, lty = 4, col="blue")
lines(ypt$upr ~ xpt, lwd = 2, lty = 4, col="blue")

Prediction intervals

plot of chunk unnamed-chunk-27

Summary

  • The results of a \( t \)-test on slope for \( \beta_{0}=0 \) can be obtained in R by using the summary command on a regression results from lm.

  • Confidence bands: predicts the mean \( Y \) for a given \( X \). They can be obtained by setting the argument interval = "confidence" in the predict.lm R function.

  • Prediction intervals: predicts an individual \( Y \)-value. They can be obtained by setting the argument interval = "prediction" in the predict.lm R function.