Alban Guillaumet, Troy University
Introduction to Linear Regression
Other types of regression include non-linear and logistic regression -> another time!
Definition:
Regression is the method used topredict values of one numerical variable (response) from values of another (explanatory).
Note: Difference with correlation, measuring the strength and direction of an association between numerical variables, without a distinction between explanatory and response variables.
Definition:
Linear regression draws a straight line through the data topredict the response variable from the explanatory variable.
Modern humans emerged from Africa ~ 60 KYA, and our ancestors lost some genetic variation at each step as they spread to new lands.
The line fitted to the data is the regression line. It can be used to predict the genetic diversity of a local human pop. (the response variable), even for a locale not included in the study, based on its distance from East Africa (the explanatory variable).
The slope of the regression line indicates how much \( Y \) changes per unit change in \( X \). In this case, humans lose 0.076 units of genetic diversity with every 10,000 km distance from East Africa.
Definition: For the
population , the regression line is
\[ Y = \alpha + \beta X, \]
where \( \alpha \) (theintercept ) and \( \beta \) (theslope ) are population parameters.
Definition: For a
sample , the regression line is
\[ Y = a + b X, \]
where \( a \) and \( b \) are estimates of \( \alpha \) and \( \beta \), respectively.
Technically, the linear regression equation is
\[ \mu_{Y\, |\, X=X^{*}} = \alpha + \beta X^{*}, \]
were \( \mu_{Y\, |\, X=X^{*}} \) is the mean of \( Y \) in the sub-population with \( X=X^{*} \)
We are predicting the mean of Y given X.
Many straight lines can be drawn through a scatter of points, so how do we find the 'best' one?
Method of least squares
Definition: The
least-squares regression line is the line for which the sum of all thesquared deviations in \( Y \) is smallest.
The method of least-squares leads to the following estimates for intercept and slope:
\[ \begin{align} b & = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i}(X_{i}-\bar{X})^2} \\ a & = \bar{Y}-b\bar{X} \end{align} \]
Note:
\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} = r\frac{s_{Y}}{s_{X}}, \]
where \( r \) is the correlation coefficient!
Practice Problem #12
Male lizards in the species Crotaphytus collaris use their jaws as weapons during territorial interactions. Lappin and Husak (2005) tested whether weapon performance (bite force) predicted territory size in this species.
Long way: FYI only
\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} \]
# Slope
(b <- cov(biteData$bite, biteData$territory.area)/var(biteData$bite))
[1] 11.6773
Long way: FYI only
\[ a = \bar{Y}-b\bar{X} \]
# Intercept
(a <- mean(biteData$territory.area) - b * mean(biteData$bite))
[1] -31.53929
How do you interpret the negative intercept?
Faster: Use lm…
(biteRegression <- lm(territory.area ~ bite, data = biteData))
Call:
lm(formula = territory.area ~ bite, data = biteData)
Coefficients:
(Intercept) bite
-31.54 11.68
Remember: lm was used for ANOVA too!
Bonus! With lm, we can add the regression line to plot.
# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))
# Scatter plot
plot(territory.area ~ bite, data = biteData, pch=16, col="forestgreen", cex=2, bty = "n", cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)))
# Add in the best-fit line
abline(biteRegression, col="forestgreen", lty = 2, lwd=3)
Bonus! With lm, we can add the regression line to plot.
Definition: The
predicted value of \( Y \) (denoted \( \hat{Y} \), or \( \mu_{Y\, |\, X} \)) from a regression line estimates the mean value of \( Y \) for all individuals having a given value of \( X \).
Definition:
Residuals measure the scatter of points above and below the least-squares regression line, and are denoted by
\[ r_{i} = Y_{i} - \hat{Y}_{i}, \]
where \( \hat{Y}_{i} = a + bX_{i} \).
We can predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.
The function predict.lm accomplishes this, and even gives us a standard error for our estimate.
(pred_5.1 <- predict.lm(biteRegression, data.frame(bite = 5.1), se.fit = TRUE))
$fit
1
28.01492
$se.fit
[1] 2.163259
$df
[1] 9
$residual.scale
[1] 5.788413
Definition:
Extrapolation is the prediction of the value of a response variable outside the range of \( X \)-values in the data.
Regression should not be used to predict the value of the response values for an \( X \)-value that lies well outside the range of the data.
Definition:
Regression is the method used topredict values of one numerical variable (response) from values of another (explanatory).
Use the R function
lm to perform a linear regression, i.e. estimate the intercept and slope of a regression line, and the functionabline to add the regression line to a plot.
Use the R function
predict.lm to predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.
Definition: a
residual plot is a scatter plot of the residuals \( (Y_{i}-\hat{Y}_{i}) \) against the \( X_{i} \), the values of the explanatory variable.
# Get residuals from regression output
biteData$res = resid(biteRegression)
# Plot residuals
plot(res ~ bite, data=biteData, pch=16, cex=2, bty = "n", cex.lab=1.5, col="orange", xlab="Bite force (N)", ylab="Residuals")
# Add a horizontal line at zero
abline(h=0, lty=2)
If the assumptions of normality and equal variance of residuals are met, we expect:
a roughly symmetric cloud of points above and below the horizontal line at zero, with a higher density of points close to the line
Approximately equal variance of points above and below the line at all values of \( X \).
In addition, we expect little noticeable curvature as we move from left to right along the \( x \)-axis (linearity).
hist(biteData$res)
shapiro.test(biteData$res)
Shapiro-Wilk normality test
data: biteData$res
W = 0.95404, p-value = 0.6959
Visualizing the data will help spot nonlinearities.
Transformations can make some nonlinear relationships linear.
Transformations can also correct for non-equal variances (just like in ANOVA).
In this example, the response variable for the untransformed data on the left was transformed to the right using a square-root transformation.
Outliers can also have large effects on linear fits.
If an outlier is present, compare the regression line produced with and without the outlier.
If the outlier has a large effect, alternative approaches may be tried, such as data transformation and permutation tests.
Make hypothesis tests about the slope of a regression line
Define three types of 'confidence intervals' of interest:
Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as
\[ \mathrm{SE}_{b} = \sqrt{\frac{\mathrm{MS}_{\mathrm{residual}}}{\sum_{i}(X_{i}-\bar{X})^2}}. \]
Definition: The
variance of the residuals is known as \( \mathrm{MS}_{\mathrm{residuals}} \) and is given by
\[ \mathrm{MS}_{\mathrm{residuals}} = \frac{\sum_{i}\left(Y_{i}-\hat{Y}_{i}\right)^2}{n-2}. \]
This is also known asresidual mean square .
\( \mathrm{MS}_{\mathrm{residuals}} \) is completely analogous to the error mean square in ANOVA.
\( \mathrm{MS}_{\mathrm{residuals}} \) has \( n-2 \) degrees of freedom since we needed to estimate slope and intercept for the predicted values.
Definition: If the assumptions of linear regression are met, then the sampling distribution of \( b \) is a normal distribution having a mean equal to \( \beta \) and a standard error estimated from data as \( \mathrm{SE}_{b} \).
\[ H_{0}: \beta = \beta_{0}\\
H_{A}: \beta \neq \beta_{0} \]
Test statistic: \[ t = \frac{b-\beta_{0}}{\mathrm{SE}_{b}} \]
Degree of freedom: \[ df = n-2 \]
You can get the results of a \( t \)-test on slope in R for \( \beta_{0}=0 \) by using the summary command on a regression results from lm:
summary(biteRegression)
Call:
lm(formula = territory.area ~ bite, data = biteData)
Residuals:
Min 1Q Median 3Q Max
-7.0472 -4.5101 -0.5504 3.6689 10.2237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.539 23.513 -1.341 0.2127
bite 11.677 4.848 2.409 0.0393 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3244
F-statistic: 5.801 on 1 and 9 DF, p-value: 0.03934
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
First, get estimate.
biteRegression$coefficients
(Intercept) bite
-31.53929 11.67730
(b <- biteRegression$coefficients['bite'])
bite
11.6773
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
Second, get standard error.
(coef <- summary(biteRegression)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.53929 23.512646 -1.341375 0.2126627
bite 11.67730 4.848226 2.408571 0.0393411
(sderr <- coef['bite','Std. Error'])
[1] 4.848226
For a \( t \)-test with \( H_{0}: \beta_{0} \neq 0 \), go manual.
Third, compute test statistic.
\[ \begin{align} t & = \frac{b - \beta_{0}}{\mathrm{SE}_{b}} \\ & = \frac{11.6772959 - \beta_{0}}{4.8482259} \end{align} \]
Then continue with usual hypothesis test steps by computing \( P \)-value of test statistic with \( df = n-2 \).
Two sources of variation:
In the ANOVA table, we can compute the \( F \)-ratio
\[ F = \frac{\mathrm{MS}_{\mathrm{regression}}}{\mathrm{MS}_{\mathrm{residual}}} \]
which is a test on the slope with \( H_{0}: \beta = 0 \).
Use anova command on biteRegression
(biteAnova <- anova(biteRegression))
Analysis of Variance Table
Response: territory.area
Df Sum Sq Mean Sq F value Pr(>F)
bite 1 194.37 194.374 5.8012 0.03934 *
Residuals 9 301.55 33.506
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tests null hypothesis \( H_{0}: \beta = 0 \).
We can measure how well the line “fits” the data by estimating the \( R^{2} \) value using the relationship:
\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{regression}}}{\mathrm{SS}_{\mathrm{total}}}. \]
This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).
Basic idea is:
For the lizard example,
biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418
# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))
# Scatter plot
plot(territory.area ~ bite, pch=16, col="forestgreen", cex=2, bty = "n", cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)),data=biteData)
# Add in the best-fit line
abline(biteRegression, col="forestgreen", lty = 2, lwd=3)
# Text
text(5.11, 31.5, expression(paste(R^2,"= ", 0.39)), cex=1.5)
text(5.11, 32.9, expression(paste("y = ", -31.54 + 11.68)), cex=1.5)
Definition: A confidence interval for the parameter \( \beta \) is given by
\[ b - t_{\alpha(2),df}\mathrm{SE}_{b} < \beta < b + t_{\alpha(2),df}\mathrm{SE}_{b}, \]
with \( df = n-2 \).
# Get x values
xpt <- seq(min(biteData$bite), max(biteData$bite), length.out = 100)
# Call predict function
ypt <- data.frame( predict.lm(biteRegression, newdata = data.frame(bite = xpt), interval = "confidence") )
# Draw the confidence bands
lines(ypt$lwr ~ xpt, lwd = 2, lty = 3, col="forestgreen")
lines(ypt$upr ~ xpt, lwd = 2, lty = 3, col="forestgreen")
# Get x values
xpt <- seq(min(biteData$bite), max(biteData$bite), length.out = 100)
# Call predict function
ypt <- data.frame( predict(biteRegression, newdata = data.frame(bite = xpt), interval = "prediction", level = 0.95) )
# Draw the confidence bands
lines(ypt$lwr ~ xpt, lwd = 2, lty = 4, col="blue")
lines(ypt$upr ~ xpt, lwd = 2, lty = 4, col="blue")
The results of a \( t \)-test on slope for \( \beta_{0}=0 \) can be obtained in R by using the summary command on a regression results from lm.
Confidence bands: predicts the mean \( Y \) for a given \( X \). They can be obtained by setting the argument interval = "confidence" in the predict.lm R function.
Prediction intervals: predicts an individual \( Y \)-value. They can be obtained by setting the argument interval = "prediction" in the predict.lm R function.