Simple linear regression models describe the effect that a particular variable, called the explanatory variable, might have on the value of a continuous outcome variable, called the response variable.
The explanatory variable may be continuous, discrete, or categorical.
library(knitr)
kable(trees)
| Girth | Height | Volume |
|---|---|---|
| 8.3 | 70 | 10.3 |
| 8.6 | 65 | 10.3 |
| 8.8 | 63 | 10.2 |
| 10.5 | 72 | 16.4 |
| 10.7 | 81 | 18.8 |
| 10.8 | 83 | 19.7 |
| 11.0 | 66 | 15.6 |
| 11.0 | 75 | 18.2 |
| 11.1 | 80 | 22.6 |
| 11.2 | 75 | 19.9 |
| 11.3 | 79 | 24.2 |
| 11.4 | 76 | 21.0 |
| 11.4 | 76 | 21.4 |
| 11.7 | 69 | 21.3 |
| 12.0 | 75 | 19.1 |
| 12.9 | 74 | 22.2 |
| 12.9 | 85 | 33.8 |
| 13.3 | 86 | 27.4 |
| 13.7 | 71 | 25.7 |
| 13.8 | 64 | 24.9 |
| 14.0 | 78 | 34.5 |
| 14.2 | 80 | 31.7 |
| 14.5 | 74 | 36.3 |
| 16.0 | 72 | 38.3 |
| 16.3 | 77 | 42.6 |
| 17.3 | 81 | 55.4 |
| 17.5 | 82 | 55.7 |
| 17.9 | 80 | 58.3 |
| 18.0 | 80 | 51.5 |
| 18.0 | 80 | 51.0 |
| 20.6 | 87 | 77.0 |
plot(Volume ~ Height, data=trees)
plot(Volume ~ Girth, data=trees)
cov(trees$Volume, trees$Height)
## [1] 62.66
cor(trees$Volume, trees$Height)
## [1] 0.59824965
cov(trees$Volume, trees$Girth)
## [1] 49.888118
cor(trees$Volume, trees$Girth)
## [1] 0.96711937
plot(Volume ~ Height, data=trees)
abline(lm(Volume ~ Height, data=trees))
plot(Volume ~ Girth, data=trees)
abline(lm(Volume ~ Girth, data=trees))
The purpose of a linear regression model is to come up with a function that estimates the mean of one variable given a particular value of another variable.
The simple linear regression model states that the value of a response is expressed as the following equation:
\[ Y|X = \beta_0 + \beta_1 X + \epsilon.\]
The validity of the conclusions you can draw based on the model in is critically dependent on the assumptions made about \(\epsilon\), which represents random error. These are defined as follows:
The regression coefficients or parameters are:
The value denoted by \(\beta_0\) is called the intercept. This is interpreted as the expected value of the response variable when the predictor is zero.
The value denoted by \(\beta_1\) is called the slope. This is interpreted as the change in the mean response for each one-unit increase in the predictor.
Looking at our example:
lm(Volume ~ Height, data=trees)
##
## Call:
## lm(formula = Volume ~ Height, data = trees)
##
## Coefficients:
## (Intercept) Height
## -87.1236 1.5433
lm(Volume ~ Girth, data=trees)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Coefficients:
## (Intercept) Girth
## -36.9435 5.0659
The data observations are used to estimate the regression line. This is referred to as fitting the linear model. The fitted regression line is:
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}x\] The parameter estimates are obtained by:
\[ \hat{\beta_1} = \rho_{xy} \frac{s_y}{s_x} \] \[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\]
In simple linear regression there’s a natural question that should always be asked: Is there statistical evidence to support the presence of a relationship between the predictor and the response?
Simple linear regression coefficients, when estimated using least-squares, follow a t-distribution with \(n − 2\) degrees of freedom.
In particular we are mostlly interested in testing:
\[ H_0: \beta_ 1 = 0 \] \[ H_A: \beta_ 1 \ne 0 \] As with any hypothesis test, the smaller the p-value, the stronger the evidence against \(H_0.\)
We can assess the goodness of fit of our model by using the coefficient of determination, more often called Multiple R-squared or \(R^2.\)
For simple linear regression this equals the square of the correlation coefficient.
\[ R^2 = \hat{\rho}^2.\] This value is interpreted as the proportion of the variation in the response variable Y that is being explained by the predictor variable X.
Let’s revisit our example:
summary(lm(Volume ~ Height, data=trees))
##
## Call:
## lm(formula = Volume ~ Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2744 -9.8944 -2.8941 12.0675 29.8522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.12361 29.27312 -2.9762 0.0058347 **
## Height 1.54335 0.38387 4.0205 0.0003784 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.397 on 29 degrees of freedom
## Multiple R-squared: 0.3579, Adjusted R-squared: 0.33576
## F-statistic: 16.164 on 1 and 29 DF, p-value: 0.00037838
summary(lm(Volume ~ Girth, data=trees))
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.06536 -3.10670 0.15197 3.49477 9.58682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.94346 3.36514 -10.978 7.621e-12 ***
## Girth 5.06586 0.24738 20.478 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.93532, Adjusted R-squared: 0.93309
## F-statistic: 419.36 on 1 and 29 DF, p-value: < 2.22e-16
Many times, the objective of fitting a regression is to make predictions of the response for possible values of X. To do this, you simply plug in (to the fitted model equation) the value of \(x\) you’re interested in.
Confidence intervals and prediction intervals can be calculated as well.
Using our examples, can you predict the volume of timber that an 80 ft tree would give?
A prediction is referred to as interpolation if the \(x\) value you specify falls within the range of your observed data; extrapolation is when the \(x\) value of interest lies outside this range.
The results of simple linear regression SHOULD NOT be used to extrapolate outside the values of \(x.\)
Would it be ok to predict the volume of timber for a 60 ft tree?
We can also use a discrete or categorical explanatory variable, made up of k distinct groups or levels, to model the mean response.
Suppose your predictor variable is categorical, with only two possible levels (binary; k = 2) and observations coded either 0 or 1. Let’s look at how this would translate into our model:
\[ Y|X = \beta_0 + \beta_1 X + \epsilon.\] After we fit the model we would have:
\[\hat{y} = \left\{ \begin{array}{c l} \hat{\beta_0} + \hat{\beta_1} & if \quad X=1 \\ \hat{\beta_0} & if \quad X=0 \end{array}\right.\]
The interpretation of the model parameters, \(\beta_0\) and \(\beta_1\), isn’t really one of an “intercept” and a “slope” anymore.
Instead, it’s better to think of them as being something like two intercepts, where \(\beta_0\) provides the baseline or reference value of the response when \(X = 0\) and \(\beta_1\) represents the additive effect (or difference) on the mean response if \(X = 1.\)
Let’s look at an example:
summary(lm(mpg ~ factor(am), data=mtcars))
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.39231 -3.09231 -0.29737 3.24393 9.50769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.1474 1.1246 15.2475 1.134e-15 ***
## factor(am)1 7.2449 1.7644 4.1061 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.33846
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.00028502
summary(lm(mpg ~ factor(vs), data=mtcars))
##
## Call:
## lm(formula = mpg ~ factor(vs), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7571 -3.0821 -1.2667 2.8280 9.3833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6167 1.0797 15.3899 8.847e-16 ***
## factor(vs)1 7.9405 1.6324 4.8644 3.416e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.5808 on 30 degrees of freedom
## Multiple R-squared: 0.44095, Adjusted R-squared: 0.42231
## F-statistic: 23.662 on 1 and 30 DF, p-value: 3.4159e-05
When we have more than two k levels in the predictor, we convert everything into dummy coding. This is the procedure used to create several binary variables from a categorical variable like X.
Most statistical analysis software typically takes care of this process.
Let’s look at an example:
summary(lm(mpg ~ factor(cyl), data=mtcars))
##
## Call:
## lm(formula = mpg ~ factor(cyl), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.26364 -1.83571 0.02857 1.38929 7.23636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6636 0.9718 27.4373 < 2.2e-16 ***
## factor(cyl)6 -6.9208 1.5583 -4.4411 0.0001195 ***
## factor(cyl)8 -11.5636 1.2986 -8.9045 8.568e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.2231 on 29 degrees of freedom
## Multiple R-squared: 0.73246, Adjusted R-squared: 0.71401
## F-statistic: 39.698 on 2 and 29 DF, p-value: 4.9789e-09
Multiple linear regression is a straightforward generalization of the single-predictor models. An important idea here is that correlation does not imply causation.
Take once again the simple linear regression model:
\[ Y = \beta_0 + \beta_1 X + \epsilon.\]
If we add more predictors to this model we would simply write as:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon.\] You can think of the relationship between response and predictors as a multidimensional plane or surface.
When it comes to output and interpretation, working with multiple explanatory variables follows the same rules as before:
Any numeric-continuous variables have a slope coefficient that provides a “per-unit-change” quantity.
Any k-group categorical variables (factors) are dummy coded and provide k − 1 intercepts.
Let’s look at some examples:
summary(lm(Volume ~ Height + Girth, data=trees))
##
## Call:
## lm(formula = Volume ~ Height + Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.40648 -2.64933 -0.28758 2.20032 8.48470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.98766 8.63823 -6.7129 2.75e-07 ***
## Height 0.33925 0.13015 2.6066 0.01449 *
## Girth 4.70816 0.26426 17.8161 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.8818 on 28 degrees of freedom
## Multiple R-squared: 0.94795, Adjusted R-squared: 0.94423
## F-statistic: 254.97 on 2 and 28 DF, p-value: < 2.22e-16
summary(lm(mpg ~ hp + factor(am), data=mtcars))
##
## Call:
## lm(formula = mpg ~ hp + factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.38434 -2.26423 0.13659 1.69682 5.86571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.5849137 1.4250943 18.6548 < 2.2e-16 ***
## hp -0.0588878 0.0078567 -7.4952 2.92e-08 ***
## factor(am)1 5.2770853 1.0795406 4.8883 3.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.9092 on 29 degrees of freedom
## Multiple R-squared: 0.78203, Adjusted R-squared: 0.767
## F-statistic: 52.024 on 2 and 29 DF, p-value: 2.5505e-10
summary(lm(mpg ~ factor(vs) + factor(am), data=mtcars))
##
## Call:
## lm(formula = mpg ~ factor(vs) + factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.19048 -2.59881 0.22222 2.73155 6.30952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.59444 0.92615 15.7582 9.352e-16 ***
## factor(vs)1 6.92937 1.26213 5.4902 6.501e-06 ***
## factor(am)1 6.06667 1.27484 4.7588 4.958e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.4913 on 29 degrees of freedom
## Multiple R-squared: 0.68608, Adjusted R-squared: 0.66443
## F-statistic: 31.69 on 2 and 29 DF, p-value: 5.0562e-08