SIMPLE LINEAR REGRESSION

Simple linear regression models describe the effect that a particular variable, called the explanatory variable, might have on the value of a continuous outcome variable, called the response variable.

The explanatory variable may be continuous, discrete, or categorical.

An Example of a Linear Relationship

library(knitr)
kable(trees)
Girth Height Volume
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77.0
plot(Volume ~ Height, data=trees)

plot(Volume ~ Girth, data=trees)

cov(trees$Volume, trees$Height)
## [1] 62.66
cor(trees$Volume, trees$Height)
## [1] 0.59824965
cov(trees$Volume, trees$Girth)
## [1] 49.888118
cor(trees$Volume, trees$Girth)
## [1] 0.96711937
plot(Volume ~ Height, data=trees)
abline(lm(Volume ~ Height, data=trees))

plot(Volume ~ Girth, data=trees)
abline(lm(Volume ~ Girth, data=trees))

General Concepts

The purpose of a linear regression model is to come up with a function that estimates the mean of one variable given a particular value of another variable.

Definition of the Model

The simple linear regression model states that the value of a response is expressed as the following equation:

\[ Y|X = \beta_0 + \beta_1 X + \epsilon.\]

Residual Assumptions

The validity of the conclusions you can draw based on the model in is critically dependent on the assumptions made about \(\epsilon\), which represents random error. These are defined as follows:

  • The value of \(\epsilon\) is assumed to be normally distributed such that \(\epsilon \sim N(0, \sigma).\)
  • That \(\epsilon\) is has a mean of zero.
  • The variance of \(\epsilon\), \(\sigma^2\), is constant.

Parameters

The regression coefficients or parameters are:

  • The value denoted by \(\beta_0\) is called the intercept. This is interpreted as the expected value of the response variable when the predictor is zero.

  • The value denoted by \(\beta_1\) is called the slope. This is interpreted as the change in the mean response for each one-unit increase in the predictor.

Looking at our example:

lm(Volume ~ Height, data=trees)
## 
## Call:
## lm(formula = Volume ~ Height, data = trees)
## 
## Coefficients:
## (Intercept)       Height  
##    -87.1236       1.5433
lm(Volume ~ Girth, data=trees)
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Coefficients:
## (Intercept)        Girth  
##    -36.9435       5.0659

Estimating the Intercept and Slope Parameters

The data observations are used to estimate the regression line. This is referred to as fitting the linear model. The fitted regression line is:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}x\] The parameter estimates are obtained by:

\[ \hat{\beta_1} = \rho_{xy} \frac{s_y}{s_x} \] \[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\]

Statistical Inference

In simple linear regression there’s a natural question that should always be asked: Is there statistical evidence to support the presence of a relationship between the predictor and the response?

Regression Coefficient Significance Tests

Simple linear regression coefficients, when estimated using least-squares, follow a t-distribution with \(n − 2\) degrees of freedom.

In particular we are mostlly interested in testing:

\[ H_0: \beta_ 1 = 0 \] \[ H_A: \beta_ 1 \ne 0 \] As with any hypothesis test, the smaller the p-value, the stronger the evidence against \(H_0.\)

Coefficient of Determination

We can assess the goodness of fit of our model by using the coefficient of determination, more often called Multiple R-squared or \(R^2.\)

For simple linear regression this equals the square of the correlation coefficient.

\[ R^2 = \hat{\rho}^2.\] This value is interpreted as the proportion of the variation in the response variable Y that is being explained by the predictor variable X.

Let’s revisit our example:

summary(lm(Volume ~ Height, data=trees))
## 
## Call:
## lm(formula = Volume ~ Height, data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2744  -9.8944  -2.8941  12.0675  29.8522 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -87.12361   29.27312 -2.9762 0.0058347 ** 
## Height        1.54335    0.38387  4.0205 0.0003784 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.397 on 29 degrees of freedom
## Multiple R-squared:  0.3579, Adjusted R-squared:  0.33576 
## F-statistic: 16.164 on 1 and 29 DF,  p-value: 0.00037838
summary(lm(Volume ~ Girth, data=trees))
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8.06536 -3.10670  0.15197  3.49477  9.58682 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -36.94346    3.36514 -10.978 7.621e-12 ***
## Girth         5.06586    0.24738  20.478 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.93532,    Adjusted R-squared:  0.93309 
## F-statistic: 419.36 on 1 and 29 DF,  p-value: < 2.22e-16

Prediction

Many times, the objective of fitting a regression is to make predictions of the response for possible values of X. To do this, you simply plug in (to the fitted model equation) the value of \(x\) you’re interested in.

Confidence intervals and prediction intervals can be calculated as well.

Using our examples, can you predict the volume of timber that an 80 ft tree would give?

Interpolation vs. Extrapolation

A prediction is referred to as interpolation if the \(x\) value you specify falls within the range of your observed data; extrapolation is when the \(x\) value of interest lies outside this range.

The results of simple linear regression SHOULD NOT be used to extrapolate outside the values of \(x.\)

Would it be ok to predict the volume of timber for a 60 ft tree?

Understanding Categorical Predictors

We can also use a discrete or categorical explanatory variable, made up of k distinct groups or levels, to model the mean response.

Binary Variables: k = 2

Suppose your predictor variable is categorical, with only two possible levels (binary; k = 2) and observations coded either 0 or 1. Let’s look at how this would translate into our model:

\[ Y|X = \beta_0 + \beta_1 X + \epsilon.\] After we fit the model we would have:

\[\hat{y} = \left\{ \begin{array}{c l} \hat{\beta_0} + \hat{\beta_1} & if \quad X=1 \\ \hat{\beta_0} & if \quad X=0 \end{array}\right.\]

The interpretation of the model parameters, \(\beta_0\) and \(\beta_1\), isn’t really one of an “intercept” and a “slope” anymore.

Instead, it’s better to think of them as being something like two intercepts, where \(\beta_0\) provides the baseline or reference value of the response when \(X = 0\) and \(\beta_1\) represents the additive effect (or difference) on the mean response if \(X = 1.\)

Let’s look at an example:

summary(lm(mpg ~ factor(am), data=mtcars))
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9.39231 -3.09231 -0.29737  3.24393  9.50769 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  17.1474     1.1246 15.2475 1.134e-15 ***
## factor(am)1   7.2449     1.7644  4.1061  0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.33846 
## F-statistic:  16.86 on 1 and 30 DF,  p-value: 0.00028502
summary(lm(mpg ~ factor(vs), data=mtcars))
## 
## Call:
## lm(formula = mpg ~ factor(vs), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7571 -3.0821 -1.2667  2.8280  9.3833 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  16.6167     1.0797 15.3899 8.847e-16 ***
## factor(vs)1   7.9405     1.6324  4.8644 3.416e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.5808 on 30 degrees of freedom
## Multiple R-squared:  0.44095,    Adjusted R-squared:  0.42231 
## F-statistic: 23.662 on 1 and 30 DF,  p-value: 3.4159e-05

Multilevel Variables: k > 2

When we have more than two k levels in the predictor, we convert everything into dummy coding. This is the procedure used to create several binary variables from a categorical variable like X.

Most statistical analysis software typically takes care of this process.

Let’s look at an example:

summary(lm(mpg ~ factor(cyl), data=mtcars))
## 
## Call:
## lm(formula = mpg ~ factor(cyl), data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5.26364 -1.83571  0.02857  1.38929  7.23636 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   26.6636     0.9718 27.4373 < 2.2e-16 ***
## factor(cyl)6  -6.9208     1.5583 -4.4411 0.0001195 ***
## factor(cyl)8 -11.5636     1.2986 -8.9045 8.568e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.2231 on 29 degrees of freedom
## Multiple R-squared:  0.73246,    Adjusted R-squared:  0.71401 
## F-statistic: 39.698 on 2 and 29 DF,  p-value: 4.9789e-09

MULTIPLE LINEAR REGRESSION

Multiple linear regression is a straightforward generalization of the single-predictor models. An important idea here is that correlation does not imply causation.

Extending the Simple Model to a Multiple Model

Take once again the simple linear regression model:

\[ Y = \beta_0 + \beta_1 X + \epsilon.\]

If we add more predictors to this model we would simply write as:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon.\] You can think of the relationship between response and predictors as a multidimensional plane or surface.

Interpreting Marginal Effects

When it comes to output and interpretation, working with multiple explanatory variables follows the same rules as before:

  • Any numeric-continuous variables have a slope coefficient that provides a “per-unit-change” quantity.

  • Any k-group categorical variables (factors) are dummy coded and provide k − 1 intercepts.

Let’s look at some examples:

summary(lm(Volume ~ Height + Girth, data=trees))
## 
## Call:
## lm(formula = Volume ~ Height + Girth, data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6.40648 -2.64933 -0.28758  2.20032  8.48470 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -57.98766    8.63823 -6.7129  2.75e-07 ***
## Height        0.33925    0.13015  2.6066   0.01449 *  
## Girth         4.70816    0.26426 17.8161 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.8818 on 28 degrees of freedom
## Multiple R-squared:  0.94795,    Adjusted R-squared:  0.94423 
## F-statistic: 254.97 on 2 and 28 DF,  p-value: < 2.22e-16
summary(lm(mpg ~ hp + factor(am), data=mtcars))
## 
## Call:
## lm(formula = mpg ~ hp + factor(am), data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4.38434 -2.26423  0.13659  1.69682  5.86571 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 26.5849137  1.4250943 18.6548 < 2.2e-16 ***
## hp          -0.0588878  0.0078567 -7.4952  2.92e-08 ***
## factor(am)1  5.2770853  1.0795406  4.8883  3.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.9092 on 29 degrees of freedom
## Multiple R-squared:  0.78203,    Adjusted R-squared:   0.767 
## F-statistic: 52.024 on 2 and 29 DF,  p-value: 2.5505e-10
summary(lm(mpg ~ factor(vs) + factor(am), data=mtcars))
## 
## Call:
## lm(formula = mpg ~ factor(vs) + factor(am), data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6.19048 -2.59881  0.22222  2.73155  6.30952 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 14.59444    0.92615 15.7582 9.352e-16 ***
## factor(vs)1  6.92937    1.26213  5.4902 6.501e-06 ***
## factor(am)1  6.06667    1.27484  4.7588 4.958e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.4913 on 29 degrees of freedom
## Multiple R-squared:  0.68608,    Adjusted R-squared:  0.66443 
## F-statistic:  31.69 on 2 and 29 DF,  p-value: 5.0562e-08