Introduction to Linear Regression with R

Introduction

This post introduces linear regression with an emphasis on prediction, rather than inference. We will use the ames_train data set.

Load Packages

# Helper packages
library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics

# Modeling packages
library(caret)    # for cross-validation, etc.
library(rsample)   # for resampling procedures
library(h2o)       # for resampling and model training

# Model interpretability packages
library(vip)      # variable importance

# access data
ames <- AmesHousing::make_ames()

Simple Linear Regression

Simple linear regression (SLR) assumes that the statistical relationship between two continuous variables (say \(X\) and \(Y\)) is (at least approximately) linear:

\[ Y_i = \beta_0 + \beta_1 X + \epsilon_i \]

for \(i\) = 1,2,…,\(n\), where

\(Y_i\) represents the \(i\)-th response value,

\(X_i\) represents the \(i\)-th feature value,

\(\beta_0\) and \(\beta_1\) are fixed, but unknown constants (commonly referred to as coefficients or parameters) that represent the intercept and slope of the regression line, respectively,

\(\epsilon_i\) represents noise or random error.

We'll assume that the errors are normally distributed with mean zero and constant variance \({\sigma}^2\)

Since the random errors are centered around zero (i.e., \(E(\epsilon) = 0\) ), linear regression is really a problem of estimating a conditional mean:

\[ E(Y_i|X_i) = \beta_0 + \beta_1X \]

For brevity, we often drop the conditional piece and write \(E(Y|X) = E(Y)\).

Consequently, the interpretation of the coefficients is in terms of the average, or mean response. The intercept \(\beta_0\) represents the average response value when \(X=0\) (it is often not meaningful or of interest and is sometimes referred to as a bias term). The slope \(\beta_1\) represents the increase in the average response per one-unit increase in \(X\) (i.e., it is a rate of change).

Estimation

Ideally, we want estimates of \(\beta_0\) and \(\beta_1\) that give us the "best fitting" line. The most common approach for finding the best fitting line is to use the method of least squares (LS) estimation; this form of linear regression is often referred to as ordinary least squares (OLS) regression. There are multiple ways to measure "best fitting", but the LS criterion finds the "best fitting" line by minimizing the residual sum of squares (\(RSS\)):

\[ Equation \]

The LS estimates of \(\beta_0\) and \(\beta_1\) are denoted as \(\hat \beta_0\) and \(\hat \beta_1\), respectively. Once obtained, we can generate predicted values, say at \(X = X_{new}\), using the estimated regression equation:

\[ Equation \]

where \(Equation\) is the estimated mean response at \(X = X_{new}\).

With the Ames housing data, suppose we wanted to model a linear relationship between the total above ground living space of a home (Gr_Liv_Area) and sale price (Sale_Price).

To perform an OLS regression model in R we can use the lm() function:

# Stratified sampling with the rsample package
set.seed(123)
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- training(split)
ames_test   <- testing(split)
model1 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_train)

The fitted model (model1) is displayed in the left plot in figure where the points represent the values of Sale_Price in the training data. In the right plot of figure, the vertical lines represent the individual errors, called residuals, associated with each observation. The OLS criterion in equation identifies the "best fitting" line that minimizes the sum of squares of these residuals.

The coef() function extracts the estimated coefficients from the model. We can also use summary() to get a more detailed report of the model results.

summary(model1) 

Call:
lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-474682  -30794   -1678   23353  328183 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15938.173   3851.853   4.138 3.65e-05 ***
Gr_Liv_Area   109.667      2.421  45.303  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56790 on 2047 degrees of freedom
Multiple R-squared:  0.5007,    Adjusted R-squared:  0.5004 
F-statistic:  2052 on 1 and 2047 DF,  p-value: < 2.2e-16

Inference