Visualizing linear regression models

Simple linear regression

Typical notations of the linear regression include:

\(Y_i\) denotes the outcome (or dependent) variable for subject \(i\)
\(X_{1i}\) denotes the predictor of interest or the independent variable (\(X_1\)) for subject \(i\)
\(\beta_0\) denotes the Y-intercept when \(X\) is zero
\(\beta_1\) denotes the slope or the change in \(Y\) with a 1-unit change in \(X\)
\(\epsilon\) denotes the errors

The structural form of a linear regression model:

\(Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon\)

Using some example data (mtcars), let’s evaluate the association between the car’s weight (wt) and fuel efficiency (mpg). We set up the linear regression model so that the outcome is mpg and wt is the main predictor of interest. The structural of this model is:

\(mpg_i = \beta_0 + \beta_1 wt_{1i} + \epsilon\)

We can use the R command lm() to perform this regression:

#### CLear the environment
rm(list = ls())

#### Load the libraries
library("ggplot2")
library("predict3d")

#### Load Data
data1 <- mtcars
head(data1)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

### Linear regression model (Y = mpg, X = wt)
linear.model1 <- lm(mpg ~ wt, data = data1)
summary(linear.model1)

## 
## Call:
## lm(formula = mpg ~ wt, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

We can generate the 95% confidence intervals (CI) for the parameters using confint() command.

confint(linear.model1)

##                 2.5 %    97.5 %
## (Intercept) 33.450500 41.119753
## wt          -6.486308 -4.202635

Based on these outputs, a 1-unit increase in wt was associated with a 5.34-unit decrease in mpg (95% CI: -6.49, -4.20), which was statistically significant at the 5% alpha level.

We can visualize the association between wt and mpg:

#### Visualize the linear regression models
ggPredict(linear.model1, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)

ggPredict(linear.model1, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)

ggPredict(linear.model1, digits = 1, show.point = TRUE, show.error = TRUE, se = TRUE, xpos = 0.5)

Multiple variable linear regression model

For a linear regression model with multiple \(X\)s, we just add another \(\beta\) coefficient and \(X\):

\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon\)

\(\beta_2\) denotes the change in \(Y\) due to a 1-unit change \(X\)

\(X_{2i}\) denotes the second \(X\) variable for individual \(i\)

Multivariable linear regression models allow us to focus on one variable’s impact on the outcome while “controlling” for the effects of the other variables. Sometimes, we use the term “adjusting” for the other covariates instead of “controlling,” but these terms are not helpful in understanding what the linear regression is doing with these other variables. When we say “control” we’re trying to mitigate the effects of other variables form the variable of interest. In other words, we want to see what the effects are for a similar group of people with the same characteristics.

Using the mtcars data, let’s test the association between the weight of the car (wt) and fuel efficiency (mpg) controlling for engine cylinder size (cyl).

\(mpg_i = \beta_0 + \beta_1 wt_{i} + \beta_2 cyl_{i} + \epsilon\)

We can use the R command lm() to perform this regression, but add cyl to the equation:

### Linear regression model (Y = mpg, X = wt + cyl)
linear.model2 <- lm(mpg ~ wt + cyl, data = data1)
summary(linear.model2)

## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

confint(linear.model2)

##                 2.5 %     97.5 %
## (Intercept) 36.178725 43.1937976
## wt          -4.739020 -1.6429245
## cyl         -2.355928 -0.6596622

Based on the linear regression output, the weight of the car was associated with a 3.19-unit decrease in the mpg controlling for the cylinder size (95% CI: -4.74, -1.64).

We can visualize the association between wt and mpg for different cyl groups:

#### Visualize the linear regression models
ggPredict(linear.model2, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)

ggPredict(linear.model2, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5, facet.modx = TRUE, show.error = TRUE)

ggPredict(linear.model2, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)

ggPredict(linear.model2, digits = 1, pred = cyl, show.point = TRUE, se = TRUE, xpos = 0.5)

Conclusions

Linear regression models are very useful to test the association between multiple variables. In these examples, we go through the syntax using the lm() command to run the linear regression model, the confint() command to generate the 95% confidence intervals, and visualize the results using the predict3d package, a great package for visualizing these coefficients. Now that we have generated some of the visuals for the linear regression, we will need to look at the residuals and assess whether the assumptions hold (we’ll visit these issues in a future article).

Visualizing linear regression models - Part 1

Mark Bounthavong

10/28/2021

Introduction

Simple linear regression

Multiple variable linear regression model

Conclusions