Introduction

Linear regression models allow us to determine if changing the values on a variable is associated with the values of another variable. In other words, if I make a 1-unit change in \(X\), how much does Y change? In fact, linear regression is similar to the algebraic equation for a simple line (\(Y = mx + b\), where \(m\) is the slope, \(X\) is the parameter that is changing, and \(b\) is the Y-intercept). In biostatistics, we use linear regression models to test the association between two or more variables where the outcome is a continuous data type.

There are several conditions that need to be satisfied in order for us to use the results from a linear regression model. These include:

However, the linear regression model is pretty robust to violations of these assumptions. Hence, it’s popularity. Moreover, it is very easy to interpret as this article will demonstrate.

Simple linear regression

Typical notations of the linear regression include:

The structural form of a linear regression model:

\(Y_i = \beta_0 + \beta_1 X_{1i} + \epsilon\)


Using some example data (mtcars), let’s evaluate the association between the car’s weight (wt) and fuel efficiency (mpg). We set up the linear regression model so that the outcome is mpg and wt is the main predictor of interest. The structural of this model is:

\(mpg_i = \beta_0 + \beta_1 wt_{1i} + \epsilon\)


We can use the R command lm() to perform this regression:

#### CLear the environment
rm(list = ls())

#### Load the libraries
library("ggplot2")
library("predict3d")

#### Load Data
data1 <- mtcars
head(data1)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
### Linear regression model (Y = mpg, X = wt)
linear.model1 <- lm(mpg ~ wt, data = data1)
summary(linear.model1)
## 
## Call:
## lm(formula = mpg ~ wt, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

We can generate the 95% confidence intervals (CI) for the parameters using confint() command.

confint(linear.model1)
##                 2.5 %    97.5 %
## (Intercept) 33.450500 41.119753
## wt          -6.486308 -4.202635

Based on these outputs, a 1-unit increase in wt was associated with a 5.34-unit decrease in mpg (95% CI: -6.49, -4.20), which was statistically significant at the 5% alpha level.

We can visualize the association between wt and mpg:

#### Visualize the linear regression models
ggPredict(linear.model1, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)

ggPredict(linear.model1, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)

ggPredict(linear.model1, digits = 1, show.point = TRUE, show.error = TRUE, se = TRUE, xpos = 0.5)

Multiple variable linear regression model

For a linear regression model with multiple \(X\)s, we just add another \(\beta\) coefficient and \(X\):

\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon\)


\(\beta_2\) denotes the change in \(Y\) due to a 1-unit change \(X\)
\(X_{2i}\) denotes the second \(X\) variable for individual \(i\)


Multivariable linear regression models allow us to focus on one variable’s impact on the outcome while “controlling” for the effects of the other variables. Sometimes, we use the term “adjusting” for the other covariates instead of “controlling,” but these terms are not helpful in understanding what the linear regression is doing with these other variables. When we say “control” we’re trying to mitigate the effects of other variables form the variable of interest. In other words, we want to see what the effects are for a similar group of people with the same characteristics.

Using the mtcars data, let’s test the association between the weight of the car (wt) and fuel efficiency (mpg) controlling for engine cylinder size (cyl).

\(mpg_i = \beta_0 + \beta_1 wt_{i} + \beta_2 cyl_{i} + \epsilon\)


We can use the R command lm() to perform this regression, but add cyl to the equation:

### Linear regression model (Y = mpg, X = wt + cyl)
linear.model2 <- lm(mpg ~ wt + cyl, data = data1)
summary(linear.model2)
## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12
confint(linear.model2)
##                 2.5 %     97.5 %
## (Intercept) 36.178725 43.1937976
## wt          -4.739020 -1.6429245
## cyl         -2.355928 -0.6596622

Based on the linear regression output, the weight of the car was associated with a 3.19-unit decrease in the mpg controlling for the cylinder size (95% CI: -4.74, -1.64).

We can visualize the association between wt and mpg for different cyl groups:

#### Visualize the linear regression models
ggPredict(linear.model2, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)

ggPredict(linear.model2, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5, facet.modx = TRUE, show.error = TRUE)

ggPredict(linear.model2, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)

ggPredict(linear.model2, digits = 1, pred = cyl, show.point = TRUE, se = TRUE, xpos = 0.5)

Conclusions

Linear regression models are very useful to test the association between multiple variables. In these examples, we go through the syntax using the lm() command to run the linear regression model, the confint() command to generate the 95% confidence intervals, and visualize the results using the predict3d package, a great package for visualizing these coefficients. Now that we have generated some of the visuals for the linear regression, we will need to look at the residuals and assess whether the assumptions hold (we’ll visit these issues in a future article).