Linear regression models allow us to determine if changing the values on a variable is associated with the values of another variable. In other words, if I make a 1-unit change in \(X\), how much does Y change? In fact, linear regression is similar to the algebraic equation for a simple line (\(Y = mx + b\), where \(m\) is the slope, \(X\) is the parameter that is changing, and \(b\) is the Y-intercept). In biostatistics, we use linear regression models to test the association between two or more variables where the outcome is a continuous data type.
There are several conditions that need to be satisfied in order for us to use the results from a linear regression model. These include:
However, the linear regression model is pretty robust to violations of these assumptions. Hence, it’s popularity. Moreover, it is very easy to interpret as this article will demonstrate.
Typical notations of the linear regression include:
The structural form of a linear regression model:
Using some example data (mtcars
), let’s evaluate the association between the car’s weight (wt
) and fuel efficiency (mpg
). We set up the linear regression model so that the outcome is mpg
and wt
is the main predictor of interest. The structural of this model is:
We can use the R command lm()
to perform this regression:
#### CLear the environment
rm(list = ls())
#### Load the libraries
library("ggplot2")
library("predict3d")
#### Load Data
data1 <- mtcars
head(data1)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
### Linear regression model (Y = mpg, X = wt)
linear.model1 <- lm(mpg ~ wt, data = data1)
summary(linear.model1)
##
## Call:
## lm(formula = mpg ~ wt, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
We can generate the 95% confidence intervals (CI) for the parameters using confint()
command.
confint(linear.model1)
## 2.5 % 97.5 %
## (Intercept) 33.450500 41.119753
## wt -6.486308 -4.202635
Based on these outputs, a 1-unit increase in wt
was associated with a 5.34-unit decrease in mpg
(95% CI: -6.49, -4.20), which was statistically significant at the 5% alpha level.
We can visualize the association between wt
and mpg
:
#### Visualize the linear regression models
ggPredict(linear.model1, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)
ggPredict(linear.model1, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)
ggPredict(linear.model1, digits = 1, show.point = TRUE, show.error = TRUE, se = TRUE, xpos = 0.5)
For a linear regression model with multiple \(X\)s, we just add another \(\beta\) coefficient and \(X\):
Multivariable linear regression models allow us to focus on one variable’s impact on the outcome while “controlling” for the effects of the other variables. Sometimes, we use the term “adjusting” for the other covariates instead of “controlling,” but these terms are not helpful in understanding what the linear regression is doing with these other variables. When we say “control” we’re trying to mitigate the effects of other variables form the variable of interest. In other words, we want to see what the effects are for a similar group of people with the same characteristics.
Using the mtcars
data, let’s test the association between the weight of the car (wt
) and fuel efficiency (mpg
) controlling for engine cylinder size (cyl
).
We can use the R command lm()
to perform this regression, but add cyl
to the equation:
### Linear regression model (Y = mpg, X = wt + cyl)
linear.model2 <- lm(mpg ~ wt + cyl, data = data1)
summary(linear.model2)
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
confint(linear.model2)
## 2.5 % 97.5 %
## (Intercept) 36.178725 43.1937976
## wt -4.739020 -1.6429245
## cyl -2.355928 -0.6596622
Based on the linear regression output, the weight of the car was associated with a 3.19-unit decrease in the mpg
controlling for the cylinder size (95% CI: -4.74, -1.64).
We can visualize the association between wt and mpg for different cyl
groups:
#### Visualize the linear regression models
ggPredict(linear.model2, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)
ggPredict(linear.model2, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5, facet.modx = TRUE, show.error = TRUE)
ggPredict(linear.model2, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)
ggPredict(linear.model2, digits = 1, pred = cyl, show.point = TRUE, se = TRUE, xpos = 0.5)
Linear regression models are very useful to test the association between multiple variables. In these examples, we go through the syntax using the lm()
command to run the linear regression model, the confint()
command to generate the 95% confidence intervals, and visualize the results using the predict3d
package, a great package for visualizing these coefficients. Now that we have generated some of the visuals for the linear regression, we will need to look at the residuals and assess whether the assumptions hold (we’ll visit these issues in a future article).