Linear regression models the relationship between a scalar response (dependent variable) and one or more explanatory variables (independent variables) using linear predictor functions whose unknown model parameters are estimated from the data.
“Linear” here means linear in the parameters, not necessarily that the relationship is a straight line.
In linear regression, the observations (red) are assumed to be the result of random deviations (green) from an underlying relationship (blue) between a dependent variable (y) and an independent variable (x).
Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways.
Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function by minimizing the sum of the squares of the differences between the dependent variable and those predicted by the linear function of the independent variable (i.e., the green lines in the figure above). Mathematically, this can be expressed by a relatively simple formula and solved analytically.
# Generate some fake data for two groups, A and B
N <- 15
Aslope <- 1
Bslope <- 3
Aintercept <- 0
Bintercept <- -3
Nseq <- seq_len(N)
Adat <- data.frame(group = "A",
x = Nseq,
y = Nseq * Aslope + Aintercept)
Bdat <- data.frame(group = "B",
x = Nseq,
y = Nseq * Bslope + Bintercept)
dat <- rbind(Adat, Bdat)
# Add some random noise
dat$y <- dat$y + rnorm(nrow(dat)) * 2
ggplot(dat, aes(x, y, color = group)) + geom_point(size = 4)
# "y is a linear function of x"
m1 <- lm(y ~ x, data = dat)
summary(m1)
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.854 -6.111 -1.013 7.052 13.596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0034 3.1558 -0.318 0.753
## x 1.9582 0.3471 5.642 4.82e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.214 on 28 degrees of freedom
## Multiple R-squared: 0.532, Adjusted R-squared: 0.5153
## F-statistic: 31.83 on 1 and 28 DF, p-value: 4.818e-06
The model has an intercept of -1.003, which is not significantly different from zero (as the p-value is 0.753).
The model’s best-fit slope is 1.958. The p-value for the slope is very small, strong evidence against the “null hypothesis” that there is no relationship between x and y. Because there’s only a single term in the model, the p-value for the slope is exactly the same as the p-value for the overall model, at the bottom.
In other words, there’s a strong and statistically significant correlation between these two variables, explaining about 52% (from the adjusted R2) of the variability in y.
dat$m1 <- predict(m1)
ggplot(dat, aes(x, y, color = group)) +
geom_point(size = 4) +
geom_line(aes(y = m1), linetype = 2, color = "black")
Unfortunately, if we plot the model residuals (the difference between the predicted values and the actual y values), it’s obvious that this model is not satisfactory.
ggplot(dat, aes(m1, y - m1, color = group)) +
xlab(expression(Predicted~value~(hat(y)))) +
ylab(expression(Residual~(y-hat(y)))) +
geom_point(size = 4) +
geom_smooth(aes(group = 1), se = FALSE)
(The blue loess smoother simply makes it easier to see trends.)
The problem is that the residuals of a linear model should be homoschedastic: normally distributed with a mean of zero and roughly constant variance. We can test for this using a chi-squared (χ2) test:
library(olsrr)
ols_test_breusch_pagan(m1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------
## Response : y
## Variables: fitted values of y
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 10.41012
## Prob > Chi2 = 0.001253263
So this model has heteroschedastic residuals; more specifically, it overpredicts group A
and underpredicts B
.
We can do better!
We know we have two groups in our data, A
and B
; maybe observations from these groups are systematically different in some way, i.e. there’s a constant difference between them?
# "y is a linear function of x plus some constant group-dependent offset"
m2 <- lm(y ~ x + group, data = dat)
summary(m2)
##
## Call:
## lm(formula = y ~ x + group, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7465 -3.9688 0.7215 3.7605 7.6018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3108 2.1590 -3.386 0.00219 **
## x 1.9582 0.2145 9.130 9.63e-10 ***
## groupB 12.6148 1.8533 6.807 2.60e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.075 on 27 degrees of freedom
## Multiple R-squared: 0.8277, Adjusted R-squared: 0.8149
## F-statistic: 64.85 on 2 and 27 DF, p-value: 4.901e-11
Let’s take a close look at this fitted model.
The model has an overall intercept of -7.311, which is significantly different from zero (as the p-value is 0.002). This is the intercept for the A
group.
The intercept for the B
group is -7.311 + 12.615 = 5.304. This is significantly different from the A
group, because the p-value for groupB
is 2.6040986^{-7} (i.e. it’s very small).
The model’s slope is 1.958. The p-value for the slope is very small, strong evidence against the “null hypothesis” that there is no overall relationship between x and y. In other words, the slope is statistically different from zero.
Remember, in our model specification above we did not allow group
to affect the slope parameter.
We’ve made progress! This new model explains 81% of the variability in y, a huge improvement, and has a much lower error.
The residual plot is still pretty weird looking, because we’re underpredicting A
(and overpredicting B
) at low x
values, and overpredicting A
(and underpredicting B
) at high x
.
If we allow group
to interact with x, then we are fitting a model in which slope (but not intercept) changes with group
:
# "y is a linear function of x whose slope is group-dependent"
m3 <- lm(y ~ x : group, data = dat)
summary(m3)
##
## Call:
## lm(formula = y ~ x:group, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9668 -1.2410 0.0058 1.6262 4.1989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0034 0.8376 -1.198 0.241
## x:groupA 1.1157 0.1020 10.938 2.01e-11 ***
## x:groupB 2.8008 0.1020 27.460 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.18 on 27 degrees of freedom
## Multiple R-squared: 0.9682, Adjusted R-squared: 0.9659
## F-statistic: 411.1 on 2 and 27 DF, p-value: < 2.2e-16
The model has a single intercept of -1.003, which is not significantly different from zero.
The best-fit slope for the A
group is 1.116, and the B
group slope estimate is 2.801. Both are significantly different from zero.
This third model explains 97% of the variability in y — another big jump.
This…this is a pretty good looking model. It explains almost all the variability in the data, and the residual plot looks good.
Should we try a model with group
affecting both slope and intercept? It depends on our understanding of the system we’re trying to model. Does it make sense for these models to have a common intercept? In many cases, probably not: the groups would be expected to be completely different.
# "y is a group-dependent linear function of x"
m4 <- lm(y ~ x * group, data = dat)
summary(m4)
##
## Call:
## lm(formula = y ~ x * group, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5294 -1.3547 0.2663 1.2635 3.7656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9156 1.0835 0.845 0.4058
## x 0.9299 0.1192 7.804 2.81e-08 ***
## groupB -3.8381 1.5323 -2.505 0.0188 *
## x:groupB 2.0566 0.1685 12.204 2.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.994 on 26 degrees of freedom
## Multiple R-squared: 0.9744, Adjusted R-squared: 0.9714
## F-statistic: 329.7 on 3 and 26 DF, p-value: < 2.2e-16
Our final model. How did OLS do?
Term | True value | Estimate |
---|---|---|
A intercept | 0 | 0.916 |
B intercept | -3 | 0.916 + -3.838 = -2.922 |
A slope | 1 | 0.93 |
B slope | 3 | 0.93 + 2.057 = 2.987 |
All the terms in our final model above are statistically significant. If a term isn’t significant, however, we’d generally like to remove it from the model. This can be done by hand or using functions such as MASS::stepAIC
or olsrr::ols_step_both_p
.
An important principle of model-building is parsimony: we’d like the simplest, most explainable model. In particular we don’t want to overfit–this will leave us with a too-specific model that doesn’t generalize to new data.
# Fit a six-term polynomial (x + x^2 + x^3 + ...)
m4poly <- lm(y ~ poly(x, 6) * group, data = dat)
Above we used ggplot2 to make residuals ‘by hand’, but we can also just use the plot
function to make a residual plot, as well as visualize other model diagnostics:
cars_model <- lm(dist ~ speed, data = cars)
plot(cars_model, which = 1)
By default plot.lm
produces four key plots:
plot(cars_model)
These are:
Some problems–in particular, heteroschedasticity–may be addressable by transforming the data so that we can model it successfully using a linear model.
For example, look at the Puromycin
dataset and fit a basic one-term linear model. Obviously it doesn’t work very well, because the data don’t fall in a straight line:
Try a polynomial (this is still a linear model; see above):
pm2 <- lm(rate ~ poly(conc, 2) * state, data = Puromycin)
Use a transformation:
pm3 <- lm(rate ~ log(conc) * state, data = Puromycin)
The other, and probably best, option would be to use a nonlinear model for these data.
The repository for this document is here.
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] olsrr_0.5.3 ggplot2_3.3.5
##
## loaded via a namespace (and not attached):
## [1] nortest_1.0-4 tidyselect_1.1.1 xfun_0.24 bslib_0.2.5.1
## [5] purrr_0.3.4 splines_4.1.2 haven_2.4.1 lattice_0.20-45
## [9] carData_3.0-4 colorspace_2.0-2 vctrs_0.3.8 generics_0.1.0
## [13] htmltools_0.5.1.1 yaml_2.2.1 mgcv_1.8-38 utf8_1.2.2
## [17] rlang_0.4.12 jquerylib_0.1.4 pillar_1.6.4 foreign_0.8-81
## [21] glue_1.5.0 withr_2.4.2 readxl_1.3.1 lifecycle_1.0.1
## [25] stringr_1.4.0 cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0
## [29] zip_2.1.1 evaluate_0.14 labeling_0.4.2 knitr_1.33
## [33] rio_0.5.26 forcats_0.5.1 curl_4.3.2 fansi_0.5.0
## [37] highr_0.9 Rcpp_1.0.7 scales_1.1.1 jsonlite_1.7.2
## [41] abind_1.4-5 farver_2.1.0 gridExtra_2.3 hms_1.1.0
## [45] digest_0.6.28 stringi_1.7.5 openxlsx_4.2.3 dplyr_1.0.6
## [49] grid_4.1.2 tools_4.1.2 goftest_1.2-3 magrittr_2.0.1
## [53] sass_0.4.0 tibble_3.1.6 crayon_1.4.2 car_3.0-10
## [57] pkgconfig_2.0.3 ellipsis_0.3.2 Matrix_1.3-4 data.table_1.14.0
## [61] rmarkdown_2.8 R6_2.5.1 nlme_3.1-153 compiler_4.1.2