Linear models and ordinary least squares regression

Linear regression models the relationship between a scalar response (dependent variable) and one or more explanatory variables (independent variables) using linear predictor functions whose unknown model parameters are estimated from the data.

“Linear” here means linear in the parameters, not necessarily that the relationship is a straight line.

In linear regression, the observations (red) are assumed to be the result of random deviations (green) from an underlying relationship (blue) between a dependent variable (y) and an independent variable (x).

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways.

Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function by minimizing the sum of the squares of the differences between the dependent variable and those predicted by the linear function of the independent variable (i.e., the green lines in the figure above). Mathematically, this can be expressed by a relatively simple formula and solved analytically.

Fitting and assessing linear models

# Generate some fake data for two groups, A and B
N <- 15
Aslope <- 1
Bslope <- 3
Aintercept <- 0
Bintercept <- -3
Nseq <- seq_len(N)
Adat <- data.frame(group = "A",
                   x = Nseq,
                   y = Nseq * Aslope + Aintercept)
Bdat <- data.frame(group = "B",
                   x = Nseq,
                   y = Nseq * Bslope + Bintercept)
dat <- rbind(Adat, Bdat)

# Add some random noise
dat$y <- dat$y + rnorm(nrow(dat)) * 2

ggplot(dat, aes(x, y, color = group)) + geom_point(size = 4)

Simplest model (no groups)

# "y is a linear function of x"
m1 <- lm(y ~ x, data = dat)
summary(m1)
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.854  -6.111  -1.013   7.052  13.596 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.0034     3.1558  -0.318    0.753    
## x             1.9582     0.3471   5.642 4.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.214 on 28 degrees of freedom
## Multiple R-squared:  0.532,  Adjusted R-squared:  0.5153 
## F-statistic: 31.83 on 1 and 28 DF,  p-value: 4.818e-06

The model has an intercept of -1.003, which is not significantly different from zero (as the p-value is 0.753).

The model’s best-fit slope is 1.958. The p-value for the slope is very small, strong evidence against the “null hypothesis” that there is no relationship between x and y. Because there’s only a single term in the model, the p-value for the slope is exactly the same as the p-value for the overall model, at the bottom.

In other words, there’s a strong and statistically significant correlation between these two variables, explaining about 52% (from the adjusted R2) of the variability in y.

dat$m1 <- predict(m1)
ggplot(dat, aes(x, y, color = group)) +
  geom_point(size = 4) +
  geom_line(aes(y = m1), linetype = 2, color = "black")

Unfortunately, if we plot the model residuals (the difference between the predicted values and the actual y values), it’s obvious that this model is not satisfactory.

ggplot(dat, aes(m1, y - m1, color = group)) + 
  xlab(expression(Predicted~value~(hat(y)))) + 
  ylab(expression(Residual~(y-hat(y)))) +
  geom_point(size = 4) + 
  geom_smooth(aes(group = 1), se = FALSE)

(The blue loess smoother simply makes it easier to see trends.)

The problem is that the residuals of a linear model should be homoschedastic: normally distributed with a mean of zero and roughly constant variance. We can test for this using a chi-squared (χ2) test:

library(olsrr)
ols_test_breusch_pagan(m1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##             Data              
##  -----------------------------
##  Response : y 
##  Variables: fitted values of y 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    10.41012 
##  Prob > Chi2   =    0.001253263

So this model has heteroschedastic residuals; more specifically, it overpredicts group A and underpredicts B.

We can do better!

Group as a factor that affects intercept only

We know we have two groups in our data, A and B; maybe observations from these groups are systematically different in some way, i.e. there’s a constant difference between them?

# "y is a linear function of x plus some constant group-dependent offset"
m2 <- lm(y ~ x + group, data = dat)
summary(m2)
## 
## Call:
## lm(formula = y ~ x + group, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7465 -3.9688  0.7215  3.7605  7.6018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.3108     2.1590  -3.386  0.00219 ** 
## x             1.9582     0.2145   9.130 9.63e-10 ***
## groupB       12.6148     1.8533   6.807 2.60e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.075 on 27 degrees of freedom
## Multiple R-squared:  0.8277, Adjusted R-squared:  0.8149 
## F-statistic: 64.85 on 2 and 27 DF,  p-value: 4.901e-11

Let’s take a close look at this fitted model.

The model has an overall intercept of -7.311, which is significantly different from zero (as the p-value is 0.002). This is the intercept for the A group.

The intercept for the B group is -7.311 + 12.615 = 5.304. This is significantly different from the A group, because the p-value for groupB is 2.6040986^{-7} (i.e. it’s very small).

The model’s slope is 1.958. The p-value for the slope is very small, strong evidence against the “null hypothesis” that there is no overall relationship between x and y. In other words, the slope is statistically different from zero.

Remember, in our model specification above we did not allow group to affect the slope parameter.

We’ve made progress! This new model explains 81% of the variability in y, a huge improvement, and has a much lower error.

The residual plot is still pretty weird looking, because we’re underpredicting A (and overpredicting B) at low x values, and overpredicting A (and underpredicting B) at high x.

Group as a factor that affects slope only

If we allow group to interact with x, then we are fitting a model in which slope (but not intercept) changes with group:

# "y is a linear function of x whose slope is group-dependent"
m3 <- lm(y ~ x : group, data = dat)
summary(m3)
## 
## Call:
## lm(formula = y ~ x:group, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9668 -1.2410  0.0058  1.6262  4.1989 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.0034     0.8376  -1.198    0.241    
## x:groupA      1.1157     0.1020  10.938 2.01e-11 ***
## x:groupB      2.8008     0.1020  27.460  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.18 on 27 degrees of freedom
## Multiple R-squared:  0.9682, Adjusted R-squared:  0.9659 
## F-statistic: 411.1 on 2 and 27 DF,  p-value: < 2.2e-16

The model has a single intercept of -1.003, which is not significantly different from zero.

The best-fit slope for the A group is 1.116, and the B group slope estimate is 2.801. Both are significantly different from zero.

This third model explains 97% of the variability in y — another big jump.

This…this is a pretty good looking model. It explains almost all the variability in the data, and the residual plot looks good.

Group as a factor affecting both slope and intercept

Should we try a model with group affecting both slope and intercept? It depends on our understanding of the system we’re trying to model. Does it make sense for these models to have a common intercept? In many cases, probably not: the groups would be expected to be completely different.

# "y is a group-dependent linear function of x"
m4 <- lm(y ~ x * group, data = dat)
summary(m4)
## 
## Call:
## lm(formula = y ~ x * group, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5294 -1.3547  0.2663  1.2635  3.7656 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9156     1.0835   0.845   0.4058    
## x             0.9299     0.1192   7.804 2.81e-08 ***
## groupB       -3.8381     1.5323  -2.505   0.0188 *  
## x:groupB      2.0566     0.1685  12.204 2.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.994 on 26 degrees of freedom
## Multiple R-squared:  0.9744, Adjusted R-squared:  0.9714 
## F-statistic: 329.7 on 3 and 26 DF,  p-value: < 2.2e-16

Our final model. How did OLS do?

Term True value Estimate
A intercept 0 0.916
B intercept -3 0.916 + -3.838 = -2.922
A slope 1 0.93
B slope 3 0.93 + 2.057 = 2.987

Overfitting and removing terms

All the terms in our final model above are statistically significant. If a term isn’t significant, however, we’d generally like to remove it from the model. This can be done by hand or using functions such as MASS::stepAIC or olsrr::ols_step_both_p.

An important principle of model-building is parsimony: we’d like the simplest, most explainable model. In particular we don’t want to overfit–this will leave us with a too-specific model that doesn’t generalize to new data.

# Fit a six-term polynomial (x + x^2 + x^3 + ...)
m4poly <- lm(y ~ poly(x, 6) * group, data = dat)

Asssessing models

Above we used ggplot2 to make residuals ‘by hand’, but we can also just use the plot function to make a residual plot, as well as visualize other model diagnostics:

cars_model <- lm(dist ~ speed, data = cars)
plot(cars_model, which = 1)

By default plot.lm produces four key plots:

plot(cars_model)

These are:

  • Top left: residual plot. Residuals should have constant variance.
  • Top right: normal Q-Q plot. This lets you assess whether the residuals are normally distributed, as they should be.
  • Bottom left: scale-location plot. This shows whether residuals are spread equally along the ranges of predictor variables.
  • Bottom right: residuals versus leverage plot. This plots Cook’s distance to detect whether any outliers have substantial influence on the regression results.

Transformations

Some problems–in particular, heteroschedasticity–may be addressable by transforming the data so that we can model it successfully using a linear model.

For example, look at the Puromycin dataset and fit a basic one-term linear model. Obviously it doesn’t work very well, because the data don’t fall in a straight line:

Try a polynomial (this is still a linear model; see above):

pm2 <- lm(rate ~ poly(conc, 2) * state, data = Puromycin)

Use a transformation:

pm3 <- lm(rate ~ log(conc) * state, data = Puromycin)

The other, and probably best, option would be to use a nonlinear model for these data.

The End

The repository for this document is here.

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] olsrr_0.5.3   ggplot2_3.3.5
## 
## loaded via a namespace (and not attached):
##  [1] nortest_1.0-4     tidyselect_1.1.1  xfun_0.24         bslib_0.2.5.1    
##  [5] purrr_0.3.4       splines_4.1.2     haven_2.4.1       lattice_0.20-45  
##  [9] carData_3.0-4     colorspace_2.0-2  vctrs_0.3.8       generics_0.1.0   
## [13] htmltools_0.5.1.1 yaml_2.2.1        mgcv_1.8-38       utf8_1.2.2       
## [17] rlang_0.4.12      jquerylib_0.1.4   pillar_1.6.4      foreign_0.8-81   
## [21] glue_1.5.0        withr_2.4.2       readxl_1.3.1      lifecycle_1.0.1  
## [25] stringr_1.4.0     cellranger_1.1.0  munsell_0.5.0     gtable_0.3.0     
## [29] zip_2.1.1         evaluate_0.14     labeling_0.4.2    knitr_1.33       
## [33] rio_0.5.26        forcats_0.5.1     curl_4.3.2        fansi_0.5.0      
## [37] highr_0.9         Rcpp_1.0.7        scales_1.1.1      jsonlite_1.7.2   
## [41] abind_1.4-5       farver_2.1.0      gridExtra_2.3     hms_1.1.0        
## [45] digest_0.6.28     stringi_1.7.5     openxlsx_4.2.3    dplyr_1.0.6      
## [49] grid_4.1.2        tools_4.1.2       goftest_1.2-3     magrittr_2.0.1   
## [53] sass_0.4.0        tibble_3.1.6      crayon_1.4.2      car_3.0-10       
## [57] pkgconfig_2.0.3   ellipsis_0.3.2    Matrix_1.3-4      data.table_1.14.0
## [61] rmarkdown_2.8     R6_2.5.1          nlme_3.1-153      compiler_4.1.2