Linear Regression I - Ordinary Least Squares

Author

Dr Andrew Dalby

Introduction

This is another post about conventional wisdom and why we are actually doing it wrong.

I have taught statistics for 25 years and for almost all of that time I have been doing it wrong, or at least sometimes. I teach undergraduates in life sciences and I do not want to go into the technical aspects of statistics. It is much more a practical course.

When I cover linear regression the emphasis is on the diagnostics of whether you actually have a good model. What I have ignored is the different types of linear regression and we stick to Ordinary Least Squares (OLS) and ignore the rest but now I have more time I want to go back and do linear regression properly.

I still want it to be accessible to life sciences and I want to avoid too much pedantry but I want it to be done correctly.

This first part is going to be on ordinary least squares. My first issue has been finding data which fits the requirements for OLS. While almost everyone uses this method for their data a lot of the time it isn’t the right method to use. If you look at the top 10 public datasets for linear regression none of them should be analysed with OLS! But almost everyone who downloads them will use this method.

Why did I only cover OLS before? We use SPSS for teaching and it doesn’t allow you to use other methods very easily. You can but it gets a bit technical. With Excel it is even worse. I generally try to avoid Excel for anything but it does do OLS pretty well and in a way that avoids needing to know anything technical, but it lacks the diagnostic tools.

Now I am using R and in R I can use all of the linear regression tools, although R still doesn’t highlight that OLS isn’t the only possibility and in most cases the wrong one.

When Should I Use Ordinary Least Squares?

This is the key point that I have ignore in teaching up until now. I have just used the same linear regression model for all data that I wanted to fit a regression line to. This is WRONG.

Ordinary least squares should only be used with data where the x-values are known and measured without error and where the y-axis values are predicted by the x-axis values.

This will happen in most experiments where we set a series of treatments and when we measure the effect of these treatments on an outcome variable.

A simple example of when you should use OLS is a standard curve. For example is gel electrophoresis where you have a ladder of molecular weights and you can measure the distance that they travel. There will be error/uncertainty in the distance travelled but the molecular weights are known exactly.

In contrast if you have an observational study most of the time there will be uncertainty in the measurements on both axes. In that case you should use major/principal axis regression instead.

An example of this would be the original use of regression by Galton on children and parents heights.

If you have a very specific cohort design asking a very specific question so that you have removed all possible variation from the x-axis then you might be justified in using OLS for an observational study, but I am struggling to think of an example while I am writing this.

Examples of Standard Curves

For the first example I am going to fit a regression model to the standard curve for the concentration of Zinc ions.

The table of data is as follows:

Zinc 0 1 2 3 4 5 6 7 8 9 10
Absorbance 0 0.082 0.174 0.257 0.340 0.408 0.463 0.511 0.543 0.561 0.575

I can put this data into R and then plot the scatterplot of the data and fit a line to the plot.

library(ggplot2)
Zinc <- c(0:10)
Absorbance <- c(0, 0.082,0.174,0.257,0.34,0.408,0.463,0.511,0.543,0.561,0.575)
standard <- data.frame(Zinc,Absorbance)
ggplot(standard, aes(x=Zinc, y=Absorbance))+
  geom_point()+
  geom_smooth(method=lm)

This is clearly not linear and so I cannot fit a linear regression. I need to transform the data to make it linear. To do that I am going to take the logarithm of the concentration. Note that you need to remove the concentration 0 value as you cannot calculate the logarithm of 0.

library(dplyr)
library(ggplot2)
standard <- standard %>%
mutate(logZn = log(Zinc))

standard[sapply(standard, is.infinite)] <- NA

ggplot(standard, aes(x=logZn, y=Absorbance))+
  geom_point()+
  geom_smooth(method=lm)

It is still not as linear as I would like and has a slight S-shape but there is not enough data for fitting a non-linear model.

To fit the linear regression in R you use the linear model (lm) function.

fit.standard <- lm(Absorbance ~ logZn, standard)

This creates an R object containing not only the regression model but also the diagnostics for the model. You can then display the model and plot the graphs of the diagnostics. Here I have just plotted the residuals and not the other plots.

summary(fit.standard)

Call:
lm(formula = Absorbance ~ logZn, data = standard)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.038272 -0.018443  0.002318  0.015789  0.043163 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.03884    0.01961    1.98    0.083 .  
logZn        0.23342    0.01179   19.79 4.42e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02593 on 8 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:   0.98, Adjusted R-squared:  0.9775 
F-statistic: 391.7 on 1 and 8 DF,  p-value: 4.422e-08
plot(fit.standard$residuals)
abline(h=0, col="red", lty=3)

The model has a high r-squared value indicating a good fit for the data. The coefficient for LogZn is highly significant and the log of concentration has a significant effect on absorbance.

The residuals show a clear s-shaped curve suggesting that the linear model is still not the best fit.

You also should give the confidence intervals for the model parameters not just the parameters of the model.

library(broom)
confint(fit.standard)
                   2.5 %     97.5 %
(Intercept) -0.006383623 0.08405794
logZn        0.206222143 0.26061209
tidy(fit.standard)
# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.0388    0.0196      1.98 0.0830      
2 logZn         0.233     0.0118     19.8  0.0000000442

As a second example I have a standard curve for a gel elecrophoresis of DNA where we have assigned sizes of DNA molecules in a ladder.

DNA size 3000 2500 2000 1500 1000 750 500 250
Distance (mm) 10 12 13 15 17 19 21 26
library(ggplot2)
DNA.size <- c(3000,2500,2000,1500,1000,750,500,250)
Distance <- c(10,12,13,15,17,19,21,26)
ladder <- data.frame(DNA.size,Distance)
ggplot(ladder, aes(x=DNA.size, y=Distance))+
  geom_point()+
  geom_smooth(method=lm)

Again this is not linear and we need to take the logarithm of the DNA fragment size.

library(dplyr)
library(ggplot2)
ladder <- ladder %>%
mutate(logDNA = log(DNA.size))
ggplot(ladder, aes(x=logDNA, y=Distance))+
  geom_point()+
  geom_smooth(method=lm)
`geom_smooth()` using formula = 'y ~ x'

That is a much better linear fit.

library(broom)
fit.ladder <- lm(Distance ~ logDNA, ladder)
summary(fit.ladder)

Call:
lm(formula = Distance ~ logDNA, data = ladder)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46061 -0.25991  0.04787  0.28169  0.43531 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.6722     1.1433   52.19 3.32e-09 ***
logDNA       -6.1487     0.1623  -37.90 2.25e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3667 on 6 degrees of freedom
Multiple R-squared:  0.9958,    Adjusted R-squared:  0.9951 
F-statistic:  1436 on 1 and 6 DF,  p-value: 2.254e-08
tidy(fit.ladder)
# A tibble: 2 × 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)    59.7      1.14       52.2 0.00000000332
2 logDNA         -6.15     0.162     -37.9 0.0000000225 
confint(fit.ladder)
                2.5 %    97.5 %
(Intercept) 56.874624 62.469863
logDNA      -6.545703 -5.751657
ladder.post <- augment(fit.ladder, ladder)
ggplot(ladder.post, aes(logDNA, .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0, color = "red", linetype='dashed') + 
  labs(y = "Residuals", title = "Residual plot for the gel migration model.")

ggplot(ladder.post, aes(sample = .std.resid)) + 
  geom_qq() +
  geom_qq_line(color="#4b0082")

In this case the residuals do not show any obvious patterns and the QQ-plot of the residuals shows that they are normally distributed. the confidence intervals are also quite narrow and for the gradient it is far away from 0.