Complete Example of Simple Linear Regression in R

Introduction

A series of experiments was conducted in ventilated, temperature- controlled, growth chambers. Three different temperature regimes were used to simulate the diurnal maximum-minimum temperatures which may occur during the fall in the California annual grasslands. Temperature regimes were low, medium and high. Plants were harvested periodically, dried and weighed. The experiment was reported in: Guerrero, F.P. and Williams, W.A. 1975. Influence of temperature on the growth of Erodium botrys and Trifolium subterraneum. Crop Science, 15:553-556.

IGNORE THE TEMPERATURE TREATMENTS FOR NOW!

The main goal of the experiment (for the purpose of this example) is to determine the relative growth rate of filaree. A second goals is to estimate mean log(plant mass) for plants of age 10, 22 and 40 days and to predict individual log(plant mass) for a plant age 10, 22 and 40 days. The final goal is to make a publication-quality graph showing the scatterplot, fitted line and confidence and predictions regions.

When the log of mass of plants is regressed on plant age (or length of growth period), the slope of the line equals the relative growth rate during the exponential growth phase.

Load necesary packages

Although we can do simple linear regression and most of the testing, predictions and estimations with the base packages of R, we use several packages to better format the output and to produce publication-quality graphs. tidyverse:: gives access to most packages for tidy computing and coding. Specifically we use:

magrittr::%>%
readr::read_csv
dplyr::mutate()
dplyr::select()
dplyr::bind_cols()
car::scatterplot()
performance::check_model()
knitr::kable()
kableExtra::kable_styling()
ggplot2::ggplot()
ggplot2::geom_line()
ggplot2::geom_point()

# Load packages

library(tidyverse)
library(performance)
library(car)
library(knitr)
library(kableExtra)

Read in the data

The data are in the file filaree.csv. This file is also in the corresponding folder of Section3 in the SLR topic of our RStudio Cloud workspace. The variables (columns) in the data are:

temp: temperature group (we ignore this variable for now)
days: number of days since germination
ln_wt: natural log of plant mass

# Read data

filaree <- read_csv("filaree.csv")

## Rows: 45 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): temp
## dbl (2): days, ln_wt
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(filaree)

## spec_tbl_df [45 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ temp : chr [1:45] "low" "low" "low" "low" ...
##  $ days : num [1:45] 9 9 9 15 15 15 22 22 22 30 ...
##  $ ln_wt: num [1:45] 2.76 2.83 2.81 3.41 3.66 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   temp = col_character(),
##   ..   days = col_double(),
##   ..   ln_wt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The structure of the data frame shows that temp is a vector of characters, days is a vector of integers and ln_wt is a vector of numeric (Real) values.

Explore data

Always start with an exploration of the data. This will detect many mistakes that happen during data entry or formatting. It also gives us a general idea of what is going on. It is important NOT to use this step to generate hypothesis to be tested with the same data. If you do that, all your p-values will be more meaningless than usual.

# Explore data

summary(filaree)

##      temp                days          ln_wt      
##  Length:45          Min.   : 7.0   Min.   :2.763  
##  Class :character   1st Qu.:14.0   1st Qu.:3.661  
##  Mode  :character   Median :22.0   Median :4.697  
##                     Mean   :22.4   Mean   :4.425  
##                     3rd Qu.:30.0   3rd Qu.:5.250  
##                     Max.   :37.0   Max.   :5.895

plot(ln_wt ~ days,
  data = filaree
) # Note: several plants measured each date

scatterplot(ln_wt ~ days, data = filaree) # Some bells and whistles from car::

The plots shows that several plants were measured each day. Measurements were destructive in the experiment, thus, each plant was measured only once. If the same plant had been measured more than once, we would need a mixed model, which is the subject of later topics (LMM1 and LMM2).

Fit simple linear regression model

A simple linear regression has one quantitative (interval) response or dependent variable and one quantitative predictor or independent variable (also known informally as an x-variable). This model can be written in two different ways that mean the same thing:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \\ \epsilon \sim N(0,\sigma^2 I)\]

and

\[Y \sim N(\mu_i, \sigma^2 I) \\ \mu_i = \beta_0 + \beta_1 X_i \]

Both of these imply that the distributions of Y for any two levels of X are independent, normal and have equal variance. The model has three unknown parameters:

the intercept $\beta_0$,
the slope $\beta_1$, and
the variance $\sigma^2$

We use data to estimate those parameters. Always keep in mind that the numbers you see in the output are estimates. Estimates themselves are random variables: if we get another sample, the estimated values will be different. For estimated parameters we use symbols with hats:

Estimated intercept: $\hat{\beta}_0$
Estimated slope: $\hat{\beta}_1$
Estimated variance or errors: $\hat{\sigma}^2$
Estimated mean log_wt for X = x: $\hat{\mu}_{Y|X=x}$

# Fit simple linear regression model

flr_m1 <- lm(ln_wt ~ days, # R philosophy: nothing shown unless requested
  data = filaree
)

Note that R did exactly as asked: fitted the model and put the result into a new object. In order to see anything you have to request it directly.

Check model assumptions

Before we look at the results and get too excited, we should check whether we are meeting the assumptions of the model. It is VERY disappointing to find the results we wanted and later find out that assumptions are violated and the “beautiful” results are invalid! And it is DANGEROUS, because true to our human nature, we will do everything possible to validate the invalid results.

We check the assumptions here to be thorough, but the subject is fully developed in Topic 5: Beyond Linearity, Homoscedasticity and Independence.

# Check assumptions

check_model(flr_m1) # Comment but ignore departures

Except for the first one, assumptions are encapsulated in the model equations. In decreasing order of importance (Gelman and Hill, 2000).

“Validity.” Most important and hardest to meet.
Additivity and linearity.
Independence of errors.
Homogeneity of variance.
Normality of errors.

Validity means that the data are appropriate for the model scope and purpose, and that all relevant predictors are in the model.

Additivity means that predictors do not interact or have multiplicative effects on the response.

Linearity means that the actual relationship is a straight line.

Additivity and linearity refer to the structural (or “signal” as opposed to “noise”) part of the model.

Independence of errors means that knowing one or many errors does not tell us anything about the specific value of another error.

Homogeneity of variance means that the variance of observations about the mean of any group is always the same. If we repeated the sampling many times and looked at the variance of each observation across samples, that variance would be the same for all observations.

Normality means that the errors come from a normal distribution, or at least that it is appropriate to model them as having a normal distribution.

In this specific regression example we detect: lack of linearity, lack of fit not fixable with “curvy” formulas (most grave), and a touch of non-normality. We ignore these for the purpose of the example. In a future section we will fully address the problems, which are completely predictable using basic principles and knowledge about the experiment.

Analysis of variance

Well, this may be a bit of a surprise, given that we are in SLR and not in ANOVA. It should not be. For now, check out what Gelman (2006) wrote:

“Analysis of variance (ANOVA) represents a set of models that can be fit to data, and also a set of methods for summarize an existing fitted model. We first consider ANOVA as it applies to classical linear models (the context for which it was originally devised; Fisher, 1925) and then discuss how ANOVA has been extended to generalized linear models and multilevel models.”

Gelman A. (2018) Variance, Analysis Of. In: Macmillan Publishers Ltd (eds) The New Palgrave Dictionary of Economics. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-349-95189-5_2402

We will explore those ideas in the ANOVA topic.

# ANOVA

(flr_m1_aov <- anova(flr_m1))

## Analysis of Variance Table
## 
## Response: ln_wt
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## days       1 35.296  35.296  345.84 < 2.2e-16 ***
## Residuals 43  4.389   0.102                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

kable(flr_m1_aov, # Fancier looking
  digits = c(0, 2, 3, 1, 5),
  caption = "Analysis of variance of a simple linear regression."
) %>%
  kableExtra::kable_styling(full_width = FALSE)

Analysis of variance of a simple linear regression.
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
days	1	35.30	35.296	345.8	0
Residuals	43	4.39	0.102	NA	NA

The total variation in log_wt, as represented by the corresponding total sum of squares, is partitioned into variation explained by plant age (days) and unexplained or residual variation. A brief note about error or random variation: from one point of view, there is nothing “random” about the size of any plant. It is just that some part of the variation is due to all conditions other then plant age. (If this sounds intriguing, read about Pierre-Simon de Laplace’s demon and Bruno de Finetti’s “Probability does not exist”).

There is one degree of freedom associated with the slope of the model, 43 degrees of freedom for the residuals. The last degree of freedom is associated with intercept or the overall mean, whichever you prefer. If assumptions are correct, and the slope is actually zero (plants not growing), then the ratio of the mean square of the regression (days, 35.3) to the mean square of the residuals (best estimate of error variance $\sigma^2$, 0.102) should have an F distribution with 1 df in the numerator and 43 df in the denominator. The column labeled Pr(>F) shows the probability of observing an F value like the one observed or more extreme from such a distribution. The probability is lower than 0.05, so we reject the hypothesis that the slope is zero. Convoluted logic!!

So what is the relative growth rate?!

We can see our estimated coefficients with additional information by using the summary function on the model object.

# Estimated coefficients, etc.

summary(flr_m1)

## 
## Call:
## lm(formula = ln_wt ~ days, data = filaree)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55511 -0.29095 -0.04595  0.26581  0.57601 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.448672   0.116450   21.03   <2e-16 ***
## days        0.088223   0.004744   18.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3195 on 43 degrees of freedom
## Multiple R-squared:  0.8894, Adjusted R-squared:  0.8868 
## F-statistic: 345.8 on 1 and 43 DF,  p-value: < 2.2e-16

summary(flr_m1)$coefficients %>%
  kable(
    digits = c(3, 4, 2, 4)
  ) %>%
  kable_styling(full_width = FALSE)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2.449	0.1164	21.03	0
days	0.088	0.0047	18.60	0

The output shows several useful estimates and quantities calculated:

The intercept is estimated to be 2.45 (in log of grams units). It’s standard error is 0.116 (same units).

The slope (or relative growth rate) is 0.088 $day^{-1}$ with SE = 0.005 $day^{-1}$.

The t value column is the quotient of the estimate over the SE. This column and the Pr(>|t|) constitute tests of the null hypotheses that the coefficients are zero. But we were not trying to test that.

Residual standard error: 0.3195 is the square root of the MSE from the anova table and is the standard deviation of the errors.

Multiple R-squared: 0.8894 is the proportion of the total SS represented by the explained SS, i.e. 35.296/(35.296 + 4.389).

We can get a confidence interval for the estimated coefficients by using stats::confint(). The default is a level = 0.95 confidence.

# Confidence interval for RGR

confint(flr_m1) # the line labeled "days" has the slope

##                  2.5 %    97.5 %
## (Intercept) 2.21382918 2.6835153
## days        0.07865589 0.0977904

confint(flr_m1) %>%
  kable(
    digits = c(2, 2),
    caption = "Confidence intervals for intercept and slope. The slope is an estimate of relative growth rate."
  ) %>%
  kable_styling(full_width = FALSE)

Confidence intervals for intercept and slope. The slope is an estimate of relative growth rate.
	2.5 %	97.5 %
(Intercept)	2.21	2.68
days	0.08	0.10

Estimation of mean plant mass (in log dimension) for a given age

This estimation answers the following question: what is the average log_wt of a million plants of age x?

This is an estimation of a parameter. The parameter is the mean of the distribution of log_wt for a given fixed value of x or plant age. Once the X = x is given and fixed, the estimated log_wt is has a normal sampling distribution about $\mu = \beta_0 + \beta_1 x$. The variance of the this distribution depends on sample size and results from the fact that the intercept and slope are not known with certainty. If we knew those parameters with certainty, we would be able to estimate mean log_wt for any age with certainty.

As sample size increases, the uncertainty (variance) of the estimated mean tends to zero. The variance of an estimated parameter is reducible by increasing sample size.

# Estimation of mean log_wt at 10, 22 and 40 days

days4est_pred <- data.frame(days = c(10, 22, 40)) # X values for prediction, etc.

predict(flr_m1, # This _estimates_ parameter mean of Y given X
  newdata = days4est_pred,
  interval = "confidence",
  se.fit = TRUE
)

## $fit
##        fit      lwr      upr
## 1 3.330904 3.178267 3.483541
## 2 4.389581 4.293463 4.485699
## 3 5.977598 5.783750 6.171446
## 
## $se.fit
##          1          2          3 
## 0.07568675 0.04766119 0.09612177 
## 
## $df
## [1] 43
## 
## $residual.scale
## [1] 0.3194674

When the argument se = is provided, the result of predict() is a list, which is evident by the $ symbols printed. Estimates are named fit and have different standard errors (se.fit) and therefore different widths of the confidence intervals. Confidence interval extremes are labeled lwr and upr. residual.scale is the square root of the MSE, i.e., the estimated standard deviation of the errors.

Prediction of individual plant mass (in log dimension) for a given age

The prediction answers the following question: Here is a randomly selected plant that is x days old. Guess its log_wt just based on its age.

This is an prediction of a single value of a random variable, the log_wt of an unobserved plant of a given age. In addition to the uncertainty about the mean of the distribution, which was present in the estimation above, we have the inherent variance of the random draw from a distribution. The value predicted is the same as the estimate, because the mean and the mode are the same for a normal distribution, but the variance of the prediction is larger than that of the estimation. Even if we knew the parameters with certainty, we would not know the exact log_wt of a randomly selected plant (unless we sample, kill and measure it).

As sample size increases the uncertainty (variance) of the prediction tends to $\sigma^2$, not to zero. That variance of the population of individual plants is a property of the population and you cannot change it by changing sample size. You could try to change it by controlling sources of variation in the experiment … but that would be ANOTHER population!

# Prediction of log_wt of one plant at 10, 22 and 40 days

predict(flr_m1, # This _predicts_ the random value Y for a plant given X
  newdata = days4est_pred,
  interval = "prediction",
  se.fit = TRUE
)

## $fit
##        fit      lwr      upr
## 1 3.330904 2.668802 3.993005
## 2 4.389581 3.738183 5.040979
## 3 5.977598 5.304800 6.650396
## 
## $se.fit
##          1          2          3 
## 0.07568675 0.04766119 0.09612177 
## 
## $df
## [1] 43
## 
## $residual.scale
## [1] 0.3194674

A note on estimation vs. prediction

In my experience, most people were never told or though about the difference between estimating a parameter or making a prediction. For some reason, the difference can be tricky to understand. If you encounter difficulty, you are not alone. I also skipped that page of the book until I had to teach it. But what really brought it home is when I had to make estimations and predictions for hire. When you say something about a specific grower, orchard, cow, person, etc. based on samples that did not include that individual, you are usually a LOT LESS CERTAIN than when you say something about the populations to which those individuals belong.

A related but not identical issue is called “ecological fallacy” whereby applying things that happen at the population level to the individual is dead wrong. Hopefully we will have a chance to go over this, probably when we talk about mixed models.

Plot regresion line, data, and confidence and prediction regions

First we need to create a table of data with sufficient points to plot the lines. For the regression line we only need a couple of points, but the confidence and prediction regions have boundaries that are quadratic polynomials, so we need sufficient points to accurately portray the curves.

The predict() function needs a data frame with columns for all variables present in the right hand side of the model formula. In simple linear regression there is only one variable, which in the filaree model is called days.

# Plot results with confidence and prediction regions

days_4plot <- data.frame(days = 5:40) # X values for prediction, etc.

est_ci <- predict(flr_m1, # Calculate limits for confidence ribbon
  newdata = days_4plot,
  interval = "confidence"
) %>%
  as.data.frame() %>%
  set_names(c("ln_wt", "lo_ci", "hi_ci"))

pred_ci <- predict(flr_m1, # Calculate limits for prediction ribbon
  newdata = days_4plot,
  interval = "prediction"
) %>%
  as.data.frame() %>%
  dplyr::select(lwr, upr) %>%
  set_names(c("lo_pi", "hi_pi"))

est_mass_4plot <- days_4plot %>%
  bind_cols(est_ci) %>%
  bind_cols(pred_ci) %>%
  as_tibble()

est_mass_4plot

## # A tibble: 36 × 6
##     days ln_wt lo_ci hi_ci lo_pi hi_pi
##    <int> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     5  2.89  2.70  3.08  2.22  3.56
##  2     6  2.98  2.79  3.16  2.31  3.65
##  3     7  3.07  2.89  3.24  2.40  3.73
##  4     8  3.15  2.99  3.32  2.49  3.82
##  5     9  3.24  3.08  3.40  2.58  3.91
##  6    10  3.33  3.18  3.48  2.67  3.99
##  7    11  3.42  3.27  3.56  2.76  4.08
##  8    12  3.51  3.37  3.65  2.85  4.17
##  9    13  3.60  3.46  3.73  2.94  4.25
## 10    14  3.68  3.56  3.81  3.03  4.34
## # … with 26 more rows

Finally, we use the data to make the plot.

ggplot(
  data = est_mass_4plot,
  aes(x = days)
) +
  geom_ribbon(aes(
    ymin = lo_ci,
    ymax = hi_ci
  ),
  fill = "skyblue",
  alpha = .75
  ) +
  geom_ribbon(aes(
    ymin = lo_pi,
    ymax = hi_pi
  ),
  fill = "skyblue",
  alpha = .4
  ) +
  geom_line(aes(y = ln_wt)) +
  geom_point(
    data = filaree,
    aes(
      y = ln_wt,
      x = days
    )
  )