Linear model with constant interval data

Preliminaries

Is there a relationship between plant growth rate and soil moisture content?

For this exercise you will need the dataset plant_growth_rate.csv. This should be in your RStuff/data folder. If not, get it from the edenR/data folder in the Postgrad. Research Methods notebook on Teams.

Suppose in an experiment we have measured plant growth rates on soils of varying moisture content. We wish to see if there is a dependence of growth rate on moisture content. If this dependence is linear, we want to know by how much growth rate changes for each unit (ie by 1 unit of whatever it is measured in) change in soil moisture content.

Start a new notebook

Start a new R Notebook and save it into your RStuff/scripts folder. Delete everything beneath the yaml

The `yaml` section

What is the yaml? It is the section at the very top of the notebook, topped and tailed by three dashes ---. Remember that you should never delete it entirely but can either ignore it (everything will still work if you do) or add to it in all sorts of ways. Things that you do to the yaml will mostly affect what kind of output you get and what it looks like once you render your .rmd script. So, you only need to worry about the yaml if that is something you want to be finicky about.

At the very least though, we might want to alter the yaml section to include author: <your name> and date: < the date> lines. Also, change the title to something sensible.

The set-up chunk

Next, include this set-up chunk - every one of your scripts will include this, or a version of it:

```{r,echo=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE)
```

For this chunk we have used the chunk option echo=FALSE since we prefer not to see it in any rendered version of the script that we might make. This set-up chunk allows you to set chunk options that will apply to all the other chunks, so that you do not have to write them in for each chunk. If you need to override these general settings, then you can just write in the option you want in the chunks that need them.

A list of the most commonly used chunk options can be found on the R Markdown cheat sheet, and a comprehensive of options is available here.

Here, we are suppressing warnings and messages from appearing in our knitted document.

From here on, include all the code chunks presented, but remember to top them with ```{r} and and tail them with ```, as shown to you in the chunk above. Run each chunk as you go, to make sure they work. Remember that a quick way to type the top and tail is Ctrl-Alt-I, or Cmd-Alt-I if you are using a Mac.

Load packages

library(tidyverse)
library(here)
library(ggfortify) # you may need to install this one - do it in the console window.
library(kableExtra)

Read in the data

plant_gr<-read_csv(here("data","plant_growth_rate.csv"))
glimpse(plant_gr)

How many variables (columns) are there? How many observations (rows)? Is the data tidy and do you remember what that term means in this context? If not, go back to the tidying data exercise.

Exploratory plot

Once the data is in the shape we want it, we can plot it.

Given the nature of the data, a plot of it requires a scatter plot. Let’s do a rough and ready one using ggplot2:

plant_gr %>%
  ggplot(aes(x=soil.moisture.content,y=plant.growth.rate)) +
  geom_point()+
  #xlim(0,2)+
  #ylim(0,60)+
  ylab ("Plant growth rate (mm/week)")+
  theme_bw()

Does this look sensible? Meaning, is it roughly what you expect, given what you know about the topic?
Does it seem plausible that there is a linear relationship between moisture content and growth rate? This might not even plausibly be true, but if it is, then we might consider investigating how to characterise that relationship. Eh? A line has an intercept and a gradient. What values do these have?
Can you estimate the intercept and gradient from this plot?

Remember that it is rare for a linear model to completely capture the relationship between two variables. There may well be other variables that affect the variation of one or other of them and also there will be natural and/or measurement variation. However a linear model is simple to implement and in many cases will capture enough of the releationship between two variables for it to be useful.

Fit a linear model to the data

To find the actual gradient and intercept, we can fit a linear model to the data We use the function lm() to do this, and save the result to an object called model_pgr

model_pgr<-lm(plant.growth.rate~soil.moisture.content,data=plant_gr)

Check assumptions

The linear model requires that

there is no systematic departure from linearity in the data
the residuals are normally distributed
there is equal variance (scatter) over all predicted values of the response variable.

Its outputs can also be unduly influenced by outliers in the data set, which may have undue influence, or ‘leverage’.

We need to check for all these things. The autoplot() function from the ggfortify package is a handy way to do all these things in one go. It uses as its first argument the model_pgr object we have just created.

autoplot(model_pgr,smooth.colour=NA) + theme_bw()

top left: there does not appear to be any systematic variation in the residuals
top right: This is a “quantile-quantile” or qq-plot. The dots are the residuals and the dotted line is the expectation under normality.
bottom left: This examines the assumption of equal variance. There is no obvious pattern here.
bottom right: This examines whether any individual points have unde ‘leverage’.

On all counts, this dataset looks good to go.

Interpretation of the model

anova(model_pgr)

The F value here tells us the ratio of variance explained by the explanatory variable to the leftover variance. It is very large, and leads to the tiny p value we get. Remember that the p-value is the probability that you would get an F-value as big or bigger than the one we got if the null hypothesis (that there is no relationship between plant growth rate and soil moisture content) were true. We infer that our large F value is very unlikely to be due to chance.

summary(model_pgr)

## 
## Call:
## lm(formula = plant.growth.rate ~ soil.moisture.content, data = plant_gr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9089 -3.0747  0.2261  2.6567  8.9406 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             19.348      1.283   15.08   <2e-16 ***
## soil.moisture.content   12.750      1.021   12.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.019 on 48 degrees of freedom
## Multiple R-squared:  0.7648, Adjusted R-squared:  0.7599 
## F-statistic: 156.1 on 1 and 48 DF,  p-value: < 2.2e-16

From this output we see the estimate of the intercept and gradient. The p-values for each give us the probability that we would have seen values as big as these if the null hypothesis had been true, under which both would be equal to zero.

The adjusted R² value gives the proportion of the variation in the dependent variable (plant growth rate) that is explained by variation in the independent variable (soul moisture content). Thus it is always between 0 and 1. A value of 0.76 is quite high for ecological data.

What kind of relationship between plant growth rate and soil moisture content do you think would give adjusted R² values of 0 or 1?

(The ‘adjusted’ term here refers to the fact that R² tends to rise if more explanatory variables are used, so that one needs to compensate for that when choosing between different models. Use of the ‘adjusted’ R² does that.)

How would you report this result?

Back to the plot

We can add a regression line to our plot:

plant_gr %>% 
  ggplot(aes(x=soil.moisture.content,y=plant.growth.rate))+
  geom_point()+
  geom_smooth(method="lm",se=FALSE)+
  #xlim(0,2)+
  #ylim(0,60)+
  ylab ("Plant growth rate (mm/week)")+
  theme_bw()

Exercises

Start a new R Notebook, and save it in your scripts folder as regression2.Rmd

We will practise fitting linear models to different data sets, where here we will always have just one explanatory variable ie y ~ x₁ rather than y ~ x₁ + x₂ + …., let alone anything more complicated.

In each case we will first plot the data to see if there plausibly is a linear relationship between the variables, then if so we will fit a linear model. The null hypothesis of this model is that there is no relationship between x and y.

We will use the anova(our.model) command to find out from the F value and p value for this if there is evidence to reject the null hypothesis.

If so, we then go on to find the estimated intercept and gradient of the best-fit line. The p values for these allow us to reject (or not) the null hypothesis that both intercept and gradient are zero.

Finally, we will see from the R² value how much of the variation in y is explained by variation in x

knitr::opts_chunk$set(message=FALSE,warning=FALSE)

library(tidyverse)
library(here)
library(ggfortify)

Tree data: volume vs height

filepath<-here("data","trees.csv")
trees<-read_csv(filepath)

trees %>%
  ggplot(aes(x=HEIGHT,y=VOLUME))+
  geom_point()+
  geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
  scale_x_continuous(limits=c(60,90))+
  scale_y_continuous(limits=c(0,80))+
  labs(x='Height (m)',y=expression(paste('Volume (',m^3,')')))+
  theme_bw()

Linear model for volumes as a function of height

trees.model<-lm(VOLUME~HEIGHT,data=trees)

Check the validity of the model

autoplot(trees.model) + theme_bw()

The qq-plot is more or less linear, but there is some systematic variation in the residuals. We proceed with caution!

ANOVA

To find the overall significance of the linear model

anova(trees.model)

Summary

To find the intercept, gradient and R² values

summary(trees.model)

## 
## Call:
## lm(formula = VOLUME ~ HEIGHT, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.274  -9.894  -2.894  12.068  29.852 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.1236    29.2731  -2.976 0.005835 ** 
## HEIGHT        1.5433     0.3839   4.021 0.000378 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.4 on 29 degrees of freedom
## Multiple R-squared:  0.3579, Adjusted R-squared:  0.3358 
## F-statistic: 16.16 on 1 and 29 DF,  p-value: 0.0003784

We see here a strong relationship (as denoted by the t and F values, and hence the p-values) but also a sloppy relationship, as denoted by the low R² value. Much of the variability of volume is not accounted for by tree height. So while tree height definitely does seem to affect the volume of wood, so also must other factors.

Conclusions from a regression analysis

You can end up with one of four possibilities:

strong relationship with little scatter
strong relationship with lots of scatter
weak relationship with little scatter
weak relationship with lots of scatter

Let us consider two of these:

A strong relationship with little scatter

We look at some seed data where we compare seed weight with plant density.

filepath<-here("data","seeds.csv")
seeds<-read_csv(filepath)

seeds %>%
  ggplot(aes(x=PLANDEN,y=SEEDWGHT))+
  geom_point()+
  geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
  scale_x_continuous(limits=c(50,200))+
  scale_y_continuous(limits=c(150,300))+
  labs(x='Plant density',y='Seed weight')+
  theme_bw()

seeds.model<-lm(SEEDWGHT~PLANDEN,data=seeds)

autoplot(seeds.model) + theme_bw()

Not looking too bad

anova(seeds.model)

summary(seeds.model)

## 
## Call:
## lm(formula = SEEDWGHT ~ PLANDEN, data = seeds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.0581  -7.5618   0.0293   6.3702  17.1909 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 311.89777    8.57377   36.38  < 2e-16 ***
## PLANDEN      -0.68773    0.06515  -10.56 3.85e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.731 on 18 degrees of freedom
## Multiple R-squared:  0.8609, Adjusted R-squared:  0.8532 
## F-statistic: 111.4 on 1 and 18 DF,  p-value: 3.855e-09

We have a low p-value (\(F_{1,18} = 111.45, p<0.0005\)) implies that PLANDEN affects seed weight. The high \(R^2\), means that we have a tight relationship - PLANDEN appears to be the main determinant of seed weight.

Weak relationship with lots of noise

Here we look at the maths and english scores of some students. How much of the maths score can be explained by the english score?

filepath<-here("data","scores.csv")
scores<-read_csv(filepath)

scores %>%
  ggplot(aes(x=ESSAYS,y=MATHS))+
  geom_point()+
  geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
  scale_x_continuous(limits=c(50,90))+
  scale_y_continuous(limits=c(50,90))+
  labs(x='Essay score',y='Maths score')+
  theme_bw()

There plausibly is a linear relationship here, but it is hard to tell with so few data

scores.model<-lm(MATHS~ESSAYS,data=scores)

autoplot(scores.model) + theme_bw()

Not great, but again, hard to tell with so few data.

anova(scores.model)

summary(scores.model)

## 
## Call:
## lm(formula = MATHS ~ ESSAYS, data = scores)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4714  -3.8226   0.9571   3.8321  13.9810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  27.5667    22.2630   1.238   0.2507  
## ESSAYS        0.6548     0.3154   2.076   0.0715 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.14 on 8 degrees of freedom
## Multiple R-squared:  0.3502, Adjusted R-squared:  0.2689 
## F-statistic: 4.311 on 1 and 8 DF,  p-value: 0.07153

The plot suggests there may be some trend, but there is not much evidence for this (\(F_{1,8} = 4.31, p=0.072\)) and only 35% of all variance has been explained. A low \(t\) and F-ratio indicates that the evidence of a relationship is poor and that little variability is explained by fitting the line.

Small datasets and pet theories

Lastly, some data comparing some aspect of two species of rodent.

filepath<-here("data","rodents.csv")
rodents<-read_csv(filepath)

rodents %>%
  ggplot(aes(x=SPECIES1,y=SPECIES2))+
  geom_point()+
  geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
  scale_x_continuous(limits=c(2,10))+
  scale_y_continuous(limits=c(6,12))+
  labs(x='Species 1',y='Species 2')+
  theme_bw()

rodents.model<-lm(SPECIES2~SPECIES1,data=rodents)

anova(rodents.model)

summary(rodents.model)
## 
## Call:
## lm(formula = SPECIES2 ~ SPECIES1, data = rodents)
## 
## Residuals:
##        1        2        3        4        5 
##  0.59740  1.93506 -0.96104 -0.06494 -1.50649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  14.5195     1.7731   8.189  0.00381 **
## SPECIES1     -0.7792     0.2811  -2.773  0.06942 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.56 on 3 degrees of freedom
## Multiple R-squared:  0.7193, Adjusted R-squared:  0.6257 
## F-statistic: 7.687 on 1 and 3 DF,  p-value: 0.06942

72% of the variance is explained, but we fail to achieve significance. This can happen with small datasets - a high proportion of the variance in the dataset is explained, but owing to the small sample size there is insufficient evidence to draw firm conclusions about relationships in the population.

A p-value near significance might merit further investigation, perhaps by gathering more data, but as it stands we could not reject the null hypothesis of there being no relationship in the distribution of the two species.

The plot suggests that if there is any relationship, it may be non-linear in nature!

Multivariate example

Calculation of parameters for the statistical linear model proposed to predict the volume of trees as function of their diameter and height.

Create a new notebook wit the usual two first chunks:

```{r global-options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, warning=FALSE, message=FALSE)
```

library(tidyverse)
library(here)
library(car)  
library(cowplot)
library(gridExtra)
library(ggfortify)

#Reading the data

filepath<-here("data","TreeVol.csv")
trees_alom <- read_csv(filepath)
names (trees_alom)

## [1] "diameter" "height"   "volume"

Fitting the model (or estimating the parameters \(beta_j\) of model):

\[\text{volume}_i= \beta_0 + \beta_1\times \text{diameter}_i + \beta_2 \times \text{height}_i + e_i\]

The car library allows the calculation of Type II Sum of Squares; and without regard to the order in which terms are included in the model

Please notice that Anova(), anova(), and aov() can all do analysis of variance, but they use different approaches, and their output can lead to very different interpretations. We use all three here (to show that they can all be useful), but for an introduction to some of the issues, see Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage; or Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage. John Fox wrote and maintains the car library.

vol_dh <- lm (volume ~ diameter + height, data = trees_alom)
summary (vol_dh)

## 
## Call:
## lm(formula = volume ~ diameter + height, data = trees_alom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2224 -1.3994 -0.0429  0.8784  5.7161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.8077     5.6488  -6.339 7.41e-06 ***
## diameter     34.0993     3.9788   8.570 1.41e-07 ***
## height        0.3204     0.0754   4.249 0.000541 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.121 on 17 degrees of freedom
## Multiple R-squared:  0.8837, Adjusted R-squared:   0.87 
## F-statistic: 64.58 on 2 and 17 DF,  p-value: 1.143e-08

Anova (vol_dh)

Checking if we can fit a simpler model:

The high significance of the three coefficients suggests to retain both predictor (independent) variables. The ANOVA also suggests a significant effect of all the predictor variables. However, to be sure, let’s eliminate each variable stepwise.

vol_D <- update (vol_dh, ~ . -height)
summary (vol_D)
## 
## Call:
## lm(formula = volume ~ diameter, data = trees_alom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4943 -1.7973  0.0089  0.7810  8.4323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.235      4.994  -3.451  0.00285 ** 
## diameter      39.630      5.247   7.553 5.51e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.959 on 18 degrees of freedom
## Multiple R-squared:  0.7601, Adjusted R-squared:  0.7468 
## F-statistic: 57.04 on 1 and 18 DF,  p-value: 5.506e-07
anova (vol_D)

vol <- update (vol_D, ~ . -diameter)
summary (vol)
## 
## Call:
## lm(formula = volume ~ 1, data = trees_alom)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9.95  -2.40   0.30   2.85  13.65 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.150      1.315   15.32 3.79e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.882 on 19 degrees of freedom
anova (vol)

vol_H <- update (vol_dh, ~ . -diameter)
summary (vol_H)
## 
## Call:
## lm(formula = volume ~ height, data = trees_alom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5898 -2.9069 -0.3852  1.6887 10.2010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -19.3368    11.9073  -1.624  0.12177   
## height        0.5318     0.1597   3.329  0.00373 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.754 on 18 degrees of freedom
## Multiple R-squared:  0.3811, Adjusted R-squared:  0.3468 
## F-statistic: 11.09 on 1 and 18 DF,  p-value: 0.00373
anova (vol_H)

To see how the AIC changes when deleting variables:

The Akaiki Information Criterion is a measure of the amount of information lost by a model.

We can calculate it for the various models we have tried and see which one has the lowet AIC. That would be our choice of best model.

AIC (vol_dh, vol_D, vol_H, vol)

Clearly, the saturated model has the lower AIC. We recall the anova table for the model we keep which is the first model fitted that includes both diameter and height as predictors

anova (vol_dh)

Notice that a more complex model can be proposed that considers an interaction (multiplicative effect) between both predictor (independent) variables.

vol_dh.interac <- lm (volume ~ diameter * height, data = trees_alom)
summary (vol_dh.interac)

## 
## Call:
## lm(formula = volume ~ diameter * height, data = trees_alom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2243 -1.1416  0.0915  0.8044  5.1193 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -6.78368   38.55627  -0.176    0.863
## diameter         4.31176   39.33948   0.110    0.914
## height          -0.09712    0.55381  -0.175    0.863
## diameter:height  0.42774    0.56193   0.761    0.458
## 
## Residual standard error: 2.147 on 16 degrees of freedom
## Multiple R-squared:  0.8877, Adjusted R-squared:  0.8667 
## F-statistic: 42.18 on 3 and 16 DF,  p-value: 7.984e-08

Anova (vol_dh.interac)

The effect of an interaction term is not significant. Deleting the interaction term takes us back to the previous chosen model