This exercise is partly taken from Beckerman, Childs and Petchey, Chapter 5
For this exercise you will need the dataset plant_growth_rate.csv. This should be in your RStuff/data folder. If not, get it from the edenR/data folder in the Postgrad. Research Methods notebook on Teams.
Suppose in an experiment we have measured plant growth rates on soils of varying moisture content. We wish to see if there is a dependence of growth rate on moisture content. If this dependence is linear, we want to know by how much growth rate changes for each unit (ie by 1 unit of whatever it is measured in) change in soil moisture content.
Start a new R Notebook and save it into your RStuff/scripts folder. Delete everything beneath the yaml
yaml sectionWhat is the yaml? It is the section at the very top of the notebook, topped and tailed by three dashes ---. Remember that you should never delete it entirely but can either ignore it (everything will still work if you do) or add to it in all sorts of ways. Things that you do to the yaml will mostly affect what kind of output you get and what it looks like once you render your .rmd script. So, you only need to worry about the yaml if that is something you want to be finicky about.
At the very least though, we might want to alter the yaml section to include author: <your name> and date: < the date> lines. Also, change the title to something sensible.
Next, include this set-up chunk - every one of your scripts will include this, or a version of it:
```{r,echo=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE)
```
For this chunk we have used the chunk option echo=FALSE since we prefer not to see it in any rendered version of the script that we might make. This set-up chunk allows you to set chunk options that will apply to all the other chunks, so that you do not have to write them in for each chunk. If you need to override these general settings, then you can just write in the option you want in the chunks that need them.
A list of the most commonly used chunk options can be found on the R Markdown cheat sheet, and a comprehensive of options is available here.
Here, we are suppressing warnings and messages from appearing in our knitted document.
From here on, include all the code chunks presented, but remember to top them with ```{r} and and tail them with ```, as shown to you in the chunk above. Run each chunk as you go, to make sure they work. Remember that a quick way to type the top and tail is Ctrl-Alt-I, or Cmd-Alt-I if you are using a Mac.
library(tidyverse)
library(here)
library(ggfortify) # you may need to install this one - do it in the console window.
library(kableExtra)
plant_gr<-read_csv(here("data","plant_growth_rate.csv"))
glimpse(plant_gr)
How many variables (columns) are there? How many observations (rows)? Is the data tidy and do you remember what that term means in this context? If not, go back to the tidying data exercise.
Once the data is in the shape we want it, we can plot it.
Given the nature of the data, a plot of it requires a scatter plot. Let’s do a rough and ready one using ggplot2:
plant_gr %>%
ggplot(aes(x=soil.moisture.content,y=plant.growth.rate)) +
geom_point()+
#xlim(0,2)+
#ylim(0,60)+
ylab ("Plant growth rate (mm/week)")+
theme_bw()
Remember that it is rare for a linear model to completely capture the relationship between two variables. There may well be other variables that affect the variation of one or other of them and also there will be natural and/or measurement variation. However a linear model is simple to implement and in many cases will capture enough of the releationship between two variables for it to be useful.
To find the actual gradient and intercept, we can fit a linear model to the data We use the function lm() to do this, and save the result to an object called model_pgr
model_pgr<-lm(plant.growth.rate~soil.moisture.content,data=plant_gr)
The linear model requires that
Its outputs can also be unduly influenced by outliers in the data set, which may have undue influence, or ‘leverage’.
We need to check for all these things. The autoplot() function from the ggfortify package is a handy way to do all these things in one go. It uses as its first argument the model_pgr object we have just created.
autoplot(model_pgr,smooth.colour=NA) + theme_bw()
On all counts, this dataset looks good to go.
anova(model_pgr)
The F value here tells us the ratio of variance explained by the explanatory variable to the leftover variance. It is very large, and leads to the tiny p value we get. Remember that the p-value is the probability that you would get an F-value as big or bigger than the one we got if the null hypothesis (that there is no relationship between plant growth rate and soil moisture content) were true. We infer that our large F value is very unlikely to be due to chance.
summary(model_pgr)
##
## Call:
## lm(formula = plant.growth.rate ~ soil.moisture.content, data = plant_gr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9089 -3.0747 0.2261 2.6567 8.9406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.348 1.283 15.08 <2e-16 ***
## soil.moisture.content 12.750 1.021 12.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.019 on 48 degrees of freedom
## Multiple R-squared: 0.7648, Adjusted R-squared: 0.7599
## F-statistic: 156.1 on 1 and 48 DF, p-value: < 2.2e-16
From this output we see the estimate of the intercept and gradient. The p-values for each give us the probability that we would have seen values as big as these if the null hypothesis had been true, under which both would be equal to zero.
The adjusted R2 value gives the proportion of the variation in the dependent variable (plant growth rate) that is explained by variation in the independent variable (soul moisture content). Thus it is always between 0 and 1. A value of 0.76 is quite high for ecological data.
What kind of relationship between plant growth rate and soil moisture content do you think would give adjusted R2 values of 0 or 1?
(The ‘adjusted’ term here refers to the fact that R2 tends to rise if more explanatory variables are used, so that one needs to compensate for that when choosing between different models. Use of the ‘adjusted’ R2 does that.)
How would you report this result?
We can add a regression line to our plot:
plant_gr %>%
ggplot(aes(x=soil.moisture.content,y=plant.growth.rate))+
geom_point()+
geom_smooth(method="lm",se=FALSE)+
#xlim(0,2)+
#ylim(0,60)+
ylab ("Plant growth rate (mm/week)")+
theme_bw()
Start a new R Notebook, and save it in your scripts folder as regression2.Rmd
We will practise fitting linear models to different data sets, where here we will always have just one explanatory variable ie y ~ x1 rather than y ~ x1 + x2 + …., let alone anything more complicated.
In each case we will first plot the data to see if there plausibly is a linear relationship between the variables, then if so we will fit a linear model. The null hypothesis of this model is that there is no relationship between x and y.
We will use the anova(our.model) command to find out from the F value and p value for this if there is evidence to reject the null hypothesis.
If so, we then go on to find the estimated intercept and gradient of the best-fit line. The p values for these allow us to reject (or not) the null hypothesis that both intercept and gradient are zero.
Finally, we will see from the R2 value how much of the variation in y is explained by variation in x
knitr::opts_chunk$set(message=FALSE,warning=FALSE)
library(tidyverse)
library(here)
library(ggfortify)
filepath<-here("data","trees.csv")
trees<-read_csv(filepath)
trees %>%
ggplot(aes(x=HEIGHT,y=VOLUME))+
geom_point()+
geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
scale_x_continuous(limits=c(60,90))+
scale_y_continuous(limits=c(0,80))+
labs(x='Height (m)',y=expression(paste('Volume (',m^3,')')))+
theme_bw()
trees.model<-lm(VOLUME~HEIGHT,data=trees)
autoplot(trees.model) + theme_bw()
The qq-plot is more or less linear, but there is some systematic variation in the residuals. We proceed with caution!
To find the overall significance of the linear model
anova(trees.model)
To find the intercept, gradient and R2 values
summary(trees.model)
##
## Call:
## lm(formula = VOLUME ~ HEIGHT, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.274 -9.894 -2.894 12.068 29.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.1236 29.2731 -2.976 0.005835 **
## HEIGHT 1.5433 0.3839 4.021 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 29 degrees of freedom
## Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358
## F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
We see here a strong relationship (as denoted by the t and F values, and hence the p-values) but also a sloppy relationship, as denoted by the low R2 value. Much of the variability of volume is not accounted for by tree height. So while tree height definitely does seem to affect the volume of wood, so also must other factors.
You can end up with one of four possibilities:
Let us consider two of these:
We look at some seed data where we compare seed weight with plant density.
filepath<-here("data","seeds.csv")
seeds<-read_csv(filepath)
seeds %>%
ggplot(aes(x=PLANDEN,y=SEEDWGHT))+
geom_point()+
geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
scale_x_continuous(limits=c(50,200))+
scale_y_continuous(limits=c(150,300))+
labs(x='Plant density',y='Seed weight')+
theme_bw()
seeds.model<-lm(SEEDWGHT~PLANDEN,data=seeds)
autoplot(seeds.model) + theme_bw()
Not looking too bad
anova(seeds.model)
summary(seeds.model)
##
## Call:
## lm(formula = SEEDWGHT ~ PLANDEN, data = seeds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0581 -7.5618 0.0293 6.3702 17.1909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 311.89777 8.57377 36.38 < 2e-16 ***
## PLANDEN -0.68773 0.06515 -10.56 3.85e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.731 on 18 degrees of freedom
## Multiple R-squared: 0.8609, Adjusted R-squared: 0.8532
## F-statistic: 111.4 on 1 and 18 DF, p-value: 3.855e-09
We have a low p-value (\(F_{1,18} = 111.45, p<0.0005\)) implies that PLANDEN affects seed weight. The high \(R^2\), means that we have a tight relationship - PLANDEN appears to be the main determinant of seed weight.
Here we look at the maths and english scores of some students. How much of the maths score can be explained by the english score?
filepath<-here("data","scores.csv")
scores<-read_csv(filepath)
scores %>%
ggplot(aes(x=ESSAYS,y=MATHS))+
geom_point()+
geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
scale_x_continuous(limits=c(50,90))+
scale_y_continuous(limits=c(50,90))+
labs(x='Essay score',y='Maths score')+
theme_bw()
There plausibly is a linear relationship here, but it is hard to tell with so few data
scores.model<-lm(MATHS~ESSAYS,data=scores)
autoplot(scores.model) + theme_bw()
Not great, but again, hard to tell with so few data.
anova(scores.model)
summary(scores.model)
##
## Call:
## lm(formula = MATHS ~ ESSAYS, data = scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4714 -3.8226 0.9571 3.8321 13.9810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.5667 22.2630 1.238 0.2507
## ESSAYS 0.6548 0.3154 2.076 0.0715 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.14 on 8 degrees of freedom
## Multiple R-squared: 0.3502, Adjusted R-squared: 0.2689
## F-statistic: 4.311 on 1 and 8 DF, p-value: 0.07153
The plot suggests there may be some trend, but there is not much evidence for this (\(F_{1,8} = 4.31, p=0.072\)) and only 35% of all variance has been explained. A low \(t\) and F-ratio indicates that the evidence of a relationship is poor and that little variability is explained by fitting the line.
Lastly, some data comparing some aspect of two species of rodent.
filepath<-here("data","rodents.csv")
rodents<-read_csv(filepath)
rodents %>%
ggplot(aes(x=SPECIES1,y=SPECIES2))+
geom_point()+
geom_smooth(se=FALSE,method='lm',fullrange=TRUE)+
scale_x_continuous(limits=c(2,10))+
scale_y_continuous(limits=c(6,12))+
labs(x='Species 1',y='Species 2')+
theme_bw()
rodents.model<-lm(SPECIES2~SPECIES1,data=rodents)
anova(rodents.model)
summary(rodents.model)
##
## Call:
## lm(formula = SPECIES2 ~ SPECIES1, data = rodents)
##
## Residuals:
## 1 2 3 4 5
## 0.59740 1.93506 -0.96104 -0.06494 -1.50649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5195 1.7731 8.189 0.00381 **
## SPECIES1 -0.7792 0.2811 -2.773 0.06942 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.56 on 3 degrees of freedom
## Multiple R-squared: 0.7193, Adjusted R-squared: 0.6257
## F-statistic: 7.687 on 1 and 3 DF, p-value: 0.06942
72% of the variance is explained, but we fail to achieve significance. This can happen with small datasets - a high proportion of the variance in the dataset is explained, but owing to the small sample size there is insufficient evidence to draw firm conclusions about relationships in the population.
A p-value near significance might merit further investigation, perhaps by gathering more data, but as it stands we could not reject the null hypothesis of there being no relationship in the distribution of the two species.
The plot suggests that if there is any relationship, it may be non-linear in nature!
Calculation of parameters for the statistical linear model proposed to predict the volume of trees as function of their diameter and height.
Create a new notebook wit the usual two first chunks:
```{r global-options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, warning=FALSE, message=FALSE)
```
library(tidyverse)
library(here)
library(car)
library(cowplot)
library(gridExtra)
library(ggfortify)
#Reading the data
filepath<-here("data","TreeVol.csv")
trees_alom <- read_csv(filepath)
names (trees_alom)
## [1] "diameter" "height" "volume"
Fitting the model (or estimating the parameters \(beta_j\) of model):
\[\text{volume}_i= \beta_0 + \beta_1\times \text{diameter}_i + \beta_2 \times \text{height}_i + e_i\]
The car library allows the calculation of Type II Sum of Squares; and without regard to the order in which terms are included in the model
Please notice that Anova(), anova(), and aov() can all do analysis of variance, but they use different approaches, and their output can lead to very different interpretations. We use all three here (to show that they can all be useful), but for an introduction to some of the issues, see Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage; or Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage. John Fox wrote and maintains the car library.
vol_dh <- lm (volume ~ diameter + height, data = trees_alom)
summary (vol_dh)
##
## Call:
## lm(formula = volume ~ diameter + height, data = trees_alom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2224 -1.3994 -0.0429 0.8784 5.7161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.8077 5.6488 -6.339 7.41e-06 ***
## diameter 34.0993 3.9788 8.570 1.41e-07 ***
## height 0.3204 0.0754 4.249 0.000541 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.121 on 17 degrees of freedom
## Multiple R-squared: 0.8837, Adjusted R-squared: 0.87
## F-statistic: 64.58 on 2 and 17 DF, p-value: 1.143e-08
Anova (vol_dh)
The high significance of the three coefficients suggests to retain both predictor (independent) variables. The ANOVA also suggests a significant effect of all the predictor variables. However, to be sure, let’s eliminate each variable stepwise.
vol_D <- update (vol_dh, ~ . -height)
summary (vol_D)
##
## Call:
## lm(formula = volume ~ diameter, data = trees_alom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4943 -1.7973 0.0089 0.7810 8.4323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.235 4.994 -3.451 0.00285 **
## diameter 39.630 5.247 7.553 5.51e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.959 on 18 degrees of freedom
## Multiple R-squared: 0.7601, Adjusted R-squared: 0.7468
## F-statistic: 57.04 on 1 and 18 DF, p-value: 5.506e-07
anova (vol_D)
vol <- update (vol_D, ~ . -diameter)
summary (vol)
##
## Call:
## lm(formula = volume ~ 1, data = trees_alom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.95 -2.40 0.30 2.85 13.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.150 1.315 15.32 3.79e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.882 on 19 degrees of freedom
anova (vol)
vol_H <- update (vol_dh, ~ . -diameter)
summary (vol_H)
##
## Call:
## lm(formula = volume ~ height, data = trees_alom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5898 -2.9069 -0.3852 1.6887 10.2010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.3368 11.9073 -1.624 0.12177
## height 0.5318 0.1597 3.329 0.00373 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.754 on 18 degrees of freedom
## Multiple R-squared: 0.3811, Adjusted R-squared: 0.3468
## F-statistic: 11.09 on 1 and 18 DF, p-value: 0.00373
anova (vol_H)
The Akaiki Information Criterion is a measure of the amount of information lost by a model.
We can calculate it for the various models we have tried and see which one has the lowet AIC. That would be our choice of best model.
AIC (vol_dh, vol_D, vol_H, vol)
Clearly, the saturated model has the lower AIC. We recall the anova table for the model we keep which is the first model fitted that includes both diameter and height as predictors
anova (vol_dh)
Notice that a more complex model can be proposed that considers an interaction (multiplicative effect) between both predictor (independent) variables.
vol_dh.interac <- lm (volume ~ diameter * height, data = trees_alom)
summary (vol_dh.interac)
##
## Call:
## lm(formula = volume ~ diameter * height, data = trees_alom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2243 -1.1416 0.0915 0.8044 5.1193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.78368 38.55627 -0.176 0.863
## diameter 4.31176 39.33948 0.110 0.914
## height -0.09712 0.55381 -0.175 0.863
## diameter:height 0.42774 0.56193 0.761 0.458
##
## Residual standard error: 2.147 on 16 degrees of freedom
## Multiple R-squared: 0.8877, Adjusted R-squared: 0.8667
## F-statistic: 42.18 on 3 and 16 DF, p-value: 7.984e-08
Anova (vol_dh.interac)
The effect of an interaction term is not significant. Deleting the interaction term takes us back to the previous chosen model