Changes to this lab will be announced via email and the course FaceBook page CalU EcoStats.
For more labs and tutorials see my WordPress Site lo.brow.R
** These data come from:** Meredith et al 1991 Repeated measures experiments in forestry: focus on analysis of response curves. Can. J. For. Res.
The size of the dataframe
dat.orig <- read.csv(file = "data_orig.csv")
dim(dat.orig)
## [1] 67 6
head(dat.orig)
## X conc.AL ht.wk.1 ht.wk.2 ht.wk.3 ht.wk.4
## 1 1 0 60 62 78 104
## 2 2 0 41 50 60 60
## 3 3 0 85 97 115 120
## 4 4 0 88 87 90 80
## 5 5 0 66 65 80 95
## 6 6 0 106 100 133 172
The size of the dataframe
data.long <- read.csv(file = "data_long.csv")
dim(data.long)
## [1] 268 4
head(data.long)
## X height week conc.AL
## 1 1 60 1 0
## 2 2 41 1 0
## 3 3 85 1 0
## 4 4 88 1 0
## 5 5 66 1 0
## 6 6 106 1 0
plot(height ~ week,
data = data.long,
main = "Seedling growth: Height ~ week"
)
You can change … * color using col = 2 (or another number) * symbol with pch = … * x axis label w/ xlab = * y axis label w/ ylab =
We fill fit a “null” and alternative model.
“Null”" Model
model.null <- lm(height ~ 1,
data = data.long)
“Alternative”" Model
model.alt <- lm(height ~ week,
data = data.long)
anova(model.null,
model.alt)
## Analysis of Variance Table
##
## Model 1: height ~ 1
## Model 2: height ~ week
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 267 964926
## 2 266 907135 1 57791 16.946 5.135e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F statistic and p-value are very important quantities that need to be reported in a paper.
TASK: Write the F-statistc and p-value for test on the worksheet.
NOTE: Because we sampled the sample plant multiple tiems this p-value is WAY too small and our sample size is WAY too big. This is a form of psuedoreplication.
par(mfrow = c(1,2))
plot(height ~ week, data = data.long,
main = "Height ~ 1"
)
abline(model.null, col = 2, lwd = 3)
plot(height ~ week,
data = data.long,
main = "Height ~ week"
)
abline(model.alt, col = 2, lwd = 3)
Add annotations: You can add annotations using the “text” command. Try running this text(y = 300, x = 0, “sample text”,pos = 4)
Use the summary command to get teh parameters
summary(model.alt)
##
## Call:
## lm(formula = height ~ week, data = data.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.58 -38.58 -14.44 22.98 204.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.037 8.738 7.558 6.64e-13 ***
## week 13.134 3.191 4.117 5.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.4 on 266 degrees of freedom
## Multiple R-squared: 0.05989, Adjusted R-squared: 0.05636
## F-statistic: 16.95 on 1 and 266 DF, p-value: 5.135e-05
TASK: * Write the model equation on the worksheet * First write it in words: “height ~ intercept + slope*week“.
* Then write it with the numbers (properly rounded) included
TASK: * Write down the “adjusted R^2” * Is the model telling us much?
Use the resid() command
my.resid <- resid(model.alt)
This gives insight into whether the assumption of normality is met
par(mfrow = c(1,1))
hist(my.resid)
par(mfrow = c(1,1))
plot(model.alt, 2)
Also called “Fitted values”. Mathematicall they are called “y.hat”
my.fitted <- fitted(model.alt)
Fitted values (aka predictions) are obtained by plugging in your observations (x values) into the model and calculating the y value.
This tests the assumption of “constant variance”
plot(my.resid ~ my.fitted)
abline(h = 0, col = 2, lwd = 2)
The fanning shape of the residual is a very bad sign that the assumtpon of constant variance has been violated.
We can fix some problems with normality and non-cosntant variance via transformation.
model.alt.log <- lm( log(height) ~ week, data = data.long)
Make a histogram of the residuals
my.resid.log <- resid(model.alt.log)
hist(my.resid.log)
More normal!
qqplot
plot(model.alt.log, 2)
Points fall closer to the line!
par(mfrow = c(1,2), mar = c(1,1,4,1))
plot(model.alt, 2, main = "origingal")
plot(model.alt.log, 2, main = "log trans")
Plot residuals vs. fitted (predicted values)
par(mfrow = c(1,1))
my.fitted.log <- fitted(model.alt.log)
plot(my.resid.log ~ my.fitted.log)
abline(h = 0, col = 2, lwd = 2)
Not so fan-shapped any more
This is a lot of stuff, but all of it is important! NOte that the slope can be considered a type of “efffect size” (more specifically, an un-standardized effect size). The standard error of the slope tells us how precisely we have estimated.
I can get all this info I need summary(model.alt.log)
I would write these results like this: “There was a strong relationship (F = ____, df = 1,266, p = ____ the height of sugar maple seedlings and time (slope = ____, SE = _____, R^2 = _____)”
TASK: On the worksheet complete this sentence with the ppropirate info.
These results look pretty impressive when we read them. The p-value is small, the slope is positive, etc. However, when we get to the R^2 value, what do we learn? The model, even though its p-value is very small, is not really telling us much of why the data is so variable. We’re explaining one aspect of the variability in the data (variation over time) but 95% of the variation [ (1 - 0.05)*100 ] remains unexplained
These data are from an experiment where plants were treated with aluminum, from 0 to 600 (grams?). We can see if there is an impact of the AL on growth
plot(height ~ conc.AL, data = data.long)
This plot isn’t very helpful because the process of growth over time is the main biological process. The effect of AL is to change the growth rate.
Regression allows us to look at multiple factors at the same time. We can therefore look at the impact of time on growth and AL on growth. A full modeling of this process will take some work; I’ll jsut introduce the most basic elemetns now.
When you include multiple predictor variables (x variables) in a model its called multiple regression. We can do this in R like this:
model.alt.2 <- lm(height ~ week + conc.AL,
data = data.long)
Note the “+” sign between week and conc.AL
We can see whether how original “alt” model, with just the week varibble, compares to a multiple linear regression model with both week and AL. We do this with the anova() command as we have done for comparisons between null and alt models
anova(model.alt,
model.alt.2)
## Analysis of Variance Table
##
## Model 1: height ~ week
## Model 2: height ~ week + conc.AL
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 266 907135
## 2 265 884739 1 22396 6.7081 0.01013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TASK: Write the F-statistc and p-value for test on the worksheet for the multiple regression.
Get the parameters using the summary() command
summary(model.alt.2)
##
## Call:
## lm(formula = height ~ week + conc.AL, data = data.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.72 -37.34 -14.02 21.98 195.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.17612 9.49051 8.027 3.27e-14 ***
## week 13.13433 3.15691 4.161 4.29e-05 ***
## conc.AL -0.03996 0.01543 -2.590 0.0101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.78 on 265 degrees of freedom
## Multiple R-squared: 0.0831, Adjusted R-squared: 0.07618
## F-statistic: 12.01 on 2 and 265 DF, p-value: 1.018e-05
There are three lines for the parameters. We now have one intercepts and TWO DIFFERNT slope values.
TASK: Write the word formula and mathematical formula for the multiple regression model.
The word formula is: height ~ intercept + slope1conc.AL + slope2week
Adding more predictor variables allows us to explain more of what is going on in a dataset. R^2 will ALWAYS go down when we add a new variable
summary(model.alt.2)
##
## Call:
## lm(formula = height ~ week + conc.AL, data = data.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.72 -37.34 -14.02 21.98 195.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.17612 9.49051 8.027 3.27e-14 ***
## week 13.13433 3.15691 4.161 4.29e-05 ***
## conc.AL -0.03996 0.01543 -2.590 0.0101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.78 on 265 degrees of freedom
## Multiple R-squared: 0.0831, Adjusted R-squared: 0.07618
## F-statistic: 12.01 on 2 and 265 DF, p-value: 1.018e-05
TASK: Write R^2 for the multiple regression model.
Let’s go back to the original “wide” dataset
head(dat.orig)
## X conc.AL ht.wk.1 ht.wk.2 ht.wk.3 ht.wk.4
## 1 1 0 60 62 78 104
## 2 2 0 41 50 60 60
## 3 3 0 85 97 115 120
## 4 4 0 88 87 90 80
## 5 5 0 66 65 80 95
## 6 6 0 106 100 133 172
plot(ht.wk.4 ~ conc.AL,
data = dat.orig)
Build the model
model.wk.4 <- lm(ht.wk.4 ~ conc.AL,
data = dat.orig)
** TASK:** Make a graph of the model plotted on the data. This requires the plot() and abline() command. The code was used above.
Look at the results
summary(model.wk.4)
##
## Call:
## lm(formula = ht.wk.4 ~ conc.AL, data = dat.orig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.91 -46.88 -21.85 25.83 208.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.11722 14.10602 10.004 8.77e-15 ***
## conc.AL -0.08211 0.04129 -1.989 0.051 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.32 on 65 degrees of freedom
## Multiple R-squared: 0.05735, Adjusted R-squared: 0.04285
## F-statistic: 3.955 on 1 and 65 DF, p-value: 0.05096
** TASK ** Write a sentence that summarizes this model
We can get the residuals vs. fitted diagnostic plot in a single step below using “plot(model.wk.4, which = 1)”. We get the qqplot with “which = 2”.
par(mfrow = c(1,2))
plot(model.wk.4, which = 1) #residuals vs. fitted
plot(model.wk.4, which = 2) #qqplot
** TASK:** Print off this diagnostic plot and write several sentences about what it tells you.
A log transformation might improve the model. Implement this and check the residuals. Note that you can transform the data within the model funciton as “log(ht.wk.4)”.
** TASK:** Write the code needed for the transformation onto the worksheet.
To do a real analysis of this data we need to tell our regression model that we are measuring the same plant multiple times
library(reshape2)
data.orig$plant.ID <- 1:dim(data.orig)[1]
data.melt2 <- melt(data = data.orig,
id.vars = c("conc.AL","plant.ID"))
names(data.melt2)[3:4] <- c("week", "height")
data.melt2$week <- as.numeric(gsub("ht.wk.",
"",
data.melt2$week))
library(ggplot2)
qplot(y = height,
x = week,
data = data.melt2,
group = plant.ID,
color = plant.ID,
geom = c("point")) +
geom_smooth(method = "lm", se = F) + theme_bw()
library(ggplot2)
qplot(y = height,
x = week,
data = data.melt2,
group = plant.ID,
color = plant.ID,
geom = c("point"), facets = . ~ conc.AL)+ theme_bw()
library(ggplot2)
qplot(y = height,
x = week,
data = data.melt2,
group = plant.ID,
color = plant.ID,
geom = c("point"), facets = . ~ conc.AL) +
geom_smooth(method = "lm", se = F) + theme_bw()
library(ggplot2)
qplot(y = height,
x = week,
data = data.melt2,
geom = c("point"), facets = . ~ conc.AL) +
geom_smooth(method = "lm", se = F) + theme_bw()
library(lme4)
lmer.ints <- lmer(height ~ week +
(1|plant.ID),
data = data.melt2)
Each plant gets its own slope
library(lme4)
data.melt2$week.scaled <- scale(data.melt2$week)
lmer.ints <- lmer(height ~ week.scaled +
(week.scaled|plant.ID),
data = data.melt2)