Load libraries.
library(gridExtra)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggfortify)
# Packages for raincloud plots
library(ggdist) # for stat_halfeye()
library(ggbeeswarm) # for nicer jitter
# Packages for modeling
library(lme4)
library(car)
library(lmerTest)
Load Newts dataset and check the data.
# Read the "Newts" sheet
newts <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/Newts.csv")
# Quick check of data structure
head(newts)
## SVL Crest Pond Date LSVL LCREST
## 1 108.9 10.12 4 27 2.037 1.005
## 2 94.7 5.62 7 37 1.976 0.750
## 3 96.9 2.00 7 28 1.986 0.302
## 4 110.1 15.04 10 24 2.042 1.177
## 5 103.7 3.88 6 9 2.016 0.589
## 6 99.3 4.45 8 12 1.997 0.648
str(newts)
## 'data.frame': 120 obs. of 6 variables:
## $ SVL : num 108.9 94.7 96.9 110.1 103.7 ...
## $ Crest : num 10.12 5.62 2 15.04 3.88 ...
## $ Pond : int 4 7 7 10 6 8 7 6 7 2 ...
## $ Date : int 27 37 28 24 9 12 32 16 15 25 ...
## $ LSVL : num 2.04 1.98 1.99 2.04 2.02 ...
## $ LCREST: num 1.005 0.75 0.302 1.177 0.589 ...
# code to check for any NA data.
colSums(is.na(newts[c("SVL", "Crest", "Pond", "Date", "LSVL", "LCREST")]))
## SVL Crest Pond Date LSVL LCREST
## 33 33 33 33 33 33
sum(rowSums(is.na(newts)) == ncol(newts))
## [1] 33
The actual structure of the data is as below:
LSVL - log of snout-vent length in mm; continuous variable.
LCREST - log of the height of the dorsal crest in mm; continuous variable.
POND - A code from 1 to 10 for the pond at which the male was
captured; categorical variable. NB. R has classified this column as
numerical data. Use as.factor() to categorise it.
DATE - The day of the study on which the male was measured; ordinal variable (order of days matter because the sampling period was during the breeding season, so the day of measurement might matter. NB. R has classified this column as numerical data, so I choose to keep it that way so that it is analysed as a covariate.
There are 33 rows with NA values (ie no data). Remove these rows and
factorise Pond.
newtsClean <- na.omit(newts)
newtsClean$Pond <- as.factor(newtsClean$Pond)
# Raincloud plot
ggplot(newtsClean, aes(x = Pond, y = LCREST, fill = Pond)) +
# set half-violin
stat_halfeye(adjust = 0.6, width = 0.6, justification = -0.25,
slab_alpha = 0.6, .width = 0, point_interval = NULL) +
# add boxplot
geom_boxplot(width = 0.15, outlier.shape = NA,
position = position_nudge(x = 0.15)) +
# add rain points
geom_quasirandom(width = 0.05, alpha = 0.7, size = 1.8, color = "grey10",
position = position_nudge(x = 0.25)) +
guides(fill = "none") +
labs(x = "Pond", y = "Log Dorsal Crest Height (mm)",
title = "Fig 1A: Relationship between LCREST and LSVL by Pond") +
theme_minimal()
# Scatterplot with regression line
ggplot(newtsClean, aes(x = LSVL, y = LCREST)) +
geom_point(aes(color = factor(Pond))) +
geom_smooth(method = "lm", color = "black") +
labs(x = "Log snout-vent length (mm)", y = "Log Dorsal Crest Height (mm)",
color = "Pond",
title = "Fig 1B: Scatterplot of LCREST against LSVL") +
theme_minimal()
In general, dorsal crest seems to increase slightly with body size
(Fig. 1B). LCREST doesn’t seem to vary too much between
ponds (Fig. 1A) but the spread of data by ponds varies quite a lot: some
ponds like 1 and 9 have a smaller distribution (and pond 1 only has 3
data points) while other ponds such as 7 and 10 have many more data
points and a larger distribution.
Here, the hypothesis is that there is a relationship between the body size of a newt and the size of its dorsal crest.
options(show.signif.stars = FALSE)
mod1.1 <- lm(formula = LCREST ~ LSVL, data = newtsClean)
summary(mod1.1)
##
## Call:
## lm(formula = LCREST ~ LSVL, data = newtsClean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54974 -0.15546 0.00396 0.16713 0.52165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.3806 1.5013 -6.248 1.59e-08
## LSVL 5.0870 0.7516 6.768 1.58e-09
##
## Residual standard error: 0.2284 on 85 degrees of freedom
## Multiple R-squared: 0.3502, Adjusted R-squared: 0.3425
## F-statistic: 45.81 on 1 and 85 DF, p-value: 1.579e-09
\(F_{1, 85} = 45.81, P < 0.0001\). From these results, body size does seem to have a statistically significant effect on dorsal crest height. Plot residual diagnostics to check if the model fits the data well.
autoplot(lm(newtsClean$LCREST ~ newtsClean$LSVL),which = c(1:3, 5),
ncol = 2, label.size = 3, label.repel = TRUE) +
theme_minimal()
Briefly,
Residuals vs Fitted: Residuals are equally spread around a roughly horizontal line with no distinct patterns. There does not seem to be non-linear relationships. Assumption of linearity is met.
Q-Q Plot: Residuals follow the normal distribution line for the most part, and are thus normally distributed. Assumption of normality is met.
Scale-location: Residuals are spread equally along the range of fitted values with roughly horizontal line, although it rises upwards slightly around fitted value 0.9 to 1.1. Variance is mostly homogenous. Assumption of homoscedasticity is met.
Residuals vs Leverage: the Residuals 41 and 29 identified as potentially influential cases but none of the points fall outside Cook’s distance lines.
Overall, the diagnostic plot indicates a well-fitted model ,
LCREST~LSVL.
By nature of the experimental design, Pond is a blocking
factor and should be kept in the model. However, referencing Fig. 1A,
Pond does not seem to have a significant effect on
LCREST, since the median does not differ by much between
ponds. I will confirm this by including Pond as a random
effect.
mod1.2 <- lmer(formula = LCREST ~ LSVL + (1|Pond), data = newtsClean)
summary(mod1.2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: LCREST ~ LSVL + (1 | Pond)
## Data: newtsClean
##
## REML criterion at convergence: -7.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.40704 -0.68067 0.01732 0.73178 2.28405
##
## Random effects:
## Groups Name Variance Std.Dev.
## Pond (Intercept) 0.00000 0.0000
## Residual 0.05216 0.2284
## Number of obs: 87, groups: Pond, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -9.3806 1.5013 85.0000 -6.248 1.59e-08
## LSVL 5.0870 0.7516 85.0000 6.768 1.58e-09
##
## Correlation of Fixed Effects:
## (Intr)
## LSVL -1.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
From the model, the variance of Pond is 0, therefore it
also explains 0% of the total variance after the variance explained by
fixed effects of LSVL. Furthermore, the model results indicate that the
boundary fit is singular, ie. the variance of the effects are close to
zero. Based on these, I would not include Pond as a
random blocking factor.
For practice, I will also include Pond as a fixed
effect.
mod1.3 <- lm(formula = LCREST ~ LSVL + Pond, data = newtsClean)
mod1.4 <- lm(formula = LCREST ~ Pond + LSVL, data = newtsClean)
anova(mod1.3)
## Analysis of Variance Table
##
## Response: LCREST
## Df Sum Sq Mean Sq F value Pr(>F)
## LSVL 1 2.3894 2.38939 43.3078 5.35e-09
## Pond 9 0.2406 0.02674 0.4846 0.8807
## Residuals 76 4.1931 0.05517
anova(mod1.4)
## Analysis of Variance Table
##
## Response: LCREST
## Df Sum Sq Mean Sq F value Pr(>F)
## Pond 9 0.3252 0.03613 0.6549 0.7466
## LSVL 1 2.3048 2.30483 41.7751 8.84e-09
## Residuals 76 4.1931 0.05517
When type I model is run, switching Pond and
LSVL makes no difference on the effect of Pond
on LCREST. With LSVL placed first, \(F_{1, 76} (Pond) = 0.485, P < 0.881\);
with Pond placed first, \(F_{1,76}(Pond) = 0.655, P = 0.747\). The
conclusion is that LSVL is still the most influential
factor to dorsal crest height, but to acocunt for Pond as a
blocking factor, LSVL should be placed last, or I can also
run a type II ANOVA.
I assume that temporal variation is a fixed effect, since as
mentioned in 1A, the sampling period was during the
breeding season. I also keep Pond as a fixed blocking
factor, as concluded from 1C.
Run a type II ANOVA SS to investigate the effect of LSVL
on Date, on LCREST when placed last in the
model.
mod1.4 <- lm(formula = LCREST~LSVL+Date+Pond, data = newtsClean)
Anova(mod1.4, type = "2")
## Anova Table (Type II tests)
##
## Response: LCREST
## Sum Sq Df F value Pr(>F)
## LSVL 2.3257 1 42.0896 8.336e-09
## Date 0.0490 1 0.8859 0.3496
## Pond 0.2222 9 0.4467 0.9050
## Residuals 4.1441 75
The type II model shows that when each of the variables are placed
last in the model, Date and Pond have no
statistically significant effect on LCREST.
Date: \(F_{1,75} = 0.886,
P = 0.350\)
Pond: \(F_{9,75} = 0.447,
P = 0.905\)
LSVL: \(F_{1,75} = 42.09,
P < 0.001\)
As with 1C, only LSVL has a statistically significant
effect on LCREST. Thus, I would use a type II ANOVA with
all 3 variables.
Load Newts dataset and check the data.
# Read the "Newts" sheet
sheep <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/sheep.csv")
# Quick check of data structure
head(sheep)
## Duration nlookups sex sheep sheepn obsper luprate Sex2 sheep2 avluprate
## 1 39 3 1 1 1 1 0.076923 1 1 0.13270
## 2 40 7 1 1 1 2 0.175000 1 2 0.23485
## 3 39 3 1 1 1 3 0.076923 1 3 0.16753
## 4 34 7 1 1 1 4 0.205882 2 4 0.19490
## 5 31 6 1 1 1 5 0.193548 2 5 0.21700
## 6 37 5 1 1 1 6 0.135135 2 6 0.32280
str(sheep)
## 'data.frame': 120 obs. of 10 variables:
## $ Duration : int 39 40 39 34 31 37 38 36 36 37 ...
## $ nlookups : int 3 7 3 7 6 5 3 4 2 5 ...
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ sheep : int 1 1 1 1 1 1 1 1 1 1 ...
## $ sheepn : int 1 1 1 1 1 1 1 1 1 1 ...
## $ obsper : int 1 2 3 4 5 6 7 8 9 10 ...
## $ luprate : num 0.0769 0.175 0.0769 0.2059 0.1935 ...
## $ Sex2 : int 1 1 1 2 2 2 NA NA NA NA ...
## $ sheep2 : int 1 2 3 4 5 6 NA NA NA NA ...
## $ avluprate: num 0.133 0.235 0.168 0.195 0.217 ...
The actual structure of the data is as below:
DURATION of feeding time in minutes; continuous data.
NLOOKUPS - The number of lookups; count data.
SEX - Coded as 1 for female and 2 for male; categorical variable.
NB. R has classified this column as integer data. Use
as.factor() to categorise it.
SHEEP - Coded as 1 to 6; categorical variable. NB. R has
classified this column as integer data. Use as.factor() to
categorise it.
OBSPER - The number of the observation period from 1 to 20;
categorical variable. NB. R has classified this column as integer data.
Use as.factor() to categorise it.
LUPRATE is NLOOKUPS/DURATION; proportion variable.
sheep <- sheep %>%
mutate(across(c(sex, sheep, obsper, Sex2, sheep2), as.factor))
str(sheep)
## 'data.frame': 120 obs. of 10 variables:
## $ Duration : int 39 40 39 34 31 37 38 36 36 37 ...
## $ nlookups : int 3 7 3 7 6 5 3 4 2 5 ...
## $ sex : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sheep : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sheepn : int 1 1 1 1 1 1 1 1 1 1 ...
## $ obsper : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ luprate : num 0.0769 0.175 0.0769 0.2059 0.1935 ...
## $ Sex2 : Factor w/ 2 levels "1","2": 1 1 1 2 2 2 NA NA NA NA ...
## $ sheep2 : Factor w/ 6 levels "1","2","3","4",..: 1 2 3 4 5 6 NA NA NA NA ...
## $ avluprate: num 0.133 0.235 0.168 0.195 0.217 ...
Hypothesis: there is a statistically significant difference in the lookup rates of male and female sheep.
mod2.1 <- lm(formula = luprate~obsper + sex, data = sheep)
summary(mod2.1)
##
## Call:
## lm(formula = luprate ~ obsper + sex, data = sheep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10565 -0.04618 -0.01251 0.03791 0.14655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.148210 0.027093 5.470 3.38e-07
## obsper2 0.020979 0.037392 0.561 0.57602
## obsper3 -0.023988 0.037392 -0.642 0.52266
## obsper4 0.049120 0.037392 1.314 0.19200
## obsper5 0.101339 0.037392 2.710 0.00793
## obsper6 0.020712 0.037392 0.554 0.58089
## obsper7 -0.024353 0.037392 -0.651 0.51637
## obsper8 0.022641 0.037392 0.606 0.54623
## obsper9 -0.018448 0.037392 -0.493 0.62284
## obsper10 0.056422 0.037392 1.509 0.13449
## obsper11 0.031622 0.037392 0.846 0.39976
## obsper12 -0.034530 0.037392 -0.923 0.35800
## obsper13 0.072037 0.037392 1.927 0.05690
## obsper14 0.005201 0.037392 0.139 0.88966
## obsper15 0.046917 0.037392 1.255 0.21252
## obsper16 0.085124 0.037392 2.277 0.02497
## obsper17 0.008613 0.037392 0.230 0.81829
## obsper18 0.049815 0.037392 1.332 0.18583
## obsper19 0.028000 0.037392 0.749 0.45573
## obsper20 0.105948 0.037392 2.833 0.00558
## sex2 0.066537 0.011824 5.627 1.71e-07
##
## Residual standard error: 0.06476 on 99 degrees of freedom
## Multiple R-squared: 0.4388, Adjusted R-squared: 0.3255
## F-statistic: 3.871 on 20 and 99 DF, p-value: 3.403e-06
The model indicates that both observation period and sex have statistically significant effects on lookup rate:
obsper: \(F_{19, 99} =
2.41, P = 0.003\)
sex: \(F_{1, 99} = 41.67,
P < 0.0001\)
Plot residual plots to check suitability of model.
autoplot(lm(sheep$luprate ~ sheep$obsper + sheep$sex),which = c(1:3, 5),
ncol = 2, label.size = 3, label.repel = TRUE) +
theme_minimal()
Residuals vs Fitted: No non-linear relationships for the most part. Roughly horizontal line, equally spread residuals.
Q-Q plot: tails lengthen inwards at the tails. Data seems non-normal, which could indicate that the model is not the best fit for the data.
Scale-location: Variance looks roughly homogenous.
Residuals vs factor levels: No outliers cross Cook’s distance lines.
Generally, 3 data points seem to be outliers (106, 110, 113) but they do not seem too far off from the rest of the data points.
To confirm these results, I will also plot LUPRATE
against obsper and sex to check if any trends
are visually clear.
ggplot(sheep) +
geom_boxplot(aes(x = sex, y = luprate, fill = sex), color = "black", size = 0.25) +
facet_wrap(~obsper, ncol = 10) +
scale_fill_discrete(name = "Sex", labels = c("1" = "Female", "2" = "Male")) +
labs(x = "Observation period", y = "Look-up rate (number/duration)",
title = "Fig. 2A: Lookup rate between sexes grouped by observation period") +
theme_minimal()
From Fig. 2, there doesn’t seem to be a clear trend between
observation period and look-up rate. obsper might appear to
be statistically significant in the model because of pseudoreplication
and overinflation of degrees of freedom. Including obsper
in the model as a fixed effect might not be the best to actually explain
the effect of sex on lookup rate.
The main issue with this analysis is that it assumes each data point taken from each sheep is independent (thus the total n from the model results adds up to 99 + 19 + 1 + 1 = 120 data points), which is not true. In actuality, there are only 6 sheep, or 6 independent samples. Thus, the model is pseudoreplicated. The analysis should account for pseudoreplication by treating sheep as a blocking factor and observation period as another blocking factor.
I will use a mixed model to account for pseudoreplication by
accounting for the individual sheep as a random blocking factor, and for
observation period as a temporal random blocking factor. According to
the experimental design, I understand that each of the 6 sheep were
observed at the same time within each hour, which means each
sheep was crossed with each obsper.
mod2.2 <- lmer(formula = luprate~sex + (1|sheep) + (1|obsper), data = sheep)
summary(mod2.2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: luprate ~ sex + (1 | sheep) + (1 | obsper)
## Data: sheep
##
## REML criterion at convergence: -388.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.90722 -0.58194 -0.06396 0.58800 2.11301
##
## Random effects:
## Groups Name Variance Std.Dev.
## obsper (Intercept) 0.001472 0.03837
## sheep (Intercept) 0.003621 0.06018
## Residual 0.001268 0.03561
## Number of obs: 120, groups: obsper, 20; sheep, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.17837 0.03608 4.48935 4.944 0.00576
## sex2 0.06654 0.04956 4.00000 1.342 0.25056
##
## Correlation of Fixed Effects:
## (Intr)
## sex2 -0.687
Based on the random effects results, both obsper and
sheep account for some of the variance in the model. It
seems that sheep accounts for more of the variation than
obsper. We could consider placing sheep as a
fixed effect, but if it’s not the main focus of the analysis then it
should be left as a random effect. These random factors aside, sex
doesn’t have a statistically significant effect on lookup rate: \(P = 0.251\).
Load weight dataset and check the data.
# Read the "weight" sheet
weight <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/weight.csv")
# Quick check of data structure
head(weight)
## FOREARM HT WT
## 1 4.868 159.4 70.69
## 2 4.671 163.9 63.93
## 3 2.995 158.1 60.39
## 4 3.544 148.1 56.68
## 5 1.719 155.1 56.61
## 6 6.297 153.9 63.72
str(weight)
## 'data.frame': 39 obs. of 3 variables:
## $ FOREARM: num 4.87 4.67 3 3.54 1.72 ...
## $ HT : num 159 164 158 148 155 ...
## $ WT : num 70.7 63.9 60.4 56.7 56.6 ...
The actual structure of the data is as below:
FOREARM - thickness of skinfold in the forearm in mm; continuous variable.
HT - height of individual in cm; continuous variable.
WT - weight of individual in kg; continuous variable.
htFA <- ggplot(data = weight, aes(x = HT, y = FOREARM)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal() +
ggtitle("Fig. 3A: Relationship between height and forearm skinfold thickness ")
wtFA <- ggplot(data = weight, aes(x = WT, y = FOREARM)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal() +
ggtitle("Fig. 3B: Relationship between weight and forearm skinfold thickness")
# Arrange side by side
grid.arrange(htFA, wtFA, nrow = 2)
From Fig. 3A, there seems to be little correlation
between height and forearm skinfold thickness. The points are loosely
scattered around the line of best fit. Fig. 3B shows
slightly more obvious positive correlation between weight and skinfold
thickness (thickness seems to increase with weight), although points are
also not closely-clustered around the line of best fit.
mod3.1 <- lm(formula = FOREARM~HT, data = weight)
mod3.2 <- lm(formula = FOREARM~WT, data = weight)
anova(mod3.1)
## Analysis of Variance Table
##
## Response: FOREARM
## Df Sum Sq Mean Sq F value Pr(>F)
## HT 1 0.944 0.9439 0.1754 0.6778
## Residuals 37 199.094 5.3809
anova(mod3.2)
## Analysis of Variance Table
##
## Response: FOREARM
## Df Sum Sq Mean Sq F value Pr(>F)
## WT 1 59.137 59.137 15.529 0.000347
## Residuals 37 140.901 3.808
Reporting results from ANOVA:
Height as the only predictor: \(F_{1, 37} = 0.175, P = 0.678\).
Weight as the only predictor: \(F_{1, 37} = 59.14, P < 0.001\).
Weight is a much better predictor of obesity if used alone, since its p-value is smaller than significance level, while p-value of height is way above significance level. From the F-ratio, I can also see that the variation explained by weight is much greater than the variation explained by its residuals, while the variation explained by height seems to be less than the variation explained by its residuals.
# Using type I SS but switch the order of HT and WT
mod3.3 <- lm(formula = FOREARM~HT + WT, data = weight)
mod3.4 <- lm(formula = FOREARM~WT + HT, data = weight)
anova(mod3.3)
## Analysis of Variance Table
##
## Response: FOREARM
## Df Sum Sq Mean Sq F value Pr(>F)
## HT 1 0.944 0.944 0.2926 0.5919
## WT 1 82.970 82.970 25.7220 1.207e-05
## Residuals 36 116.124 3.226
anova(mod3.4)
## Analysis of Variance Table
##
## Response: FOREARM
## Df Sum Sq Mean Sq F value Pr(>F)
## WT 1 59.137 59.137 18.3334 0.0001314
## HT 1 24.777 24.777 7.6813 0.0087755
## Residuals 36 116.124 3.226
# Using type II SS
Anova(mod3.3, type = "2")
## Anova Table (Type II tests)
##
## Response: FOREARM
## Sum Sq Df F value Pr(>F)
## HT 24.777 1 7.6813 0.008775
## WT 82.970 1 25.7220 1.207e-05
## Residuals 116.124 36
Switching HT and WT can help to test if
HT and WT are correlated. When HT
is placed first in a type I test, it does not statistically
significantly contribute to the model (\(F_{1,36} = 0.293, P = 0.592\)). However,
when HT is placed last (ie. the proportion of
WT variation is accounted for first), its effect becomes
statistically significant (\(F_{1,36} = 7.681,
P = 0.009\)); these results are the same as reported by the type
II test. Thus, placing HT first in the model detracts from
the informativeness of its true main effect because WT has
not been accounted for yet, while WT remains statistically
significant whether it is placed first or last (although its p-value
does change slightly).
When a type II test is run, both HT and WT
are shown to be statistically significant variables (\(F_{1,36}(WT) = 25.72, P < 0.001\)). This
is because the variation of the other variable has been accounted for.
However, it does not inform me which predictor might be stronger. On the
other hand, running a type I test shows that WT might thus
be a stronger predictor of FOREARM than HT,
but both are not mutually exclusive effects and their effects on
FOREARM might be correlated and interact to some
extent.
cor(weight$HT, weight$WT)
## [1] 0.6293559
Indeed, height and weight seems to be moderately correlated, as expected.
I have answered the question in the part above. Putting
HT first, I find that it does not statistically
significantly contribute to the model, but when WT is put
first, HT becomes statistically significant.
Load blooms dataset and check the data.
# Read the "blooms" sheet
blooms <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/blooms.csv")
# Quick check of data structure
head(blooms)
## SQBLOOMS BED WATER SHADE X SQ2 B2 W2 S2
## 1 4.359 1 1 1 NA 4.359 1 1 1
## 2 3.317 1 1 2 NA 3.317 1 1 2
## 3 3.606 1 1 3 NA 3.606 1 1 3
## 4 4.123 1 1 4 NA 4.123 1 1 4
## 5 4.472 1 2 1 NA 4.472 1 2 1
## 6 4.583 1 2 2 NA 4.583 1 2 2
str(blooms)
## 'data.frame': 36 obs. of 9 variables:
## $ SQBLOOMS: num 4.36 3.32 3.61 4.12 4.47 ...
## $ BED : int 1 1 1 1 1 1 1 1 1 1 ...
## $ WATER : int 1 1 1 1 2 2 2 2 3 3 ...
## $ SHADE : int 1 2 3 4 1 2 3 4 1 2 ...
## $ X : logi NA NA NA NA NA NA ...
## $ SQ2 : num 4.36 3.32 3.61 4.12 4.47 ...
## $ B2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ W2 : int 1 1 1 1 2 2 2 2 3 3 ...
## $ S2 : int 1 2 3 4 1 2 3 4 1 2 ...
The actual structure of the data is as below:
SQBLOOMS - square-root of the number of blooms; continuous variable.
BED and B2 - bed number; factor variable. NB. R has classified
this column as numerical data. Use as.factor() to
categorise it.
WATER and W2 - number of times watered per week; factor variable.
NB. R has classified this column as numerical data. Use
as.factor() to categorise it.
SHADE and S2 - level of shade given; factor variable. NB. R has
classified this column as numerical data. Use as.factor()
to categorise it.
blooms <- blooms %>%
mutate(across(c(BED, WATER, SHADE, B2, W2, S2), as.factor))
ggplot(data = blooms) +
geom_boxplot(aes(x = SHADE, y = SQBLOOMS, fill = SHADE),
color = "black", width = 0.5) +
facet_wrap(~WATER, strip.position = "bottom") +
labs(x = "Number of times watered per week", y = "Number of blooms",
fill = "Shade level",
title = "Fig. 4A: Number of blooms against level of shade and water") +
theme_minimal() +
theme(axis.text.x = element_blank())
There is no clear consistent trend across the level of water and shade from Fig. 4A.
# Type I SS
mod4.1 <- lm(formula = SQBLOOMS~BED + WATER + SHADE, data = blooms)
anova(mod4.1)
## Analysis of Variance Table
##
## Response: SQBLOOMS
## Df Sum Sq Mean Sq F value Pr(>F)
## BED 2 4.1323 2.06614 9.4570 0.0007277
## WATER 2 3.7153 1.85767 8.5029 0.0013016
## SHADE 3 1.6465 0.54882 2.5120 0.0789451
## Residuals 28 6.1173 0.21848
# Type II SS
Anova(mod4.1, type = "2")
## Anova Table (Type II tests)
##
## Response: SQBLOOMS
## Sum Sq Df F value Pr(>F)
## BED 4.1323 2 9.4570 0.0007277
## WATER 3.7153 2 8.5029 0.0013016
## SHADE 1.6465 3 2.5120 0.0789451
## Residuals 6.1173 28
The type I ANOVA results for each factor are:
BED: \(F_{2, 28} = 9.457,
P < 0.001\). BED is a statistically significant
factor.
WATER: \(F_{2,28} = 8.503,
P = 0.001\). WATER is a statistically significant
factor.
SHADE: \(F_{3,28} = 2.512,
P = 0.079\). SHADE is not a statistically
significant factor.
Using type II SS gives the same P-values and sums of squares (and F-ratios) as running a type I SS. This shows that the analysis is orthogonal. We can also check by counting the number of samples (beds) per water and shade level.
blooms %>%
count(WATER, SHADE)
## WATER SHADE n
## 1 1 1 3
## 2 1 2 3
## 3 1 3 3
## 4 1 4 3
## 5 2 1 3
## 6 2 2 3
## 7 2 3 3
## 8 2 4 3
## 9 3 1 3
## 10 3 2 3
## 11 3 3 3
## 12 3 4 3
From the table, we can see that there are 3 beds per
WATER and SHADE factor levels. Thus, the
design is orthogonal.
He then wondered whether it had been worth treating the beds as blocks, or whether future experiments could be fully randomised across beds. So he repeated the analysis without using BED as a blocking factor.
mod4.2 <- lm(formula = SQBLOOMS ~ WATER + SHADE, data = blooms)
anova(mod4.2)
## Analysis of Variance Table
##
## Response: SQBLOOMS
## Df Sum Sq Mean Sq F value Pr(>F)
## WATER 2 3.7153 1.85767 5.4373 0.009661
## SHADE 3 1.6465 0.54882 1.6064 0.208609
## Residuals 30 10.2496 0.34165
summary(mod4.2)
##
## Call:
## lm(formula = SQBLOOMS ~ WATER + SHADE, data = blooms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68236 -0.32844 0.02464 0.35897 1.10197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7145 0.2386 15.566 6.52e-16
## WATER2 0.7842 0.2386 3.286 0.00259
## WATER3 0.4489 0.2386 1.881 0.06967
## SHADE2 0.1969 0.2755 0.715 0.48041
## SHADE3 -0.2157 0.2755 -0.783 0.43994
## SHADE4 -0.3673 0.2755 -1.333 0.19252
##
## Residual standard error: 0.5845 on 30 degrees of freedom
## Multiple R-squared: 0.3435, Adjusted R-squared: 0.234
## F-statistic: 3.139 on 5 and 30 DF, p-value: 0.0214
The type I results when BED is removed is:
WATER: \(F_{1,33} = 5.437,
P = 0.010\). WATER is a statistically significant
factor.
SHADE: \(F_{1,33} = 1.606,
P = 0.209\). SHADE is not a statistically
significant factor.
Comparing the model coefficients here to table 1, sums of squares and
p-values of WATER and SHADE remained the same,
but the residual sums of squares have increased to \(SS_{1,33}(Residuals) = 13.37\). Likewise,
standard error of residuals has increased from 0.608 to 0.637
(presumably, this error was accounted for by BED in the
previous model. This shows that it was worthwhile including
BED as a blocking factor, since including it reduced
residuals, and importantly, because the experimental design blocked
measurements by bed.
He then discovered that a visitor had picked carnations from three of his experimental plots. He decided that because the final bloom numbers from these plots were inaccurate, he would omit them from his analysis. So he produced a new, shorter dataset with variables SQ2, B2, W2 and S2.
##4D. Fit the model SQ2 = B2 + W2 + S2. Which parts of the output are different in this third model, but were the same in the first two, and why?
mod4.3 <- lm(formula = SQ2 ~ B2 + W2 + S2, data = blooms)
anova(mod4.3)
## Analysis of Variance Table
##
## Response: SQ2
## Df Sum Sq Mean Sq F value Pr(>F)
## B2 2 2.7626 1.38129 8.3790 0.001641
## W2 2 5.0793 2.53966 15.4057 4.367e-05
## S2 3 0.8072 0.26906 1.6321 0.207156
## Residuals 25 4.1213 0.16485
Anova(mod4.3, type = "2")
## Anova Table (Type II tests)
##
## Response: SQ2
## Sum Sq Df F value Pr(>F)
## B2 2.6490 2 8.0346 0.00202
## W2 4.6764 2 14.1836 7.644e-05
## S2 0.8072 3 1.6321 0.20716
## Residuals 4.1213 25
The analysis is no longer orthogonal. Unlike in 4B, type I and II
results are different. However, what is consistent is that
WATER is still statistically significant.