Load libraries.

library(gridExtra)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggfortify)

# Packages for raincloud plots
library(ggdist)      # for stat_halfeye()
library(ggbeeswarm)  # for nicer jitter

# Packages for modeling
library(lme4)
library(car)
library(lmerTest)

Question 1: Blocking

Load Newts dataset and check the data.

# Read the "Newts" sheet
newts <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/Newts.csv")


# Quick check of data structure
head(newts)
##     SVL Crest Pond Date  LSVL LCREST
## 1 108.9 10.12    4   27 2.037  1.005
## 2  94.7  5.62    7   37 1.976  0.750
## 3  96.9  2.00    7   28 1.986  0.302
## 4 110.1 15.04   10   24 2.042  1.177
## 5 103.7  3.88    6    9 2.016  0.589
## 6  99.3  4.45    8   12 1.997  0.648
str(newts)
## 'data.frame':    120 obs. of  6 variables:
##  $ SVL   : num  108.9 94.7 96.9 110.1 103.7 ...
##  $ Crest : num  10.12 5.62 2 15.04 3.88 ...
##  $ Pond  : int  4 7 7 10 6 8 7 6 7 2 ...
##  $ Date  : int  27 37 28 24 9 12 32 16 15 25 ...
##  $ LSVL  : num  2.04 1.98 1.99 2.04 2.02 ...
##  $ LCREST: num  1.005 0.75 0.302 1.177 0.589 ...
# code to check for any NA data.
colSums(is.na(newts[c("SVL", "Crest", "Pond", "Date", "LSVL", "LCREST")]))
##    SVL  Crest   Pond   Date   LSVL LCREST 
##     33     33     33     33     33     33
sum(rowSums(is.na(newts)) == ncol(newts))
## [1] 33

The actual structure of the data is as below:

  • LSVL - log of snout-vent length in mm; continuous variable.

  • LCREST - log of the height of the dorsal crest in mm; continuous variable.

  • POND - A code from 1 to 10 for the pond at which the male was captured; categorical variable. NB. R has classified this column as numerical data. Use as.factor() to categorise it.

  • DATE - The day of the study on which the male was measured; ordinal variable (order of days matter because the sampling period was during the breeding season, so the day of measurement might matter. NB. R has classified this column as numerical data, so I choose to keep it that way so that it is analysed as a covariate.

There are 33 rows with NA values (ie no data). Remove these rows and factorise Pond.

newtsClean <- na.omit(newts)
newtsClean$Pond <- as.factor(newtsClean$Pond)

1A. Show graphically how the size of the dorsal crest (LCREST) varies with the body size of the newts (LSVL) and POND.

# Raincloud plot
ggplot(newtsClean, aes(x = Pond, y = LCREST, fill = Pond)) +
  # set half-violin
  stat_halfeye(adjust = 0.6, width = 0.6, justification = -0.25, 
               slab_alpha = 0.6, .width = 0, point_interval = NULL) +
  # add boxplot
  geom_boxplot(width = 0.15, outlier.shape = NA, 
               position = position_nudge(x = 0.15)) +
  # add rain points
  geom_quasirandom(width = 0.05, alpha = 0.7, size = 1.8, color = "grey10",
                   position = position_nudge(x = 0.25)) +
  guides(fill = "none") +
  labs(x = "Pond", y = "Log Dorsal Crest Height (mm)",
       title = "Fig 1A: Relationship between LCREST and LSVL by Pond") +
  theme_minimal()

# Scatterplot with regression line
ggplot(newtsClean, aes(x = LSVL, y = LCREST)) +
  geom_point(aes(color = factor(Pond))) +
  geom_smooth(method = "lm", color = "black") +
  labs(x = "Log snout-vent length (mm)", y = "Log Dorsal Crest Height (mm)",
       color = "Pond",
       title = "Fig 1B: Scatterplot of LCREST against LSVL") +
  theme_minimal()

In general, dorsal crest seems to increase slightly with body size (Fig. 1B). LCREST doesn’t seem to vary too much between ponds (Fig. 1A) but the spread of data by ponds varies quite a lot: some ponds like 1 and 9 have a smaller distribution (and pond 1 only has 3 data points) while other ponds such as 7 and 10 have many more data points and a larger distribution.

1B. Analyse this dataset to investigate if the size of the dorsal crest (LCREST) reflects the body size of the newts (LSVL).

Here, the hypothesis is that there is a relationship between the body size of a newt and the size of its dorsal crest.

options(show.signif.stars = FALSE)
mod1.1 <- lm(formula = LCREST ~ LSVL, data = newtsClean)
summary(mod1.1)
## 
## Call:
## lm(formula = LCREST ~ LSVL, data = newtsClean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54974 -0.15546  0.00396  0.16713  0.52165 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -9.3806     1.5013  -6.248 1.59e-08
## LSVL          5.0870     0.7516   6.768 1.58e-09
## 
## Residual standard error: 0.2284 on 85 degrees of freedom
## Multiple R-squared:  0.3502, Adjusted R-squared:  0.3425 
## F-statistic: 45.81 on 1 and 85 DF,  p-value: 1.579e-09

\(F_{1, 85} = 45.81, P < 0.0001\). From these results, body size does seem to have a statistically significant effect on dorsal crest height. Plot residual diagnostics to check if the model fits the data well.

autoplot(lm(newtsClean$LCREST ~ newtsClean$LSVL),which = c(1:3, 5),
         ncol = 2, label.size = 3, label.repel = TRUE) + 
  theme_minimal()

Briefly,

  1. Residuals vs Fitted: Residuals are equally spread around a roughly horizontal line with no distinct patterns. There does not seem to be non-linear relationships. Assumption of linearity is met.

  2. Q-Q Plot: Residuals follow the normal distribution line for the most part, and are thus normally distributed. Assumption of normality is met.

  3. Scale-location: Residuals are spread equally along the range of fitted values with roughly horizontal line, although it rises upwards slightly around fitted value 0.9 to 1.1. Variance is mostly homogenous. Assumption of homoscedasticity is met.

  4. Residuals vs Leverage: the Residuals 41 and 29 identified as potentially influential cases but none of the points fall outside Cook’s distance lines.

Overall, the diagnostic plot indicates a well-fitted model , LCREST~LSVL.

1C. Run an analysis with POND as a blocking factor. Does POND seem to matter in this case? Should you keep it in the model?

By nature of the experimental design, Pond is a blocking factor and should be kept in the model. However, referencing Fig. 1A, Pond does not seem to have a significant effect on LCREST, since the median does not differ by much between ponds. I will confirm this by including Pond as a random effect.

mod1.2 <- lmer(formula = LCREST ~ LSVL + (1|Pond), data = newtsClean)
summary(mod1.2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: LCREST ~ LSVL + (1 | Pond)
##    Data: newtsClean
## 
## REML criterion at convergence: -7.7
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.40704 -0.68067  0.01732  0.73178  2.28405 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Pond     (Intercept) 0.00000  0.0000  
##  Residual             0.05216  0.2284  
## Number of obs: 87, groups:  Pond, 10
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)
## (Intercept)  -9.3806     1.5013 85.0000  -6.248 1.59e-08
## LSVL          5.0870     0.7516 85.0000   6.768 1.58e-09
## 
## Correlation of Fixed Effects:
##      (Intr)
## LSVL -1.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

From the model, the variance of Pond is 0, therefore it also explains 0% of the total variance after the variance explained by fixed effects of LSVL. Furthermore, the model results indicate that the boundary fit is singular, ie. the variance of the effects are close to zero. Based on these, I would not include Pond as a random blocking factor.

For practice, I will also include Pond as a fixed effect.

mod1.3 <- lm(formula = LCREST ~ LSVL + Pond, data = newtsClean)
mod1.4 <- lm(formula = LCREST ~ Pond + LSVL, data = newtsClean)
anova(mod1.3)
## Analysis of Variance Table
## 
## Response: LCREST
##           Df Sum Sq Mean Sq F value   Pr(>F)
## LSVL       1 2.3894 2.38939 43.3078 5.35e-09
## Pond       9 0.2406 0.02674  0.4846   0.8807
## Residuals 76 4.1931 0.05517
anova(mod1.4)
## Analysis of Variance Table
## 
## Response: LCREST
##           Df Sum Sq Mean Sq F value   Pr(>F)
## Pond       9 0.3252 0.03613  0.6549   0.7466
## LSVL       1 2.3048 2.30483 41.7751 8.84e-09
## Residuals 76 4.1931 0.05517

When type I model is run, switching Pond and LSVL makes no difference on the effect of Pond on LCREST. With LSVL placed first, \(F_{1, 76} (Pond) = 0.485, P < 0.881\); with Pond placed first, \(F_{1,76}(Pond) = 0.655, P = 0.747\). The conclusion is that LSVL is still the most influential factor to dorsal crest height, but to acocunt for Pond as a blocking factor, LSVL should be placed last, or I can also run a type II ANOVA.

1D. To account for temporal variation, run an analysis with DATE as a covariate in the model.

I assume that temporal variation is a fixed effect, since as mentioned in 1A, the sampling period was during the breeding season. I also keep Pond as a fixed blocking factor, as concluded from 1C.

Run a type II ANOVA SS to investigate the effect of LSVL on Date, on LCREST when placed last in the model.

mod1.4 <- lm(formula = LCREST~LSVL+Date+Pond, data = newtsClean)
Anova(mod1.4, type = "2")
## Anova Table (Type II tests)
## 
## Response: LCREST
##           Sum Sq Df F value    Pr(>F)
## LSVL      2.3257  1 42.0896 8.336e-09
## Date      0.0490  1  0.8859    0.3496
## Pond      0.2222  9  0.4467    0.9050
## Residuals 4.1441 75

The type II model shows that when each of the variables are placed last in the model, Date and Pond have no statistically significant effect on LCREST.

  • Date: \(F_{1,75} = 0.886, P = 0.350\)

  • Pond: \(F_{9,75} = 0.447, P = 0.905\)

  • LSVL: \(F_{1,75} = 42.09, P < 0.001\)

As with 1C, only LSVL has a statistically significant effect on LCREST. Thus, I would use a type II ANOVA with all 3 variables.


Question 2: Pseudoreplication

Load Newts dataset and check the data.

# Read the "Newts" sheet
sheep <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/sheep.csv")


# Quick check of data structure
head(sheep)
##   Duration nlookups sex sheep sheepn obsper  luprate Sex2 sheep2 avluprate
## 1       39        3   1     1      1      1 0.076923    1      1   0.13270
## 2       40        7   1     1      1      2 0.175000    1      2   0.23485
## 3       39        3   1     1      1      3 0.076923    1      3   0.16753
## 4       34        7   1     1      1      4 0.205882    2      4   0.19490
## 5       31        6   1     1      1      5 0.193548    2      5   0.21700
## 6       37        5   1     1      1      6 0.135135    2      6   0.32280
str(sheep)
## 'data.frame':    120 obs. of  10 variables:
##  $ Duration : int  39 40 39 34 31 37 38 36 36 37 ...
##  $ nlookups : int  3 7 3 7 6 5 3 4 2 5 ...
##  $ sex      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sheep    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sheepn   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ obsper   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ luprate  : num  0.0769 0.175 0.0769 0.2059 0.1935 ...
##  $ Sex2     : int  1 1 1 2 2 2 NA NA NA NA ...
##  $ sheep2   : int  1 2 3 4 5 6 NA NA NA NA ...
##  $ avluprate: num  0.133 0.235 0.168 0.195 0.217 ...

The actual structure of the data is as below:

  • DURATION of feeding time in minutes; continuous data.

  • NLOOKUPS - The number of lookups; count data.

  • SEX - Coded as 1 for female and 2 for male; categorical variable. NB. R has classified this column as integer data. Use as.factor() to categorise it.

  • SHEEP - Coded as 1 to 6; categorical variable. NB. R has classified this column as integer data. Use as.factor() to categorise it.

  • OBSPER - The number of the observation period from 1 to 20; categorical variable. NB. R has classified this column as integer data. Use as.factor() to categorise it.

  • LUPRATE is NLOOKUPS/DURATION; proportion variable.

sheep <- sheep %>%
  mutate(across(c(sex, sheep, obsper, Sex2, sheep2), as.factor))

str(sheep)
## 'data.frame':    120 obs. of  10 variables:
##  $ Duration : int  39 40 39 34 31 37 38 36 36 37 ...
##  $ nlookups : int  3 7 3 7 6 5 3 4 2 5 ...
##  $ sex      : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sheep    : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sheepn   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ obsper   : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ luprate  : num  0.0769 0.175 0.0769 0.2059 0.1935 ...
##  $ Sex2     : Factor w/ 2 levels "1","2": 1 1 1 2 2 2 NA NA NA NA ...
##  $ sheep2   : Factor w/ 6 levels "1","2","3","4",..: 1 2 3 4 5 6 NA NA NA NA ...
##  $ avluprate: num  0.133 0.235 0.168 0.195 0.217 ...

2A. Fit a linear model of LUPRATE = OBSPER + SEX. Criticise this analysis.

Hypothesis: there is a statistically significant difference in the lookup rates of male and female sheep.

mod2.1 <- lm(formula = luprate~obsper + sex, data = sheep)
summary(mod2.1)
## 
## Call:
## lm(formula = luprate ~ obsper + sex, data = sheep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10565 -0.04618 -0.01251  0.03791  0.14655 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.148210   0.027093   5.470 3.38e-07
## obsper2      0.020979   0.037392   0.561  0.57602
## obsper3     -0.023988   0.037392  -0.642  0.52266
## obsper4      0.049120   0.037392   1.314  0.19200
## obsper5      0.101339   0.037392   2.710  0.00793
## obsper6      0.020712   0.037392   0.554  0.58089
## obsper7     -0.024353   0.037392  -0.651  0.51637
## obsper8      0.022641   0.037392   0.606  0.54623
## obsper9     -0.018448   0.037392  -0.493  0.62284
## obsper10     0.056422   0.037392   1.509  0.13449
## obsper11     0.031622   0.037392   0.846  0.39976
## obsper12    -0.034530   0.037392  -0.923  0.35800
## obsper13     0.072037   0.037392   1.927  0.05690
## obsper14     0.005201   0.037392   0.139  0.88966
## obsper15     0.046917   0.037392   1.255  0.21252
## obsper16     0.085124   0.037392   2.277  0.02497
## obsper17     0.008613   0.037392   0.230  0.81829
## obsper18     0.049815   0.037392   1.332  0.18583
## obsper19     0.028000   0.037392   0.749  0.45573
## obsper20     0.105948   0.037392   2.833  0.00558
## sex2         0.066537   0.011824   5.627 1.71e-07
## 
## Residual standard error: 0.06476 on 99 degrees of freedom
## Multiple R-squared:  0.4388, Adjusted R-squared:  0.3255 
## F-statistic: 3.871 on 20 and 99 DF,  p-value: 3.403e-06

The model indicates that both observation period and sex have statistically significant effects on lookup rate:

  • obsper: \(F_{19, 99} = 2.41, P = 0.003\)

  • sex: \(F_{1, 99} = 41.67, P < 0.0001\)

Plot residual plots to check suitability of model.

autoplot(lm(sheep$luprate ~ sheep$obsper + sheep$sex),which = c(1:3, 5),
         ncol = 2, label.size = 3, label.repel = TRUE) + 
  theme_minimal()

  1. Residuals vs Fitted: No non-linear relationships for the most part. Roughly horizontal line, equally spread residuals.

  2. Q-Q plot: tails lengthen inwards at the tails. Data seems non-normal, which could indicate that the model is not the best fit for the data.

  3. Scale-location: Variance looks roughly homogenous.

  4. Residuals vs factor levels: No outliers cross Cook’s distance lines.

Generally, 3 data points seem to be outliers (106, 110, 113) but they do not seem too far off from the rest of the data points.

To confirm these results, I will also plot LUPRATE against obsper and sex to check if any trends are visually clear.

ggplot(sheep) +
  geom_boxplot(aes(x = sex, y = luprate, fill = sex), color = "black", size = 0.25) +
  facet_wrap(~obsper, ncol = 10) +
  scale_fill_discrete(name = "Sex", labels = c("1" = "Female", "2" = "Male")) +
  labs(x = "Observation period", y = "Look-up rate (number/duration)", 
       title = "Fig. 2A: Lookup rate between sexes grouped by observation period") +
  theme_minimal()

From Fig. 2, there doesn’t seem to be a clear trend between observation period and look-up rate. obsper might appear to be statistically significant in the model because of pseudoreplication and overinflation of degrees of freedom. Including obsper in the model as a fixed effect might not be the best to actually explain the effect of sex on lookup rate.

The main issue with this analysis is that it assumes each data point taken from each sheep is independent (thus the total n from the model results adds up to 99 + 19 + 1 + 1 = 120 data points), which is not true. In actuality, there are only 6 sheep, or 6 independent samples. Thus, the model is pseudoreplicated. The analysis should account for pseudoreplication by treating sheep as a blocking factor and observation period as another blocking factor.

2B. Suggest and execute a more appropriate analysis.

I will use a mixed model to account for pseudoreplication by accounting for the individual sheep as a random blocking factor, and for observation period as a temporal random blocking factor. According to the experimental design, I understand that each of the 6 sheep were observed at the same time within each hour, which means each sheep was crossed with each obsper.

mod2.2 <- lmer(formula = luprate~sex + (1|sheep) + (1|obsper), data = sheep)
summary(mod2.2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: luprate ~ sex + (1 | sheep) + (1 | obsper)
##    Data: sheep
## 
## REML criterion at convergence: -388.4
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.90722 -0.58194 -0.06396  0.58800  2.11301 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  obsper   (Intercept) 0.001472 0.03837 
##  sheep    (Intercept) 0.003621 0.06018 
##  Residual             0.001268 0.03561 
## Number of obs: 120, groups:  obsper, 20; sheep, 6
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)
## (Intercept)  0.17837    0.03608 4.48935   4.944  0.00576
## sex2         0.06654    0.04956 4.00000   1.342  0.25056
## 
## Correlation of Fixed Effects:
##      (Intr)
## sex2 -0.687

Based on the random effects results, both obsper and sheep account for some of the variance in the model. It seems that sheep accounts for more of the variation than obsper. We could consider placing sheep as a fixed effect, but if it’s not the main focus of the analysis then it should be left as a random effect. These random factors aside, sex doesn’t have a statistically significant effect on lookup rate: \(P = 0.251\).


Question 3: Sums of squares

Load weight dataset and check the data.

# Read the "weight" sheet
weight <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/weight.csv")


# Quick check of data structure
head(weight)
##   FOREARM    HT    WT
## 1   4.868 159.4 70.69
## 2   4.671 163.9 63.93
## 3   2.995 158.1 60.39
## 4   3.544 148.1 56.68
## 5   1.719 155.1 56.61
## 6   6.297 153.9 63.72
str(weight)
## 'data.frame':    39 obs. of  3 variables:
##  $ FOREARM: num  4.87 4.67 3 3.54 1.72 ...
##  $ HT     : num  159 164 158 148 155 ...
##  $ WT     : num  70.7 63.9 60.4 56.7 56.6 ...

The actual structure of the data is as below:

  • FOREARM - thickness of skinfold in the forearm in mm; continuous variable.

  • HT - height of individual in cm; continuous variable.

  • WT - weight of individual in kg; continuous variable.

3A. Using two separate plots, show graphically how FOREARM varies with HT and WT.

htFA <- ggplot(data = weight, aes(x = HT, y = FOREARM)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  ggtitle("Fig. 3A: Relationship between height and forearm skinfold thickness ")  

wtFA <- ggplot(data = weight, aes(x = WT, y = FOREARM)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  ggtitle("Fig. 3B: Relationship between weight and forearm skinfold thickness")  

# Arrange side by side
grid.arrange(htFA, wtFA, nrow = 2)

From Fig. 3A, there seems to be little correlation between height and forearm skinfold thickness. The points are loosely scattered around the line of best fit. Fig. 3B shows slightly more obvious positive correlation between weight and skinfold thickness (thickness seems to increase with weight), although points are also not closely-clustered around the line of best fit.

3B. Taking FOREARM as the response variable, which of the two explanatory variables, HT and WT, is the better predictor of obesity when used alone in a linear model?

mod3.1 <- lm(formula = FOREARM~HT, data = weight)
mod3.2 <- lm(formula = FOREARM~WT, data = weight)

anova(mod3.1)
## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value Pr(>F)
## HT         1   0.944  0.9439  0.1754 0.6778
## Residuals 37 199.094  5.3809
anova(mod3.2)
## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value   Pr(>F)
## WT         1  59.137  59.137  15.529 0.000347
## Residuals 37 140.901   3.808

Reporting results from ANOVA:

  • Height as the only predictor: \(F_{1, 37} = 0.175, P = 0.678\).

  • Weight as the only predictor: \(F_{1, 37} = 59.14, P < 0.001\).

Weight is a much better predictor of obesity if used alone, since its p-value is smaller than significance level, while p-value of height is way above significance level. From the F-ratio, I can also see that the variation explained by weight is much greater than the variation explained by its residuals, while the variation explained by height seems to be less than the variation explained by its residuals.

3C. If both HT and WT are included as main effects (no interactions) in a linear model, do they increase or detract from each others’ informativeness, and why?

# Using type I SS but switch the order of HT and WT
mod3.3 <- lm(formula = FOREARM~HT + WT, data = weight)
mod3.4 <- lm(formula = FOREARM~WT + HT, data = weight)

anova(mod3.3)
## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value    Pr(>F)
## HT         1   0.944   0.944  0.2926    0.5919
## WT         1  82.970  82.970 25.7220 1.207e-05
## Residuals 36 116.124   3.226
anova(mod3.4)
## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value    Pr(>F)
## WT         1  59.137  59.137 18.3334 0.0001314
## HT         1  24.777  24.777  7.6813 0.0087755
## Residuals 36 116.124   3.226
# Using type II SS
Anova(mod3.3, type = "2")
## Anova Table (Type II tests)
## 
## Response: FOREARM
##            Sum Sq Df F value    Pr(>F)
## HT         24.777  1  7.6813  0.008775
## WT         82.970  1 25.7220 1.207e-05
## Residuals 116.124 36

Switching HT and WT can help to test if HT and WT are correlated. When HT is placed first in a type I test, it does not statistically significantly contribute to the model (\(F_{1,36} = 0.293, P = 0.592\)). However, when HT is placed last (ie. the proportion of WT variation is accounted for first), its effect becomes statistically significant (\(F_{1,36} = 7.681, P = 0.009\)); these results are the same as reported by the type II test. Thus, placing HT first in the model detracts from the informativeness of its true main effect because WT has not been accounted for yet, while WT remains statistically significant whether it is placed first or last (although its p-value does change slightly).

When a type II test is run, both HT and WT are shown to be statistically significant variables (\(F_{1,36}(WT) = 25.72, P < 0.001\)). This is because the variation of the other variable has been accounted for. However, it does not inform me which predictor might be stronger. On the other hand, running a type I test shows that WT might thus be a stronger predictor of FOREARM than HT, but both are not mutually exclusive effects and their effects on FOREARM might be correlated and interact to some extent.

cor(weight$HT, weight$WT)
## [1] 0.6293559

Indeed, height and weight seems to be moderately correlated, as expected.

3D. What happens when you put HT vs WT first in the models?

I have answered the question in the part above. Putting HT first, I find that it does not statistically significantly contribute to the model, but when WT is put first, HT becomes statistically significant.


Question 4: Orthogonality

Load blooms dataset and check the data.

# Read the "blooms" sheet
blooms <- read.csv("/Users/joylynngoh/Files/Uni/Y4/ES3307/Assignment 2/Assignment 2 data/blooms.csv")


# Quick check of data structure
head(blooms)
##   SQBLOOMS BED WATER SHADE  X   SQ2 B2 W2 S2
## 1    4.359   1     1     1 NA 4.359  1  1  1
## 2    3.317   1     1     2 NA 3.317  1  1  2
## 3    3.606   1     1     3 NA 3.606  1  1  3
## 4    4.123   1     1     4 NA 4.123  1  1  4
## 5    4.472   1     2     1 NA 4.472  1  2  1
## 6    4.583   1     2     2 NA 4.583  1  2  2
str(blooms)
## 'data.frame':    36 obs. of  9 variables:
##  $ SQBLOOMS: num  4.36 3.32 3.61 4.12 4.47 ...
##  $ BED     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ WATER   : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ SHADE   : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ X       : logi  NA NA NA NA NA NA ...
##  $ SQ2     : num  4.36 3.32 3.61 4.12 4.47 ...
##  $ B2      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ W2      : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ S2      : int  1 2 3 4 1 2 3 4 1 2 ...

The actual structure of the data is as below:

  • SQBLOOMS - square-root of the number of blooms; continuous variable.

  • BED and B2 - bed number; factor variable. NB. R has classified this column as numerical data. Use as.factor() to categorise it.

  • WATER and W2 - number of times watered per week; factor variable. NB. R has classified this column as numerical data. Use as.factor() to categorise it.

  • SHADE and S2 - level of shade given; factor variable. NB. R has classified this column as numerical data. Use as.factor() to categorise it.

blooms <- blooms %>%
  mutate(across(c(BED, WATER, SHADE, B2, W2, S2), as.factor))

4A. Show graphically how the number of blooms varies with level of water and level of shade.

ggplot(data = blooms) +
  geom_boxplot(aes(x = SHADE, y = SQBLOOMS, fill = SHADE),
               color = "black", width = 0.5) + 
  facet_wrap(~WATER, strip.position = "bottom") +
  labs(x = "Number of times watered per week", y = "Number of blooms", 
       fill = "Shade level",
       title = "Fig. 4A: Number of blooms against level of shade and water") +
  theme_minimal() +
  theme(axis.text.x  = element_blank())

There is no clear consistent trend across the level of water and shade from Fig. 4A.

4B. Fit a linear model of SQBLOOMS = BED + WATER + SHADE. Is the analysis orthogonal?

# Type I SS
mod4.1 <- lm(formula = SQBLOOMS~BED + WATER + SHADE, data = blooms)
anova(mod4.1)
## Analysis of Variance Table
## 
## Response: SQBLOOMS
##           Df Sum Sq Mean Sq F value    Pr(>F)
## BED        2 4.1323 2.06614  9.4570 0.0007277
## WATER      2 3.7153 1.85767  8.5029 0.0013016
## SHADE      3 1.6465 0.54882  2.5120 0.0789451
## Residuals 28 6.1173 0.21848
# Type II SS
Anova(mod4.1, type = "2")
## Anova Table (Type II tests)
## 
## Response: SQBLOOMS
##           Sum Sq Df F value    Pr(>F)
## BED       4.1323  2  9.4570 0.0007277
## WATER     3.7153  2  8.5029 0.0013016
## SHADE     1.6465  3  2.5120 0.0789451
## Residuals 6.1173 28

The type I ANOVA results for each factor are:

  • BED: \(F_{2, 28} = 9.457, P < 0.001\). BED is a statistically significant factor.

  • WATER: \(F_{2,28} = 8.503, P = 0.001\). WATER is a statistically significant factor.

  • SHADE: \(F_{3,28} = 2.512, P = 0.079\). SHADE is not a statistically significant factor.

Using type II SS gives the same P-values and sums of squares (and F-ratios) as running a type I SS. This shows that the analysis is orthogonal. We can also check by counting the number of samples (beds) per water and shade level.

blooms %>%
  count(WATER, SHADE)
##    WATER SHADE n
## 1      1     1 3
## 2      1     2 3
## 3      1     3 3
## 4      1     4 3
## 5      2     1 3
## 6      2     2 3
## 7      2     3 3
## 8      2     4 3
## 9      3     1 3
## 10     3     2 3
## 11     3     3 3
## 12     3     4 3

From the table, we can see that there are 3 beds per WATER and SHADE factor levels. Thus, the design is orthogonal.

He then wondered whether it had been worth treating the beds as blocks, or whether future experiments could be fully randomised across beds. So he repeated the analysis without using BED as a blocking factor.

4C. Fit a linear model of SQBLOOMS = WATER + SHADE. How were the residual error and sums of squares affected in this analysis? Was it worthwhile blocking for BED? If so, why?

mod4.2 <- lm(formula = SQBLOOMS ~ WATER + SHADE, data = blooms)
anova(mod4.2)
## Analysis of Variance Table
## 
## Response: SQBLOOMS
##           Df  Sum Sq Mean Sq F value   Pr(>F)
## WATER      2  3.7153 1.85767  5.4373 0.009661
## SHADE      3  1.6465 0.54882  1.6064 0.208609
## Residuals 30 10.2496 0.34165
summary(mod4.2)
## 
## Call:
## lm(formula = SQBLOOMS ~ WATER + SHADE, data = blooms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68236 -0.32844  0.02464  0.35897  1.10197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3.7145     0.2386  15.566 6.52e-16
## WATER2        0.7842     0.2386   3.286  0.00259
## WATER3        0.4489     0.2386   1.881  0.06967
## SHADE2        0.1969     0.2755   0.715  0.48041
## SHADE3       -0.2157     0.2755  -0.783  0.43994
## SHADE4       -0.3673     0.2755  -1.333  0.19252
## 
## Residual standard error: 0.5845 on 30 degrees of freedom
## Multiple R-squared:  0.3435, Adjusted R-squared:  0.234 
## F-statistic: 3.139 on 5 and 30 DF,  p-value: 0.0214

The type I results when BED is removed is:

  • WATER: \(F_{1,33} = 5.437, P = 0.010\). WATER is a statistically significant factor.

  • SHADE: \(F_{1,33} = 1.606, P = 0.209\). SHADE is not a statistically significant factor.

Comparing the model coefficients here to table 1, sums of squares and p-values of WATER and SHADE remained the same, but the residual sums of squares have increased to \(SS_{1,33}(Residuals) = 13.37\). Likewise, standard error of residuals has increased from 0.608 to 0.637 (presumably, this error was accounted for by BED in the previous model. This shows that it was worthwhile including BED as a blocking factor, since including it reduced residuals, and importantly, because the experimental design blocked measurements by bed.

He then discovered that a visitor had picked carnations from three of his experimental plots. He decided that because the final bloom numbers from these plots were inaccurate, he would omit them from his analysis. So he produced a new, shorter dataset with variables SQ2, B2, W2 and S2.

##4D. Fit the model SQ2 = B2 + W2 + S2. Which parts of the output are different in this third model, but were the same in the first two, and why?

mod4.3 <- lm(formula = SQ2 ~ B2 + W2 + S2, data = blooms)
anova(mod4.3)
## Analysis of Variance Table
## 
## Response: SQ2
##           Df Sum Sq Mean Sq F value    Pr(>F)
## B2         2 2.7626 1.38129  8.3790  0.001641
## W2         2 5.0793 2.53966 15.4057 4.367e-05
## S2         3 0.8072 0.26906  1.6321  0.207156
## Residuals 25 4.1213 0.16485
Anova(mod4.3, type = "2")
## Anova Table (Type II tests)
## 
## Response: SQ2
##           Sum Sq Df F value    Pr(>F)
## B2        2.6490  2  8.0346   0.00202
## W2        4.6764  2 14.1836 7.644e-05
## S2        0.8072  3  1.6321   0.20716
## Residuals 4.1213 25

The analysis is no longer orthogonal. Unlike in 4B, type I and II results are different. However, what is consistent is that WATER is still statistically significant.