Question 1 = Blocking

looking at Newts dataset:

library(tidyverse)
library(palmerpenguins)
library(ggfortify)
library(dplyr)

newts<-read.csv("C:/Users/paul/OneDrive/Documents/ES3307/Newts.csv",header=TRUE)
head(newts)
##     SVL Crest Pond Date  LSVL LCREST
## 1 108.9 10.12    4   27 2.037  1.005
## 2  94.7  5.62    7   37 1.976  0.750
## 3  96.9  2.00    7   28 1.986  0.302
## 4 110.1 15.04   10   24 2.042  1.177
## 5 103.7  3.88    6    9 2.016  0.589
## 6  99.3  4.45    8   12 1.997  0.648
str(newts)
## 'data.frame':    87 obs. of  6 variables:
##  $ SVL   : num  108.9 94.7 96.9 110.1 103.7 ...
##  $ Crest : num  10.12 5.62 2 15.04 3.88 ...
##  $ Pond  : int  4 7 7 10 6 8 7 6 7 2 ...
##  $ Date  : int  27 37 28 24 9 12 32 16 15 25 ...
##  $ LSVL  : num  2.04 1.98 1.99 2.04 2.02 ...
##  $ LCREST: num  1.005 0.75 0.302 1.177 0.589 ...
summary(newts)
##       SVL             Crest             Pond            Date      
##  Min.   : 87.60   Min.   : 1.320   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 92.90   1st Qu.: 3.765   1st Qu.: 4.00   1st Qu.:11.50  
##  Median : 99.10   Median : 5.900   Median : 7.00   Median :22.00  
##  Mean   : 99.64   Mean   : 7.439   Mean   : 6.08   Mean   :21.99  
##  3rd Qu.:105.75   3rd Qu.: 9.065   3rd Qu.: 8.00   3rd Qu.:32.00  
##  Max.   :112.40   Max.   :25.940   Max.   :10.00   Max.   :45.00  
##       LSVL           LCREST      
##  Min.   :1.942   Min.   :0.1200  
##  1st Qu.:1.968   1st Qu.:0.5755  
##  Median :1.996   Median :0.7710  
##  Mean   :1.997   Mean   :0.7792  
##  3rd Qu.:2.024   3rd Qu.:0.9575  
##  Max.   :2.051   Max.   :1.4140

1A. Show graphically how the size of the dorsal crest (LCREST) varies with the body size of the newts (LSVL) and POND. (Hint: Check your data structure to see if your variables are correctly specified i.e. continuous vs categorical.

library(ggplot2)

newts$Pond<-as.factor(newts$Pond)

ggplot(newts, aes(x = LSVL, y = LCREST, shape = Pond, color = Pond)) +
  geom_point(alpha = 2, size = 2) +   
  labs(
    x = "Body size (log snout-vent length, mm)",
    y = "Dorsal crest size (log crest height, mm)",
    title = "Scatter plot showing Dorsal crest size vs body size across ponds",
    shape = "Pond",
    color = "Pond"
  ) +
  scale_shape_manual(values = 0:9) +  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

1B. Analyse this dataset to investigate if the size of the dorsal crest (LCREST) reflects the body size of the newts (LSVL).

m1<-lm(LCREST~LSVL,newts)
par(mfrow=c(2,2))
plot(m1)

The diagnostic plots suggest that the data follows the assumptions of the linear model relatively well. The residuals vs Fitted plot shows an even distribution of residuals suggesting homogeneity of variance.The QQ residuals plot demonstrates most of the residuals in alignment with the model, apart from two outliers, suggesting normally distributed data. The scale-location plot shows an even spread of residuals suggesting homogeneity of residuals.The Residuals vs Leverage plot suggest that the outliers are not sufficiant enough to significantly influence the results as they donot exceed the cook’s distance. Therefore, an ANOVA test can be done.

anova(m1)
## Analysis of Variance Table
## 
## Response: LCREST
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## LSVL       1 2.3894 2.38939  45.808 1.579e-09 ***
## Residuals 85 4.4337 0.05216                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA showed that there is a statistically significant difference in the size of the dorsal crest (LCREST) based on the body size of the newts (LSVL) (F1,85 = 45.8, p<0.001)

1C. Run an analysis with POND as a blocking factor. Does POND seem to matter in this case? Should you keep it in the model?

m2<-lm(LCREST~LSVL*Pond,data=newts)
anova(m2)
## Analysis of Variance Table
## 
## Response: LCREST
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## LSVL       1 2.3894 2.38939 42.6712 1.031e-08 ***
## Pond       9 0.2406 0.02674  0.4775    0.8848    
## LSVL:Pond  9 0.4414 0.04904  0.8759    0.5510    
## Residuals 67 3.7517 0.05600                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There doesn’t seem to be a statistically significant interaction effect of Pond as a blocking factor on LSVL (F9,67 = 0.876, p=0.551), however it still accounts for 17.9% of the variation caused by LSVL, suggesting that interaction between some of the ponds must be present.

ggplot(newts, aes(x = LSVL, y = LCREST, color = Pond)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, aes(linetype = Pond)) +  # regression per pond
  labs(
    x = "Body size (log snout-vent length, mm)",
    y = "Dorsal crest size (log height, mm)",
    title = "Interaction between body size and pond on crest size",
    color = "Pond",
    linetype = "Pond"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The plot above shows an interaction of pond as a blocking factor in certain ponds but not enough to be considered a statistically significant interaction, suggesting that pond should not be included in the model.

1D. To account for temporal variation, run an analysis with DATE as a covariate in the model. (Hint: think about where DATE should be placed in the model.)

m3<-lm(LCREST~Date,data=newts)
anova(m3)
## Analysis of Variance Table
## 
## Response: LCREST
##           Df Sum Sq  Mean Sq F value Pr(>F)
## Date       1 0.0310 0.031014  0.3881  0.535
## Residuals 85 6.7921 0.079907

There is no statistically significant interaction between LCREST and the covariate Date (anova: F1,85 = 0.388, p = 0.535) and we have already demonstrated that the covariate Pond has no significant interaction. As a result of there being no significant interactions, we will run a type 2 SS

library(car)
m5<-lm(LCREST~Date+LSVL,data=newts)
Anova(m5, type="2")
## Anova Table (Type II tests)
## 
## Response: LCREST
##           Sum Sq Df F value    Pr(>F)    
## Date      0.0674  1  1.2972     0.258    
## LSVL      2.4258  1 46.6682 1.242e-09 ***
## Residuals 4.3663 84                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant difference between LCREST (Anova type 2: F1,84= 40.1, p<0.001) and LSVL (F1,84 = 46.7, p<0.001) based on the date. As a result, the date covariate has a significant interaction in the model and should be kept.

Question 2 - Pseudoreplication

2A. Fit a linear model of LUPRATE = OBSPER + SEX. Criticise this analysis.

sheep<-read.csv("C:/Users/paul/OneDrive/Documents/ES3307/sheep.csv",header=TRUE)
head(sheep)
##   Duration nlookups sex sheep sheepn obsper  luprate Sex2 sheep2 avluprate
## 1       39        3   1     1      1      1 0.076923    1      1   0.13270
## 2       40        7   1     1      1      2 0.175000    1      2   0.23485
## 3       39        3   1     1      1      3 0.076923    1      3   0.16753
## 4       34        7   1     1      1      4 0.205882    2      4   0.19490
## 5       31        6   1     1      1      5 0.193548    2      5   0.21700
## 6       37        5   1     1      1      6 0.135135    2      6   0.32280
str(sheep)
## 'data.frame':    120 obs. of  10 variables:
##  $ Duration : int  39 40 39 34 31 37 38 36 36 37 ...
##  $ nlookups : int  3 7 3 7 6 5 3 4 2 5 ...
##  $ sex      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sheep    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sheepn   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ obsper   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ luprate  : num  0.0769 0.175 0.0769 0.2059 0.1935 ...
##  $ Sex2     : int  1 1 1 2 2 2 NA NA NA NA ...
##  $ sheep2   : int  1 2 3 4 5 6 NA NA NA NA ...
##  $ avluprate: num  0.133 0.235 0.168 0.195 0.217 ...
summary(sheep)
##     Duration        nlookups          sex          sheep         sheepn 
##  Min.   :27.00   Min.   : 2.00   Min.   :1.0   Min.   :1.0   Min.   :1  
##  1st Qu.:34.75   1st Qu.: 6.00   1st Qu.:1.0   1st Qu.:2.0   1st Qu.:1  
##  Median :38.00   Median : 7.00   Median :1.5   Median :3.5   Median :2  
##  Mean   :38.09   Mean   : 7.85   Mean   :1.5   Mean   :3.5   Mean   :2  
##  3rd Qu.:41.00   3rd Qu.:10.00   3rd Qu.:2.0   3rd Qu.:5.0   3rd Qu.:3  
##  Max.   :53.00   Max.   :14.00   Max.   :2.0   Max.   :6.0   Max.   :3  
##                                                                         
##      obsper         luprate             Sex2         sheep2    
##  Min.   : 1.00   Min.   :0.05556   Min.   :1.0   Min.   :1.00  
##  1st Qu.: 5.75   1st Qu.:0.15790   1st Qu.:1.0   1st Qu.:2.25  
##  Median :10.50   Median :0.20521   Median :1.5   Median :3.50  
##  Mean   :10.50   Mean   :0.21164   Mean   :1.5   Mean   :3.50  
##  3rd Qu.:15.25   3rd Qu.:0.25659   3rd Qu.:2.0   3rd Qu.:4.75  
##  Max.   :20.00   Max.   :0.43333   Max.   :2.0   Max.   :6.00  
##                                    NA's   :114   NA's   :114   
##    avluprate     
##  Min.   :0.1327  
##  1st Qu.:0.1744  
##  Median :0.2059  
##  Mean   :0.2116  
##  3rd Qu.:0.2304  
##  Max.   :0.3228  
##  NA's   :114
sm1<-lm(luprate~obsper+sex, data=sheep)
summary(sm1)
## 
## Call:
## lm(formula = luprate ~ obsper + sex, data = sheep)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.142030 -0.053607 -0.007543  0.045107  0.182505 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.086954   0.023550   3.692 0.000339 ***
## obsper      0.002369   0.001119   2.117 0.036386 *  
## sex         0.066537   0.012907   5.155 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0707 on 117 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.1962 
## F-statistic: 15.53 on 2 and 117 DF,  p-value: 1.045e-06

This anova suggests that there is a significant difference between look up rates in male and female sheep (F2,117 = 15.53, p<0.001). However, there is obvious pseudo replication in this experimental design by including OBSPER in the model, which ultimately has incorrectly inflated the degrees of freedom to 117, due to the increase in sample size. As a result, the P value is most likely much lower than it should be causing a type 1 error, which mistakenly rejects a null hypothesis when no statistical difference actually exists.

2B. Suggest and execute a more appropriate analysis.

sm2<-lm(avluprate~Sex2,data=sheep)
summary(sm2)
## 
## Call:
## lm(formula = avluprate ~ Sex2, data = sheep)
## 
## Residuals:
##        1        2        3        4        5        6 
## -0.04566  0.05649 -0.01083 -0.05000 -0.02790  0.07790 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.11182    0.07837   1.427    0.227
## Sex2         0.06654    0.04956   1.342    0.251
## 
## Residual standard error: 0.0607 on 4 degrees of freedom
##   (114 observations deleted due to missingness)
## Multiple R-squared:  0.3106, Adjusted R-squared:  0.1383 
## F-statistic: 1.802 on 1 and 4 DF,  p-value: 0.2506

A more appropriate analysis will look at the interaction between the mean look up rates of each of the three sheep in the two different sexes. This way the degrees of freedom will be correctly based on the sample size of 6, reducing the df by 114). As a result, this results of this anova suggest no significant difference bin the average look up rates of male and female sheep (F1,4= 1.802, P=0.251), proving that the pseudoreplication in the first analysis caused a type 1 error.

Question 3 - Sums of squares:

weight<-read.csv("C:/Users/paul/OneDrive/Documents/ES3307/weight.csv",header=TRUE)
head(sheep)
##   Duration nlookups sex sheep sheepn obsper  luprate Sex2 sheep2 avluprate
## 1       39        3   1     1      1      1 0.076923    1      1   0.13270
## 2       40        7   1     1      1      2 0.175000    1      2   0.23485
## 3       39        3   1     1      1      3 0.076923    1      3   0.16753
## 4       34        7   1     1      1      4 0.205882    2      4   0.19490
## 5       31        6   1     1      1      5 0.193548    2      5   0.21700
## 6       37        5   1     1      1      6 0.135135    2      6   0.32280
str(sheep)
## 'data.frame':    120 obs. of  10 variables:
##  $ Duration : int  39 40 39 34 31 37 38 36 36 37 ...
##  $ nlookups : int  3 7 3 7 6 5 3 4 2 5 ...
##  $ sex      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sheep    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sheepn   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ obsper   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ luprate  : num  0.0769 0.175 0.0769 0.2059 0.1935 ...
##  $ Sex2     : int  1 1 1 2 2 2 NA NA NA NA ...
##  $ sheep2   : int  1 2 3 4 5 6 NA NA NA NA ...
##  $ avluprate: num  0.133 0.235 0.168 0.195 0.217 ...
summary(sheep)
##     Duration        nlookups          sex          sheep         sheepn 
##  Min.   :27.00   Min.   : 2.00   Min.   :1.0   Min.   :1.0   Min.   :1  
##  1st Qu.:34.75   1st Qu.: 6.00   1st Qu.:1.0   1st Qu.:2.0   1st Qu.:1  
##  Median :38.00   Median : 7.00   Median :1.5   Median :3.5   Median :2  
##  Mean   :38.09   Mean   : 7.85   Mean   :1.5   Mean   :3.5   Mean   :2  
##  3rd Qu.:41.00   3rd Qu.:10.00   3rd Qu.:2.0   3rd Qu.:5.0   3rd Qu.:3  
##  Max.   :53.00   Max.   :14.00   Max.   :2.0   Max.   :6.0   Max.   :3  
##                                                                         
##      obsper         luprate             Sex2         sheep2    
##  Min.   : 1.00   Min.   :0.05556   Min.   :1.0   Min.   :1.00  
##  1st Qu.: 5.75   1st Qu.:0.15790   1st Qu.:1.0   1st Qu.:2.25  
##  Median :10.50   Median :0.20521   Median :1.5   Median :3.50  
##  Mean   :10.50   Mean   :0.21164   Mean   :1.5   Mean   :3.50  
##  3rd Qu.:15.25   3rd Qu.:0.25659   3rd Qu.:2.0   3rd Qu.:4.75  
##  Max.   :20.00   Max.   :0.43333   Max.   :2.0   Max.   :6.00  
##                                    NA's   :114   NA's   :114   
##    avluprate     
##  Min.   :0.1327  
##  1st Qu.:0.1744  
##  Median :0.2059  
##  Mean   :0.2116  
##  3rd Qu.:0.2304  
##  Max.   :0.3228  
##  NA's   :114

3A. Using two separate plots, show graphically how FOREARM varies with HT and WT.

par(mfrow=c(1,2))
ggplot(weight, aes(x = HT, y = FOREARM)) +
  geom_point(alpha = 0.7, color = "#2A9571", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "#1F4E79") +
  labs(
    x = "HEIGHT)",
    y = "Forearm length (FOREARM)",
    title = "Relationship between Forearm length and height"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

ggplot(weight, aes(x = WT, y = FOREARM)) +
  geom_point(alpha = 0.7, color = "#2A9571", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "#1F4E79") +
  labs(
    x = "weight",
    y = "Forearm length (FOREARM)",
    title = "Relationship between Forearm length and weight"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

3B. Taking FOREARM as the response variable, which of the two explanatory variables, HT and WT, is the better predictor of obesity when used alone in a linear model?

wm1<-lm(FOREARM~HT,data=weight)
summary(wm1)
## 
## Call:
## lm(formula = FOREARM ~ HT, data = weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5203 -1.5167 -0.3876  1.4310  4.9971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.95880    9.82623   0.098    0.923
## HT           0.02605    0.06219   0.419    0.678
## 
## Residual standard error: 2.32 on 37 degrees of freedom
## Multiple R-squared:  0.004719,   Adjusted R-squared:  -0.02218 
## F-statistic: 0.1754 on 1 and 37 DF,  p-value: 0.6778

There is no significant interaction between Forearm and height measurements, suggesting that height is not a good predictor of obesity (F1,37 = 0.175, P=0.678)

wm2<-lm(FOREARM~WT,data=weight)
summary(wm2)
## 
## Call:
## lm(formula = FOREARM ~ WT, data = weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4386 -1.3166 -0.5204  0.6831  5.1331 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.59476    2.47273  -1.858 0.071115 .  
## WT           0.15298    0.03882   3.941 0.000347 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.951 on 37 degrees of freedom
## Multiple R-squared:  0.2956, Adjusted R-squared:  0.2766 
## F-statistic: 15.53 on 1 and 37 DF,  p-value: 0.000347

There is a significant interaction between Forearm and weight measurements, suggesting that weight is a much better predictor of obesity than height (F1,37 = 15.53, P<0.001)

3C. If both HT and WT are included as main effects (no interactions) in a linear model, do they increase or detract from each others’ informativeness, and why? (hint: try running with different sums of squares).

wm3<-lm(FOREARM~WT+HT,data=weight)
anova(wm3)
## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## WT         1  59.137  59.137 18.3334 0.0001314 ***
## HT         1  24.777  24.777  7.6813 0.0087755 ** 
## Residuals 36 116.124   3.226                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(wm3,type="2")
## Anova Table (Type II tests)
## 
## Response: FOREARM
##            Sum Sq Df F value    Pr(>F)    
## WT         82.970  1 25.7220 1.207e-05 ***
## HT         24.777  1  7.6813  0.008775 ** 
## Residuals 116.124 36                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Including HT and WT as main effects in the model using type 1 SS, decreases the informativeness of the model as HT is seen to have a significant effect on Forearm (F1,36= 7.68, P=0.009).

Including HT and WT as main effects in the model but using type 11 SS doesn’t change the decreased informativess as the p-value of HT is the same, however, the statistical significance of WT in the model increases as the p value is smaller (F1,36= 25.72, P<0.001).

3D What happens when you put HT vs WT first in the models

wm4<-lm(FOREARM~HT+WT,data=weight)
anova(wm4)
## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## HT         1   0.944   0.944  0.2926    0.5919    
## WT         1  82.970  82.970 25.7220 1.207e-05 ***
## Residuals 36 116.124   3.226                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Putting HT first increases the informativeness of the model as HT is shown to have no statistical significance on forearm (F1,36= 0.293, p=0.592), suggesting that the order of the two explantory variables in the model matters.

Question 4 - Orthogonality

blooms<-read.csv("C:/Users/paul/OneDrive/Documents/ES3307/blooms.csv",header=TRUE)
head(blooms)
##   SQBLOOMS BED WATER SHADE  X   SQ2 B2 W2 S2
## 1    4.359   1     1     1 NA 4.359  1  1  1
## 2    3.317   1     1     2 NA 3.317  1  1  2
## 3    3.606   1     1     3 NA 3.606  1  1  3
## 4    4.123   1     1     4 NA 4.123  1  1  4
## 5    4.472   1     2     1 NA 4.472  1  2  1
## 6    4.583   1     2     2 NA 4.583  1  2  2
str(blooms)
## 'data.frame':    36 obs. of  9 variables:
##  $ SQBLOOMS: num  4.36 3.32 3.61 4.12 4.47 ...
##  $ BED     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ WATER   : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ SHADE   : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ X       : logi  NA NA NA NA NA NA ...
##  $ SQ2     : num  4.36 3.32 3.61 4.12 4.47 ...
##  $ B2      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ W2      : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ S2      : int  1 2 3 4 1 2 3 4 1 2 ...
summary(blooms)
##     SQBLOOMS          BED        WATER       SHADE         X          
##  Min.   :2.449   Min.   :1   Min.   :1   Min.   :1.00   Mode:logical  
##  1st Qu.:3.571   1st Qu.:1   1st Qu.:1   1st Qu.:1.75   NA's:36       
##  Median :4.061   Median :2   Median :2   Median :2.50                 
##  Mean   :4.029   Mean   :2   Mean   :2   Mean   :2.50                 
##  3rd Qu.:4.583   3rd Qu.:3   3rd Qu.:3   3rd Qu.:3.25                 
##  Max.   :5.385   Max.   :3   Max.   :3   Max.   :4.00                 
##                                                                       
##       SQ2              B2             W2          S2       
##  Min.   :2.828   Min.   :1.00   Min.   :1   Min.   :1.000  
##  1st Qu.:3.606   1st Qu.:1.00   1st Qu.:1   1st Qu.:1.000  
##  Median :4.123   Median :2.00   Median :2   Median :2.000  
##  Mean   :4.094   Mean   :1.97   Mean   :2   Mean   :2.424  
##  3rd Qu.:4.583   3rd Qu.:3.00   3rd Qu.:3   3rd Qu.:3.000  
##  Max.   :5.385   Max.   :3.00   Max.   :3   Max.   :4.000  
##  NA's   :3       NA's   :3      NA's   :3   NA's   :3

4A. Show graphically how the number of blooms varies with level of water and level of shade.

library(ggplot2)
library(gghalves)
library(dplyr)

blooms <- blooms %>%
  mutate(
    Water_bin = cut(WATER, breaks = 3, labels = c("1", "2", "3")),
    Shade_bin = cut(SHADE, breaks = 4, labels = c("none", "1/4", "1/2","fully"))
  )

ggplot(blooms) +
  geom_boxplot(aes(x = Water_bin, y = SQBLOOMS, fill = Shade_bin),
               width = 0.3,
               outlier.shape = NA,
               alpha = 0.7,
               position = position_dodge(0.8)) +
  geom_half_point(aes(x = Water_bin, y = SQBLOOMS, colour = Shade_bin),
                  side = "l",
                  range_scale = 0.1,
                  position = position_dodge(0.8),
                  alpha = 0.7,
                  size = 2) +
  labs(
    x = "Number of times watered per week",
    y = "Number of Blooms",
    fill = "Shade Level",
    colour = "Shade Level",
    title = "Number of Blooms across Water and Shade Levels"
  ) +
  scale_fill_manual(values = c("#2A9571", "#DFB468", "#4B8CD8", "#E27D60")) +
  scale_color_manual(values = c("#2A9571", "#DFB468", "#4B8CD8", "#E27D60")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    text = element_text(family = "Roboto", size = 14)
  )

4B. Fit a linear model of SQBLOOMS = BED + WATER + SHADE. Is the analysis orthogonal?

bm1<-lm(SQBLOOMS~BED+WATER+SHADE,data=blooms)
Anova(bm1,type="2")
## Anova Table (Type II tests)
## 
## Response: SQBLOOMS
##            Sum Sq Df F value  Pr(>F)  
## BED        1.5266  1  4.1248 0.05064 .
## WATER      1.2092  1  3.2671 0.08009 .
## SHADE      1.0322  1  2.7890 0.10467  
## Residuals 11.8434 32                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(bm1)
## Analysis of Variance Table
## 
## Response: SQBLOOMS
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## BED        1  1.5266 1.52662  4.1248 0.05064 .
## WATER      1  1.2092 1.20916  3.2671 0.08009 .
## SHADE      1  1.0322 1.03225  2.7890 0.10467  
## Residuals 32 11.8434 0.37011                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis should in theory be orthogonal because the experimental design in balanced with the same number of replicates across the different treatments and the variables are categorical rather than continuous. This is shown by the sequential and adjusted sums of squares being the same and the p values for all treatments remaining the same after using both type 1 and type 11 SS.(Bed: F1,32=4.12, p=0.051)(Water: F 1,32= 3.27,p=0.0801)(Shade: F1,32= 2.79, P=0.105)

4C. Fit a linear model of SQBLOOMS = WATER + SHADE. How were the residual error and sums of squares affected in this analysis? Was it worthwhile blocking for BED? If so, why?

bm2<-lm(SQBLOOMS~WATER+SHADE,data=blooms)
anova(bm2)
## Analysis of Variance Table
## 
## Response: SQBLOOMS
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## WATER      1  1.2092 1.20916  2.9845 0.09342 .
## SHADE      1  1.0322 1.03225  2.5478 0.11998  
## Residuals 33 13.3700 0.40515                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By getting rid of bed from the model, the p values have remained very similar and have not changed enough to affect the results (Water: F1,33= 2.98, P=0.093) (Shade: F1,33= 2.55, p=0.12), suggesting that it wasn’t worthwhile blocking for bed. However, this increases the residual error as less of the error can be explained by blocking into bed, reducing the models power. As a result, it may be more beneficial to keep blocking for bed in the experiment as it doesn’t affect the outcome of the results (p-value) but does increase the power of the model by accounting for some of the residual error.

4D. Fit the model SQ2 = B2 + W2 + S2. Which parts of the output are different in this third model, but were the same in the first two, and why?

bm3<-lm(SQ2~B2+W2+S2,data=blooms)
anova(bm3)
## Analysis of Variance Table
## 
## Response: SQ2
##           Df  Sum Sq Mean Sq F value Pr(>F)  
## B2         1  0.8176 0.81757  2.3194 0.1386  
## W2         1  1.3946 1.39456  3.9563 0.0562 .
## S2         1  0.3360 0.33599  0.9532 0.3370  
## Residuals 29 10.2223 0.35249                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model above (model 3) has a smaller sample size due to shorter data set which has reduced the residual df from both models 1 and 2, causing a decrease f-value, suggesting that less of the variability in the model is explained by the three factors. All three models have similar p values providing the same results of no statistical significance between the three factors on bloom numbers.