Assignment 2 Aorui

library(readr)
library(dplyr)
library(ggplot2)
theme_set(theme_minimal(base_family = "sans"))

# Load data (using your original column names)
newts <- read_csv("Newts.csv", show_col_types = FALSE) |>
  mutate(Pond = factor(Pond))

# Sanity check for required columns
stopifnot(all(c("LSVL","LCREST","Pond","Date") %in% names(newts)))

1A. Show graphically how the size of the dorsal crest (LCREST) varies with the body size of the newts (LSVL) and POND. (Hint: Check your data structure to see if your variables are correctly specified i.e. continuous vs categorical.)

# Plot 1: coloured by pond
p1 <- ggplot(newts, aes(LSVL, LCREST, colour = Pond)) +
  geom_point(alpha = 0.85) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "LCREST vs LSVL (coloured by pond)", x = "LSVL", y = "LCREST", colour = "Pond")

# Plot 2: faceted by pond
p2 <- ggplot(newts, aes(LSVL, LCREST)) +
  geom_point(alpha = 0.85) +
  geom_smooth(method = "lm", se = FALSE, colour = "grey20") +
  facet_wrap(~ Pond, scales = "free") +
  labs(title = "LCREST ~ LSVL (faceted by pond)", x = "LSVL", y = "LCREST")

# Plot 3: distribution across ponds
p3 <- ggplot(newts, aes(Pond, LCREST, fill = Pond)) +
  geom_boxplot(width = 0.3, outlier.shape = NA, alpha = 0.9) +
  geom_jitter(width = 0.12, alpha = 0.6) +
  labs(title = "Distribution of LCREST across ponds", x = "Pond (block)", y = "LCREST") +
  theme(legend.position = "none")

# Print all three plots
p1; p2; p3

In both the colour-coded scatter and the faceted panels, within-pond linear smooths are consistently positive, indicating that larger LSVL is associated with larger LCREST within ponds.

Boxplots show clear between-pond differences in central tendency and spread: medians and interquartile ranges vary across ponds, and some ponds display wider dispersion than others.

These visuals indicate meaningful between-pond heterogeneity (mainly different intercepts with modest slope variation)

1B. Analyse this dataset to investigate if the size of the dorsal crest (LCREST) reflects the body size of the newts (LSVL).

#1B. Baseline model: does LCREST reflect LSVL
m0 <- lm(LCREST ~ LSVL, data = newts)

# Quick visual for the baseline fit (overall line; no blocking yet)
ggplot(newts, aes(LSVL, LCREST)) +
  geom_point(alpha = 0.85) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  labs(title = "Baseline: LCREST ~ LSVL (no blocking)",
       x = "LSVL", y = "LCREST")

# Model summary and 95% CI for the LSVL slope
s  <- summary(m0)
co <- s$coefficients["LSVL", ]
ci <- confint(m0)["LSVL", ]

summary(m0)

## 
## Call:
## lm(formula = LCREST ~ LSVL, data = newts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54974 -0.15546  0.00396  0.16713  0.52165 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.3806     1.5013  -6.248 1.59e-08 ***
## LSVL          5.0870     0.7516   6.768 1.58e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2284 on 85 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.3502, Adjusted R-squared:  0.3425 
## F-statistic: 45.81 on 1 and 85 DF,  p-value: 1.579e-09

# Diagnostics (keep one set in the main text)
par(mfrow = c(2,2)); plot(m0); par(mfrow = c(1,1))

# Normality (reference; interpret mainly via QQ plot)
shapiro.test(residuals(m0))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m0)
## W = 0.9944, p-value = 0.9745

# Simple heteroscedasticity proxy: |residuals| vs fitted
cor.test(abs(residuals(m0)), fitted(m0), method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  abs(residuals(m0)) and fitted(m0)
## S = 101259, p-value = 0.477
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.07725119

LCREST increased with body size: LSVL was a positive predictor (β = 5.09, SE = 0.75, t(85) = 6.77, p = 1.58×10⁻⁹; R² = 0.350, adj. R² = 0.343).

Assumptions were reasonable: residuals were approximately normal (Shapiro–Wilk W = 0.994, p = 0.975), and there was no evidence of heteroscedasticity (Spearman ρ between |residuals| and fitted = 0.077, p = 0.477); no influential observations were indicated by the residuals–leverage plot.

Crest height scales positively with body size, accounting for ~35% of the variance at this stage

1C. Run an analysis with POND as a blocking factor. Does POND seem to matter in this case? Should you keep it in the model?

# Ensure baseline model exists and Pond is a factor
if (!exists("m0")) m0 <- lm(LCREST ~ LSVL, data = newts)
if (!is.factor(newts$Pond)) newts$Pond <- factor(newts$Pond)

#1C model: add Pond as a blocking factor
m_block <- lm(LCREST ~ LSVL + Pond, data = newts)

# Summaries
s0 <- summary(m0)
s1 <- summary(m_block)

# Model comparison: nested F-test (does Pond add explanatory power?)
cmp   <- anova(m0, m_block)  # row 2 compares m_block vs m0
F_add <- cmp$`F`[2]
p_add <- cmp$`Pr(>F)`[2]

# Partial F for Pond controlling LSVL
drop_tab <- drop1(m_block, test = "F")
F_pond   <- drop_tab$F["Pond"]
p_pond   <- drop_tab$`Pr(>F)`["Pond"]

# Fit metrics
AIC0 <- AIC(m0);  AIC1 <- AIC(m_block);  dAIC <- AIC0 - AIC1
adj0 <- s0$adj.r.squared;  adj1 <- s1$adj.r.squared
n0   <- nobs(m0);          n1   <- nobs(m_block)

# LSVL effect (does it remain positive after blocking?)
co1 <- s1$coefficients["LSVL", ]

# Decision rule (ΔAIC > 2 or p_pond < 0.05 → keep)
keep     <- (dAIC > 2) || (!is.na(p_pond) && p_pond < 0.05)
decision <- if (keep) "retain Pond as a blocking factor" else "omit Pond from the final model"

# PRINT results
print(cmp)

## Analysis of Variance Table
## 
## Model 1: LCREST ~ LSVL
## Model 2: LCREST ~ LSVL + Pond
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     85 4.4337                           
## 2     76 4.1931  9   0.24063 0.4846 0.8807

print(drop_tab)

## Single term deletions
## 
## Model:
## LCREST ~ LSVL + Pond
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)    
## <none>              4.1931 -241.82                     
## LSVL    1   2.30483 6.4979 -205.72 41.7751 8.84e-09 ***
## Pond    9   0.24063 4.4337 -254.97  0.4846   0.8807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

print(data.frame(
  AIC_baseline = AIC0, AIC_block = AIC1, dAIC = dAIC,
  adjR2_baseline = adj0, adjR2_block = adj1,
  n_baseline = n0, n_block = n1
))

##   AIC_baseline AIC_block      dAIC adjR2_baseline adjR2_block n_baseline
## 1    -6.074827  7.070541 -13.14537      0.3425459   0.3045963         87
##   n_block
## 1      87

Adding Pond did not improve model fit (nested F(9,76) = 0.485, p = 0.881; partial F for Pond = 0.485, p = 0.881). Fit metrics worsened (AIC: −6.07 → 7.07, ΔAIC = −13.15; adjusted R²: 0.343 → 0.305), while the LSVL slope remained positive.

Pond does not appear to matter in this dataset. Can omit Pond from the final model and proceed to consider DATE as a covariate in 1D.

1D. To account for temporal variation, run an analysis with DATE as a covariate in the model. (Hint: think about where DATE should be placed in the model.)

# 1D: add DATE as a covariate (no Pond, per 1C)
m_date <- lm(LCREST ~ LSVL + Date, data = newts)

# Partial F for DATE and model comparison vs baseline
drop_tab <- drop1(m_date, test = "F")
F_date   <- drop_tab$F["Date"]
p_date   <- drop_tab$`Pr(>F)`["Date"]

s0  <- summary(m0)
sD  <- summary(m_date)
AIC0 <- AIC(m0); AICD <- AIC(m_date); dAIC <- AIC0 - AICD

b_l <- sD$coefficients["LSVL", ]
b_d <- sD$coefficients["Date", ]

# Print key table & metrics
print(drop_tab["Date", , drop = FALSE])

## Single term deletions
## 
## Model:
## LCREST ~ LSVL + Date
##      Df Sum of Sq    RSS     AIC F value Pr(>F)
## Date  1  0.067426 4.4337 -254.97  1.2972  0.258

print(data.frame(
  AIC_baseline = AIC0,
  AIC_plusDATE = AICD,
  dAIC         = dAIC,
  adjR2_0      = s0$adj.r.squared,
  adjR2_DATE   = sD$adj.r.squared
))

##   AIC_baseline AIC_plusDATE       dAIC   adjR2_0 adjR2_DATE
## 1    -6.074827    -5.408041 -0.6667853 0.3425459  0.3448363

# Decision rule (keep DATE if ΔAIC > 2 or p_date < .05)
keep_date <- (dAIC > 2) || (!is.na(p_date) && p_date < 0.05)

To account for temporal variation, added DATE as a continuous covariate in the fixed-effects model The DATE effect was not supported (partial F = 1.297, p = 0.258). Model fit changed little (AIC: −6.07 → −5.41, ΔAIC = −0.67; adjusted R²: 0.343 → 0.345). The LSVL slope remained positive. DATE does not materially improve the model here

2A. Fit a linear model of LUPRATE = OBSPER + SEX. Criticise this analysis.

2B. Suggest and execute a more appropriate analysis.

sheep <- read.csv("sheep.csv", stringsAsFactors = FALSE)
stopifnot(all(c("Duration","nlookups","sex","sheep","obsper") %in% names(sheep)))

# Factors (1=female, 2=male) and rate
sheep$sex_f   <- factor(sheep$sex, levels = c(1,2), labels = c("Female","Male"))
sheep$sheep_f <- factor(sheep$sheep)
if (!("luprate" %in% names(sheep))) sheep$luprate <- sheep$nlookups / sheep$Duration

#2A. Naive Gaussian model on the rate
m2a <- lm(luprate ~ obsper + sex_f, data = sheep)
s2a <- summary(m2a)
knitr::kable(s2a$coefficients, digits = 4,
             caption = "2A: Linear model on rate (luprate ~ obsper + sex)")

2A: Linear model on rate (luprate ~ obsper + sex)
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.1535	0.0149	10.3155	0.0000
obsper	0.0024	0.0011	2.1169	0.0364
sex_fMale	0.0665	0.0129	5.1550	0.0000

cat(sprintf("R^2 = %.3f, adj. R^2 = %.3f, n = %d\n\n",
            s2a$r.squared, s2a$adj.r.squared, nobs(m2a)))

## R^2 = 0.210, adj. R^2 = 0.196, n = 120

# Diagnostics (the standard 4-panel)
par(mfrow = c(2,2)); plot(m2a); par(mfrow = c(1,1))

#2B. Count model with exposure offset

library(lme4)

# (i) Center obsper so intercept = rate at average period
sheep$c_obsper <- as.numeric(scale(sheep$obsper, center = TRUE, scale = FALSE))

m2b <- glmer(nlookups ~ c_obsper + sex_f + offset(log(Duration)) + (1|sheep),
             family = poisson, data = sheep, control = glmerControl(optimizer = "bobyqa"))

# IRR table (Wald) for obsper & sex
sg <- summary(m2b)$coefficients
pick <- c("c_obsper","sex_fMale")
irr_tab <- data.frame(
  Term   = c("Per 1 unit of obsper","Male vs Female"),
  IRR    = exp(sg[pick,"Estimate"]),
  CI_low = exp(sg[pick,"Estimate"] - 1.96*sg[pick,"Std. Error"]),
  CI_high= exp(sg[pick,"Estimate"] + 1.96*sg[pick,"Std. Error"]),
  p      = sg[pick,"Pr(>|z|)"],
  row.names = NULL
)
knitr::kable(irr_tab, digits = 3,
             caption = "2B (GLMM): IRR with 95% CI and Wald p-values")

2B (GLMM): IRR with 95% CI and Wald p-values
Term	IRR	CI_low	CI_high	p
Per 1 unit of obsper	1.012	1.000	1.023	0.041
Male vs Female	1.368	0.947	1.975	0.095

# (ii) Optional: random slope check  (1 + c_obsper | sheep)
m2b_rs <- glmer(nlookups ~ c_obsper + sex_f + offset(log(Duration)) + (1 + c_obsper | sheep),
                family = poisson, data = sheep, control = glmerControl(optimizer="bobyqa"))
sing <- lme4::isSingular(m2b_rs)
lrt  <- try(anova(m2b, m2b_rs, test = "Chisq"), silent = TRUE)
if (!inherits(lrt, "try-error")) {
  cat(sprintf("Random-slope check: singular = %s; LRT Δχ² = %.3f (df = %d), p = %.3g.\n\n",
              sing, lrt$Chisq[2], lrt$`Df diff`[2], lrt$`Pr(>Chisq)`[2]))
} else {
  cat("Random-slope check: LRT not available (model comparison failed to compute).\n\n")
}

# (iii) Quick dispersion & zero-inflation checks (population level)
overdisp_fun <- function(model) {
  rp  <- residuals(model, type = "pearson")
  rdf <- nobs(model) - length(fixef(model))  # rough df for GLMM
  chisq <- sum(rp^2)
  c(pearson_chisq = chisq, ratio = chisq/rdf, df = rdf)
}
od <- overdisp_fun(m2b)
mu_pop   <- predict(m2b, type = "response", re.form = NA)
p0_model <- mean(exp(-mu_pop))              # Poisson expected zeros (population level)
p0_obs   <- mean(sheep$nlookups == 0)
cat(sprintf("Dispersion ratio (Pearson χ²/df) = %.2f; zeros observed = %.3f, model-expected = %.3f.\n\n",
            od["ratio"], p0_obs, p0_model))

## Dispersion ratio (Pearson χ²/df) = 0.45; zeros observed = 0.000, model-expected = 0.001.

# (iv) Predicted rates at representative obsper (5,10,15), population-level
obs_vals <- c(5, 10, 15)
grid <- expand.grid(
  c_obsper = obs_vals - mean(sheep$obsper),
  sex_f    = levels(sheep$sex_f)
)
grid$Duration <- mean(sheep$Duration)

# NOTE: lme4::predict() doesn't support se.fit -> compute SE via fixed-effect covariance
X    <- model.matrix(~ c_obsper + sex_f, data = grid)
beta <- lme4::fixef(m2b)
V    <- as.matrix(vcov(m2b))               # covariance of fixed effects
eta    <- as.numeric(X %*% beta)           # predicted log-mean (population level)
se_eta <- sqrt(rowSums((X %*% V) * X))     # diag(X V X^T)^(1/2)

grid$obsper <- rep(obs_vals, each = length(levels(sheep$sex_f)))
grid$rate   <- exp(eta) / grid$Duration
grid$lo     <- exp(eta - 1.96*se_eta) / grid$Duration
grid$hi     <- exp(eta + 1.96*se_eta) / grid$Duration

knitr::kable(grid[, c("obsper","sex_f","rate","lo","hi")],
             digits = 3,
             caption = "Predicted look-up rates (per min) at obsper = 5, 10, 15")

Predicted look-up rates (per min) at obsper = 5, 10, 15
obsper	sex_f	rate	lo	hi
5	Female	0.004	0.003	0.006
5	Female	0.004	0.003	0.006
10	Female	0.005	0.004	0.006
10	Male	0.006	0.004	0.008
15	Male	0.006	0.005	0.008
15	Male	0.007	0.005	0.008

# (v) Random-intercept SD (sheep-level variability, log-rate scale)
tau_sheep <- sqrt(as.data.frame(VarCorr(m2b))$vcov[1])
cat(sprintf("Random-intercept SD for sheep (log-rate): %.3f\n\n", tau_sheep))

## Random-intercept SD for sheep (log-rate): 0.214

# (vi) Copy-ready one-liner
cat(sprintf(
  "2B (GLMM): IRR(obsper) = %.3f [%.3f, %.3f], p = %.3g; IRR(Male vs Female) = %.3f [%.3f, %.3f], p = %.3g.\n",
  irr_tab$IRR[1], irr_tab$CI_low[1], irr_tab$CI_high[1], irr_tab$p[1],
  irr_tab$IRR[2], irr_tab$CI_low[2], irr_tab$CI_high[2], irr_tab$p[2]
))

## 2B (GLMM): IRR(obsper) = 1.012 [1.000, 1.023], p = 0.0415; IRR(Male vs Female) = 1.368 [0.947, 1.975], p = 0.0946.

2A answer： The simple linear model of look-up rate (luprate) on observation period and sex found a small positive time trend and higher rates in males: β̂_obsper = 0.0024 (SE = 0.0011, t = 2.12, p = 0.036) and β̂_Male = 0.0665 (SE = 0.0129, t = 5.16, p < 0.001); R² = 0.210 (adj. R² = 0.196, n = 120). The QQ plot shows mild right-tail deviation and the Scale–Location panel hints at increasing variance with fitted values; leverage is modest with no egregious outliers.

This analysis treats luprate (a count per unit time) as Gaussian with constant variance, which is inappropriate for data arising from counts with varying exposure (Duration). A linear model can also predict negative rates and ignores that the variance of counts typically grows with the mean. In addition, the 20 observations per sheep violate independence and sex is constant within each sheep, so the model conflates the sex effect with between-sheep variability (pseudo-replication), likely inflating significance.

A more suitable approach models the count of look-ups with a log(Duration) offset and includes random intercepts for sheep (Poisson/negative-binomial GLMM).

2B answer： I analysed the count of look-ups with a Poisson GLMM using a log link, log(Duration) as an offset (so we model a rate per minute), fixed effects for obsper and sex, and a random intercept for sheep to account for the repeated measures and the fact that sex is constant within each sheep.

The time trend is small but positive: IRR(obsper) = 1.012 (95% CI [1.000, 1.023], p = 0.041), i.e., about a 1.2% increase in look-up rate per additional observation period. The sex effect is not statistically significant after accounting for between-sheep variability: IRR(Male vs Female) = 1.368 (95% CI [0.947, 1.975], p = 0.095).

The estimated sheep-level SD on the log-rate scale is 0.214 (≈ 24% multiplicative SD; exp(0.214) ≈ 1.24), indicating non-trivial heterogeneity among sheep. A quick dispersion check gave Pearson χ²/df ≈ 0.45, suggesting no over-dispersion relative to Poisson; zeros were essentially as expected. Allowing random slopes for obsper did not materially improve fit (LRT not significant / became singular), so the random-intercept model is adequate.

Relative to the naïve linear model in 2A, this GLMM is appropriate for count-per-time data and for repeated measures within sheep. It provides evidence of a slight increase over time in look-up rate, and no clear sex difference once between-sheep variability is modelled.

3A. Using two separate plots, show graphically how FOREARM varies with HT and WT.

# 1) Read data (expects columns FOREARM, HT, WT)
weight <- read.csv("weight.csv", stringsAsFactors = FALSE)
stopifnot(all(c("FOREARM","HT","WT") %in% names(weight)))  # will error if names differ

# 2) Plots
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  stop("Please install ggplot2 first: install.packages('ggplot2')")
}
library(ggplot2)

# FOREARM vs HT
p_ht <- ggplot(weight, aes(x = HT, y = FOREARM)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "FOREARM vs HT",
       x = "Height (HT)",
       y = "Forearm skinfold (FOREARM)") +
  theme_minimal()

# FOREARM vs WT
p_wt <- ggplot(weight, aes(x = WT, y = FOREARM)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "FOREARM vs WT",
       x = "Weight (WT)",
       y = "Forearm skinfold (FOREARM)") +
  theme_minimal()

# Print as two separate plots (one after the other)
print(p_ht)

print(p_wt)

With HT, FOREARM shows only a very slight positive trend; the slope is small and the confidence band is wide, indicating weak association.

With WT, FOREARM increases clearly and approximately linearly; the slope is steeper and points cluster more tightly around the line, suggesting a stronger relationship.

In both plots the predictors are treated as continuous (each contributes 1 df), and a straight-line fit is appropriate at first glance.

Overall, WT appears to explain more variation in FOREARM than HT.

3B. Taking FOREARM as the response variable, which of the two explanatory variables, HT and WT, is the better predictor of obesity when used alone in a linear model?

# 2) Two simple linear models
m_ht <- lm(FOREARM ~ HT, data = weight)
m_wt <- lm(FOREARM ~ WT, data = weight)

s_ht <- summary(m_ht)
s_wt <- summary(m_wt)

# 3) Degrees of freedom check (each continuous predictor should have df = 1)
df_ht <- anova(m_ht)$Df[1]
df_wt <- anova(m_wt)$Df[1]

# 4) Collect key stats
tab <- data.frame(
  Predictor = c("HT","WT"),
  Beta      = c(coef(s_ht)["HT"], coef(s_wt)["WT"]),
  SE        = c(s_ht$coef["HT","Std. Error"], s_wt$coef["WT","Std. Error"]),
  t         = c(s_ht$coef["HT","t value"],   s_wt$coef["WT","t value"]),
  p         = c(s_ht$coef["HT","Pr(>|t|)"],  s_wt$coef["WT","Pr(>|t|)"]),
  R2        = c(s_ht$r.squared, s_wt$r.squared),
  adjR2     = c(s_ht$adj.r.squared, s_wt$adj.r.squared),
  AIC       = c(AIC(m_ht), AIC(m_wt)),
  df_term   = c(df_ht, df_wt),
  n         = c(nobs(m_ht), nobs(m_wt)),
  row.names = NULL
)

# 5) Print comparison table (works in plain R or knitr)
if (requireNamespace("knitr", quietly = TRUE)) {
  knitr::kable(tab, digits = 3,
               caption = "Q3B: Single-predictor models for FOREARM")
} else {
  print(round(tab, 3))
}

Q3B: Single-predictor models for FOREARM
Predictor	Beta	SE	t	p	R2	adjR2	AIC	df_term	n
HT	NA	0.062	0.419	0.678	0.005	-0.022	180.256	1	39
WT	NA	0.039	3.941	0.000	0.296	0.277	166.772	1	39

# 6) Auto-written conclusion
fmt <- function(x) formatC(x, digits = 3, format = "f")
better <- if (tab$adjR2[tab$Predictor=="WT"] > tab$adjR2[tab$Predictor=="HT"]) "WT" else "HT"

cat(sprintf(
  "\nConclusion (3B): Using single-predictor linear models, %s is the better predictor of FOREARM.\n",
  better))

## 
## Conclusion (3B): Using single-predictor linear models, WT is the better predictor of FOREARM.

cat(sprintf(
  "WT model: β = %s (SE %s, t = %s, p = %s); R² = %s (adj. %s), AIC = %s.\n",
  fmt(tab$Beta[tab$Predictor=="WT"]), fmt(tab$SE[tab$Predictor=="WT"]),
  fmt(tab$t[tab$Predictor=="WT"]), fmt(tab$p[tab$Predictor=="WT"]),
  fmt(tab$R2[tab$Predictor=="WT"]), fmt(tab$adjR2[tab$Predictor=="WT"]),
  fmt(tab$AIC[tab$Predictor=="WT"])
))

## WT model: β =   NA (SE 0.039, t = 3.941, p = 0.000); R² = 0.296 (adj. 0.277), AIC = 166.772.

cat(sprintf(
  "HT model: β = %s (SE %s, t = %s, p = %s); R² = %s (adj. %s), AIC = %s.\n",
  fmt(tab$Beta[tab$Predictor=="HT"]), fmt(tab$SE[tab$Predictor=="HT"]),
  fmt(tab$t[tab$Predictor=="HT"]), fmt(tab$p[tab$Predictor=="HT"]),
  fmt(tab$R2[tab$Predictor=="HT"]), fmt(tab$adjR2[tab$Predictor=="HT"]),
  fmt(tab$AIC[tab$Predictor=="HT"])
))

## HT model: β =   NA (SE 0.062, t = 0.419, p = 0.678); R² = 0.005 (adj. -0.022), AIC = 180.256.

cat(sprintf("DF check: df(HT) = %d, df(WT) = %d (each should be 1).\n", df_ht, df_wt))

## DF check: df(HT) = 1, df(WT) = 1 (each should be 1).

Fitting separate simple linear models shows that weight (WT) is a much stronger predictor of forearm skinfold than height (HT). The WT model is significant (t = 3.94, p < 0.001) and explains about 30% of the variance (R² ≈ 0.296; adj. R² ≈ 0.277; AIC ≈ 166.8). By contrast, the HT model is not significant (t = 0.42, p = 0.678) and explains almost none of the variance (R² ≈ 0.005; adj. R² ≈ −0.022; AIC ≈ 180.3). Each predictor contributes 1 df.

3C. If both HT and WT are included as main effects (no interactions) in a linear model, do they increase or detract from each others’ informativeness, and why? (hint: try running with different sums of squares).

# Full model with both predictors
m_full     <- lm(FOREARM ~ HT + WT, data = weight)

# ---- Type II (partial) sums of squares: unique contribution of each term ----
# (drop1 is Type II when there is no interaction)
dr <- drop1(m_full, test = "F")   # rows for HT and WT with F & p

# ---- Type I (sequential) sums of squares: order-dependent ----
m_seq_htwt <- lm(FOREARM ~ HT + WT, data = weight)  # HT first
m_seq_wtht <- lm(FOREARM ~ WT + HT, data = weight)  # WT first
a_htwt <- anova(m_seq_htwt)  # sequential SS: HT then WT
a_wtht <- anova(m_seq_wtht)  # sequential SS: WT then HT

# ---- Collinearity diagnostics (no extra packages) ----
r_ht_wt <- cor(weight$HT, weight$WT, use = "complete.obs")

r2_ht_on_wt <- summary(lm(HT ~ WT, data = weight))$r.squared
r2_wt_on_ht <- summary(lm(WT ~ HT, data = weight))$r.squared
vif_ht <- 1/(1 - r2_ht_on_wt)
vif_wt <- 1/(1 - r2_wt_on_ht)

# ---- Print concise tables ----
fmt <- function(x) formatC(x, digits = 3, format = "f")

cat("\n[Type II (partial) tests: each adjusted for the other]\n")

## 
## [Type II (partial) tests: each adjusted for the other]

print(dr)

## Single term deletions
## 
## Model:
## FOREARM ~ HT + WT
##        Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>              116.12 48.553                      
## HT      1    24.777 140.90 54.095  7.6813  0.008775 ** 
## WT      1    82.970 199.09 67.578 25.7220 1.207e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\n[Type I (sequential) SS: order = HT then WT]\n")

## 
## [Type I (sequential) SS: order = HT then WT]

print(a_htwt)

## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## HT         1   0.944   0.944  0.2926    0.5919    
## WT         1  82.970  82.970 25.7220 1.207e-05 ***
## Residuals 36 116.124   3.226                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\n[Type I (sequential) SS: order = WT then HT]\n")

## 
## [Type I (sequential) SS: order = WT then HT]

print(a_wtht)

## Analysis of Variance Table
## 
## Response: FOREARM
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## WT         1  59.137  59.137 18.3334 0.0001314 ***
## HT         1  24.777  24.777  7.6813 0.0087755 ** 
## Residuals 36 116.124   3.226                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat(sprintf(
  "\nCollinearity: cor(HT, WT) = %s; VIF(HT) = %s, VIF(WT) = %s\n\n",
  fmt(r_ht_wt), fmt(vif_ht), fmt(vif_wt)
))

## 
## Collinearity: cor(HT, WT) = 0.629; VIF(HT) = 1.656, VIF(WT) = 1.656

# A short decision summary you can copy into the report
p_ht_adj <- dr["HT","Pr(>F)"];  p_wt_adj <- dr["WT","Pr(>F)"]
cat(sprintf(
  "3C summary: With both predictors in the model, WT adjusted-for-HT p = %s; HT adjusted-for-WT p = %s.\n",
  fmt(p_wt_adj), fmt(p_ht_adj)
))

## 3C summary: With both predictors in the model, WT adjusted-for-HT p = 0.000; HT adjusted-for-WT p = 0.009.

With both predictors in the model (main effects only), WT remains highly significant after adjusting for HT, and HT remains significant at ~1% after adjusting for WT. Thus each explains a distinct part of FOREARM beyond the other. The moderate correlation between HT and WT (r ≈ 0.63; VIF ≈ 1.66) explains why the partial tests shrink a bit.

3D. What happens when you put HT vs WT first in the models?

When I use Type I (sequential) SS, the result depends on which variable I put in first.

HT then WT: HT isn’t significant (F≈0.29, p≈0.59), but WT is (F≈25.7, p≈1.2×10⁻⁵).

WT then HT: WT is significant (F≈18.3, p≈1.3×10⁻⁴) and HT also becomes significant once WT is in (F≈7.68, p≈0.009).

So WT grabs most of the shared variation with FOREARM; HT only shows a smaller, extra piece once WT is already in the model. That’s exactly why order matters for Type I SS, and why I rely on Type II (partial) tests for the final interpretation.

4A. Show graphically how the number of blooms varies with level of water and level of shade.

# ---- 4A. Graphics: SQBLOOMS by WATER and by SHADE ----
blooms <- read.csv("blooms.csv", stringsAsFactors = FALSE)
stopifnot(all(c("SQBLOOMS","BED","WATER","SHADE") %in% names(blooms)))

# Treat WATER and SHADE as categorical, as described in the question
blooms$WATER <- factor(blooms$WATER)
blooms$SHADE <- factor(blooms$SHADE)

library(ggplot2)

# SQBLOOMS vs WATER
p_water <- ggplot(blooms, aes(x = WATER, y = SQBLOOMS)) +
  geom_boxplot(outlier.shape = NA, fill = "grey92") +
  geom_jitter(width = 0.15, alpha = 0.6, size = 1) +
  stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, color = "steelblue") +
  labs(title = "SQBLOOMS vs WATER", x = "WATER (times per week)", y = "SQBLOOMS (sqrt blooms)") +
  theme_minimal()

# SQBLOOMS vs SHADE
p_shade <- ggplot(blooms, aes(x = SHADE, y = SQBLOOMS)) +
  geom_boxplot(outlier.shape = NA, fill = "grey92") +
  geom_jitter(width = 0.15, alpha = 0.6, size = 1) +
  stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, color = "steelblue") +
  labs(title = "SQBLOOMS vs SHADE", x = "SHADE level", y = "SQBLOOMS (sqrt blooms)") +
  theme_minimal()

print(p_water)

print(p_shade)

Watering: Bloom output (on the √–scale) is lowest at 1×/week, rises to a peak at 2×/week, and is slightly lower again at 3×/week. The blue mean±SE markers confirm a non-monotone pattern (2 > 3 ≈ 1), with somewhat larger dispersion at 3×.

Shade: Bloom numbers are highest at light shade (level 2), a bit lower with no shade (level 1), and drop further under half and full shade (levels 3–4). Variability is also greater at the higher shade levels.

Taken together, watering and shade both matter, and neither effect looks strictly linear across levels.

4B. Fit a linear model of SQBLOOMS = BED + WATER + SHADE. Is the analysis orthogonal?He then wondered whether it had been worth treating the beds as blocks, or whether future experiments could be fully randomised across beds. So he repeated the analysis without using BED as a blocking factor.

# --- 4B. SQBLOOMS ~ BED + WATER + SHADE (check orthogonality) ---
blooms$BED   <- factor(blooms$BED)
blooms$WATER <- factor(blooms$WATER)
blooms$SHADE <- factor(blooms$SHADE)
# fit the 3–factor model
m_full <- lm(SQBLOOMS ~ BED + WATER + SHADE, data = blooms)

# Type II (partial) sums of squares = each term adjusted for the others
type2 <- drop1(m_full, test = "F")
print(type2)

## Single term deletions
## 
## Model:
## SQBLOOMS ~ BED + WATER + SHADE
##        Df Sum of Sq     RSS     AIC F value    Pr(>F)    
## <none>               6.1173 -47.806                      
## BED     2    4.1323 10.2496 -33.226  9.4570 0.0007277 ***
## WATER   2    3.7153  9.8327 -34.721  8.5029 0.0013016 ** 
## SHADE   3    1.6465  7.7638 -45.226  2.5120 0.0789451 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Type I (sequential) sums of squares in three different orders
a_BWS <- anova(lm(SQBLOOMS ~ BED   + WATER + SHADE, data = blooms))
a_WBS <- anova(lm(SQBLOOMS ~ WATER + BED   + SHADE, data = blooms))
a_SWB <- anova(lm(SQBLOOMS ~ SHADE + WATER + BED,   data = blooms))

# put the sequential SS side by side for easy comparison
ssI <- data.frame(
  order  = c("BED + WATER + SHADE","WATER + BED + SHADE","SHADE + WATER + BED"),
  BED    = c(a_BWS["BED","Sum Sq"],   a_WBS["BED","Sum Sq"],   a_SWB["BED","Sum Sq"]),
  WATER  = c(a_BWS["WATER","Sum Sq"], a_WBS["WATER","Sum Sq"], a_SWB["WATER","Sum Sq"]),
  SHADE  = c(a_BWS["SHADE","Sum Sq"], a_WBS["SHADE","Sum Sq"], a_SWB["SHADE","Sum Sq"])
)
print(ssI, row.names = FALSE)

##                order      BED    WATER    SHADE
##  BED + WATER + SHADE 4.132281 3.715344 1.646466
##  WATER + BED + SHADE 4.132281 3.715344 1.646466
##  SHADE + WATER + BED 4.132281 3.715344 1.646466

# quick decision helper: in an orthogonal design, Type I SS do not change with order
sameSS <- function(x) max(x) - min(x)
decision <- if (sameSS(ssI$BED) < 1e-10 &&
                sameSS(ssI$WATER) < 1e-10 &&
                sameSS(ssI$SHADE) < 1e-10) "orthogonal" else "not orthogonal"
cat(sprintf("\nOrthogonality check: the analysis is %s.\n", decision))

## 
## Orthogonality check: the analysis is orthogonal.

# (optional) basic ANOVA table and df check
anova(m_full)  # df should be: BED = 2, WATER = 2, SHADE = 3

## Analysis of Variance Table
## 
## Response: SQBLOOMS
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## BED        2 4.1323 2.06614  9.4570 0.0007277 ***
## WATER      2 3.7153 1.85767  8.5029 0.0013016 ** 
## SHADE      3 1.6465 0.54882  2.5120 0.0789451 .  
## Residuals 28 6.1173 0.21848                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting SQBLOOMS ~ BED + WATER + SHADE gives BED F(2, 28)=9.46, p=0.00028; WATER F(2, 28)=8.50, p=0.00130; SHADE F(3, 28)=2.51, p=0.079. (Df match factor levels: BED 3→2, WATER 3→2, SHADE 4→3.)

Orthogonality: Type-I (sequential) sums of squares are identical regardless of entry order—BED = 4.1323, WATER = 3.7153, SHADE = 1.6465—so the analysis is orthogonal (balanced RCBD). Consequently, each factor’s information is independent, and the F-tests/SS are order-invariant.

4C. Fit a linear model of SQBLOOMS = WATER + SHADE. How were the residual error and sums of squares affected in this analysis? Was it worthwhile blocking for BED? If so, why?He then discovered that a visitor had picked carnations from three of his experimental plots. He decided that because the final bloom numbers from these plots were inaccurate, he would omit them from his analysis. So he produced a new, shorter dataset with variables SQ2, B2, W2 and S2.

Answer：without blocking, Residual SS = 10.2496, df = 30, MS ≈ 0.342. With BED in the model (4B), Residual SS = 6.1173, df = 28, MS ≈ 0.218. Note that 10.2496 ≈ 6.1173 + 4.1323, i.e., the SS previously captured by BED is now absorbed into residual error.

WATER SS = 3.7153, SHADE SS = 1.6465 — unchanged when BED is removed. This invariance occurs because the design is orthogonal (balanced): the treatment factors are independent of the blocks.

Because the residual MS rises (0.218 → 0.342), the F-tests weaken: WATER F drops from 8.50 (p = 0.0013) to about 5.44 (p ≈ 0.009); SHADE F drops from 2.51 (p = 0.079) to about 1.61 (p ≈ 0.21).

In conclusion, blocking for BED was worthwhile. It removes bed-to-bed variability from the error term (residual MS reduced by ~36%), increasing power to detect WATER/SHADE effects. In this balanced design the treatment SS don’t change, but their significance improves when BED is included because residual noise is lower

4D. Fit the model SQ2 = B2 + W2 + S2. Which parts of the output are different in this third model, but were the same in the first two, and why?

1.Loss of orthogonality. By deleting three plots the design is no longer perfectly balanced across BED × WATER × SHADE. The columns of the design matrix (B2, W2, S2) are now correlated, so the ANOVA partition is no longer order-invariant.

2.The sequential (Type-I) SS now depend on order. In 4B we saw that the Type-I sums of squares for BED, WATER and SHADE were the same no matter the order, proving orthogonality. With SQ2, if you fit B2 + W2 + S2 vs W2 + B2 + S2 (or S2 + W2 + B2), the SS, F and p for W2 and S2 will change with the order of entry. (Coefficients don’t change with order, but the ANOVA decomposition does.)

3.Blocking now competes with treatments. In the balanced data (4B/4C) WATER and SHADE SS were identical whether BED was in the model or not; removing BED only inflated the residual SS by exactly SS(BED). With SQ2, adding/removing B2 will also change the SS attributed to W2 and S2 (not just the residual), because their effects overlap when the design is unbalanced.

4.Factor dfs remain tied to the number of levels (B2: 2, W2: 2, S2: 3), but the residual df drop by 3 (from 28 to 25) because the sample size is now 33 instead of 36. Standard errors and p-values typically change as a result.

5.In the third model the pieces that were identical in the first two—sequential SS (and the corresponding F tests) across different term orders, and the invariance of WATER/SHADE SS to including BED—are no longer the same. That difference is exactly because the three deleted plots broke the orthogonality (balance) of the design, so the factors no longer explain independent portions of variation.

Assignment 2 Aorui

Aorui Liu

2025-09-30