Question 1

(a). Rewriting Random Intercept + Random Slope to Random Slope Only

Original model: \[ Y_{ij}^{(1)}=\beta_1+\beta_2 x_{ij}+b_{1i}+b_{2i}x_{ij}+\varepsilon_{ij}, \] \[ \begin{pmatrix} b_{1i} \\ b_{2i} \end{pmatrix} \sim N\!\left( \begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix} \sigma_{b1}^2 & 0\\ 0 & \sigma_{b2}^2 \end{pmatrix} \right), \qquad \varepsilon_i \sim MVN(0,\sigma^2 I_{n_i}). \]

Marginal mean: \[ E(Y_{ij}^{(1)})=\beta_1+\beta_2 x_{ij}. \]

Marginal covariance (within subject): \[ \mathrm{Cov}(Y_{ij}^{(1)},Y_{ik}^{(1)}) = \sigma_{b1}^2 + \sigma_{b2}^2 x_{ij}x_{ik} + \sigma^2 \mathbf 1(j=k). \]

Random slope only + independent errors

Consider \[ Y_{ij}^{(2)}=\beta_1+\beta_2 x_{ij}+u_i x_{ij}+e_{ij}, \] with \[ u_i\sim N(0,\sigma_{b2}^2), \qquad e_{ij}\sim N(0,\sigma_e^2), \] independent.

Then \[ \mathrm{Cov}(Y_{ij}^{(2)},Y_{ik}^{(2)}) = \sigma_{b2}^2 x_{ij}x_{ik} + \sigma_e^2 \mathbf 1(j=k). \]

For \(j\neq k\), the term \(\sigma_{b1}^2\) is missing. Hence, no equivalence under independent errors.

Random slope only + CS residuals

Let \[ Y_{ij}^{(3)}=\beta_1+\beta_2 x_{ij}+u_i x_{ij}+e_{ij}, \] with \[ u_i\sim N(0,\sigma_{b2}^2), \qquad \mathbf e_i \sim MVN(0,\sigma^2 I_{n_i}+\sigma_{b1}^2 \mathbf 1\mathbf 1^\top). \]

Then \[ \mathrm{Cov}(Y_{ij}^{(3)},Y_{ik}^{(3)}) = \sigma_{b1}^2 + \sigma_{b2}^2 x_{ij}x_{ik} + \sigma^2 \mathbf 1(j=k), \] which matches Model (1).

If \(\mathrm{Cov}(b_{1i},b_{2i})\neq 0\)

An additional term \(\sigma_{12}(x_{ij}+x_{ik})\) appears in the marginal covariance, which cannot be reproduced by a slope-only model with independent errors.

Conclusion:

Equivalence is not possible with independent residuals, whereas it is possible if the residual covariance is compound symmetric.

(b). Random intercept model equivalence, AR(1) vs Toeplitz

Let \(i=1,\dots,n\) index subjects and \(j=1,\dots,m_i\) index repeated measurements within subject \(i\).

Consider the random intercept model

\[ Y_{ij} = \beta_1 + \beta_2 x_{ij} + b_{1i} + e_{ij}, \] where \[ b_{1i} \sim N(0,\sigma_{b1}^2), \]

and for the within-subject error vector \(e_i=(e_{i1},\dots,e_{im_i})^\top\),

\[ e_i \sim N(0, R_i), \qquad b_{1i} \perp e_i. \]

The marginal mean is

\[ E(Y_{ij}) = \beta_1 + \beta_2 x_{ij}, \]

and for \(j,k \in \{1,\dots,m_i\}\) the marginal covariance is

\[ \mathrm{Cov}(Y_{ij},Y_{ik}) = \mathrm{Cov}(b_{1i}+e_{ij},\, b_{1i}+e_{ik}) = \sigma_{b1}^2 + \mathrm{Cov}(e_{ij},e_{ik}). \]

Equivalently in matrix form (stacking \(Y_i=(Y_{i1},\dots,Y_{im_i})^\top\) and letting \(\mathbf{1}\) be the \(m_i\times 1\) vector of ones),

\[ \mathrm{Cov}(Y_i) = \sigma_{b1}^2\,\mathbf{1}\mathbf{1}^\top + R_i. \]

We ask whether there exists an equivalent model the random intercept,

\[ Y_{ij} = \beta_1 + \beta_2 x_{ij} + \tilde e_{ij}, \qquad \tilde e_i \sim N(0,\tilde R_i), \]

where \(\tilde R_i\) is constrained to be AR(1) (first part) or Toeplitz (second part).

AR(1) Structure

Assume the original errors have AR(1):

\[ \mathrm{Var}(e_{ij})=\sigma^2, \qquad \mathrm{Corr}(e_{ij},e_{ik})=\rho^{|j-k|}, \]

so

\[ \mathrm{Cov}(e_{ij},e_{ik}) = \sigma^2 \rho^{|j-k|}. \]

Then the marginal covariance between two responses is

\[ \mathrm{Cov}(Y_{ij},Y_{ik}) = \sigma_{b1}^2 + \sigma^2 \rho^{|j-k|}. \]

If we tried to remove \(b_{1i}\) and replace it with a new AR(1) error, we would need parameters \((\tilde\sigma^2,\tilde\rho)\) such that for all lags \(h=|j-k|\),

\[ \tilde\sigma^2 \tilde\rho^{h} = \sigma_{b1}^2 + \sigma^2 \rho^{h}. \]

In particular, matching lag \(h=1\) and \(h=2\) would require

\[ \tilde\sigma^2 \tilde\rho = \sigma_{b1}^2 + \sigma^2 \rho, \qquad \tilde\sigma^2 \tilde\rho^2 = \sigma_{b1}^2 + \sigma^2 \rho^2. \]

Dividing the second equation by the first gives

\[ \tilde\rho = \frac{\sigma_{b1}^2 + \sigma^2 \rho^2} {\sigma_{b1}^2 + \sigma^2 \rho}. \]

This value of \(\tilde\rho\) will generally not satisfy the corresponding equation for lag \(h=3\) (or higher), so the full covariance sequence \(\sigma_{b1}^2 + \sigma^2 \rho^{h}\) cannot be represented by an AR(1) structure with only two parameters \((\tilde\sigma^2,\tilde\rho)\).

Conclusion: Therefore, except for degenerate special cases (e.g. \(\sigma_{b1}^2=0\) or when \(m_i=2\) so only one lag exists), there is no equivalent linear model that removes the random intercept and replaces it by a new AR(1) error term.

Toeplitz structure

Assume the original errors have Toeplitz correlation:

\[ \mathrm{Var}(e_{ij})=\sigma^2, \qquad \mathrm{Corr}(e_{ij}, e_{i,j+k}) = \rho_k \quad \text{for all } j \text{ and } k, \]

so

\[ \mathrm{Cov}(e_{ij}, e_{i,j+k}) = \sigma^2 \rho_k. \]

Then for lag \(k\ge 1\),

\[ \mathrm{Cov}(Y_{ij}, Y_{i,j+k}) = \sigma_{b1}^2 + \sigma^2 \rho_k. \]

Define a new Toeplitz error correlation sequence \(\{\tilde\rho_k\}\) by keeping the variance as \(\tilde\sigma^2=\sigma^2+\sigma_{b1}^2\) and setting, for \(k\ge 1\),

\[ \tilde\rho_k = \frac{\sigma_{b1}^2 + \sigma^2 \rho_k}{\sigma_{b1}^2 + \sigma^2}. \]

Then

\[ \mathrm{Var}(Y_{ij}) = \sigma_{b1}^2 + \sigma^2 = \tilde\sigma^2, \]

and for \(k\ge 1\),

\[ \mathrm{Cov}(Y_{ij}, Y_{i,j+k}) = \sigma_{b1}^2 + \sigma^2 \rho_k = \tilde\sigma^2 \tilde\rho_k, \]

which is exactly the Toeplitz form for an error-only model:

\[ Y_{ij} = \beta_1 + \beta_2 x_{ij} + \tilde e_{ij}, \qquad \mathrm{Var}(\tilde e_{ij})=\tilde\sigma^2, \quad \mathrm{Corr}(\tilde e_{ij},\tilde e_{i,j+k})=\tilde\rho_k. \]

Conclusion:

Under Toeplitz, an equivalent model without the random intercept does exist, the random-intercept contribution can be absorbed into a new Toeplitz error variance and new Toeplitz lag correlations.

Question 2 Aijing Yang

(a). Data Loading and Processing

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)

cf_data <- read.csv("NewCFkids-SAS.csv", header = F)
colnames(cf_data) <- c("ID", "FEV1", "AGE", "FEMALE", "PSEUDOA", "F508", "PANCREAT", "AGE0", "AGEL")

# Creating dummy variables for F508 
cf_data$F508_1 <- ifelse(cf_data$F508 == 1, 1, 0)  
cf_data$F508_2 <- ifelse(cf_data$F508 == 2, 1, 0)

# First follow up age
cf_data$first_age <- case_when(cf_data$AGE0 < 8 ~ 1,
                               cf_data$AGE0 >= 8 & cf_data$AGE0 <= 12 ~ 2,
                               cf_data$AGE0 > 12 & cf_data$AGE0 < 15 ~ 3,
                               cf_data$AGE0 >= 15 ~ 4)

# Creating table of percent missing outcome by age split by female and male 
cf_data$Sex <- ifelse(cf_data$FEMALE == 1, "Female", "Male")
missingness_table <- cf_data %>%
  mutate(FEV1_missing = is.na(FEV1)) %>%
  group_by(first_age, Sex) %>%
  summarise(total = n(),
            missing = sum(FEV1_missing),
            percentage_missing = round(100 * mean(FEV1_missing), 2),
            .groups = 'drop') %>% pivot_wider(names_from = Sex,
                                              values_from = c(total, missing, percentage_missing))

missingness_table

The FEV1 outcome has no missing values (0% missing across all age categories and both sexes), suggesting a complete data collection.

(b). Individual trajectories of FEV1 with time (x-axis) as AGE

p1 <- ggplot(cf_data, aes(x = AGE, y = FEV1))
p1 + geom_line(aes(group = factor(ID), alpha = 0.8)) + 
  geom_smooth(method = "loess", se = F, linewidth = 1.2) +
  labs(title = "Individual Profiles of Fev 1 over age", 
       x = "Age", y = "FEV1") 
## `geom_smooth()` using formula = 'y ~ x'

The mean FEV1 declines with the increase of age, and the chaotic trajectories show substantial heterogeneity across individuals. This reflects the progressive lung function decline characteristic of Cystic Fibrosis.

(c). Individual trajectories of FEV1 with time (x-axis) as longitudinal component of age

p2 <- ggplot(cf_data, aes(x = AGEL, y = FEV1))
p2 + geom_line(aes(group = factor(ID), alpha = 0.8)) + 
  geom_smooth(method = "loess", se = F, linewidth = 1.2) +
  labs(title = "Individual Profiles of Fev 1 over longitudinla compnent of age", 
       x = "Longitudinal component of age, (AGEL = AGE - AGE0)", y = "FEV1") 
## `geom_smooth()` using formula = 'y ~ x'

All subjects now start at time 0 (their first follow-up). The smoothed mean shows a gradual decline in FEV1 over follow-up time, though the slope is flatter compared to the AGE plot. This separates the within-subject longitudinal change from between-subject cross-sectional differences due to different ages at study entry.

(d). Mean FEV1 for boys and girls, with the time (x-axis) as AGE.

p3 <- ggplot(cf_data, aes(x = AGE, y = FEV1, group = Sex))
p3 + geom_smooth(method = 'loess', se = F, aes(color = Sex)) +
  labs(title="Smoothed Average Trend of FEV1 by Sex Group Over Age",
       color="Sex", x="Age", y = "Mean FEV1 (% predicted)")  
## `geom_smooth()` using formula = 'y ~ x'

Both males and females have declining FEV1 with age. Females start with higher mean FEV1 at younger ages but experience a steeper decline, crossing below males around age 22-23. Males show a more gradual decline. By age 30, males have lower mean FEV1 than females. This pattern suggests sex differences in the trajectory of lung function decline in CF.

(e). Mean FEV1 for boys and girls, with the time (x-axis) as longitudinal component of age.

p4 <- ggplot(cf_data, aes(x = AGEL, y = FEV1, group = Sex))
p4 + geom_smooth(method = 'loess', se = F, aes(color = Sex)) +
  labs(title="Smoothed Average Trend of FEV1 by Sex Group Over Longitudinal Age",
       color="Sex", x="Longitudinla component of age", y = "Mean FEV1 (% predicted)")
## `geom_smooth()` using formula = 'y ~ x'

Both males and females show declining FEV1 over follow-up time. Males consistently maintain higher mean FEV1 than females throughout the entire follow-up period. Both sexes show similar rates of decline over longitudinal age. This suggests that while females have lower baseline lung function, the within-subject rate of decline is comparable between sexes.

(f). Mean FEV1 by genotype, with the time (x-axis) as AGE

cf_data$Genotype <- factor(cf_data$F508, levels = c(0, 1, 2), labels = c("None", "Homozygous", "Heterozygous"))
p5 <- ggplot(cf_data, aes(x = AGE, y = FEV1, group = Genotype))
p5 + geom_smooth(method = 'loess', se = F, aes(color = Genotype)) +
  labs(title="Mean FEV1 Trajectories by Genotype Over Age",
       color="Genotype", x="Age", y = "Mean FEV1 (% predicted)")  
## `geom_smooth()` using formula = 'y ~ x'

Heterozygous children start with the highest FEV1 at young ages but experience the steepest decline. Homozygous children show a more gradual decline. Children with no F508 mutation maintain relatively stable FEV1 until around age 20, after which they show a sharp decline. By age 30, all three groups converge to similar low FEV1 levels. The crossing of trajectories suggests genotype and age impact health status of lung function.

(g). Mean FEV1 by genotype, with the time (x-axis) as longitudinal component of age

p6 <- ggplot(cf_data, aes(x = AGEL, y = FEV1, group = Genotype))
p6 + geom_smooth(method = 'loess', se = F, aes(color = Genotype)) +
  labs(title="Mean FEV1 Trajectories by Genotype Over Longitudinal Age",
       color="Genotype", x="Longitudinal component of age", y = "Mean FEV1 (% predicted)")
## `geom_smooth()` using formula = 'y ~ x'

When aligned by follow-up time, all three genotype groups show declining FEV1. The hierarchy remains: no genotype mutation > heterozygous > homozygous, suggesting that having the F508 mutation is associated with lower baseline lung function and similar rates of decline.

(h). Mean FEV1 by the new categorical variable of age at first follow up, with the time (x-axis) as AGE

cf_data$first_age_cat <- factor(cf_data$first_age, levels = c(1, 2, 3, 4), labels = c("<8", "8-12", "12-15", ">=15"))
p7 <- ggplot(cf_data, aes(x = AGE, y = FEV1, group = first_age_cat))
p7 + geom_smooth(method = 'loess', se = F, aes(color = first_age_cat)) +
  labs(title="Mean FEV1 Trajectories by Age at First Follow-up Over Age",
       color="Age at First Follow-up", x="Age", y = "Mean FEV1 (% predicted)") 
## `geom_smooth()` using formula = 'y ~ x'

The four cohorts show different trajectory patterns. Younger cohorts (<8 & 8-12) start with higher FEV1 and are observed over a wider age range. Older cohorts enter with lower FEV1 values. This illustrates the cross-sectional component: children entering the study at older ages already have lower lung function.

(i). Mean FEV1 by the new categorical variable of age at first follow up, with the time (x-axis) as longitudinal age

p7 <- ggplot(cf_data, aes(x = AGEL, y = FEV1, group = first_age_cat))
p7 + geom_smooth(method = 'loess', se = F, aes(color = first_age_cat)) +
  labs(title="Mean FEV1 Trajectories by Age at First Follow-up Over Longitudinal Age",
       color="Age at First Follow-up", x="Longitudinal component of age", y = "Mean FEV1 (% predicted)") 
## `geom_smooth()` using formula = 'y ~ x'

When aligned by follow-up time, the four cohorts show parallel declining trajectories but at different levels. Children who entered younger (<8) maintain the highest FEV1, while those entering at >=15 have the lowest. The similar slopes suggest comparable rates of within-subject decline across cohorts, while the level differences reflect the cross-sectional age effect.

Question 3 Irene Wang

a) Fit a model for the random intercept only

colnames(cf_data)
##  [1] "ID"            "FEV1"          "AGE"           "FEMALE"       
##  [5] "PSEUDOA"       "F508"          "PANCREAT"      "AGE0"         
##  [9] "AGEL"          "F508_1"        "F508_2"        "first_age"    
## [13] "Sex"           "Genotype"      "first_age_cat"
library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
m3 <- lme(FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + 
            AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2, data=cf_data,
          random= ~ 1 | ID)
summary(m3)
## Linear mixed-effects model fit by REML
##   Data: cf_data 
##        AIC      BIC    logLik
##   12520.08 12578.56 -6249.041
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept) Residual
## StdDev:    22.59935 12.19734
## 
## Fixed effects:  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + AGEL:FEMALE +      AGEL:F508_1 + AGEL:F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 103.80627  6.706026 1309 15.479550  0.0000
## AGE0         -1.85532  0.334312  195 -5.549663  0.0000
## AGEL         -0.58782  0.393060 1309 -1.495504  0.1350
## FEMALE       -1.16203  3.397872  195 -0.341987  0.7327
## F508_1       -4.28096  5.611017  195 -0.762956  0.4464
## F508_2       -6.74044  5.642572  195 -1.194569  0.2337
## AGEL:FEMALE  -0.82574  0.249427 1309 -3.310541  0.0010
## AGEL:F508_1  -0.48769  0.422469 1309 -1.154391  0.2486
## AGEL:F508_2  -0.65745  0.421359 1309 -1.560310  0.1189
##  Correlation: 
##             (Intr) AGE0   AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE0        -0.630                                                       
## AGEL        -0.212  0.006                                                
## FEMALE      -0.164 -0.088  0.073                                         
## F508_1      -0.657 -0.002  0.229 -0.021                                  
## F508_2      -0.725  0.123  0.226 -0.062  0.788                           
## AGEL:FEMALE  0.062 -0.005 -0.272 -0.262  0.004  0.011                    
## AGEL:F508_1  0.181 -0.004 -0.856  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2  0.179 -0.003 -0.853  0.011 -0.215 -0.266 -0.041   0.805     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -4.942954121 -0.528230691 -0.003676248  0.516817787  4.289855016 
## 
## Number of Observations: 1513
## Number of Groups: 200

i. Describe in words the interpretation of the random intercept component (b1i) and report the estimate of the variance for this random effect.

VarCorr(m3)
## ID = pdLogChol(1) 
##             Variance StdDev  
## (Intercept) 510.7307 22.59935
## Residual    148.7751 12.19734
# this extracts variance of random intercept and variance of epsilon, m3.var
m3.var <- as.numeric(VarCorr(m3))[1:2] 
m3.var
## [1] 510.7307 148.7751

The random intercept \(b_{1i}\) represents how each child’s baseline FEV1 differs from the overall average baseline FEV1, after accounting for age at first follow-up, longitudinal age, sex, genotype, and their interactions with time. It captures differences in baseline lung function across children that are not explained by the measured covariates. The estimated variance of the random intercept is 510.73, indicating substantial variability in baseline FEV1 between children.

ii. Output an estimated covariance matrix for Yi (Σi) for a child ID=100073 and describe the patterns in the variances and covariances and estimate a correlation.

id0 <- 100073

# how many repeated measures for this child
dat_i <- cf_data[cf_data$ID == id0, ]
m_i <- nrow(dat_i)
m_i
## [1] 7
# variance components from the random intercept model
# (Intercept) variance = random intercept variance
# Residual variance = within-subject error variance
sig2_b1 <- 510.7307
sig2_e  <- 148.7751

# ZiGZi' (all entries are sig2_b1)
ZGZ <- matrix(sig2_b1, nrow = m_i, ncol = m_i)

# R_i = sig2_e * I
R   <- diag(sig2_e, m_i)

# Sigma_i
Sigma_i <- ZGZ + R
Sigma_i
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 659.5058 510.7307 510.7307 510.7307 510.7307 510.7307 510.7307
## [2,] 510.7307 659.5058 510.7307 510.7307 510.7307 510.7307 510.7307
## [3,] 510.7307 510.7307 659.5058 510.7307 510.7307 510.7307 510.7307
## [4,] 510.7307 510.7307 510.7307 659.5058 510.7307 510.7307 510.7307
## [5,] 510.7307 510.7307 510.7307 510.7307 659.5058 510.7307 510.7307
## [6,] 510.7307 510.7307 510.7307 510.7307 510.7307 659.5058 510.7307
## [7,] 510.7307 510.7307 510.7307 510.7307 510.7307 510.7307 659.5058
# estimate a correlation (ICC)
510.7307 / 659.5058
## [1] 0.7744143

The estimated covariance matrix for child ID = 100073 shows a compound symmetry structure. All diagonal elements are equal to 659.51, representing the marginal variance of FEV1 at each time point, which is the sum of the random intercept variance (510.73) and the residual variance (148.78). All off-diagonal elements are equal to 510.73, indicating a constant covariance between any two measurements within the child due to the shared random intercept. The correlation between repeated measurements (ICC) is approximately 0.77, which means 77% of the total variability in FEV1 is due to differences between children.

For ID=100073, show your calculations of how you can obtain variance of Yij at time point 2 and show how you obtain covariance of Yij=1 and Yij=2.

sig2_b1 <- 510.7307
sig2_e  <- 148.7751

var_Yi2  <- sig2_b1 + sig2_e      # Sigma_i[2,2]
cov_Yi12 <- sig2_b1               # Sigma_i[1,2]

var_Yi2
## [1] 659.5058
cov_Yi12
## [1] 510.7307

iii. Output empirical BLUPs and obtain their variance, then compare that variance to the estimate in Q3(a)(i). If they are different, why?

# Obtain predicted values of b1i (BLUPs from Lecture 4)
m3.re <- as.data.frame(random.effects(m3))
head(m3.re)
dim(m3.re) # 200 individual rows of data corresponding to unique id s
## [1] 200   1
# Obtain the variance of the empirical BLUPs of b1i and note that it's smaller than variance of the random intercept from G matrix above
apply(m3.re, 2, var) # 481.4666 < 510.73 from getVarCov(m3)
## (Intercept) 
##    481.4666

The estimated random intercept variance represents the true between-child variability in baseline FEV1. In contrast, the BLUPs are predictions of individual random intercepts that are estimated with uncertainty. To avoid overfitting to noisy individual data, the BLUPs are shrunk toward the population mean. This shrinkage reduces the spread of the BLUPs, so their empirical variance is smaller than the model-based variance estimate.

iv. Using Cholesky-transformed residuals, plot the empirical semi-variogram and comment on whether the covariance model fits the data well or not.

source("Empirical or Sample Variogram Function.r")
cf_data$res_chol <- resid(m3, type = "normalized")
variogram(resid = cf_data$res_chol,
          timeVar = cf_data$AGEL,
          id = cf_data$ID,
          irregular = T,
          binwidth = 1,
          numElems = 5)

## $bin.mids
## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
## 
## $bin.means
## [1] 0.7021925 0.7308466 0.9309099 1.1248428 1.2055138 1.2066503 1.3115471
## [8] 1.2395730 1.3755750
## 
## $bin.sizes
## [1]  586 1081  924  795  639  474  341  198   65
## 
## $process.var
## [1] 0.8700767

The empirical semi-variogram increases with lag rather than fluctuating around a constant level, which indicates that residual correlation remains after accounting for the random intercept. This suggests that the random-intercept-only covariance model does not adequately fit the data.

b) Fit a covariance pattern model with the Compound Symmetry for Ri

m3_cs_pat <- gls(
  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 +
    AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2,
  data = cf_data,
  correlation = corCompSymm(form = ~ 1 | ID),
  method = "REML"
)

summary(m3_cs_pat)
## Generalized least squares fit by REML
##   Model: FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + AGEL:FEMALE +      AGEL:F508_1 + AGEL:F508_2 
##   Data: cf_data 
##        AIC      BIC    logLik
##   12520.08 12578.56 -6249.041
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##       Rho 
## 0.7744143 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) 103.80627  6.706025 15.479552  0.0000
## AGE0         -1.85532  0.334312 -5.549664  0.0000
## AGEL         -0.58782  0.393060 -1.495504  0.1350
## FEMALE       -1.16203  3.397872 -0.341987  0.7324
## F508_1       -4.28096  5.611017 -0.762956  0.4456
## F508_2       -6.74044  5.642571 -1.194569  0.2324
## AGEL:FEMALE  -0.82574  0.249427 -3.310541  0.0010
## AGEL:F508_1  -0.48769  0.422469 -1.154391  0.2485
## AGEL:F508_2  -0.65745  0.421359 -1.560310  0.1189
## 
##  Correlation: 
##             (Intr) AGE0   AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE0        -0.630                                                       
## AGEL        -0.212  0.006                                                
## FEMALE      -0.164 -0.088  0.073                                         
## F508_1      -0.657 -0.002  0.229 -0.021                                  
## F508_2      -0.725  0.123  0.226 -0.062  0.788                           
## AGEL:FEMALE  0.062 -0.005 -0.272 -0.262  0.004  0.011                    
## AGEL:F508_1  0.181 -0.004 -0.856  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2  0.179 -0.003 -0.853  0.011 -0.215 -0.266 -0.041   0.805     
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.66032274 -0.73159979  0.01712711  0.73394138  3.18588347 
## 
## Residual standard error: 25.68084 
## Degrees of freedom: 1513 total; 1504 residual

(c) When looking at the estimates of the fixed effects and standard errors from Q3(a) and Q3(b), what do you notice? Explain the results.

When comparing the fixed-effect estimates and standard errors from Q3(a) (random intercept model) and Q3(b) (compound symmetry covariance model), we see that they are essentially the same. This happens because both models assume the same marginal mean and the same marginal covariance structure.

Q3(a): Random Intercept Model

Mean model:

\[ Y_{ij} = \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} + b_{1i} + \varepsilon_{ij}, \]

where

\[ b_{1i} \sim N(0, \sigma_b^2), \quad \varepsilon_{ij} \sim N(0, \sigma_e^2). \]

Marginal mean:

\[ E(Y_{ij}) = \mathbf{x}_{ij}^{\top}\boldsymbol{\beta}. \]

Marginal covariance within subject \(i\):

\[ \mathrm{Var}(Y_{ij}) = \sigma_b^2 + \sigma_e^2, \]

\[ \mathrm{Cov}(Y_{ij}, Y_{ik}) = \sigma_b^2 \quad (j \neq k). \]

This is a compound symmetry structure (constant variance and constant covariance).

Q3(b): Covariance Pattern Model with Compound Symmetry

Mean model:

\[ Y_{ij} = \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} + e_{ij}. \]

Marginal mean:

\[ E(Y_{ij}) = \mathbf{x}_{ij}^{\top}\boldsymbol{\beta}. \]

Covariance structure:

\[ \mathrm{Var}(Y_{ij}) = \sigma^2, \]

\[ \mathrm{Cov}(Y_{ij}, Y_{ik}) = \rho \sigma^2 \quad (j \neq k). \]

Both models assume the same marginal mean \[ E(Y_{ij}) = \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} \] and the same compound symmetry covariance structure.

The parameters are related by

\[ \sigma_b^2 = \rho \sigma^2, \quad \sigma_e^2 = (1 - \rho)\sigma^2. \]

Since the marginal covariance matrix is the same in both models, the GLS estimates of the fixed effects and their standard errors are essentially identical.

Question 4 Anna Efstate

a) Fit a model for random intercept and random slope for the longitudinal age

library(nlme)
m4 <- lme(FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + 
            AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2, 
          data=cf_data,
          random = ~ AGEL | ID)

#Type 3 tests of fixed effects (Wald Test)
anova(m4, type = 'marginal')
summary(m4)
## Linear mixed-effects model fit by REML
##   Data: cf_data 
##        AIC      BIC    logLik
##   12415.17 12484.28 -6194.586
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 22.63675 (Intr)
## AGEL         2.12442 -0.158
## Residual    10.86380       
## 
## Fixed effects:  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + AGEL:FEMALE +      AGEL:F508_1 + AGEL:F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 104.51929  6.644351 1309 15.730548  0.0000
## AGE0         -1.91054  0.331048  195 -5.771197  0.0000
## AGEL         -0.60278  0.590297 1309 -1.021148  0.3074
## FEMALE       -1.30048  3.370111  195 -0.385887  0.7000
## F508_1       -4.23809  5.563661  195 -0.761744  0.4471
## F508_2       -6.65233  5.594474  195 -1.189090  0.2359
## AGEL:FEMALE  -0.76243  0.381214 1309 -1.999992  0.0457
## AGEL:F508_1  -0.50011  0.635753 1309 -0.786643  0.4316
## AGEL:F508_2  -0.74591  0.634492 1309 -1.175599  0.2400
##  Correlation: 
##             (Intr) AGE0   AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE0        -0.630                                                       
## AGEL        -0.211  0.002                                                
## FEMALE      -0.164 -0.088  0.075                                         
## F508_1      -0.657 -0.002  0.230 -0.021                                  
## F508_2      -0.725  0.122  0.227 -0.062  0.788                           
## AGEL:FEMALE  0.060 -0.003 -0.279 -0.265  0.005  0.012                    
## AGEL:F508_1  0.179 -0.001 -0.850  0.005 -0.269 -0.214 -0.023             
## AGEL:F508_2  0.178  0.000 -0.845  0.012 -0.215 -0.268 -0.046   0.797     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.328977867 -0.461790924  0.005172532  0.478384722  4.319672993 
## 
## Number of Observations: 1513
## Number of Groups: 200

(i): Random intercept + random slope model (nlme::lme)

# Q4(a)(i) --- Random intercept + random slope: G and Corr(G)

# 1) Residual variance (sigma^2)
sigma2 <- m4$sigma^2

# 2) Relative random-effects covariance from nlme (scaled by sigma^2)
G_rel <- as.matrix(pdMatrix(m4$modelStruct$reStruct$ID))

# 3) Actual G matrix for random effects (in the same scale as VarCorr output)
G <- sigma2 * G_rel
G
##             (Intercept)      AGEL
## (Intercept)  512.422280 -7.622039
## AGEL          -7.622039  4.513159
# 4) Correlation matrix for random effects
corG <- cov2cor(G)
corG
##             (Intercept)       AGEL
## (Intercept)   1.0000000 -0.1584955
## AGEL         -0.1584955  1.0000000
# 5) Pull out key numbers (these now match VarCorr)
var_b1   <- G[1, 1]      # Var(random intercept)
var_b3   <- G[2, 2]      # Var(random slope)
cov_b1b3 <- G[1, 2]      # Cov(intercept, slope)
cor_b1b3 <- corG[1, 2]   # Corr(intercept, slope)

c(var_b1 = var_b1,
  var_b3 = var_b3,
  cov_b1b3 = cov_b1b3,
  cor_b1b3 = cor_b1b3)
##      var_b1      var_b3    cov_b1b3    cor_b1b3 
## 512.4222799   4.5131589  -7.6220392  -0.1584955
# show VarCorr table (should agree with G and corG)
VarCorr(m4)
## ID = pdLogChol(AGEL) 
##             Variance   StdDev   Corr  
## (Intercept) 512.422280 22.63675 (Intr)
## AGEL          4.513159  2.12442 -0.158
## Residual    118.022081 10.86380

In this random intercept and random slope model, \(b_{1i}\) represents subject \(i\)’s deviation from the population mean baseline FEV1 level (random intercept), and \(b_{3i}\) represents subject \(i\)’s deviation from the population mean rate of change in FEV1 over longitudinal age AGEL (random slope). The estimated random-effects covariance matrix is \[ G = \begin{pmatrix} 512.43 & -7.62\\ -7.62 & 4.51 \end{pmatrix}, \] and the corresponding correlation between the random intercept and random slope is approximately \(-0.158\). This indicates a weak negative association: subjects with higher baseline FEV1 tend to have slightly smaller rates of change in FEV1 over time. Note that \(-7.62\) is the covariance (not the correlation) between the random intercept and slope.

  1. Estimated covariance matrix for \(Y_i\) (ID = 100073)
# Q4(a)(ii): Construct estimated covariance matrix Sigma_i for subject ID = 100073

# 1) Choose the subject
id_target <- 100073
dat_i <- subset(cf_data, ID == id_target)

# Quick check: number of repeated measures for this subject
n_i <- nrow(dat_i)
n_i
## [1] 7
# 2) Random-effects design matrix Z_i for (Intercept, AGEL)
#    This matches random = ~ AGEL | ID
Z_i <- model.matrix(~ AGEL, data = dat_i)   # columns: (Intercept), AGEL
Z_i
##   (Intercept)  AGEL
## 1           1 0.000
## 2           1 0.331
## 3           1 1.333
## 4           1 2.086
## 5           1 3.877
## 6           1 4.854
## 7           1 5.966
## attr(,"assign")
## [1] 0 1
# 3) Extract the actual random-effects covariance matrix G (NOT the scaled version)
#    pdMatrix gives the scaled/relative version, so multiply by sigma^2
sigma2 <- m4$sigma^2
G_rel  <- as.matrix(pdMatrix(m4$modelStruct$reStruct$ID))
G      <- sigma2 * G_rel
G
##             (Intercept)      AGEL
## (Intercept)  512.422280 -7.622039
## AGEL          -7.622039  4.513159
# 4) Residual covariance R_i = sigma^2 * I (independent within-subject residuals in this model)
R_i <- sigma2 * diag(n_i)
R_i
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 118.0221   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
## [2,]   0.0000 118.0221   0.0000   0.0000   0.0000   0.0000   0.0000
## [3,]   0.0000   0.0000 118.0221   0.0000   0.0000   0.0000   0.0000
## [4,]   0.0000   0.0000   0.0000 118.0221   0.0000   0.0000   0.0000
## [5,]   0.0000   0.0000   0.0000   0.0000 118.0221   0.0000   0.0000
## [6,]   0.0000   0.0000   0.0000   0.0000   0.0000 118.0221   0.0000
## [7,]   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000 118.0221
# 5) Estimated covariance of Y_i: Sigma_i = Z_i G Z_i' + R_i
Sigma_i <- Z_i %*% G %*% t(Z_i) + R_i
Sigma_i
##          1        2        3        4        5        6        7
## 1 630.4444 509.8994 502.2621 496.5227 482.8716 475.4249 466.9492
## 2 509.8994 625.8930 501.7305 497.1160 486.1404 480.1532 473.3386
## 3 502.2621 501.7305 618.1434 498.9120 496.0356 494.4666 492.6807
## 4 496.5227 497.1160 498.9120 618.2838 503.4719 505.2231 507.2162
## 5 482.8716 486.1404 496.0356 503.4719 639.1809 530.8072 541.7887
## 6 475.4249 480.1532 494.4666 505.2231 530.8072 662.7856 560.6482
## 7 466.9492 473.3386 492.6807 507.2162 541.7887 560.6482 700.1358

For subject ID = 100073, the estimated covariance matrix \(\Sigma_i\) reflects the structure of a random intercept and random slope model. The diagonal elements (variances) are not constant across time and generally increase at later follow-up ages, because the random slope component contributes variability that grows with AGEL. The off-diagonal elements (covariances) are also not constant; they depend on both time values through the random slope term, so measurements taken at larger AGEL values tend to have larger covariances with one another.

Compared to the random-intercept-only model in Q3(a)(ii), which induces a compound symmetry structure with constant within-subject covariance, this model produces time-dependent variances and covariances and therefore allows greater flexibility in modeling longitudinal growth.

  1. Obtain empirical BLUPs and compare their variances to G
# Q4(a)(iii): Obtain empirical BLUPs and compare their variances to G

# 1) Extract empirical BLUPs (subject-specific random effects)
#    These are the conditional estimates E(b_i | data)
blups <- ranef(m4)   # columns: (Intercept), AGEL
head(blups)
# 2) Compute empirical variance-covariance matrix of BLUPs across subjects
#    This estimates Var( b_hat_i )
var_blups <- var(as.matrix(blups))
var_blups
##             (Intercept)      AGEL
## (Intercept)  467.596525 -1.934763
## AGEL          -1.934763  2.786447
# 3) Extract the true random-effects variance-covariance matrix G
#    (same as constructed in Q4(a)(i))
sigma2 <- m4$sigma^2
G_rel  <- as.matrix(pdMatrix(m4$modelStruct$reStruct$ID))
G      <- sigma2 * G_rel
G
##             (Intercept)      AGEL
## (Intercept)  512.422280 -7.622039
## AGEL          -7.622039  4.513159
# 4) Compare BLUP variance to G
var_blups
##             (Intercept)      AGEL
## (Intercept)  467.596525 -1.934763
## AGEL          -1.934763  2.786447
G
##             (Intercept)      AGEL
## (Intercept)  512.422280 -7.622039
## AGEL          -7.622039  4.513159

The empirical variance of the BLUPs is smaller than the estimated random-effects variance matrix \(G\) obtained in Q4(a)(i). For example, the variance of the random intercept based on BLUPs is 467.60 compared to 512.43 from \(G\), and the variance of the random slope is 2.79 compared to 4.51 from \(G\). Similarly, the covariance magnitude is also reduced.

This difference occurs because BLUPs are shrinkage estimators. Each subject-specific effect is pulled toward the population mean depending on the amount of within-subject information and residual variability. As a result, the empirical variability of the predicted random effects across subjects underestimates the true between-subject variability represented by \(G\).

  1. Cholesky-transformed residuals + empirical semi-variogram
library(nlme)

# Compute variogram of normalized residuals
vg <- Variogram(m4, form = ~ AGEL | ID, resType = "normalized")

# Plot variogram
plot(vg, smooth = TRUE,
     xlab = "Time lag (|ΔAGEL|)",
     ylab = "Semi-variogram")

# DO NOT use abline()

The empirical semi-variogram of the Cholesky-transformed (normalized) residuals is not flat across time lags. Instead, it shows a systematic downward trend as lag increases. This indicates that residual dependence remains after accounting for the random intercept and random slope structure. Therefore, the fitted covariance model does not fully capture the within-subject correlation structure, suggesting possible model misspecification.

4b

# Fit random-intercept-only model (m3)
m3 <- lme(
  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 +
    AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2,
  data   = cf_data,
  random = ~ 1 | ID,
  method = "REML"
)

# Random intercept + random slope model (already fitted as m4)
# Compare models using likelihood ratio test
anova(m3, m4)

To assess whether inclusion of a random slope for AGEL is necessary, we compared the random-intercept-only model (m3) with the random intercept and random slope model (m4) using a likelihood ratio test under REML estimation. We are testing the null hypothesis that the random slope is not needed: \[ \mathrm{H_0} : \sigma_{b2}^2= 0 \ \mathrm{vs} \ \mathrm{H_0} : \sigma_{b2}^2> 0 \] The likelihood ratio statistic was 108.91 with p-value < 0.0001, providing strong evidence against the null hypothesis that the random slope variance is zero. In addition, the AIC decreased from 12520.08 (m3) to 12415.17 (m4), indicating improved model fit.

Therefore, there is substantial between-subject variability in rates of change in FEV1 over time, and the random intercept + random slope model (m4) is preferred.

Question 5

(a). Rewriting Random Intercept + Random Slope to Random Slope Only

library(nlme)

# Fit model with random slope only (no random intercept)
m5 <- lme(
  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 +
    AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2,
  data = cf_data,
  random = list(ID = pdDiag(~ 0 + AGEL)),  # random slope only
  method = "REML"
)

# Type III Wald tests for fixed effects
anova(m5, type = "marginal")
# Model summary
summary(m5)
## Linear mixed-effects model fit by REML
##   Data: cf_data 
##        AIC      BIC    logLik
##   13407.21 13465.69 -6692.605
## 
## Random effects:
##  Formula: ~0 + AGEL | ID
##             AGEL Residual
## StdDev: 4.457029 17.44713
## 
## Fixed effects:  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + AGEL:FEMALE +      AGEL:F508_1 + AGEL:F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 105.56602 3.0920752 1309  34.14083  0.0000
## AGE0         -2.01523 0.1475508  195 -13.65788  0.0000
## AGEL         -0.43350 1.1344034 1309  -0.38214  0.7024
## FEMALE       -0.74676 1.5786127  195  -0.47305  0.6367
## F508_1       -3.98344 2.6464852  195  -1.50518  0.1339
## F508_2       -6.94036 2.6622896  195  -2.60691  0.0098
## AGEL:FEMALE  -0.88823 0.7347241 1309  -1.20893  0.2269
## AGEL:F508_1  -0.68217 1.2222087 1309  -0.55815  0.5768
## AGEL:F508_2  -0.92513 1.2196276 1309  -0.75853  0.4483
##  Correlation: 
##             (Intr) AGE0   AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE0        -0.601                                                       
## AGEL        -0.333 -0.005                                                
## FEMALE      -0.186 -0.069  0.118                                         
## F508_1      -0.677 -0.009  0.360 -0.014                                  
## F508_2      -0.735  0.105  0.355 -0.050  0.796                           
## AGEL:FEMALE  0.094 -0.003 -0.281 -0.412  0.007  0.016                    
## AGEL:F508_1  0.282  0.006 -0.849  0.007 -0.419 -0.334 -0.022             
## AGEL:F508_2  0.280  0.007 -0.843  0.016 -0.337 -0.417 -0.048   0.796     
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.64218144 -0.46871145  0.03134041  0.50424566  3.22331770 
## 
## Number of Observations: 1513
## Number of Groups: 200

The random slope only model allows each child to have a different rate of change in FEV1 over time, but assumes all children share the same baseline level. The estimated variance of the random slope indicates meaningful variability in individual growth trajectories. However, compared with the random intercept and slope model in Question 4, this model has a higher AIC and worse log-likelihood, indicating a poorer fit. Therefore, allowing both random intercepts and random slopes provides a better representation of the longitudinal data.

(b). Testing Whether Genotype Contributes to Change in FEV1 Over Time

library(nlme)

# Fit model with two age terms (AGE0 and AGEL)
# Use ML when comparing fixed effects
m_age2 <- lme(
  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 +
    AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2,
  data   = cf_data,
  random = ~ AGEL | ID,   # same random structure as Q4
  method = "ML"
)

# Fit model with single AGE variable
m_age1 <- lme(
  FEV1 ~ AGE + FEMALE + F508_1 + F508_2 +
    AGE:FEMALE + AGE:F508_1 + AGE:F508_2,
  data   = cf_data,
  random = ~ AGEL | ID,
  method = "ML"
)

# Likelihood ratio test comparing the two age specifications
anova(m_age2, m_age1)
# Refit the chosen age model using REML (assume AGE0 + AGEL retained)
m_final <- update(m_age2, method = "REML")

# Joint Wald test for genotype-by-time interaction
# This tests AGEL:F508_1 and AGEL:F508_2 simultaneously
anova(m_final, type = "marginal")
# View coefficient estimates for interpretation
summary(m_final)
## Linear mixed-effects model fit by REML
##   Data: cf_data 
##        AIC      BIC    logLik
##   12415.17 12484.28 -6194.586
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 22.63675 (Intr)
## AGEL         2.12442 -0.158
## Residual    10.86380       
## 
## Fixed effects:  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + AGEL:FEMALE +      AGEL:F508_1 + AGEL:F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 104.51929  6.644351 1309 15.730548  0.0000
## AGE0         -1.91054  0.331048  195 -5.771197  0.0000
## AGEL         -0.60278  0.590297 1309 -1.021148  0.3074
## FEMALE       -1.30048  3.370111  195 -0.385887  0.7000
## F508_1       -4.23809  5.563661  195 -0.761744  0.4471
## F508_2       -6.65233  5.594474  195 -1.189090  0.2359
## AGEL:FEMALE  -0.76243  0.381214 1309 -1.999992  0.0457
## AGEL:F508_1  -0.50011  0.635753 1309 -0.786643  0.4316
## AGEL:F508_2  -0.74591  0.634492 1309 -1.175599  0.2400
##  Correlation: 
##             (Intr) AGE0   AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE0        -0.630                                                       
## AGEL        -0.211  0.002                                                
## FEMALE      -0.164 -0.088  0.075                                         
## F508_1      -0.657 -0.002  0.230 -0.021                                  
## F508_2      -0.725  0.122  0.227 -0.062  0.788                           
## AGEL:FEMALE  0.060 -0.003 -0.279 -0.265  0.005  0.012                    
## AGEL:F508_1  0.179 -0.001 -0.850  0.005 -0.269 -0.214 -0.023             
## AGEL:F508_2  0.178  0.000 -0.845  0.012 -0.215 -0.268 -0.046   0.797     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.328977867 -0.461790924  0.005172532  0.478384722  4.319672993 
## 
## Number of Observations: 1513
## Number of Groups: 200

The likelihood ratio test comparing the two age parameterizations (AGE0 + AGEL versus single AGE) was not statistically significant (p = 0.1397). Therefore, there is no strong evidence that modeling age with two separate components improves model fit, and the simpler age specification is adequate.

To assess whether genotype contributes to the change in FEV1 over time, we conducted a joint Wald test of the interaction terms AGEL:F508_1 and AGEL:F508_2. Neither interaction term was statistically significant (p = 0.4316 and p = 0.2400, respectively). Thus, there is no evidence that genotype significantly affects the rate of change in FEV1 over time.

(c). Testing Whether Gender Contributes to Change in FEV1 Over Time

library(nlme)

# Refit the selected age model using REML
# (Assume AGE0 + AGEL retained; change if you selected AGE instead)
m_final <- lme(
  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 +
    AGEL:FEMALE + AGEL:F508_1 + AGEL:F508_2,
  data   = cf_data,
  random = ~ AGEL | ID,
  method = "REML"
)

# Joint Wald test for gender-by-time interaction
# This tests whether AGEL:FEMALE = 0
anova(m_final, type = "marginal")
# View coefficient estimates for interpretation
summary(m_final)
## Linear mixed-effects model fit by REML
##   Data: cf_data 
##        AIC      BIC    logLik
##   12415.17 12484.28 -6194.586
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 22.63675 (Intr)
## AGEL         2.12442 -0.158
## Residual    10.86380       
## 
## Fixed effects:  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 + AGEL:FEMALE +      AGEL:F508_1 + AGEL:F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 104.51929  6.644351 1309 15.730548  0.0000
## AGE0         -1.91054  0.331048  195 -5.771197  0.0000
## AGEL         -0.60278  0.590297 1309 -1.021148  0.3074
## FEMALE       -1.30048  3.370111  195 -0.385887  0.7000
## F508_1       -4.23809  5.563661  195 -0.761744  0.4471
## F508_2       -6.65233  5.594474  195 -1.189090  0.2359
## AGEL:FEMALE  -0.76243  0.381214 1309 -1.999992  0.0457
## AGEL:F508_1  -0.50011  0.635753 1309 -0.786643  0.4316
## AGEL:F508_2  -0.74591  0.634492 1309 -1.175599  0.2400
##  Correlation: 
##             (Intr) AGE0   AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE0        -0.630                                                       
## AGEL        -0.211  0.002                                                
## FEMALE      -0.164 -0.088  0.075                                         
## F508_1      -0.657 -0.002  0.230 -0.021                                  
## F508_2      -0.725  0.122  0.227 -0.062  0.788                           
## AGEL:FEMALE  0.060 -0.003 -0.279 -0.265  0.005  0.012                    
## AGEL:F508_1  0.179 -0.001 -0.850  0.005 -0.269 -0.214 -0.023             
## AGEL:F508_2  0.178  0.000 -0.845  0.012 -0.215 -0.268 -0.046   0.797     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.328977867 -0.461790924  0.005172532  0.478384722  4.319672993 
## 
## Number of Observations: 1513
## Number of Groups: 200

We test whether gender significantly affects the rate of change in FEV1 over time.

Null hypothesis (H₀): The interaction between age and gender is zero (AGEL:FEMALE = 0). There is no difference in slope between males and females.

Alternative hypothesis (H₁): The interaction between age and gender is not zero (AGEL:FEMALE ≠ 0). The rate of change in FEV1 differs by gender.

We use a Wald test for the fixed effect AGEL:FEMALE.

From the model output:

p-value for AGEL:FEMALE = 0.0457

Since p < 0.05, we reject the null hypothesis at the 5% significance level.

This suggests that gender has a statistically significant effect on the change in FEV1 over time. The estimated coefficient for AGEL:FEMALE is negative, indicating that females experience a faster decline in FEV1 over time compared to males (reference group).

(d). Residual Diagnostics for the Final Mean Model

# Q5(d)(i) Residual diagnostics for final mean model (m_final)

res_norm <- resid(m_final, type = "normalized")  # Cholesky/normalized residuals
fit_marg <- fitted(m_final)                      # predicted marginal mean

# 1) Histogram
hist(res_norm,
     breaks = 30,
     main = "Histogram of Cholesky-Transformed Residuals",
     xlab = "Normalized residuals")

# 2) Residuals vs predicted marginal mean
plot(fit_marg, res_norm,
     xlab = "Predicted marginal mean",
     ylab = "Normalized residuals",
     main = "Residuals vs Predicted Marginal Mean")
abline(h = 0, lty = 2)

# 3) Residuals vs time (use AGE or AGEL depending on your final choice)
plot(cf_data$AGEL, res_norm,
     xlab = "Time (AGEL)",
     ylab = "Normalized residuals",
     main = "Residuals vs Time (AGEL)")
abline(h = 0, lty = 2)

The histogram of the Cholesky-transformed (normalized) residuals appears approximately symmetric and bell-shaped, centered around zero. Although there may be slight deviations in the tails, there is no strong evidence of severe non-normality. Thus, the normality assumption for the residuals appears reasonable.

The plot of normalized residuals versus predicted marginal mean response shows no obvious systematic pattern. The residuals are randomly scattered around zero with approximately constant variability across the range of fitted values. This suggests no strong evidence of heteroscedasticity or misspecification of the mean structure.

The plot of normalized residuals versus time (AGEL) does not exhibit any clear trend, curvature, or increasing/decreasing spread over time. The residuals fluctuate randomly around zero, indicating that the chosen functional form for age in the mean model is adequate.

Overall, the residual diagnostics do not reveal any serious violations of the model assumptions. The mean response model appears to provide a reasonable fit to the data.

(e). Final Mean Model and Covariance Model

Final mean model: \[ E(\text{FEV1}_{ij}) = \beta_0 + \beta_{A0}\,\text{AGE0}_i + \beta_{A}\,\text{AGEL}_{ij} + \beta_F\,\text{FEMALE}_i + \beta_{G1}\,\text{F508\_1}_i + \beta_{G2}\,\text{F508\_2}_i + \beta_{AF}\,(\text{AGEL}_{ij}\times \text{FEMALE}_i). \]

Final covariance model (random intercept + random slope): \[ \mathrm{Var}(Y_i)=\Sigma_i = Z_i\,G\,Z_i' + \sigma^2 I_{n_i},\qquad Z_i = \begin{pmatrix} 1 & \text{AGEL}_{i1}\\ \vdots & \vdots\\ 1 & \text{AGEL}_{i n_i}\end{pmatrix}, \quad (b_{1i}, b_{3i})' \sim N(0,G). \]

In conclusion, after adjusting for age and genotype, we found evidence that the mean change in FEV1 over time differs by gender, with a statistically significant age-by-gender interaction. In contrast, genotype did not show evidence of affecting the rate of change in FEV1 over time. A random intercept and random slope covariance structure was used to account for between-subject heterogeneity in baseline FEV1 and longitudinal change.