BIS628 HW3 Cystic Fibrosis

true knitr

knitr::opts_chunk$set(echo = TRUE)

my working directory

setwd("B:/OneDrive - Yale University/Documents/Yale University/2semester/BIS628 Longitudinal Multilevel Analysis/HW/3")

remove environment

rm(list = ls())

detach packages

attach my favorite packages

library(tidyverse)
library(nlme)
library(pscl)
library(broom)
library(formatR)
library(mgcv)

Question 1

Definition of linear mixed effects models

1.a Re-write a LME model in which the random intercept and random slope of one covariate (i.e. only two fixed effects and only two random effects are present) have independent correlation structure (G matrix has ‘0’ on the off-diagonal) and the error terms also have independent correlation structure (R matrix has ‘0s’ on the off-diagonal) into an equivalent LME model in which the only random effect present is the slope (there is no random intercept), and there is also an error term.

1.b For a LME model including fixed effects, random intercept with variance \(\sigma^2_{b1}\) and AR(1) error term with variance \(\sigma^2\) and a correlation coefficient \(\rho\), is there an equivalent linear model without the random intercept but with a new AR(1) error term? Why or Why not?

Question 2

import-export business

CFkids <- read.csv("NewCFkids-SAS.csv")
glimpse(CFkids)

## Rows: 1,512
## Columns: 9
## $ X100073  <int> 100073, 100073, 100073, 100073, 100073, 100073, 100111, 10011…
## $ X113.8   <dbl> 98.18, 98.73, 101.79, 98.04, 94.32, 95.48, 96.85, 101.05, 100…
## $ X8.452   <dbl> 8.783, 9.785, 10.538, 12.329, 13.306, 14.418, 12.515, 13.103,…
## $ X1       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X1.1     <int> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X2       <int> 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X1.2     <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ X8.452.1 <dbl> 8.452, 8.452, 8.452, 8.452, 8.452, 8.452, 12.515, 12.515, 12.…
## $ X0       <dbl> 0.331, 1.333, 2.086, 3.877, 4.854, 5.966, 0.000, 0.588, 2.590…

Question 2.a Create 2 dummy variables from the variable genotype: F508_1 (1=if F508 is equal to ‘1’, otherwise=0) and F508_2 (1=if F508 is equal to ‘2’, otherwise=0)

#rename
CFkids <- CFkids %>%
  rename(ID = X100073, FEV1 = X113.8, AGE = X8.452, FEMALE = X1, 
         PSEUDOA = X1.1, F508 = X2, PANCREAT = X1.2, AGE0 = X8.452.1, 
         AGEL = X0)
#dummies
CFkids$F508_1 <- ifelse(CFkids$F508 == "1", 1, 0)
CFkids$F508_2 <- ifelse(CFkids$F508 == "2", 1, 0)
#categorical
CFkids$AGE_CAT <- CFkids$AGE
CFkids$AGE_CAT <- factor(cut(CFkids$AGE_CAT, 
                             breaks = c(0, 8, 12, 15, Inf), 
                             labels = c("1", "2", "3", "4"),
                             right = FALSE))
CFkids$FEMALE <- factor(CFkids$FEMALE)
CFkids$F508 <- factor(CFkids$F508)
#table
na_counts <- CFkids %>%
  group_by(FEMALE) %>%
  summarise(`NA` = sum(is.na(FEV1)))
print(na_counts)

## # A tibble: 2 × 2
##   FEMALE  `NA`
##   <fct>  <int>
## 1 0          0
## 2 1          0

there’s no missing data

Question 2.b On a single graph, construct a time plot that displays individual trajectories of FEV1 with time(x-axis) as AGE, with a smoothed estimate of the population mean FEV1 over time. Describe what you see.

figure2b <- ggplot(CFkids, aes(x = AGE, y = FEV1))
figure2b +  geom_line(aes(group = factor(ID)), color = "darkred") + labs(title = "Individual Profiles of FEV1 over Age", x = "Age (in years)") + geom_smooth(method = 'loess', se = FALSE, color = "darkblue")

## `geom_smooth()` using formula = 'y ~ x'

i see a diminishing FEV1 response for the population over Age

Question 2.c On a single graph, construct a time plot that displays individual trajectories of FEV1 with time (x-axis) as AGEL, with a smoothed estimate of the population mean FEV1 over time. Describe what you see.

figure2c <- ggplot(CFkids, aes(x = AGEL, y = FEV1))
figure2c +  geom_line(aes(group = factor(ID)), color = "darkred") + labs(title = "Individual Profiles of FEV1 over Time", x = "Age After Follow Up (in years)") + geom_smooth(method = 'loess', se = FALSE, color = "darkblue")

## `geom_smooth()` using formula = 'y ~ x'

I see a less diminishing FEV1 for the population over their age since their follow up periods, relative to just Age.

Question 2.d On a single graph, construct a time plot that displays mean FEV1 for boys and girls, with the time (x-axis) as AGE. Briefly describe the time trends for boys and girls.

figure2d <- ggplot(CFkids, aes(x = AGE, y = FEV1, group  = FEMALE))
figure2d + 
  geom_smooth(method = 'loess', 
              se = FALSE, 
              aes(color = FEMALE)) +
  labs(title="Smoothed Average Trend Lines of FEV1",
       color="Gender", x="Age (in years)") + 
  scale_color_manual(values=c("blue","red"))

## `geom_smooth()` using formula = 'y ~ x'

males see a more direct, linear relationship between FEV1 response than women, who see a convex relationship.

Question 2.e On a single graph, construct a time plot that displays mean FEV1 for boys and girls, with the time (x-axis) as AGEL. Briefly describe the time trends for boys and girls.

figure2e <- ggplot(CFkids, aes(x = AGEL, y = FEV1, group  = FEMALE))
figure2e + 
  geom_smooth(method = 'loess', 
              se = FALSE, 
              aes(color = FEMALE)) +
  labs(title="Smoothed Average Trend Lines of FEV1",
       color="Gender", x="Age After Follow Up (in years)") + 
  scale_color_manual(values=c("blue","red"))

## `geom_smooth()` using formula = 'y ~ x'

The slope between genders is different between 0 and 2.5 years of follow up, after which females see a convex and males see a concave relationship.

Question 2.f On a single graph, construct a time plot that displays mean FEV1 by genotype, with the time (x-axis) as AGE. Briefly describe the time trends for the three genotype groups.

figure2f <- ggplot(CFkids, aes(x = AGE, y = FEV1, group  = F508))
figure2f + 
  geom_smooth(method = 'loess', 
              se = FALSE, 
              aes(color = F508)) +
  labs(title="Smoothed Average Trend Lines of FEV1",
       color="Genotype", x="Age (in years)") + 
  scale_color_manual(values=c("blue","red", "darkgreen"))

## `geom_smooth()` using formula = 'y ~ x'

Genotypes O and 2, at 30 years old, end in similar spots but do not follow the same trends. Genotype 1 stays smooth until roughly 17 years old, then flattens out.

Question 2.g On a single graph, construct a time plot that displays mean FEV1 by genotype, with the time (x-axis) as AGEL. Briefly describe the time trends for the three genotype groups.

figure2g <- ggplot(CFkids, aes(x = AGEL, y = FEV1, group  = F508))
figure2g + 
  geom_smooth(method = 'loess', 
              se = FALSE, 
              aes(color = F508)) +
  labs(title="Smoothed Average Trend Lines of FEV1",
       color="Genotype", x="Age After Follow Up (in years)") + 
  scale_color_manual(values=c("blue","red", "darkgreen"))

## `geom_smooth()` using formula = 'y ~ x'

The FEV1 response follows similar trends for all three genotypes until roughly 3 years, when the response profile differs and does not remain linear for genotype 0 or 3.

Question 2.h On a single graph, construct a time plot that displays mean FEV1 by the new categorical variable of age at first follow up, with the time (x-axis) as AGE. Briefly describe the time trends for the four cohorts.

figure2h <- ggplot(CFkids, aes(x = AGE, y = FEV1, group  = AGE_CAT))
figure2h + 
  geom_smooth(method = 'loess', 
              se = FALSE, 
              aes(color = AGE_CAT)) +
  labs(title="Smoothed Average Trend Lines of FEV1",
       color="Age Cohort", x="Age (in years)") + 
  scale_color_manual(values=c("blue","red", "darkgreen", "darkorange"))

## `geom_smooth()` using formula = 'y ~ x'

The response profile for the first three age cohorts seems similar, until the fourth and oldest cohort, which has a flatter slope.

Question 2.i On a single graph, construct a time plot that displays mean FEV1 by the new categorical variable of age at first follow up, with the time (x-axis) as AGEL. Briefly describe the time trends for the four cohorts.

figure2i <- ggplot(CFkids, aes(x = AGEL, y = FEV1, group  = AGE_CAT))
figure2i + 
  geom_smooth(method = 'loess', 
              se = FALSE, 
              aes(color = AGE_CAT)) +
  labs(title="Smoothed Average Trend Lines of FEV1",
       color="Age Cohort", x="Age After Follow Up (in years)") + 
  scale_color_manual(values=c("blue","red", "darkgreen", "darkorange"))

## `geom_smooth()` using formula = 'y ~ x'

From youngest to oldest, the cohorts have diminishing FEV1 responses over their follow up periods.

Question 3

Consider the following maximal model: a fixed intercept \(X_1\), age at first follow up \(X_2\), longitudinal age \(X_3\), female sex \(X_4\), F508_1 \(X_5\), F508_2 \(X_6\), and 3 interactions: longitudinal age by female sex \(X_3*X_4\), longitudinal age by F508_1 \(X_3*X_5\), and longitudinal age by F508_2 \(X_3*X_6\)

Fit a model for the random intercept only:

#using the lme function, fitting the maximal model.
lme3ai <- lme(FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL*FEMALE + AGEL*F508_1 + AGEL*F508_2, data = CFkids, random = ~ 1 | ID)
#summary
summary(lme3ai)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12512.41 12570.88 -6245.207
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept) Residual
## StdDev:    22.59066 12.19951
## 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  103.81003  6.703758 1307 15.485349  0.0000
## AGE           -1.85437  0.334193 1307 -5.548804  0.0000
## AGEL           1.26394  0.514581 1307  2.456243  0.0142
## FEMALE1       -1.19867  3.397055  196 -0.352856  0.7246
## F508_1        -4.27993  5.609198  196 -0.763020  0.4464
## F508_2        -6.77695  5.640915  196 -1.201392  0.2310
## AGEL:FEMALE1  -0.81968  0.249580 1307 -3.284231  0.0010
## AGEL:F508_1   -0.48790  0.422543 1307 -1.154682  0.2484
## AGEL:F508_2   -0.65120  0.421501 1307 -1.544959  0.1226
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.247 -0.645                                                
## FEMALE1      -0.164 -0.088  0.113                                         
## F508_1       -0.657 -0.002  0.176 -0.021                                  
## F508_2       -0.725  0.123  0.093 -0.062  0.788                           
## AGEL:FEMALE1  0.062 -0.005 -0.205 -0.262  0.004  0.011                    
## AGEL:F508_1   0.181 -0.004 -0.651  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2   0.179 -0.003 -0.649  0.011 -0.215 -0.267 -0.040   0.805     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -4.943751608 -0.527381631 -0.003483511  0.514482060  4.289862752 
## 
## Number of Observations: 1512
## Number of Groups: 200

#interval
intervals(lme3ai)

## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                    lower        est.       upper
## (Intercept)   90.6587280 103.8100307 116.9613334
## AGE           -2.5099818  -1.8543693  -1.1987567
## AGEL           0.2544408   1.2639358   2.2734308
## FEMALE1       -7.8981452  -1.1986722   5.5008009
## F508_1       -15.3420644  -4.2799329   6.7821987
## F508_2       -17.9016325  -6.7769514   4.3477297
## AGEL:FEMALE1  -1.3093003  -0.8196788  -0.3300573
## AGEL:F508_1   -1.3168406  -0.4879031   0.3410344
## AGEL:F508_2   -1.4780939  -0.6512016   0.1756908
## 
##  Random Effects:
##   Level: ID 
##                    lower     est.   upper
## sd((Intercept)) 20.37415 22.59066 25.0483
## 
##  Within-group standard error:
##    lower     est.    upper 
## 11.74081 12.19951 12.67613

3.a.i Describe in words the interpretation of the random intercept component (b1i) and report the estimate of the variance for this random effect.

###################################################
#Get estimated parameters of Cov(Yi)= ZiGZi' + Ri # ## Ri is not empty
###################################################
#Here is how  you get the parameters from G-matrix, random effects variance/covariance matrix
getVarCov(lme3ai)

## Random effects variance covariance matrix
##             (Intercept)
## (Intercept)      510.34
##   Standard Deviations: 22.591

#Here is how you get all parameters from G-matrix and R-matrix (remaining within-person residual, epsilon from Lecture 4)
VarCorr(lme3ai)

## ID = pdLogChol(1) 
##             Variance StdDev  
## (Intercept) 510.3377 22.59066
## Residual    148.8280 12.19951

#this extracts variance of random intercept and variance of epsilon, m1.var
m1.var <- as.numeric(VarCorr(lme3ai))[1:2] 
m1.var

## [1] 510.3377 148.8280

Each subject has an underlying level of mean response which persists across all repeated measurements. The response of the i’th subject at j’th occasion differs from the population mean by subject-level effects. it’s how much the intercepts vary across ID’s. The value is 510.3377

3.a.ii Output an estimated covariance matrix for \(Y_i\) \((\sum_i)\) for a child ID=100073 and describe the patterns in the variances and covariances and estimate a correlation. For ID=100073, show your calculations of how you can obtain variance of \(Y_{ij}\) at time point 2 and show how you obtain covariance of \(Y_{ij} = 1\) and \(Y_{ij} = 2\).

#$cov(Y_i=1)$ aka covariance matrix for child ID = 100073
est.cov.lme3ai <- extract.lme.cov(lme3ai) 
est.cov.lme3ai[1:6,1:6]

##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 659.1658 510.3377 510.3377 510.3377 510.3377 510.3377
## [2,] 510.3377 659.1658 510.3377 510.3377 510.3377 510.3377
## [3,] 510.3377 510.3377 659.1658 510.3377 510.3377 510.3377
## [4,] 510.3377 510.3377 510.3377 659.1658 510.3377 510.3377
## [5,] 510.3377 510.3377 510.3377 510.3377 659.1658 510.3377
## [6,] 510.3377 510.3377 510.3377 510.3377 510.3377 659.1658

apparently it’s also possible to do it this way?

# Extract random effects for all subjects
random_effects <- ranef(lme3ai)
# Get random effects for ID = 100073 (1st row, assuming ID = 100073 is the first ID in your dataset)
random_effect_id_100073 <- random_effects$ID[1, ]
print(random_effect_id_100073)

## NULL

# Get the variance-covariance components
var_corr <- VarCorr(lme3ai)
print(var_corr)

## ID = pdLogChol(1) 
##             Variance StdDev  
## (Intercept) 510.3377 22.59066
## Residual    148.8280 12.19951

# Variance-covariance matrix for the individual (ID = 100073) at two time points
# Assuming the random intercept covariance for each pair of time points is equal
cov_matrix_100073 <- matrix(c(659.1658, 510.3377, 510.3377, 659.1658), nrow = 2)
print(cov_matrix_100073)

##          [,1]     [,2]
## [1,] 659.1658 510.3377
## [2,] 510.3377 659.1658

The diagonals is the variance of the individual random effects. Off-diagonals are the covariance between random effects. The correlation is estimated with an ICC, value of 0.774.

3.a.iii Output empirical BLUPs and obtain their variance, then compare that variance to the estimate in Q3(a)(i). If they are different, why?

#obtain predicted values of lme3ai (BLUPs)
lme3ai.re <- as.data.frame(random.effects(lme3ai))
head(lme3ai.re)

#200 individual rows of data corresponding to unique ID's
dim(lme3ai.re)

## [1] 200   1

#check
length(unique(CFkids$ID))

## [1] 200

#Obtain the variance of the empirical BLUPs of lme3ai and note that it's smaller than variance of the random intercept from G matrix above
apply(lme3ai.re, 2, var)

## (Intercept) 
##    481.0597

getVarCov(lme3ai)

## Random effects variance covariance matrix
##             (Intercept)
## (Intercept)      510.34
##   Standard Deviations: 22.591

the variance of the empirical BLUPs of lme3ai are 481.06 which is smaller than the variance of the random intercept from the G matrix 510.3377438 because we might have violated the random effects of BLUPs assumption to be normally distributed. it’s highly affected by their distribution assumption and differences in the population random effects b might not be preserved in shrunken estimates \(\hat{b}\). shrunken estimates happen when we pull the random individual effects towards the population mean.

3.a.iv Using Cholesky-transformed residuals, plot the empirical semi-variogram and comment on whether the covariance model fits the data well or not.

#Transform the residuals based on the Cholesky decomposition of the covariance matrix for M3 (m3 above)
est.lme3ai <- extract.lme.cov(lme3ai)
cr.3 <- solve(t(chol(est.lme3ai))) %*% residuals(lme3ai, level = 0)
#level = 0 is marginal residuals

#Use Empirical variogram with transformed residuals to check Covariance Model fit: M3
source("Empirical or Sample Variogram Function.r") # this version allows to have a circle for the estimated g(u)
source("Semivariogram Function.r") #this version does not have a circle for the estimated g(u)
variogram(resid = cr.3[,1],
          timeVar = CFkids$AGE,
          id = CFkids$ID,
          irregular = TRUE,
          binwidth = 3)

## $bin.mids
## [1] 1.5 4.5 7.5
## 
## $bin.means
## [1] 0.7557705 1.0092820 1.0320851
## 
## $bin.sizes
## [1] 2590 1903  604
## 
## $process.var
## [1] 0.9951882

With the transformed residuals, we see a variogram that indicates the model fits the data not as well for the first four lag periods as it does for the last four.

3.b Fit a covariance pattern model with the Compound Symmetry for \(R_i\)

#Compound symmetry covariance
lme3b <- lme(FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL*FEMALE + 
               AGEL*F508_1 + AGEL*F508_2, data = CFkids, 
             random = ~ 1 | ID, corr = corCompSymm())
summary(lme3b)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC     BIC    logLik
##   12514.41 12578.2 -6245.207
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept) Residual
## StdDev:    22.59066 12.19951
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 2.467162e-17 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  103.81003  6.703758 1307 15.485349  0.0000
## AGE           -1.85437  0.334193 1307 -5.548804  0.0000
## AGEL           1.26394  0.514581 1307  2.456243  0.0142
## FEMALE1       -1.19867  3.397055  196 -0.352856  0.7246
## F508_1        -4.27993  5.609198  196 -0.763020  0.4464
## F508_2        -6.77695  5.640915  196 -1.201392  0.2310
## AGEL:FEMALE1  -0.81968  0.249580 1307 -3.284231  0.0010
## AGEL:F508_1   -0.48790  0.422543 1307 -1.154682  0.2484
## AGEL:F508_2   -0.65120  0.421501 1307 -1.544959  0.1226
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.247 -0.645                                                
## FEMALE1      -0.164 -0.088  0.113                                         
## F508_1       -0.657 -0.002  0.176 -0.021                                  
## F508_2       -0.725  0.123  0.093 -0.062  0.788                           
## AGEL:FEMALE1  0.062 -0.005 -0.205 -0.262  0.004  0.011                    
## AGEL:F508_1   0.181 -0.004 -0.651  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2   0.179 -0.003 -0.649  0.011 -0.215 -0.267 -0.040   0.805     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -4.943751608 -0.527381631 -0.003483511  0.514482060  4.289862752 
## 
## Number of Observations: 1512
## Number of Groups: 200

getVarCov(lme3b)

## Random effects variance covariance matrix
##             (Intercept)
## (Intercept)      510.34
##   Standard Deviations: 22.591

#function for variance covariance matrix
corandcov2 <- function(glsob,cov=T,...){
  corm <- corMatrix(glsob$modelStruct$corStruct)[[1]]
  corv<-print(getVarCov(glsob))
  return(corm)
}
corandcov2(lme3b)

## Random effects variance covariance matrix
##             (Intercept)
## (Intercept)      510.34
##   Standard Deviations: 22.591

##              [,1]         [,2]         [,3]         [,4]         [,5]
## [1,] 1.000000e+00 2.467162e-17 2.467162e-17 2.467162e-17 2.467162e-17
## [2,] 2.467162e-17 1.000000e+00 2.467162e-17 2.467162e-17 2.467162e-17
## [3,] 2.467162e-17 2.467162e-17 1.000000e+00 2.467162e-17 2.467162e-17
## [4,] 2.467162e-17 2.467162e-17 2.467162e-17 1.000000e+00 2.467162e-17
## [5,] 2.467162e-17 2.467162e-17 2.467162e-17 2.467162e-17 1.000000e+00
## [6,] 2.467162e-17 2.467162e-17 2.467162e-17 2.467162e-17 2.467162e-17
##              [,6]
## [1,] 2.467162e-17
## [2,] 2.467162e-17
## [3,] 2.467162e-17
## [4,] 2.467162e-17
## [5,] 2.467162e-17
## [6,] 1.000000e+00

3.c When looking at the estimates of the fixed effects and standard errors from Q3(a) and Q3(b), what do you notice? Explain the results. Hint: write-out the models for Q3(a) and Q3(b) – for the mean response and for the variance/covariance matrices.

summary(lme3ai)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12512.41 12570.88 -6245.207
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept) Residual
## StdDev:    22.59066 12.19951
## 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  103.81003  6.703758 1307 15.485349  0.0000
## AGE           -1.85437  0.334193 1307 -5.548804  0.0000
## AGEL           1.26394  0.514581 1307  2.456243  0.0142
## FEMALE1       -1.19867  3.397055  196 -0.352856  0.7246
## F508_1        -4.27993  5.609198  196 -0.763020  0.4464
## F508_2        -6.77695  5.640915  196 -1.201392  0.2310
## AGEL:FEMALE1  -0.81968  0.249580 1307 -3.284231  0.0010
## AGEL:F508_1   -0.48790  0.422543 1307 -1.154682  0.2484
## AGEL:F508_2   -0.65120  0.421501 1307 -1.544959  0.1226
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.247 -0.645                                                
## FEMALE1      -0.164 -0.088  0.113                                         
## F508_1       -0.657 -0.002  0.176 -0.021                                  
## F508_2       -0.725  0.123  0.093 -0.062  0.788                           
## AGEL:FEMALE1  0.062 -0.005 -0.205 -0.262  0.004  0.011                    
## AGEL:F508_1   0.181 -0.004 -0.651  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2   0.179 -0.003 -0.649  0.011 -0.215 -0.267 -0.040   0.805     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -4.943751608 -0.527381631 -0.003483511  0.514482060  4.289862752 
## 
## Number of Observations: 1512
## Number of Groups: 200

summary(lme3b)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC     BIC    logLik
##   12514.41 12578.2 -6245.207
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept) Residual
## StdDev:    22.59066 12.19951
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 2.467162e-17 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  103.81003  6.703758 1307 15.485349  0.0000
## AGE           -1.85437  0.334193 1307 -5.548804  0.0000
## AGEL           1.26394  0.514581 1307  2.456243  0.0142
## FEMALE1       -1.19867  3.397055  196 -0.352856  0.7246
## F508_1        -4.27993  5.609198  196 -0.763020  0.4464
## F508_2        -6.77695  5.640915  196 -1.201392  0.2310
## AGEL:FEMALE1  -0.81968  0.249580 1307 -3.284231  0.0010
## AGEL:F508_1   -0.48790  0.422543 1307 -1.154682  0.2484
## AGEL:F508_2   -0.65120  0.421501 1307 -1.544959  0.1226
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.247 -0.645                                                
## FEMALE1      -0.164 -0.088  0.113                                         
## F508_1       -0.657 -0.002  0.176 -0.021                                  
## F508_2       -0.725  0.123  0.093 -0.062  0.788                           
## AGEL:FEMALE1  0.062 -0.005 -0.205 -0.262  0.004  0.011                    
## AGEL:F508_1   0.181 -0.004 -0.651  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2   0.179 -0.003 -0.649  0.011 -0.215 -0.267 -0.040   0.805     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -4.943751608 -0.527381631 -0.003483511  0.514482060  4.289862752 
## 
## Number of Observations: 1512
## Number of Groups: 200

I essentially get the same results for fixed effect estimates with a random intercept LME mean response profile, with and without using a Compound Symmetry Matrix for \(R_i\).

Question 4

Consider the following maximal model (as in Q3): a fixed intercept \(X_1\), age at first follow up \(X_2\), longitudinal age \(X_3\), female sex \(X_4\), F508_1 \(X_5\), F508_2 \(X_6\), and 3 interactions: longitudinal age by female sex \(X_3*X_4\), longitudinal age by F508_1 \(X_3*X_5\), and longitudinal age by F508_2 \(X_3*X_6\):

4.a. Fit a model for the random intercept and random slope for the longitudinal age:

lme4a <- lme(FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL*FEMALE + 
               AGEL*F508_1 + AGEL*F508_2, data = CFkids, 
             random = ~ AGEL | ID)
summary(lme4a)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12407.24 12476.34 -6190.619
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 22.61567 (Intr)
## AGEL         2.12688 -0.157
## Residual    10.86333       
## 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  104.52800  6.639494 1307 15.743369  0.0000
## AGE           -1.90971  0.330871 1307 -5.771758  0.0000
## AGEL           1.30297  0.676545 1307  1.925924  0.0543
## FEMALE1       -1.34560  3.367545  196 -0.399578  0.6899
## F508_1        -4.23639  5.558928  196 -0.762088  0.4469
## F508_2        -6.69784  5.589911  196 -1.198202  0.2323
## AGEL:FEMALE1  -0.75318  0.381600 1307 -1.973744  0.0486
## AGEL:F508_1   -0.50057  0.636207 1307 -0.786801  0.4315
## AGEL:F508_2   -0.73638  0.635019 1307 -1.159619  0.2464
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.124 -0.487                                                
## FEMALE1      -0.164 -0.088  0.108                                         
## F508_1       -0.657 -0.002  0.201 -0.021                                  
## F508_2       -0.725  0.122  0.137 -0.062  0.788                           
## AGEL:FEMALE1  0.060 -0.002 -0.243 -0.264  0.005  0.012                    
## AGEL:F508_1   0.178 -0.001 -0.741  0.005 -0.268 -0.213 -0.023             
## AGEL:F508_2   0.177  0.000 -0.737  0.012 -0.214 -0.267 -0.046   0.797     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.328886495 -0.461215917  0.004543766  0.477953391  4.320082030 
## 
## Number of Observations: 1512
## Number of Groups: 200

4.a.i Describe in words the interpretation of the random intercept component (b1i), the random slope component (b3i); and Output the variance/covariance matrix for the random effects (G matrix) and its correlation matrix. Explain in words the meaning of the correlation estimate between random intercept and slope.

\(b_{1i}\) is subject i’s mean response on average below the population mean by this amount. \(b_{3i}\) is subject i’s underlying level of mean response which depends on the level (time) \(X_{ij2}\)

###################################################
#Get estimated parameters of Cov(Yi)= ZiGZi' + Ri # ## Ri is not empty
###################################################
#Here is how  you get the parameters from G-matrix, random effects variance/covariance matrix
getVarCov(lme4a)

## Random effects variance covariance matrix
##             (Intercept)    AGEL
## (Intercept)    511.4700 -7.5506
## AGEL            -7.5506  4.5236
##   Standard Deviations: 22.616 2.1269

Above is the G matrix.

#Here is how you get all parameters from G-matrix and R-matrix (remaining within-person residual, epsilon from Lecture 4)
VarCorr(lme4a)

## ID = pdLogChol(AGEL) 
##             Variance   StdDev   Corr  
## (Intercept) 511.468338 22.61567 (Intr)
## AGEL          4.523619  2.12688 -0.157
## Residual    118.012003 10.86333

#this extracts variance of random intercept and variance of epsilon
lme4a.var <- as.numeric(VarCorr(lme4a))[1:2]

## Warning: NAs introduced by coercion

lme4a.var

## [1] 511.468338   4.523619

4.a.ii Output an estimated covariance matrix for \(Y_i\) \(\sum_i\) for a child ID=100073 and describe the patterns in the variances and covariances. How does this covariance matrix compare to the one in Q3(a)(ii)? For ID=100073, show your calculations of how you can obtain variance of \(Y_{ij}\) at time point 2 and show how you obtain covariance of \(Y_{ij} = 1\) and \(Y_{ij} = 2\).

\(cov(Y_i=1)\) aka covariance matrix for child ID = 100073

#previous model with no random slope
#$cov(Y_i=1)$ aka covariance matrix for child ID = 100073
est.cov.lme3ai <- extract.lme.cov(lme3ai) 
est.cov.lme3ai[1:6,1:6]

##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 659.1658 510.3377 510.3377 510.3377 510.3377 510.3377
## [2,] 510.3377 659.1658 510.3377 510.3377 510.3377 510.3377
## [3,] 510.3377 510.3377 659.1658 510.3377 510.3377 510.3377
## [4,] 510.3377 510.3377 510.3377 659.1658 510.3377 510.3377
## [5,] 510.3377 510.3377 510.3377 510.3377 659.1658 510.3377
## [6,] 510.3377 510.3377 510.3377 510.3377 510.3377 659.1658

#model with random intercept and random slope
#$cov(Y_i=1)$ aka covariance matrix for child ID = 100073
est.cov.lme4a <- extract.lme.cov(lme4a)
est.cov.lme4a[1:6, 1:6]

##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 624.9774 500.9000 496.3419 485.5004 479.5863 472.8550
## [2,] 500.9000 617.3883 498.2313 495.5078 494.0221 492.3311
## [3,] 496.3419 498.2313 617.6631 503.0283 504.8706 506.9674
## [4,] 485.5004 495.5078 503.0283 638.9278 530.6735 541.7796
## [5,] 479.5863 494.0221 504.8706 530.6735 662.7612 560.7698
## [6,] 472.8550 492.3311 506.9674 541.7796 560.7698 700.3960

With the covariance matrix model with a random intercept and random slope, we see different variances for the random intercept at different time points (diagonal). With just a random intercept, it’s the same across time points. variance for random intercept and random slope is represented in this diagonal. but not in the just random intercept covariance matrix.

4.a.iii Output empirical BLUPs and obtain their variances, then compare these variances to the estimates in Q4(a)(i). If there are differences, why?

#obtain predicted values of lme3ai (BLUPs)
lme4a.re <- as.data.frame(random.effects(lme4a))
head(lme4a.re)

#200 individual rows of data corresponding to unique ID's
dim(lme4a.re)

## [1] 200   2

#check
length(unique(CFkids$ID))

## [1] 200

#Obtain the variance of the empirical BLUPs of lme3ai and note that it's smaller than variance of the random intercept from G matrix above
apply(lme4a.re, 2, var)

## (Intercept)        AGEL 
##  466.607428    2.794096

getVarCov(lme4a)

## Random effects variance covariance matrix
##             (Intercept)    AGEL
## (Intercept)    511.4700 -7.5506
## AGEL            -7.5506  4.5236
##   Standard Deviations: 22.616 2.1269

the variance of the empirical BLUPs of lme4a are 466.61 which is smaller than the variance of the random intercept from the G matrix 511.4683382 because we might have violated the random effects of BLUPs assumption to be normally distributed. it’s highly affected by their distribution assumption and differences in the population random effects b might not be preserved in shrunken estimates \(\hat{b}\). shrunken estimates happen when we pull the random individual effects towards the population mean. We actually do see the variance increase across time points for one individual ID

4.a.iv Using Cholesky-transformed residuals, plot the empirical semi-variogram and comment on whether the covariance model fits the data well or not.

#Transform the residuals based on the Cholesky decomposition of the covariance matrix
est.lme4a <- extract.lme.cov(lme4a)
cr.4 <- solve(t(chol(est.lme4a))) %*% residuals(lme4a, level = 0)
#level = 0 is marginal residuals

#Use Empirical variogram with transformed residuals to check Covariance Model fit: M3
source("Empirical or Sample Variogram Function.r") # this version allows to have a circle for the estimated g(u)
source("Semivariogram Function.r") #this version does not have a circle for the estimated g(u)
variogram(resid = cr.4[,1],
          timeVar = CFkids$AGE,
          id = CFkids$ID,
          irregular = TRUE,
          binwidth = 3) #is this the correct bin width?

## $bin.mids
## [1] 1.5 4.5 7.5
## 
## $bin.means
## [1] 0.9231832 1.1018818 0.9366320
## 
## $bin.sizes
## [1] 2590 1903  604
## 
## $process.var
## [1] 0.9947101

4.b Compare the covariance model that uses only a random intercept (from Q3) to the covariance model in Q4 (random intercept and slope). Conduct appropriate hypothesis test and show your work (e.g., which parameters are being tested and what test are you using?). Which covariance model do you choose?

lme4b <- lme(FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL*FEMALE + 
               AGEL*F508_1 + AGEL*F508_2, data = CFkids, 
             random = ~ AGEL | ID, correlation = corCompSymm())

summary(lme3b)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC     BIC    logLik
##   12514.41 12578.2 -6245.207
## 
## Random effects:
##  Formula: ~1 | ID
##         (Intercept) Residual
## StdDev:    22.59066 12.19951
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 2.467162e-17 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  103.81003  6.703758 1307 15.485349  0.0000
## AGE           -1.85437  0.334193 1307 -5.548804  0.0000
## AGEL           1.26394  0.514581 1307  2.456243  0.0142
## FEMALE1       -1.19867  3.397055  196 -0.352856  0.7246
## F508_1        -4.27993  5.609198  196 -0.763020  0.4464
## F508_2        -6.77695  5.640915  196 -1.201392  0.2310
## AGEL:FEMALE1  -0.81968  0.249580 1307 -3.284231  0.0010
## AGEL:F508_1   -0.48790  0.422543 1307 -1.154682  0.2484
## AGEL:F508_2   -0.65120  0.421501 1307 -1.544959  0.1226
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.247 -0.645                                                
## FEMALE1      -0.164 -0.088  0.113                                         
## F508_1       -0.657 -0.002  0.176 -0.021                                  
## F508_2       -0.725  0.123  0.093 -0.062  0.788                           
## AGEL:FEMALE1  0.062 -0.005 -0.205 -0.262  0.004  0.011                    
## AGEL:F508_1   0.181 -0.004 -0.651  0.005 -0.267 -0.214 -0.022             
## AGEL:F508_2   0.179 -0.003 -0.649  0.011 -0.215 -0.267 -0.040   0.805     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -4.943751608 -0.527381631 -0.003483511  0.514482060  4.289862752 
## 
## Number of Observations: 1512
## Number of Groups: 200

summary(lme4b)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12409.24 12483.65 -6190.619
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 22.615602 (Intr)
## AGEL         2.126879 -0.157
## Residual    10.863341       
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 3.913566e-07 
## Fixed effects:  FEV1 ~ AGE + AGEL + FEMALE + F508_1 + F508_2 + AGEL * FEMALE +      AGEL * F508_1 + AGEL * F508_2 
##                  Value Std.Error   DF   t-value p-value
## (Intercept)  104.52800  6.639477 1307 15.743409  0.0000
## AGE           -1.90971  0.330870 1307 -5.771774  0.0000
## AGEL           1.30297  0.676545 1307  1.925926  0.0543
## FEMALE1       -1.34560  3.367537  196 -0.399579  0.6899
## F508_1        -4.23639  5.558915  196 -0.762090  0.4469
## F508_2        -6.69784  5.589897  196 -1.198205  0.2323
## AGEL:FEMALE1  -0.75318  0.381600 1307 -1.973745  0.0486
## AGEL:F508_1   -0.50057  0.636207 1307 -0.786801  0.4315
## AGEL:F508_2   -0.73638  0.635018 1307 -1.159619  0.2464
##  Correlation: 
##              (Intr) AGE    AGEL   FEMALE F508_1 F508_2 AGEL:FE AGEL:F508_1
## AGE          -0.630                                                       
## AGEL          0.124 -0.487                                                
## FEMALE1      -0.164 -0.088  0.108                                         
## F508_1       -0.657 -0.002  0.201 -0.021                                  
## F508_2       -0.725  0.122  0.137 -0.062  0.788                           
## AGEL:FEMALE1  0.060 -0.002 -0.243 -0.264  0.005  0.012                    
## AGEL:F508_1   0.178 -0.001 -0.741  0.005 -0.268 -0.213 -0.023             
## AGEL:F508_2   0.177  0.000 -0.737  0.012 -0.214 -0.267 -0.046   0.797     
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.328882375 -0.461215548  0.004543758  0.477952924  4.320078443 
## 
## Number of Observations: 1512
## Number of Groups: 200

#anova test, LRT. significant p-value.
anova(lme3b, lme4b)

The ANOVA test returned a significant p-value. Our \(H_0\) is that there’s no significant difference between the mean response models. We can reject the \(H_0\) at 99% stat. sig. and determine that the random intercept and random slope model is a better fit, by analysis of the semi-variogram and significant p-value and significant LRT significance.

Question 5

Starting with the model for the mean response that includes only the main effects: a fixed intercept \(X_1\), {variable AGE or AGE0 and AGEL, depending upon (a) below}, female sex \(X_4\), F508_1 \(X_5\), F508_2 \(X_6\), and the best covariance model using random effect(s) specification (from Q4(b)), refine the model for the mean response (use a two-sided alpha of 0.05 for the hypothesis tests):

5.a Test whether having two fixed effects for age (cross-sectional and longitudinal) is needed, or whether just a cohort-averaged age effect is sufficient. What are the null and alternative hypotheses? Conduct your test. Explain your finding in words.

#model for the mean response including only main effects
lme5ai <- lme(FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2, data = CFkids, random = ~ AGEL | ID, correlation = corCompSymm())
summary(lme5ai)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC   BIC    logLik
##   12409.51 12468 -6193.757
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 22.641580 (Intr)
## AGEL         2.154127 -0.163
## Residual    10.860488       
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 3.261304e-07 
## Fixed effects:  FEV1 ~ AGE0 + AGEL + FEMALE + F508_1 + F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 106.70478  6.506701 1311 16.399212  0.0000
## AGE0         -1.91056  0.330840  195 -5.774871  0.0000
## AGEL         -1.52151  0.191990 1311 -7.924935  0.0000
## FEMALE1      -3.11445  3.247832  195 -0.958932  0.3388
## F508_1       -5.43078  5.355128  195 -1.014126  0.3118
## F508_2       -8.44417  5.386739  195 -1.567586  0.1186
##  Correlation: 
##         (Intr) AGE0   AGEL   FEMALE F508_1
## AGE0    -0.642                            
## AGEL    -0.071  0.002                     
## FEMALE1 -0.157 -0.092  0.001              
## F508_1  -0.645 -0.002  0.001 -0.021       
## F508_2  -0.717  0.127  0.000 -0.064  0.787
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.335210875 -0.461128316  0.001904207  0.473962719  4.325937447 
## 
## Number of Observations: 1512
## Number of Groups: 200

#model for the mean response including only main effects
lme5ai2 <- lme(FEV1 ~ AGE + FEMALE + F508_1 + F508_2, data = CFkids, 
              random = ~ AGEL | ID, correlation = corCompSymm())
summary(lme5ai2)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC   logLik
##   12408.46 12461.64 -6194.23
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 22.67918 (Intr)
## AGEL         2.15525 -0.17 
## Residual    10.85981       
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 2.609043e-07 
## Fixed effects:  FEV1 ~ AGE + FEMALE + F508_1 + F508_2 
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 103.29444  5.561215 1311 18.574079  0.0000
## AGE          -1.61874  0.166214 1311 -9.738923  0.0000
## FEMALE1      -3.41313  3.237801  196 -1.054152  0.2931
## F508_1       -5.46641  5.355726  196 -1.020667  0.3087
## F508_2       -7.87262  5.354832  196 -1.470191  0.1431
##  Correlation: 
##         (Intr) AGE    FEMALE F508_1
## AGE     -0.450                     
## FEMALE1 -0.233 -0.045              
## F508_1  -0.756  0.000 -0.022       
## F508_2  -0.777  0.064 -0.055  0.792
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.337235968 -0.462601932  0.002416518  0.480023824  4.325647154 
## 
## Number of Observations: 1512
## Number of Groups: 200

anova(lme5ai, lme5ai2)

the null hypothesis is that the mean response profile for both models fits the data equally as well. We cam run an ANOVA test here to compare AIC, log likelihood and the likelihood ratio test between the two and find that using the full model is not a significant difference relative to the model with just one fixed effect for age AGE

5.b After deciding whether to keep two fixed effects for age (variables AGE0 and AGEL) or only one fixed effect for age (variable AGE), test whether there is a significant contribution of genotype to the change in FEV1 over time. What is your Null/Alternative hypotheses? What test are you using? Explain your finding in words.

#model for the mean response including only one fixed effect for age (variable AGE)
lme5b <- lme(FEV1 ~ AGE + FEMALE + F508_1 + F508_2 + AGE*F508_1 + AGE*F508_2, 
             data = CFkids, random = ~ AGEL | ID, correlation = corCompSymm())
summary(lme5b)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12408.45 12472.25 -6192.227
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 22.532724 (Intr)
## AGEL         2.160207 -0.163
## Residual    10.857959       
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 3.913566e-07 
## Fixed effects:  FEV1 ~ AGE + FEMALE + F508_1 + F508_2 + AGE * F508_1 + AGE *      F508_2 
##                Value Std.Error   DF   t-value p-value
## (Intercept) 92.66794  9.238743 1309 10.030362  0.0000
## AGE         -0.94626  0.502116 1309 -1.884551  0.0597
## FEMALE1     -2.84219  3.235087  196 -0.878552  0.3807
## F508_1       2.87834 10.098630  196  0.285023  0.7759
## F508_2       6.82981  9.976826  196  0.684567  0.4944
## AGE:F508_1  -0.54244  0.557943 1309 -0.972218  0.3311
## AGE:F508_2  -0.99881  0.564787 1309 -1.768469  0.0772
##  Correlation: 
##            (Intr) AGE    FEMALE F508_1 F508_2 AGE:F508_1
## AGE        -0.845                                       
## FEMALE1    -0.201  0.058                                
## F508_1     -0.886  0.764  0.041                         
## F508_2     -0.897  0.774  0.043  0.815                  
## AGE:F508_1  0.762 -0.900 -0.061 -0.849 -0.697           
## AGE:F508_2  0.758 -0.891 -0.087 -0.681 -0.845  0.803    
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.348727665 -0.464536654  0.002369871  0.468583434  4.335858738 
## 
## Number of Observations: 1512
## Number of Groups: 200

We’re testing to see if the interaction covariate between AGE and F508_1 or F508_2 is 0 and statistically significant. The null hypothesis is that they are 0, and we reject at p-value < .01. We fail to reject the interaction AGE:F508_1 but reject the null hypothesis for interacting AGE:F508_2. So we accept that there’s a significant contribution of the genotype and age over time.

5.c After deciding whether to keep two fixed effects for age (variables AGE0 and AGEL) or only one fixed effect for age (variable AGE), test whether there is a significant contribution of gender to the change in FEV1 over time. What is your Null/Alternative hypotheses? What test are you using? Explain your finding in words.

lme5c <- lme(FEV1 ~ AGE + FEMALE + F508_1 + F508_2 + AGE*F508_2 + 
               FEMALE*F508_1 + FEMALE*F508_2, 
             data = CFkids, random = ~ AGEL | ID, correlation = corCompSymm())
summary(lme5c)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12398.78 12467.88 -6186.389
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 22.617636 (Intr)
## AGEL         2.160168 -0.169
## Residual    10.857799       
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 3.913566e-07 
## Fixed effects:  FEV1 ~ AGE + FEMALE + F508_1 + F508_2 + AGE * F508_2 + FEMALE *      F508_1 + FEMALE * F508_2 
##                    Value Std.Error   DF   t-value p-value
## (Intercept)     95.88568  7.323564 1310 13.092762  0.0000
## AGE             -1.36796  0.219213 1310 -6.240317  0.0000
## FEMALE1          4.70236  9.609632  194  0.489338  0.6252
## F508_1          -0.62322  7.169077  194 -0.086931  0.9308
## F508_2           3.22455  8.702165  194  0.370545  0.7114
## AGE:F508_2      -0.58084  0.338083 1310 -1.718042  0.0860
## FEMALE1:F508_1 -10.89901 10.791562  194 -1.009957  0.3138
## FEMALE1:F508_2  -6.68819 10.758350  194 -0.621674  0.5349
##  Correlation: 
##                (Intr) AGE    FEMALE1 F508_1 F508_2 AGE:F5 FEMALE1:F508_1
## AGE            -0.504                                                   
## FEMALE1        -0.606  0.075                                            
## F508_1         -0.788  0.052  0.584                                     
## F508_2         -0.842  0.424  0.510   0.663                             
## AGE:F508_2      0.327 -0.648 -0.048  -0.034 -0.552                      
## FEMALE1:F508_1  0.546 -0.080 -0.891  -0.667 -0.460  0.052               
## FEMALE1:F508_2  0.541 -0.067 -0.893  -0.522 -0.566  0.004  0.796        
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.347731243 -0.466380329  0.001641932  0.467487594  4.336608244 
## 
## Number of Observations: 1512
## Number of Groups: 200

We’re testing to see if the interaction covariate between FEMALE and F508_1 or F508_2 is 0 and statistically significant. The null hypothesis is that they are 0, and we reject that hypothesis at p-value < .01. We reject this hypothesis for both interactions. So we do not accept that there’s a significant contribution of the gender and genotype interaction on FEV1 over time.

5.d After going through Q5(a,b,c), you should have built a model for the mean response that includes an intercept, an effect of age (either one variable or two variables), as well as the main effects of gender and genotype (keep these two variables in the model because we want to have the mean responses adjusted for these variables), and any significant interaction effects from Q5(b,c): 5.d.i Conduct appropriate residual diagnostics for the mean response model. Histogram of the transformed residuals

#Create a data frame with variables needed for residual analysis
#Transform the residuals based on the Cholesky decomposition of the covariance matrix for M4 (m4 above)
est.cov5 <- extract.lme.cov(lme5b)
cr.5 <- solve(t(chol(est.cov5))) %*% residuals(lme5b, level = 0) #level = 0 is marginal residuals
pred4 <- data.frame(id = CFkids$ID,
                    F508_1 = CFkids$F508_1,
                    F508_2 = CFkids$F508_2,
                    age = CFkids$AGE,
                    pred = fitted(lme5b, level = 0), 
                    resid = cr.5[,1])
#histogram
plotq5 <- ggplot(pred4, aes(x = resid))
plotq5 + 
  geom_histogram(aes(y = ..density..), 
                 color = "black", 
                 fill = "lightblue") +
  geom_density() + 
  labs(title = "Distribution of Transformed Residuals: (LME) with Random 
       Intercept and Slope Covariance", 
       x = "Transformed Residual")

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We’re looking to make sure the residuals after transforming to a triangular matrix are normally distributed, which they are.

Transformed residuals by predicted marginal mean response

#Scatter plot of transformed residual against predicted mean with LOESS smoothed curve for mean response in M3
plotq5.2 <- ggplot(pred4, aes(x = pred, y = resid))
plotq5.2 + 
  geom_point() + 
  geom_hline(yintercept = 0,
             colour = "red") + 
  geom_smooth(method = "loess", 
              se = F) +
  labs(title = "Scatter Plot of Transformed Residual against Predicted Mean: 
       (LME) with Random Intercept and Random Slope Covariance",
       x = "Predicted Mean", 
       y = "Transformed Residual")

## `geom_smooth()` using formula = 'y ~ x'

The transformed residuals for each predicted mean are generally distributed around 0. There’s a slight deviance of the losses curve from a flat line, which might be concerning that the model’s not a good fit.

Transformed residuals by the effect of time (AGE or AGEL, depending upon your answer to Q5(a))

#Scatter plot of transformed residual against time (age) with LOESS smoothed curve for mean response in M3
plotq5.3 <- ggplot(pred4, aes(x = age, y = resid))
plotq5.3 + 
  geom_point() + 
  geom_hline(yintercept = 0,
             colour = "red") + 
  geom_smooth(method = "loess", 
              se = F) +
  labs(title = "Scatter Plot of Transformed Residual against Time (age): 
       (LME) with Random Intercept and Random Slope Covariance",
       x = "Time (age)", 
       y = "Transformed Residual")

## `geom_smooth()` using formula = 'y ~ x'

The transformed residuals for each age are generally distributed around 0. There’s a slight deviance of the losses curve from a flat line, which might be concerning that the model’s not a good fit.

5.e Write out the equations and describe your final model for the mean response (mean FEV1) over time, as well as the final covariance model. The answer to Q5(e) should be summarized in 2-3 sentences and is the conclusion for your analyses – imagine that you are writing a conclusion in an abstract.

lme5c <- lme(FEV1 ~ AGE + FEMALE + F508_1 + F508_2 + AGE*F508_2, 
             data = CFkids, random = ~ AGEL | ID, correlation = corCompSymm())
summary(lme5c)

## Linear mixed-effects model fit by REML
##   Data: CFkids 
##        AIC      BIC    logLik
##   12408.07 12466.56 -6193.033
## 
## Random effects:
##  Formula: ~AGEL | ID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 22.559797 (Intr)
## AGEL         2.159788 -0.168
## Residual    10.857838       
## 
## Correlation Structure: Compound symmetry
##  Formula: ~1 | ID 
##  Parameter estimate(s):
##          Rho 
## 1.956782e-07 
## Fixed effects:  FEV1 ~ AGE + FEMALE + F508_1 + F508_2 + AGE * F508_2 
##                Value Std.Error   DF   t-value p-value
## (Intercept) 99.53304  5.986524 1310 16.626182  0.0000
## AGE         -1.38544  0.218356 1310 -6.344897  0.0000
## FEMALE1     -3.06228  3.229095  196 -0.948339  0.3441
## F508_1      -5.47712  5.330834  196 -1.027441  0.3055
## F508_2       0.06047  7.162653  196  0.008442  0.9933
## AGE:F508_2  -0.55763  0.336787 1310 -1.655728  0.0980
##  Correlation: 
##            (Intr) AGE    FEMALE F508_1 F508_2
## AGE        -0.564                            
## FEMALE1    -0.238  0.006                     
## F508_1     -0.698 -0.002 -0.022              
## F508_2     -0.789  0.470  0.001  0.588       
## AGE:F508_2  0.379 -0.649 -0.063  0.002 -0.668
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -5.349062815 -0.467934923  0.001728418  0.466689756  4.335120073 
## 
## Number of Observations: 1512
## Number of Groups: 200

My final model looks like this: \[Y_{ij} = (\beta_1 + b_{1i}) + (\beta_2+b_{2i})t_{ij} + ... + \beta_pX_{ijp}+\epsilon_{ij}\] where \((\beta_1+b_{1i})\) is the random intercept of the MLE mean response profile, \((\beta_2+b_2i)t_{ij}\) for every covariate AGE, FEMALE, F508_1 F508_2 interaction AGE:F508_2, and \(\epsilon_{ij}\) measurment error. We model our mean response FEV1 with these covariates and determine the covariance between subjects’ different time points using a Compound Symmetry model, which allows for random intercept and random slope freedom, so each variance between time points can be assessed separately. Based on the residual diagnostics, the model fits the data moderatly well, though it’s suspected it could be a better fit if we separated out different cohorts with AGE_CAT.