Purpose: To provide an introduction to the joint parametric modeling of the mean response and covariance for longitudinal data.

Instructions:

  1. For each question requiring data analysis, support your conclusions by including only the relevant SAS/R output in your answer.

  2. Include your SAS/R program (but not your SAS output) as an appendix to your solutions.

Joint Modeling of Mean and Covariance in the a Study on the Effect of Ozone on Pulmonary Function

In a study designed to examine the acute responses of normal subjects to ozone exposure, researcher randomized 20 subjects to exposure to either room air or 0.12 ppb ozone. The exposure lasted six hours. A baseline and 6 other measures of FEV1 (a measure of pulmonary function) made at hourly intervals were recorded for all study participants in the study. Subjects cycled on an exercise bike before a measure of FEV1 was obtained. The investigators were interested in determining whether changes in pulmonary function during the 6 hours of exposure were different in the ozone and room air exposed groups.

In the analyses of the data from this study, the response variable of interest was FEV1 (ml)/100. From here on out when we refer to FEV1 we mean this rescaled response, i.e., FEV1 (ml)/100. The measurement times were coded 0-6, with time=0 for the baseline measurement, time=1 for the measurement at hour 1,…,time=6 for the measurement at hour 6. For this homework, we will consider the data from hours 0, 2, 4, 6. The two exposure groups were coded 0 and 1, with group=1 denoting exposure to ozone and group=0 denoting exposure to room air.

The data are in the file “ozone0246.dat” on the course web page. Each row of the data set contains the following four variables: subject ID, hour, group and the y = FEV1/100 measurements, respectively (note it is already in long format).


1. (10 pts) Descriptive Analyses:
  1. Describe key aspects of the longitudinal design and completeness of data.

There are 20 subjects randomized to either room air or ozone whose FEV (ml)/100 are measured at 4 equally spaced time points. All 20 subjects have 4 measurements at the same time points of baseline, 2, 4, and 6 hours. The dataset is balanced and complete.

  1. Plot the FEV1 response against hour for each subject exposed to room air (all on the first plot) and for each subject exposed to ozone (all on a second plot). Comment on any patterns in the data or other notable aspects of the data.

Subjects randomized to the room air exposure have FEV1 measurements ranging from 37 to 47 at baseline and 38 to 49 at 6 hours. There is more variability between subjects compared to variability within subjects. There is a gradual slight increase overall trend in FEV1 measurements between time points. Some subjects experience sharp changes while others have little gradual changes.

Subjects randomized to the ozone exposure have a baseline FEV1 measurement ranging from 36 to 46. The FEV1 measurement at 6 hours ranges from 23 to 45. There is greater variability within subjects compared to variability between subjects. Overall there is a decreasing trend in FEV1 across time. There are different rates of decreasing trends between subjects; some subjects have sharp changes while others have gradual changes.

Both groups have similiar baseline FEV1 measurements. However, there is greater FEV1 changes in subjects in the ozone exposure group along with greater variability between subjects.

  1. Obtain the mean FEV1 value at each hour of measurement for ozone and room air subjects separately. Plot the means against hour. Comment on the pattern of change in mean FEV1 with hour for ozone and room air subjects.

Mean Response Profile
Time Group Observations Mean Var
0 0 10 42.383 10.061979
0 1 10 42.617 8.597179
2 0 10 41.868 7.859729
2 1 10 41.009 11.547299
4 0 10 42.515 9.356672
4 1 10 39.600 37.561489
6 0 10 43.124 11.709160
6 1 10 37.211 61.898943

The mean baseline FEV1 for both exposure groups are similar. The mean FEV1 measurement for the room air exposure group is consistently higher compared to the mean FEV1 measurements for the ozone exposure group. The mean trend for the two groups are in different different and at different rates; they are non-parallel. From the 2 hour time point to the 6 hour time point, there is greater variance in the ozone group compared to the room air group.

Sample Covariance
Sample Correlation
t0 t2 t4 t6
t0 8.852958 7.570342 10.38459 10.24317
t2 7.570342 9.386982 12.95955 15.45361
t4 10.384595 12.959549 24.46050 27.52209
t6 10.243174 15.453607 27.52209 44.06794
t0 t2 t4 t6
t0 1.0000000 0.8304401 0.7056879 0.5185958
t2 0.8304401 1.0000000 0.8552522 0.7598109
t4 0.7056879 0.8552522 1.0000000 0.8382766
t6 0.5185958 0.7598109 0.8382766 1.0000000

The sample covariance matrix shows variance is not constant over time.

Sample correlation decreases as time gap increases.

2. (20 pts) Fitting a “Maximal” Model and Evaluating Variance-Covariance Structure:

a) Define a reasonable “maximal” mean model for this study. Fit this model using an unstructured variance-covariance matrix. Comment on the variance structure and on the correlation structure. What simplified variance- covariance structure(s) might be reasonable? Justify your answer.

Maximal mean model:

\(FEV1_{ij} = \beta_{0}+\beta_{1}X_{1ij}+\beta_{2}X_{2ij}+\beta_{3}X_{3ij}+\beta_{4}X_{4i}+\beta_{5}X_{5ij}+\beta_{6}X_{6ij}+\beta_{7}X_{7ij}+\epsilon_{ij}\)

where \(\epsilon_{ij} \sim {\sf N}(0,\sigma^{2}_{i})\)

UN Covariance
UN Correlation
9.329589 8.206349 17.68216 22.23825
8.206349 10.092233 20.99710 30.18338
17.682163 20.997101 58.98705 76.41663
22.238252 30.183379 76.41663 145.18795
1.0000000 0.9851576 0.9929213 0.9632263
0.9851576 1.0000000 0.9916086 0.9893279
0.9929213 0.9916086 1.0000000 0.9632258
0.9632263 0.9893279 0.9632258 1.0000000

From the unstructured covariance matrix, there is not constant variance at time points.The correlation matrix shows as time gap increase the correlation decreases. There is not great magnitude in the decrease of correlation with increasing time. A heterogeneous autoregressive covariance matrix might be a reasonable simplified variance covariance matrix structure.

b) Keeping the same maximal mean model, evaluate whether your suggestion(s) for the variance-covariance structure from question 2a as well as the following models for the variance-covariance structure provide an adequate fit to the data compared with an unstructured variance-covariance:

i) compound symmetry

Compound symmetry covariance structure does not resemble the pattern shown in either the sample covariance matrix or the unstructured covariance structure. It’s correlation matrix is not consistent either.

ii) heterogeneous compound symmetry (hint use: type=csh in proc mixed)
CSH Covariance
CSH Correlation
9.853779 7.144273 17.63164 30.18444
7.144273 8.486354 16.36258 28.01188
17.631637 16.362581 51.68807 69.13164
30.184439 28.011881 69.13164 151.48558
1.0000000 0.9864895 0.9663038 0.9908609
0.9864895 1.0000000 0.9675543 0.9922139
0.9663038 0.9675543 1.0000000 0.9409958
0.9908609 0.9922139 0.9409958 1.0000000

Compound symmetry covariance structure with heterogenous variance does resemble the patterns seen in the unstructure covariance and correlation matrix.

iii) 1st-order autoregressive

The first order autoregressive covariance matrix does not follow the variance pattern represented in the sample or unstructured covariance structure. Although correlation does decrease with time, the correlations are much lower than what is shown in the data and unstructured covariance.

iv) heterogeneous 1st-order autoregressive (hint use: type=arh(1) in proc mixed)
AR1H Covariance
AR1H Correlation
9.183796 8.000444 16.67447 22.57122
8.000444 9.802927 20.43119 27.65646
16.674472 20.431194 59.89375 81.07452
22.571217 27.656463 81.07452 154.36082
1.0000000 0.9880637 0.9895792 0.9835088
0.9880637 1.0000000 0.9985443 0.9819967
0.9895792 0.9985443 1.0000000 0.9737329
0.9835088 0.9819967 0.9737329 1.0000000

First order auto regressive covariance structure with heterogeneous variance does also remeble the sample and unstructured covariance matrix and their correlation matrix.

Using likelihood ratio tests and the AIC criterion as appropriate, identify a model for the variance-covariance structure that provides a good fit to the data. Provide estimates for the parameters used in defining this variance-covariance model. Also provide estimated variance-covariance and correlation matrices.
Model LogLikelihood -2LogLikelihood Number of Paramters
Unstructured -177.5752 355.1504 10
Compound Symmetry -197.0629 394.1258 2
Compound Symmetry, Hetero -182.6467 365.2934 5
AR(1) -190.5644 381.1288 2
AR(1), Hetero -179.1189 358.2378 5
Structure Comparisons CV = Diff LL df = Diff Parameters Pr(chi2(df,0.95)>CV)
UN vs CS 38.9754 8 < 0.00001
UN vs CSH 10.1430 5 0.054941
Structure Comparisons CV = Diff LL df = Diff Parameters Pr(chi2(df,0.95)>CV)
UN vs AR(1) 25.9784 8 0.001157
UN vs ARH(1) 3.0874 5 0.749359

The log likelihood ratio test shows that neither compound symmetry nor the first order auto regressive covariance structure are adequate compared to the unstructured covariance model.Both compound symmetry heterogeneous and first order auto regressive heterogeneous covariance structure are adequate compared to the unstructured covariance in explaining the pattern. Both compound symmetry heterogeneous and first order autoregressive heterogeneous have fewer parameters compared to the unstructured covariance.

Structure -2LL Parameters AIC
CSH 365.2934 5 375.2934
ARH(1) 358.2378 5 368.2378

Comparing compound symmetry heterogeneous and first order autoregressive heterogeneous structures using AIC, first order autoregressive heterogenous is the better structure for the data.

3. (20 pts) Analysis of Response Profiles:

Fit the usual model for the analysis of mean profiles using room air exposure as the reference level for group and baseline as the reference group for time. Use the variance-covariance structure identified in your answer to question 2b. Based on this model:

a) Test the null hypothesis that the pattern of means over hours is identical (coincides) for the two exposure groups. What do you conclude?
Model -2LogLikelihood Number of Parameters
Saturated 358.2378 8
Time Only 374.1760 4
Difference 15.9382 4

With \(Pr(\chi^{2}_{df=4, 0.95}>15.9382) = 0.003103.\), the simpler model with only time is not sufficient compared to the saturated model.

b) Test the null hypothesis that the mean response profiles of the two groups are parallel. What do you conclude?
Model -2LogLikelihood Number of Parameters
Saturated 358.2378 8
Time and Group 371.6128 5
Difference 13.3750 3

With \(Pr(\chi^{2}_{df=3, 0.95}>13.3750) = 0.003892\), the simpler model with only time and group does not sufficiently explain the changes in group mean over time.

4. (20 pts) Fitting a Linear Model in Time:

Fit a model that includes hour as a continuous variable, group and their interaction. Use the model for the variance-covariance structure that you identified in question 2b.

a) What is the estimated rate of change in mean response for the room air group? b) What is the estimated rate of change in mean response for the ozone group? c) Test the hypothesis that the rates of change in mean response are identical in the two groups. What do you conclude? d) What is the estimated difference in rate of mean change between the two groups? By calculating a 95% confidence interval for this difference, identify what are plausible values for the underlying true difference.

5. (20 pts) Evaluating the Fit of a Linear Model in Time: Does a model with a linear trend in hour for each exposure group adequately describe the pattern of change in the two groups? Justify your answer with appropriate statistical analysis. (Hint: compare it with quadratic model and statured model)
6. (10 pts) Summarizing the Key Results and Conclusions: Write a brief structured abstract (maximum 200 words) summarizing the objective, methods, results and conclusions that might be drawn concerning exposure differences in patterns of pulmonary function over time.

#### Libraries ####
library(tidyverse)
library(kableExtra)
library(nlme)

#### Dataset ####
FEV <- read_delim("/Users/questwhalen/Desktop/Professional/UMMC/Schedule\ and\ Classes\ and\ Material/BDS\ 724\ Longitudinal\ and\ Multilevel\ Models/HW_Projects/Longitudinal_Multilevel_Model/HW\ 4/ozone0246.dat", 
                  delim = " ", 
                  trim_ws = TRUE,
                  col_names = c("id", "time", "group", "y"))

# map(.x = 1:3, .f = ~ FEV[,.x] %>% unique)

#### Q1 ####
#### 1a. ####
# kable(list(FEV %>% group_by(id, group) %>% summarize(Count = n()) %>% ungroup %>% slice(1:5),
#            FEV %>% group_by(id, group) %>% summarize(Count = n()) %>% ungroup %>% slice(6:10),
#            FEV %>% group_by(id, group) %>% summarize(Count = n()) %>% ungroup %>% slice(11:15),
#            FEV %>% group_by(id, group) %>% summarize(Count = n()) %>% ungroup %>% slice(16:20))) %>%
#     kable_styling()  %>%
#     kable_paper("striped", full_width = F)

# FEV %>% group_by(time, group) %>% summarize(Count = n()) %>%
#     kable %>%
#     kable_styling()  %>%
#     kable_paper("striped", full_width = F)
# data is complete, every participants have same number of observations
# xtabs(~ id + time, FEV)
#### 1.b ####
## Individual Spagetti Plot
# interaction.plot(x.factor = FEV$time,
#                  trace.factor = FEV$id,
#                  response = FEV$y,
#                  legend = FALSE,
#                  xlab = "Time (hours)",
#                  ylab = "FEV1 (ml/100)",
#                  main = "Plot of Mean Response Profile in Subjects",
#                  type = "b",
#                  pch = 19,
#                  col = 1:20)
# interaction.plot(x.factor = FEV$time,
#                  trace.factor = FEV$id,
#                  response = FEV$y,
#                  legend = FALSE,
#                  xlab = "Time (hours)",
#                  ylab = "FEV1 (ml/100)",
#                  main = "Plot of Mean Response Profile in Subjects",
#                  type = "b",
#                  pch = c(rep(19,10),rep(17,10)),
#                  lty = c(rep(1,10),rep(2,10)),
#                  col = c(rep(4,10), rep(8,10)))
# legend(1.1, 30, lty = 1:2, pch = c(19,17), legend=c("0: Room Air","1: Ozone"), col=c(4,8), title = "Group")

# kable(list(FEV %>% mutate(Id = factor(id), Group = factor(group)) %>% group_by(Id, Group) %>%
#              summarise(Observations = n(), Mean = mean(y), StdDev = var(y) %>% sqrt) %>% ungroup %>% slice(1:10),
#            FEV %>% mutate(Id = factor(id), Group = factor(group)) %>% group_by(Id, Group) %>%
#              summarise(Observations = n(), Mean = mean(y), StdDev = var(y) %>% sqrt) %>% ungroup %>% slice(11:20)
#             )) %>%
#     kable_styling() %>%
#     kable_paper("striped", full_width = F) %>%
#   add_header_above(header = c("Within Subject Variability", ""))
interaction.plot(x.factor = FEV$time[1:40],
                 trace.factor = FEV$id[1:40],
                 response = FEV$y[1:40],
                 legend = FALSE,
                 xlab = "Time (hours)",
                 ylab = "FEV1 (ml/100)",
                 main = "Subjects with Room Air Exposure FEV1 Overtime",
                 type = "b",
                 pch = c(rep(19,10)),
                 lty = c(rep(1,10)),
                 col = c(rep(4,10)))
interaction.plot(x.factor = FEV$time[41:80],
                 trace.factor = FEV$id[41:80],
                 response = FEV$y[41:80],
                 legend = FALSE,
                 xlab = "Time (hours)",
                 ylab = "FEV1 (ml/100)",
                 main = "Subjects with Ozone Exposure FEV1 Overtime",
                 type = "b",
                 pch = c(rep(17,10)),
                 lty = c(rep(2,10)),
                 col = c(rep(8,10)))
#### 1.c ####
## Mean Response Profile Spagetti Plot
interaction.plot(x.factor = FEV$time,
                 trace.factor = FEV$group,
                 response = FEV$y,
                 legend = FALSE,
                 xlab = "Time (hours)",
                 ylab = "FEV1 (ml/100)",
                 main = "Plot of Mean Response Profile by group",
                 type = "b",
                 pch = c(19,17),
                 lty = 1:2,
                 col = c("dark blue", "dark red"))
legend(1, 40, lty = 1:2, pch = c(19,17), legend=c("0: Room Air","1: Ozone"), col=c("dark blue", "dark red"), title = "Group")
kable(list(FEV %>% mutate(Time = factor(time), Group = group) %>% group_by(Time, Group) %>% 
                summarise(Observations = n(), Mean = mean(y), Var = var(y)) %>% ungroup)) %>%
  kable_paper("striped", full_width = F) %>% 
  add_header_above(header = c("Mean Response Profile"))


#### Q2 ####
#### 2.a ####
UNmod <- gls(y ~ group + factor(time) + group*factor(time), 
             data = FEV,
             corr= corSymm(form = ~ 1 | id),
             weights = varIdent(form = ~ 1 |time),
             method = 'REML')
# UNmod %>% summary
L <- diag(c(1.000000, 1.019838, 1.585709, 1.986173))
list(L%*%getVarCov(UNmod)%*%L, #covariance matrix
     L%*%getVarCov(UNmod)%*%L %>% cor()) %>% #correlation matrix
  kable()%>%
  kable_paper("striped", full_width = F) %>% 
  add_header_above(header = c("UN Covariance", "UN Correlation"))

# getVarCov(UNmod)
# getVarCov(UNmod) %>% cor()

#varIdent(form = ~ variance covariate | grouping factor); when a grouping factor is present, a different coefficient value is used for each at its levels less one reference level
#getVarCov gives marginal variance-covariance of the responses by group
#### 2b.i ####
CSmod <- gls(y ~ group + factor(time) + group*factor(time),
             data = FEV,
             correlation = corCompSymm(form = ~ 1|id),
             weights = varIdent(form = ~ 1), # for homogenous variance
             method = 'REML')
# CSmod %>% summary
list(diag(4) %*% getVarCov(CSmod) %*% diag(4), diag(4) %*% getVarCov(CSmod) %*% diag(4) %>% cor()) %>% 
  kable%>%
  kable_paper("striped", full_width = F) %>% 
  add_header_above(header = c("CS Covariance", "CS Correlation"))
#### 2b.ii ####
CSHmod <- gls(y ~ group + factor(time) + group*factor(time),
             data = FEV,
             correlation = corCompSymm(form = ~ 1|id),
             weights = varIdent(form = ~ 1|time), #include for heterogenous variance; ratio of variances to reference time point
             method = 'REML')
# CSHmod %>% summary
LCSH <- diag(c(1.000000, 0.963340, 1.513376, 1.980123 ))
list(LCSH %*% getVarCov(CSHmod) %*% LCSH, LCSH %*% getVarCov(CSHmod) %*% LCSH %>% cor()) %>% 
  kable%>%
  kable_paper("striped", full_width = F) %>% 
  add_header_above(header = c("CSH Covariance", "CSH Correlation"))
#### 2b.iii ####
AR1mod <- gls(y ~ group + factor(time) + group*factor(time),
             data = FEV,
             correlation = corAR1(form = ~ 1|id),
             weights = varIdent(form = ~ 1), #include for homogenous variance
             method = 'REML')
# AR1mod %>% summary
list(diag(4) %*% getVarCov(AR1mod) %*% diag(4), diag(4) %*% getVarCov(CSHmod) %*% diag(4) %>% cor()) %>% 
  kable %>%
  kable_paper("striped", full_width = F) %>% 
  add_header_above(header = c("AR1 Covariance", "AR1 Correlation"))
#### 2b.iv ####
AR1Hmod <- gls(y ~ group + factor(time) + group*factor(time),
             data = FEV,
             correlation = corAR1(form = ~ 1|id),
             weights = varIdent(form = ~ 1|time), #include for heterogenous variance
             method = 'REML')
# AR1Hmod %>% summary
LAR1H <- diag(c(1.000000, 1.016444, 1.598048, 2.024784))
list(LAR1H %*% getVarCov(AR1Hmod) %*% LAR1H, LAR1H %*% getVarCov(AR1Hmod) %*% LAR1H %>% cor()) %>% 
  kable%>%
  kable_paper("striped", full_width = F) %>% 
  add_header_above(header = c("AR1H Covariance", "AR1H Correlation"))
#### 2b.comparison ####
Best <- tibble("Model" = c("Unstructured", "Compound Symmetry", "Compound Symmetry, Hetero", "AR(1)", "AR(1), Hetero"),
       "LogLikelihood" = c(-177.5752, -197.0629, -182.6467, -190.5644, -179.1189),
       "Number of Paramters" = c(10, 2, 5, 2, 5)) %>%
  mutate("-2LogLikelihood" = -2*LogLikelihood) %>% 
  select(c(1,2,4,3)) 
Best %>% 
  kable %>%
  kable_paper("striped", full_width = F)

Comp <- tibble("Structure Comparisons" = c("UN vs CS", "UN vs CSH", "UN vs AR(1)", "UN vs ARH(1)"),
       "CV = Diff LL" = c(Best[[2,3]]-Best[[1,3]], Best[[3,3]]-Best[[1,3]], Best[[4,3]]-Best[[1,3]], Best[[5,3]]-Best[[1,3]]),
       "df = Diff Parameters" = c(8, 5, 8, 5),
       "Pr(chi2(df,0.95)>CV)"= c("< 0.00001", 0.054941, 0.001157, 0.749359))
list(Comp[1:2,], Comp[3:4,]) %>% 
  kable %>%
  kable_paper("striped", full_width = F)

tibble("Structure" = c("CSH", "ARH(1)"),
       "-2LL"= c(Best[[3,3]], Best[[5,3]]),
       "Parameters" = c(5,5),
       "AIC" = c(Best[[3,3]]+(2*5), Best[[5,3]]+(2*5))) %>% 
  kable %>%
  kable_paper("striped", full_width = F)


#### Q3 ####
# AR1Hmod %>% summary
#### 3a ####
AR1Hmod.equalgroups <- gls(y~ time,
             data = FEV,
             correlation = corAR1(form = ~ 1|id),
             weights = varIdent(form = ~ 1|time), #include for heterogenous variance
             method = 'REML')
# AR1Hmod.equalgroups %>% summary
# anova(AR1Hmod.equalgroups, AR1Hmod)

tibble("Model" = c("Saturated","Time Only", "Difference"),
       "-2LogLikelihood" = c(-2*-179.1189, -2*-189.5657, (-2*-189.5657)-(-2*-179.1189)),
       "Number of Parameters" = c(8, 4, 4)) %>% 
  kable %>%
  kable_paper("striped", full_width = F)
#### 3b ####
AR1Hmod.parallel <- gls(y~ factor(time) + group,
             data = FEV,
             correlation = corAR1(form = ~ 1|id),
             weights = varIdent(form = ~ 1|time), #include for heterogenous variance
             method = 'REML')
# AR1Hmod.parallel %>% summary
# anova(AR1Hmod.parallel, AR1Hmod)

tibble("Model" = c("Saturated","Time and Group", "Difference"),
       "-2LogLikelihood" = c(-2*-179.1189, -2*-185.8064, (-2*-185.8064)-(-2*-179.1189)),
       "Number of Parameters" = c(8, 5, 3)) %>% 
  kable %>%
  kable_paper("striped", full_width = F)


#### Q4 ####
model.linear<- gls(y~ time + group + group*time, 
                data = FEV,
                correlation = corAR1(form = ~ 1|id),
                weights = varIdent(form = ~ 1|time), 
                method = 'REML')
model.linear %>% summary