knitr
knitr::opts_chunk$set(echo = TRUE)
on your mark, get set
rm(list=ls())
graphics.off()
closeAllConnections()
a few of my favorite packages
library(tidyverse)
library(xfun)
library(tidyr)
library(nlme)
director fury
setwd(
"~/Yale University/2semester/BIS628 Longitudinal Multilevel Analysis/HW/2")
the import-export business
cholesterol.data <- read.table(
"~/Yale University/2semester/BIS628 Longitudinal Multilevel Analysis/HW/2/cholesterol-data.txt",
quote="\"", comment.char="")
names(cholesterol.data) <- c("Group", "ID", "Y0", "Y6", "Y12", "Y18", "Y24")
head(cholesterol.data)
as numbers
cholesterol.data$Y0 <- as.numeric(cholesterol.data$Y0)
cholesterol.data$Y6 <- as.numeric(cholesterol.data$Y6)
cholesterol.data$Y12 <- as.numeric(cholesterol.data$Y12)
## Warning: NAs introduced by coercion
cholesterol.data$Y18 <- as.numeric(cholesterol.data$Y18)
## Warning: NAs introduced by coercion
cholesterol.data$Y24 <- as.numeric(cholesterol.data$Y24)
## Warning: NAs introduced by coercion
assigning groups
cholesterol.data$Group <- as.factor(cholesterol.data$Group)
levels(cholesterol.data$Group)[2] <- c("High-dose Treatment")
levels(cholesterol.data$Group)[1] <- c("Placebo")
levels(cholesterol.data$Group)
## [1] "Placebo" "High-dose Treatment"
1.1 Read ‘cholesterol-data.txt’ file into SAS or R (depending upon your preference) and put the data in a ‘long’ format.
long_chol.d <- cholesterol.data %>%
pivot_longer(cols = starts_with("Y"),
names_to = "Time",
values_to = "Serum Cholesterol")
cholesterol.data <- long_chol.d
rm(long_chol.d)
time as a number
time_number <- substr(cholesterol.data$Time, 2, nchar(cholesterol.data$Time))
time_number <- as.numeric(time_number)
cbind(cholesterol.data, time_number)
cholesterol.data$Time_number <- time_number
rm(time_number)
1.2 Create a summary table of missingness in the outcome over time by group. Describe what you observe.
#number of na's
na_counts <- cholesterol.data %>%
group_by(Group, Time_number) %>%
summarise(NA_Count = sum(is.na(`Serum Cholesterol`)))
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
print(na_counts)
## # A tibble: 10 × 3
## # Groups: Group [2]
## Group Time_number NA_Count
## <fct> <dbl> <int>
## 1 Placebo 0 0
## 2 Placebo 6 0
## 3 Placebo 12 3
## 4 Placebo 18 6
## 5 Placebo 24 10
## 6 High-dose Treatment 0 0
## 7 High-dose Treatment 6 0
## 8 High-dose Treatment 12 7
## 9 High-dose Treatment 18 18
## 10 High-dose Treatment 24 24
#switch me
na_c_w <- na_counts %>%
pivot_wider(id_cols = "Group",
names_from = "Time_number",
values_from = "NA_Count"
)
print(na_c_w)
## # A tibble: 2 × 6
## # Groups: Group [2]
## Group `0` `6` `12` `18` `24`
## <fct> <int> <int> <int> <int> <int>
## 1 Placebo 0 0 3 6 10
## 2 High-dose Treatment 0 0 7 18 24
#number of responses
followup <- cholesterol.data %>%
group_by(Group, Time_number) %>%
summarise(responses = sum(!is.na(`Serum Cholesterol`))) %>%
arrange(Time_number)
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
print(followup)
## # A tibble: 10 × 3
## # Groups: Group [2]
## Group Time_number responses
## <fct> <dbl> <int>
## 1 Placebo 0 41
## 2 High-dose Treatment 0 62
## 3 Placebo 6 41
## 4 High-dose Treatment 6 62
## 5 Placebo 12 38
## 6 High-dose Treatment 12 55
## 7 Placebo 18 35
## 8 High-dose Treatment 18 44
## 9 Placebo 24 31
## 10 High-dose Treatment 24 38
#switch me
followup <- followup %>%
pivot_wider(id_cols = "Group",
names_from = "Time_number",
values_from = "responses"
)
print(followup)
## # A tibble: 2 × 6
## # Groups: Group [2]
## Group `0` `6` `12` `18` `24`
## <fct> <int> <int> <int> <int> <int>
## 1 Placebo 41 41 38 35 31
## 2 High-dose Treatment 62 62 55 44 38
#combined for a ratio of na's
na_ratio <- merge(na_c_w, followup, all = TRUE, sort = TRUE)
print(na_ratio)
## Group 0 6 12 18 24
## 1 Placebo 0 0 3 6 10
## 2 Placebo 41 41 38 35 31
## 3 High-dose Treatment 0 0 7 18 24
## 4 High-dose Treatment 62 62 55 44 38
#ratio
missingness <- na_ratio
for (i in 2:6) {
missingness[1, i] <- na_ratio[1, i] / na_ratio[2, i]
missingness[3, i] <- na_ratio[3, i] / na_ratio[4, i]
}
missingness <- missingness[c(1, 3), ]
the table below shows the ratio of missing values to observed follow-ups increases over time for both groups, but starts at a higher baseline for the treatment group.
print(missingness)
## Group 0 6 12 18 24
## 1 Placebo 0 0 0.07894737 0.1714286 0.3225806
## 3 High-dose Treatment 0 0 0.12727273 0.4090909 0.6315789
1.3 In a Single Figure, create a panel (2 rows by 3 columns) of 5 plots (one for each time point) for distribution of cholesterol at each follow up timepoint by group (High-dose versus Placebo) overlaid with empirical probability curves. Hint: using the ggplot code in R Lab#1 which produced histograms by group with overlaid empirical probability curves and ADD to the code + facet_wrap(~FU_Time) where FU_Time is the follow up time “0 months”, “6 months”, etc.:
basic.hist <- ggplot(data = cholesterol.data, aes(x = `Serum Cholesterol`))
basic.hist +
geom_histogram(color = "black",
fill = "lightblue") +
labs(title = "Distribution of Serum Cholesterol",
x = "Serum Cholesterol Measurments (mg/dL)",
y = "Count of Observations")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_bin()`).
with a normal distribution
basic.hist +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "lightblue",
binwidth = density(cholesterol.data$`Serum Cholesterol`, na.rm = TRUE)$bw) +
labs(title = "Distribution of Serum Cholesterol",
x = "Serum Cholesterol",
y = "Density of Observations") +
stat_function(color = "red", size = .8, fun = dnorm,
args = list(mean = mean(cholesterol.data$`Serum Cholesterol`, na.rm = TRUE),
sd = sd(cholesterol.data$`Serum Cholesterol`, na.rm = TRUE)))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_bin()`).
layered_hist <- ggplot(data = cholesterol.data, aes(x = `Serum Cholesterol`, fill = Group))
layered_hist +
geom_histogram(color = "black",
alpha = 0.5,
position="identity") + geom_density(linetype = "dashed") +
labs(title = "Distribution of Serum Cholesterol by Group",
x = "Serum Cholesterol (mg/dL)", fill="Group", y = "Count of Observations") + facet_wrap(~ Time_number)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_density()`).
layered_hist <- ggplot(data = cholesterol.data, aes(x = `Serum Cholesterol`, fill = Group))
layered_hist +
geom_histogram(color = "black",
alpha = 0.5) + geom_density(linetype = "dashed") +
labs(title = "Distribution of Serum Cholesterol by Group",
x = "Serum Cholesterol (mg/dL)", fill="Group", y = "Count of Observations")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_density()`).
p2 <- ggplot(data = cholesterol.data, aes(x = `Serum Cholesterol`, na.rm = TRUE, color = Group))
p2 +
geom_histogram(aes(y = ..density.., fill = Group), , alpha = 0.2,
binwidth = density(cholesterol.data$`Serum Cholesterol`, na.rm = TRUE)$bw) +
geom_density(linetype = "dashed") +
labs(title = "Distribution of Serum Cholesterol",
x = "Serum Cholesterol",
y = "Density") + facet_wrap(~ Time_number)
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 68 rows containing non-finite outside the scale range
## (`stat_density()`).
we’re plotting the distribution of the response variables yi at each of the time points, and using the property of each of the distributions at each of the time points to make a determination on the normality of the variances. and that variance is the random error. so if the response variable distribution looks normal at each time point, then we can use that to make a determination on the normality of random errors at each time point. this is the MVN.
2.1 Variogram g is defined as \[g(u)=E[1/2{Y(t) – Y(t-u)}^2]\] with t > u, where Y(t) is a longitudinal response measured at time t, u is the lapsed time between the two observations within a same subject at time t and t-u. Using slides 7-8 of Lecture 3, where g(u) was defined:
Explain in words and show that the correlation coefficient, \(𝝆(u)\) between \(Y(t)\) and \(Y(t-u)\) only depends on lapsed time or lag ‘u’ but not the actual time t.
the covariance structure of the variogram is a stochastic random process that’s second-order stationary, meaning the covariance (and therefore correlation) only depends on distance, not time. See this is true, as when we plug in for \(Y(t)=0\) the only thing we’re left with is \(Y(u)\) , which stands for the lag \(u\)
#light data work handling na's
cholesterol.data_c <- cholesterol.data[is.na(cholesterol.data$`Serum Cholesterol`)==F,]
#light data work handling names
cholesterol.data_c <- rename(cholesterol.data_c, S_Cholesterol = `Serum Cholesterol`)
#light data work handling time
#time as a factor t_as_factor
cholesterol.data_c$t_as_factor <- as.factor(cholesterol.data_c$Time) %>% as.numeric() %>% as.factor()
#time as a number (1:5) t_enumerated
cholesterol.data_c$t_enumerated <- as.factor(cholesterol.data_c$Time_number) %>% as.numeric()
2.2 Back to our HW dataset, using the “maximal” model for the mean response, plot sample or empirical variogram for the dataset:
###Sample variogram: use lmensssp package in R; OR use "Empirical or Sample Variogram Function.r" provided on canvas for Lab 3
#install.packages("lmenssp")
#library(lmenssp)
source("Empirical or Sample Variogram Function.r")
#Using OLS, fit a piece-wise linear model with independent covariance structure
#cholesterol.data_c <- cholesterol.data[is.na(cholesterol.data$`Serum Cholesterol`)==F,] #only select rows where na is false
# Remove rows with NA in 'Serum Cholesterol', 'Group', or 'Time_number'
#cholesterol.data_c_clean <-
#cholesterol.data_c[complete.cases(cholesterol.data_c$`Serum Cholesterol`,
#cholesterol.data_c$Group,
#cholesterol.data_c$Time_number), ]
#row goals is 447, after cleaning
lm_variogram <- lm(S_Cholesterol ~ Group * Time_number, data = cholesterol.data_c)
#Plot sample variogram
#Note: In literature "semivariogram" is sometimes used interchangeably with "variogram", but they actually mean different things
variogram(resid = resid(lm_variogram),
timeVar = cholesterol.data_c$Time_number,
id = cholesterol.data_c$ID,
irregular = FALSE) #this line of code refers to whether the study design time is balanced or not
## $lags
## [1] 6 12 18 24
##
## $means
## 6 12 18 24
## 529.6383 542.3154 646.0807 935.8719
##
## $sizes
## 6 12 18 24
## 342 239 148 69
##
## $process.var
## [1] 1996.968
abline(h=529.6383, col="red", lty=2) #around the smallest value in $means in Console Output
abline(h=937.9797, col="blue", lty=2) #around the largest value in $means in Console Output
for my maximal model, I chose a linear regression where \(Y_i\) is a function of an interaction between \(Group x Time_number\), which accounts for all possible beta coefficients in the regression
2.2.ii Describe the different sources of variability in the response variable using the plot of the empirical variogram (see slide 8 in Lecture 3).
we have the variance of the random effect, which in this model we say is \(C(u)=0\), then the variance of the serial process which we described above, and the measurement error variance \(\tau^2\) is also negligible, as per stated above in the question.
Consider the following covariance structures (Ri): Independent, Unstructured, Compound Symmetry, Autoregressive(AR1), Toeplitz, and Heterogeneous Compound Symmetry:
3.1 Write out \(R_i\) for each covariance model using the study’s time points and identify all nested covariance models by showing what constraints you can apply to an Ri to get nested models and identify All nested models. (For example, what constraints would you apply to the Unstructured covariance to get to the Compound Symmetry covariance?)
Summary of Constraints and Nested Models:
Unstructured (UN): No constraints, all variances and covariances freely estimated
Compound Symmetry (CS): Variances are equal across time points, covariances are equal across time points. Constraint: σ12=σ22=⋯=σn2σ12=σ22=⋯=σn2 and ρij=θρij=θ for all pairs.
Heterogeneous Compound Symmetry (CSH): Covariances are equal, but variances are allowed to differ across time points. Constraint: σ12≠σ22≠⋯≠σn2σ12=σ22=⋯=σn2, and ρij=θρij=θ for all pairs.
Autoregressive (AR1): Variances are equal, covariances depend on the lag. Constraint: ρij=θ×ρ∣i−j∣ρij=θ×ρ∣i−j∣.
Toeplitz (TOEP): Variances are equal, but covariances depend on the lag and can vary for each lag. Constraint: ρij=θ∣i−j∣ρij=θ∣i−j∣ for each lag ∣i−j∣∣i
These models can be interpreted in a hierarchy, with Unstructured being the most flexible and other models being more constrained versions.