Suppose we have a data set as below. We have 100 pairs of entries for outcome \(Y\) and the independent variable \(X\). We hope to investigate the linear association between \(Y\) and \(X\).
The first ten rows of data set is as below:
library(MASS)
set.seed(1)
# consisting of i study participants
# with j observations per participant
simulate = function(i,j){
# The number of subjects is i
N = i
# The number of repeated measurements is j
timepoints = j
# Simulate errors from N(0,v_e)
e_ij = rnorm(n=N*timepoints,mean=0,sd=10)
# Generate ID variable
id = c(sapply(1:N, function(x) rep(x, timepoints)))
# Generate j
within_id = rep(seq(1,timepoints),N)
# Simulate random effects from N(0,G)
b_i = mvrnorm(n=N, mu=c(0,0), Sigma=matrix(c(2,1,1,10),ncol=2))
rows = rep(1:nrow(b_i), each=timepoints)
b_i = b_i[rows,]
# Simulate X from N(5,3^2)
x = rnorm(n=N*timepoints,mean=5,sd=3)
# set true parameter vector
beta = c(10,2)
# Simulate response from a linear mixed-effects model
y = beta[1]+b_i[,1] + (beta[2]+b_i[,2])*x+e_ij
# Return data frame
data.frame(ID_i=id,Y=y,X=x,j=within_id)
}
data = simulate(i=20, j=5)
data[1:10,]
## ID_i Y X j
## 1 1 4.214745 -0.743078277 1
## 2 1 11.910365 8.529749936 2
## 3 1 2.090292 0.005082691 3
## 4 1 26.241827 3.609408796 4
## 5 1 13.669652 1.652239685 5
## 6 2 6.463692 2.747542996 1
## 7 2 39.616939 11.261499637 2
## 8 2 27.485530 5.052186859 3
## 9 2 16.638502 1.141098409 4
## 10 2 5.320655 0.078183397 5
When we assume each of \(X\) is independent identically distributed, a simple linear regression can be used to investigate the linear relationship between \(Y\) and \(X\) as bellow: \[y_i=\beta_0 + \beta_1 x_i+ e_i, i=1,2,...,100\] \[e_i \sim N(0, \sigma^2)\]
library(ggplot2)
model1 = lm(Y~X, data=data)
model1$coefficients
## (Intercept) X
## 7.664295 3.248211
ggplot(data, aes(x=X, y=Y)) +
geom_point() +
geom_abline(intercept=model1$coefficients[1], slope = model1$coefficients[2], color="red",
linetype="dashed") +
theme_bw()
Interpretation:
For all subjects in the study population, when we have 1 unit increase in \(X\), we will observe 3.248211 unit increase in \(Y\) on average; the starting point of \(Y\) is around 7.6642945 on average.
However, i.i.d. assumption is not valid in many studies which involve repeated measurements for each subject or hierarchy design. For example, the above data set can be illustrated as: 1) 20 subjects (ID_i), each has 5 measurements (j). 5 measurements in each subject are correlated while 20 subjects are independent from each other; 2) 20 sites (ID_i), each has 5 subjects (j). 5 subjects in each site share more in common thus correlated, while 20 sites are independent from each other. We will adopt the first situation but the model interpretation can be applied to the second. The research question is still about the linear association betwee \(X\) and \(Y\).
Below is the scatter plot, colors represent subjects (or sites).
Linear mixed-effects model or subject-specific model assume that the mean response \(Y\) depends on covariates \(X\) and unobserved random effects \(Z\). It explores the longitudinal trend within each subject. The model is as below: \[E(Y_{ij}|X_{ij},Z_{ij},b_i)=X_{ij}'\beta+Z_{ij}'b_i\] \[X_{ij}=Z_{ij}=(1, x_{ij})'\]
\[y_{ij} = \beta_0+b_{0i}+(\beta_1+b_{1i})x_{ij}+e_{ij}, i=1,2,..20, j=1,2,..,5\] \[b_i=(b_{0i},b_{1i})' \sim N(0,G), G=\begin{pmatrix}\sigma_1^2 & \sigma_{12}^2\\ \sigma_{12}^2 & \sigma_{2}^2 \end{pmatrix}\] \[e_i \sim N(0, R_i), R_i=\begin{pmatrix}\sigma_e^2 &0&0&0&0\\ 0 & \sigma_{e}^2&0&0&0 \\ 0&0&\sigma_{e}^2&0&0 \\ 0&0&0&\sigma_{e}^2&0 \\ 0&0&0&0& \sigma_{e}^2 \end{pmatrix} \in j*j \]
The conditional distribution of \(Y\) given the covariates \(X\) and random effects \(b\) is: \[Y|X,Z,b \sim N(X\beta+Zb, R)\] \[R=\begin{pmatrix}R_1 &0&...\\ 0 & R_2&0&... \\ ...&...&... \\ 0&0&...&R_n \end{pmatrix} \in (n*j)*(n*j)\]
The marginal distribution of \(Y\) is: \[Y|X,Z \sim N(X\beta, V)\] \[V=Z \tilde GZ'+R\] \[\tilde G= \begin{pmatrix}G &0&...\\ 0& G&... \\ 0&0... &G \end{pmatrix} \in (2n)*(2n)\]
The estimators are commonly deduced by restricted maximum likelihood approach.
library(lme4)
## Warning: package 'lme4' was built under R version 4.2.3
## Loading required package: Matrix
model2 = lmer(data=data, Y~X+(1+X|ID_i))
### random effect coefficients for each i
model2_coef = coef(model2)$ID_i
model2_coef$ID_i = as.factor(seq(1, nrow(model2_coef)))
colnames(model2_coef) = c("b0_i","b1_i", "ID_i")
### fixed effect coefficients
model2_coef$beta0 = summary(model2)$coefficients[1,1]
model2_coef$beta1 = summary(model2)$coefficients[2,1]
### subject-specific model coefficients
model2_coef$intercept = model2_coef$beta0+model2_coef$b0_i
model2_coef$slope = model2_coef$beta1+model2_coef$b1_i
model2_coef
## b0_i b1_i ID_i beta0 beta1 intercept slope
## 1 12.300368 0.87602287 1 11.47776 2.316023 23.77813 3.192046
## 2 11.397239 2.45697543 2 11.47776 2.316023 22.87500 4.772999
## 3 13.124346 -0.56637444 3 11.47776 2.316023 24.60210 1.749649
## 4 11.195103 2.81082069 4 11.47776 2.316023 22.67286 5.126844
## 5 13.160061 -0.62889383 5 11.47776 2.316023 24.63782 1.687129
## 6 8.507121 7.51621104 6 11.47776 2.316023 19.98488 9.832234
## 7 10.109412 4.71135453 7 11.47776 2.316023 21.58717 7.027378
## 8 9.868724 5.13268536 8 11.47776 2.316023 21.34648 7.448709
## 9 11.176810 2.84284335 9 11.47776 2.316023 22.65457 5.158867
## 10 8.510700 7.50994604 10 11.47776 2.316023 19.98846 9.825969
## 11 13.398866 -1.04692945 11 11.47776 2.316023 24.87663 1.269094
## 12 12.606876 0.33947253 12 11.47776 2.316023 24.08463 2.655496
## 13 8.554921 7.43253576 13 11.47776 2.316023 20.03268 9.748559
## 14 12.742651 0.10179419 14 11.47776 2.316023 24.22041 2.417817
## 15 12.652666 0.25931499 15 11.47776 2.316023 24.13043 2.575338
## 16 12.533478 0.46795640 16 11.47776 2.316023 24.01124 2.783980
## 17 12.555825 0.42883858 17 11.47776 2.316023 24.03358 2.744862
## 18 11.949515 1.49020031 18 11.47776 2.316023 23.42727 3.806223
## 19 10.385467 4.22811387 19 11.47776 2.316023 21.86323 6.544137
## 20 12.825037 -0.04242507 20 11.47776 2.316023 24.30280 2.273598
ggplot(data, aes(x=X, y=Y, col=ID_i)) +
geom_point() +
geom_abline(intercept=model2_coef[,6], slope=model2_coef[,7], col=model2_coef[,3]) +
theme_bw() +
theme(legend.position = "none")
Interpretation:
For subjects in the study population, when we have 1 unit increase in \(X\), we will observe 2.3160232 unit increase in \(Y\) on average; the starting point of \(Y\) is 11.4777593 on average.
For subject #1, when we have 1 unit increase in \(X\), we will observe 3.192046 unit increase in \(Y\); the starting point of \(Y\) is 23.7781271; for subject #2, when we have 1 unit increase in \(X\), we will observe 4.7729986 unit increase in \(Y\); the starting point of \(Y\) is 22.8749986; for subject #3, when we have 1 unit increase in \(X\), we will observe 1.7496487 unit increase in \(Y\); the starting point of \(Y\) is 24.602105, etc. Subject #6 experienced the biggest increase in \(Y\) when \(X\) increase 1 unit, and subject #11 has the biggest starting \(Y\).
Marginal or population averaged models assume that mean response \(Y\) depends on covariates \(X\) only and not on random effects. Compared with linear mixed-effects model, this model incorporates random effects \(b_i\) into the residuals \(e_i\). The variance-covariance of \(e_i\) is not diagonal matrix anymore. The model is as below:
\[y_{ij} = \beta_0+\beta_1x_{ij}+e_{ij}, i=1,2,..20, j=1,2,..,5\] \[e_i \sim N(0, V_i), V_i=A_i^{1/2}R_i(\alpha)A_i^{1/2}\]
\[A_i=\begin{pmatrix}Var(Y_{i1}) &0&...&...\\ 0 &Var(Y_{i2})&0&...&... \\ 0&...&...&...&Var(Y_{ij}) \\ \end{pmatrix} \in j*j \]
\[R_i(\alpha)= \begin{pmatrix}1 &Corr(Y_{i1},Y_{i2})&...&Corr(Y_{i1},Y_{ij})\\Corr(Y_{i2},Y_{i1}) &1&...&Corr(Y_{i2},Y_{ij}) \\ &... \\ Corr(Y_{ij},Y_{i1})&Corr(Y_{ij},Y_{i2})&...&1 \\ \end{pmatrix} \in j*j \]
\(R_i(\alpha)\) is called the working correlation matrix. Common choices are exchangeable, first-order autoregressive, unstructured, independence.
\[Var(e)=V=\begin{pmatrix}V_1 &0&...&...\\ 0 &V_2&0&...&... \\ 0&...&...&...&V_n \\ \end{pmatrix}\]
The marginal linear model implies that \(Y|X \sim N(X\beta, V)\), the most common estimation approach is GEE. If correlation within subject is ignored, then \(V_i\) is modeled as a diagonal matrix (which is wrong), but \(\hat \beta\) is still valid because score function do not depend on \(V_i\). However, \(\hat \beta\) is not efficient.
library(geepack)
## Warning: package 'geepack' was built under R version 4.2.3
model3 = geeglm(Y~X,id=ID_i,data=data,family=gaussian,corstr = "unstructured")
model3$coefficients
## (Intercept) X
## 16.080302 2.427784
ggplot(data, aes(x=X, y=Y, col=ID_i)) +
geom_point() +
geom_abline(intercept=model3$coefficients[1], slope=model3$coefficients[2]) +
theme_bw() +
theme(legend.position = "none")
Interpretation:
For subjects in the study population, when we have 1 unit increase in \(X\), we will observe 2.4277836 unit increase in \(Y\) on average; the starting \(Y\) is 16.0803015 on average.
compare = matrix(NA, nrow=3, ncol=4)
rownames(compare) = c("simple linear regression","linear mixed-effects","marginal linear")
colnames(compare) = c("intercept beta0","slope beta1","beta0 sd","beta1 sd")
compare[1,1:2] = summary(model1)$coefficients[,1]
compare[1,3:4] = summary(model1)$coefficients[,2]
compare[2,1:2] = summary(model2)$coefficients[,1]
compare[2,3:4] = summary(model2)$coefficients[,2]
compare[3,1:2] = summary(model3)$coefficients[,1]
compare[3,3:4] = summary(model3)$coefficients[,2]
compare
## intercept beta0 slope beta1 beta0 sd beta1 sd
## simple linear regression 7.664295 3.248211 3.352107 0.5700957
## linear mixed-effects 11.477759 2.316023 1.794907 0.7361729
## marginal linear 16.080302 2.427784 2.900012 0.5684033
Linear mixed-effects: 1) the estimation relies on parametric assumptions; 2) can give subject-specific association between \(Y\) and \(X\), answer the question of subject-specific effects.
Marginal: 1) the estimation does not rely on parametric assumption, only needs a correct mean model \(E(Y_{ij}|X_{ij})\); 2) it needs to specify a correlation structure \(R_i(\alpha)\); 3) answer the question of population-averaged effects.