1. Data set

Suppose we have a data set as below. We have 100 pairs of entries for outcome \(Y\) and the independent variable \(X\). We hope to investigate the linear association between \(Y\) and \(X\).

The first ten rows of data set is as below:

library(MASS)
set.seed(1)
# consisting of i study participants
# with j observations per participant
simulate = function(i,j){
  # The number of subjects is i
  N = i
  # The number of repeated measurements is j
  timepoints = j
  # Simulate errors from N(0,v_e)
  e_ij = rnorm(n=N*timepoints,mean=0,sd=10)
  # Generate ID variable
  id = c(sapply(1:N, function(x) rep(x, timepoints)))
  # Generate j
  within_id = rep(seq(1,timepoints),N)
  # Simulate random effects from N(0,G)
  b_i = mvrnorm(n=N, mu=c(0,0), Sigma=matrix(c(2,1,1,10),ncol=2))
  rows = rep(1:nrow(b_i), each=timepoints)
  b_i = b_i[rows,]
  # Simulate X from N(5,3^2)
  x = rnorm(n=N*timepoints,mean=5,sd=3)
  # set true parameter vector
  beta = c(10,2)
  # Simulate response from a linear mixed-effects model
  y = beta[1]+b_i[,1] + (beta[2]+b_i[,2])*x+e_ij
  # Return data frame
  data.frame(ID_i=id,Y=y,X=x,j=within_id)
}
data = simulate(i=20, j=5)
data[1:10,]
##    ID_i         Y            X j
## 1     1  4.214745 -0.743078277 1
## 2     1 11.910365  8.529749936 2
## 3     1  2.090292  0.005082691 3
## 4     1 26.241827  3.609408796 4
## 5     1 13.669652  1.652239685 5
## 6     2  6.463692  2.747542996 1
## 7     2 39.616939 11.261499637 2
## 8     2 27.485530  5.052186859 3
## 9     2 16.638502  1.141098409 4
## 10    2  5.320655  0.078183397 5

2. When i.i.d. assumption is satisfied

When we assume each of \(X\) is independent identically distributed, a simple linear regression can be used to investigate the linear relationship between \(Y\) and \(X\) as bellow: \[y_i=\beta_0 + \beta_1 x_i+ e_i, i=1,2,...,100\] \[e_i \sim N(0, \sigma^2)\]

library(ggplot2)
model1 = lm(Y~X, data=data)
model1$coefficients
## (Intercept)           X 
##    7.664295    3.248211
ggplot(data, aes(x=X, y=Y)) + 
  geom_point() +
  geom_abline(intercept=model1$coefficients[1], slope = model1$coefficients[2], color="red",
              linetype="dashed") +
  theme_bw()

Interpretation:

For all subjects in the study population, when we have 1 unit increase in \(X\), we will observe 3.248211 unit increase in \(Y\) on average; the starting point of \(Y\) is around 7.6642945 on average.

3. When i.i.d. assumption is not satisfied

However, i.i.d. assumption is not valid in many studies which involve repeated measurements for each subject or hierarchy design. For example, the above data set can be illustrated as: 1) 20 subjects (ID_i), each has 5 measurements (j). 5 measurements in each subject are correlated while 20 subjects are independent from each other; 2) 20 sites (ID_i), each has 5 subjects (j). 5 subjects in each site share more in common thus correlated, while 20 sites are independent from each other. We will adopt the first situation but the model interpretation can be applied to the second. The research question is still about the linear association betwee \(X\) and \(Y\).

Below is the scatter plot, colors represent subjects (or sites).

3.1 Linear mixed-effects model (random intercept random slope)

Linear mixed-effects model or subject-specific model assume that the mean response \(Y\) depends on covariates \(X\) and unobserved random effects \(Z\). It explores the longitudinal trend within each subject. The model is as below: \[E(Y_{ij}|X_{ij},Z_{ij},b_i)=X_{ij}'\beta+Z_{ij}'b_i\] \[X_{ij}=Z_{ij}=(1, x_{ij})'\]

\[y_{ij} = \beta_0+b_{0i}+(\beta_1+b_{1i})x_{ij}+e_{ij}, i=1,2,..20, j=1,2,..,5\] \[b_i=(b_{0i},b_{1i})' \sim N(0,G), G=\begin{pmatrix}\sigma_1^2 & \sigma_{12}^2\\ \sigma_{12}^2 & \sigma_{2}^2 \end{pmatrix}\] \[e_i \sim N(0, R_i), R_i=\begin{pmatrix}\sigma_e^2 &0&0&0&0\\ 0 & \sigma_{e}^2&0&0&0 \\ 0&0&\sigma_{e}^2&0&0 \\ 0&0&0&\sigma_{e}^2&0 \\ 0&0&0&0& \sigma_{e}^2 \end{pmatrix} \in j*j \]

The conditional distribution of \(Y\) given the covariates \(X\) and random effects \(b\) is: \[Y|X,Z,b \sim N(X\beta+Zb, R)\] \[R=\begin{pmatrix}R_1 &0&...\\ 0 & R_2&0&... \\ ...&...&... \\ 0&0&...&R_n \end{pmatrix} \in (n*j)*(n*j)\]

The marginal distribution of \(Y\) is: \[Y|X,Z \sim N(X\beta, V)\] \[V=Z \tilde GZ'+R\] \[\tilde G= \begin{pmatrix}G &0&...\\ 0& G&... \\ 0&0... &G \end{pmatrix} \in (2n)*(2n)\]

The estimators are commonly deduced by restricted maximum likelihood approach.

library(lme4)
## Warning: package 'lme4' was built under R version 4.2.3
## Loading required package: Matrix
model2 = lmer(data=data, Y~X+(1+X|ID_i))
### random effect coefficients for each i
model2_coef = coef(model2)$ID_i
model2_coef$ID_i = as.factor(seq(1, nrow(model2_coef)))
colnames(model2_coef) = c("b0_i","b1_i", "ID_i")
### fixed effect coefficients
model2_coef$beta0 = summary(model2)$coefficients[1,1]
model2_coef$beta1 = summary(model2)$coefficients[2,1]
### subject-specific model coefficients
model2_coef$intercept = model2_coef$beta0+model2_coef$b0_i
model2_coef$slope = model2_coef$beta1+model2_coef$b1_i
model2_coef
##         b0_i        b1_i ID_i    beta0    beta1 intercept    slope
## 1  12.300368  0.87602287    1 11.47776 2.316023  23.77813 3.192046
## 2  11.397239  2.45697543    2 11.47776 2.316023  22.87500 4.772999
## 3  13.124346 -0.56637444    3 11.47776 2.316023  24.60210 1.749649
## 4  11.195103  2.81082069    4 11.47776 2.316023  22.67286 5.126844
## 5  13.160061 -0.62889383    5 11.47776 2.316023  24.63782 1.687129
## 6   8.507121  7.51621104    6 11.47776 2.316023  19.98488 9.832234
## 7  10.109412  4.71135453    7 11.47776 2.316023  21.58717 7.027378
## 8   9.868724  5.13268536    8 11.47776 2.316023  21.34648 7.448709
## 9  11.176810  2.84284335    9 11.47776 2.316023  22.65457 5.158867
## 10  8.510700  7.50994604   10 11.47776 2.316023  19.98846 9.825969
## 11 13.398866 -1.04692945   11 11.47776 2.316023  24.87663 1.269094
## 12 12.606876  0.33947253   12 11.47776 2.316023  24.08463 2.655496
## 13  8.554921  7.43253576   13 11.47776 2.316023  20.03268 9.748559
## 14 12.742651  0.10179419   14 11.47776 2.316023  24.22041 2.417817
## 15 12.652666  0.25931499   15 11.47776 2.316023  24.13043 2.575338
## 16 12.533478  0.46795640   16 11.47776 2.316023  24.01124 2.783980
## 17 12.555825  0.42883858   17 11.47776 2.316023  24.03358 2.744862
## 18 11.949515  1.49020031   18 11.47776 2.316023  23.42727 3.806223
## 19 10.385467  4.22811387   19 11.47776 2.316023  21.86323 6.544137
## 20 12.825037 -0.04242507   20 11.47776 2.316023  24.30280 2.273598
ggplot(data, aes(x=X, y=Y, col=ID_i)) + 
  geom_point() +
  geom_abline(intercept=model2_coef[,6], slope=model2_coef[,7], col=model2_coef[,3]) +
  theme_bw() + 
  theme(legend.position = "none")

Interpretation:

For subjects in the study population, when we have 1 unit increase in \(X\), we will observe 2.3160232 unit increase in \(Y\) on average; the starting point of \(Y\) is 11.4777593 on average.

For subject #1, when we have 1 unit increase in \(X\), we will observe 3.192046 unit increase in \(Y\); the starting point of \(Y\) is 23.7781271; for subject #2, when we have 1 unit increase in \(X\), we will observe 4.7729986 unit increase in \(Y\); the starting point of \(Y\) is 22.8749986; for subject #3, when we have 1 unit increase in \(X\), we will observe 1.7496487 unit increase in \(Y\); the starting point of \(Y\) is 24.602105, etc. Subject #6 experienced the biggest increase in \(Y\) when \(X\) increase 1 unit, and subject #11 has the biggest starting \(Y\).

3.2 Marginal linear model

Marginal or population averaged models assume that mean response \(Y\) depends on covariates \(X\) only and not on random effects. Compared with linear mixed-effects model, this model incorporates random effects \(b_i\) into the residuals \(e_i\). The variance-covariance of \(e_i\) is not diagonal matrix anymore. The model is as below:

\[y_{ij} = \beta_0+\beta_1x_{ij}+e_{ij}, i=1,2,..20, j=1,2,..,5\] \[e_i \sim N(0, V_i), V_i=A_i^{1/2}R_i(\alpha)A_i^{1/2}\]

\[A_i=\begin{pmatrix}Var(Y_{i1}) &0&...&...\\ 0 &Var(Y_{i2})&0&...&... \\ 0&...&...&...&Var(Y_{ij}) \\ \end{pmatrix} \in j*j \]

\[R_i(\alpha)= \begin{pmatrix}1 &Corr(Y_{i1},Y_{i2})&...&Corr(Y_{i1},Y_{ij})\\Corr(Y_{i2},Y_{i1}) &1&...&Corr(Y_{i2},Y_{ij}) \\ &... \\ Corr(Y_{ij},Y_{i1})&Corr(Y_{ij},Y_{i2})&...&1 \\ \end{pmatrix} \in j*j \]

\(R_i(\alpha)\) is called the working correlation matrix. Common choices are exchangeable, first-order autoregressive, unstructured, independence.

\[Var(e)=V=\begin{pmatrix}V_1 &0&...&...\\ 0 &V_2&0&...&... \\ 0&...&...&...&V_n \\ \end{pmatrix}\]

The marginal linear model implies that \(Y|X \sim N(X\beta, V)\), the most common estimation approach is GEE. If correlation within subject is ignored, then \(V_i\) is modeled as a diagonal matrix (which is wrong), but \(\hat \beta\) is still valid because score function do not depend on \(V_i\). However, \(\hat \beta\) is not efficient.

library(geepack)
## Warning: package 'geepack' was built under R version 4.2.3
model3 = geeglm(Y~X,id=ID_i,data=data,family=gaussian,corstr = "unstructured")
model3$coefficients
## (Intercept)           X 
##   16.080302    2.427784
ggplot(data, aes(x=X, y=Y, col=ID_i)) + 
  geom_point() +
  geom_abline(intercept=model3$coefficients[1], slope=model3$coefficients[2]) +
  theme_bw() + 
  theme(legend.position = "none")

Interpretation:

For subjects in the study population, when we have 1 unit increase in \(X\), we will observe 2.4277836 unit increase in \(Y\) on average; the starting \(Y\) is 16.0803015 on average.

4. Comparison

compare = matrix(NA, nrow=3, ncol=4)
rownames(compare) = c("simple linear regression","linear mixed-effects","marginal linear")
colnames(compare) = c("intercept beta0","slope beta1","beta0 sd","beta1 sd")
compare[1,1:2] = summary(model1)$coefficients[,1]
compare[1,3:4] = summary(model1)$coefficients[,2]
compare[2,1:2] = summary(model2)$coefficients[,1]
compare[2,3:4] = summary(model2)$coefficients[,2]
compare[3,1:2] = summary(model3)$coefficients[,1]
compare[3,3:4] = summary(model3)$coefficients[,2]
compare
##                          intercept beta0 slope beta1 beta0 sd  beta1 sd
## simple linear regression        7.664295    3.248211 3.352107 0.5700957
## linear mixed-effects           11.477759    2.316023 1.794907 0.7361729
## marginal linear                16.080302    2.427784 2.900012 0.5684033

5. Mixed-effects vs Marginal model

Linear mixed-effects: 1) the estimation relies on parametric assumptions; 2) can give subject-specific association between \(Y\) and \(X\), answer the question of subject-specific effects.

Marginal: 1) the estimation does not rely on parametric assumption, only needs a correct mean model \(E(Y_{ij}|X_{ij})\); 2) it needs to specify a correlation structure \(R_i(\alpha)\); 3) answer the question of population-averaged effects.