pacman::p_load(knitr, simstudy, tidyverse, nlme, rms)

1 The study

An experiment was conducted to study two new reading programs in the fourth grade. Thirty subjects were randomly assigned to three conditions: a control (traditional program), and two experimental programs. Before the study took place, a test was administered to obtain grade-equivalent reading levels for the subjects. At the end of the study, six months later, similar reading evaluation scores were obtained.

Each subject has equal chance to be assigned to either the Control group or the Experimental groups. The assignment has no influence on the pre measurement. The post measurement depends on group and on the baseline measurement (Pre).

2 Data

dta <- read.csv('reading_g3.csv')

2.1 Wide to long format

dtaL <- gather(data=dta, key=Time, value=Score, Pre:Post, factor_key=TRUE)  
kable(head(arrange(dtaL, ID), 10))
Group ID Time Score
C S11 Pre 32
C S11 Post 41
C S12 Pre 42
C S12 Post 46
C S13 Pre 45
C S13 Post 48
C S14 Pre 46
C S14 Post 54
C S15 Pre 49
C S15 Post 52

3 Visualization

We inspect the data with a boxplot and a scatter plot.

boxp <- ggplot(dtaL , 
               aes(x=Time, y=Score, col=Group))+ 
  geom_boxplot(outlier.size=0) + 
  geom_point(aes(fill=Group, col=NULL), 
             shape=21, alpha=0.5, size=2, 
             position=position_jitterdodge(jitter.width=0.2)) + 
  theme_bw() + 
  xlab("") 
boxp

ggplot(dta, 
       aes (x=Pre, y=Post, col=Group)) + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE) + 
  labs(x="Pre-test reading score",
       y="Post-test reading score")+
  theme_minimal()

4 The constrained baseline model

The terms of the most basic constrained model are ‘Time’ and ‘Time * Group’. By omitting ‘Group’, there is no term allowing for a difference in means at baseline between the groups. Consequently, the baseline means are assumed to be equal. However, a general rule in regression modeling is that when there is an interaction in the model, the lower ranked effects should also be included in the model. For this reason, R automatically includes Time and Group as main effects in the model when including the interaction term Time*Group. To avoid to have Group as the main effect in our model, we will create an alternative model matrix: Xalt.

X <- model.matrix(~ Time * Group, data = dtaL)
colnames(X)
[1] "(Intercept)"      "TimePost"         "GroupT1"          "GroupT2"         
[5] "TimePost:GroupT1" "TimePost:GroupT2"
Xalt <- X[, c("TimePost", "TimePost:GroupT1", "TimePost:GroupT2")] 
colnames(Xalt)
[1] "TimePost"         "TimePost:GroupT1" "TimePost:GroupT2"

5 Analysis

The gls() function of the ‘nlme’ package can be used for modeling data.

The standard deviations and correlations that should be estimated are defined by the constant variance function varIdent(). By setting weights = varIdent(form = ~ 1 | Time) a separate standard deviation will be estimated for each time point and a separate correlation will be estimated for each pair of time points (= unstructured variance-covariance matrix). In this example, we expect an estimated standard deviation for Time=Pre, an estimated standard deviation for Time=Post and one estimated correlation between the pre- and post reading measurements of the same subject.

By setting weights = varIdent(form = ~ 1 | Time:Group), a separate variance is estimated for each combination of Group and Time (Pre-C Post-C Pre-T1 Post-T1 Pre-T2 Post-T2).

The argument correlation=corSymm (form = ~ 1 | ID) defines the subject levels. The correlation structure is assumed to apply only to observations within the same subject (here: ID); observations from different subjects (a different value for ‘ID’) are assumed to be uncorrelated.

m0_gls <- gls(Score ~ Xalt, 
                weights = varIdent(form = ~ 1 | Time), 
                correlation=corSymm (form = ~ 1 | ID), 
                data = dtaL)
summary(m0_gls)
Generalized least squares fit by REML
  Model: Score ~ Xalt 
  Data: dtaL 
       AIC     BIC    logLik
  350.2136 364.391 -168.1068

Correlation Structure: General
 Formula: ~1 | ID 
 Parameter estimate(s):
 Correlation: 
  1   
2 0.91
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | Time 
 Parameter estimates:
     Pre     Post 
1.000000 1.016413 

Coefficients:
                        Value Std.Error  t-value p-value
(Intercept)          50.50000 1.2296621 41.06819  0.0000
XaltTimePost          6.38144 0.9027047  7.06924  0.0000
XaltTimePost:GroupT1 -2.80337 1.2699097 -2.20753  0.0314
XaltTimePost:GroupT2  7.15906 1.2699097  5.63745  0.0000

 Correlation: 
                     (Intr) XltTmP XTP:GT1
XaltTimePost         -0.102               
XaltTimePost:GroupT1  0.000 -0.703        
XaltTimePost:GroupT2  0.000 -0.703  0.500 

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.7467890 -0.3836191  0.1374178  0.7475076  1.4493709 

Residual standard error: 6.735137 
Degrees of freedom: 60 total; 56 residual

The estimated correlation between Pre and Post is 0.91. The estimated standard deviation for the Pre measurement is 6.735 and the estimated standard deviation for the Post measurement is 1.016 * 6.735.

5.1 Prediction

We can obtain predicted means using the gls() function. As expected, the predicted value is the same for three groups at baseline (Time == Pre).

To obtain the confidence intervals, the Gls() function of the rms package can be used exactly as the gls() function, leading to exactly the same predictions but with confidence intervals.

m0_Gls <- Gls(Score ~ Xalt, 
                weights = varIdent(form = ~ 1 | Time), 
                correlation=corSymm (form = ~ 1 | ID), 
                data = dtaL)
predictions <- cbind(dtaL, predict(m0_Gls, dtaL, conf.int=0.95))
predictions$Group2 <- factor(predictions$Group, levels=c("All", "C", "T1", "T2"))
predictions$Group2[predictions$Time=="Pre"] <- "All"
pd <- position_dodge(.08)
limits <- aes(ymax=upper , ymin=lower, shape=Group2)
pCI <- ggplot(predictions, 
              aes(y=linear.predictors, 
                  x=Time)) + 
  geom_errorbar(limits, 
                width=0.09, 
                position=pd) +
  geom_line(aes(group=Group, 
                y=linear.predictors), 
            position=pd)+
  geom_point(aes(fill=Group2),
             position=pd, 
             shape=21, 
             size=rel(4)) + 
  scale_fill_manual(values=c("white", "gray80", "gray20", "black")) + 
  theme_minimal() + 
  theme(panel.grid.major=element_blank(), 
        legend.title=element_blank(), 
        text=element_text(size=14), 
        legend.position="bottom") + 
  labs(x="Time", 
       y="Estimated mean with corresponding 95% CI")
pCI

6 Non-randomized studies

In the case of non-randomized experiments or studies an equal mean at baseline cannot be assumed. Consequently, applying constrained model is not appropriate.

7 Reference

Timm, N.H. (1975). Multivariate Analysis with Applications in Education and Psychology, p. 490.

Liu, G.F., Lu, K., Mogg, R., et al. (2009). Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? Statistics in Medicine, 28, 2509-2530.