Prepost analysis with continuous data using R - Part 1

Mark Bounthavong

25 December 2024

Introduction

Prepost study design is common in policy research, particularly when we can measure the effects of an intervention before and after it is implemented. There are a variety of statistical methods that can be applied to a prepost study design. The type of statistical analysis will depend on the data type (e.g., continuous or categorical) frequency of data collected across time and the number of groups.

For instance, a single-group prepost study design with two time periods (before and after an intervention) can be analyzed using a paired t-test if the data is continuous. The hypotheses for a paired t-test for a single group is:

\[\begin{aligned} H_0: \mu_{pre} = \mu_{post} \\ H_a: \mu_{pre} \neq \mu_{post} \end{aligned}\]

But what about when you have two groups? A paired t-test might not be enough partly because we will have to evaluate two differences:

\[\begin{aligned} \Delta_{Group1} = Group1_{pre} - Group1_{post} \\ \Delta_{Group2} = Group2_{pre} - Group2_{post} \end{aligned}\]

In this situation where we have two groups that have a continuous outcome that is changing before and after implementation of an intervention, we will need more than a simple paired t-test to determine if there is a statistically significant difference in the rate of change between the two groups. Essentially, we are comparing two differences, which is sometimes referred to as the “difference-in-differences” estimation.

Figure 1 provides an illustration of the prepost study design with two groups. The two groups have different starting points for pre period average number of chocolates consumed. But you can see that one group had an increase in the average number of chocolates consumed in the post period.

Figure 1. Two-group prepost study design with continuous outcome.

Figure 1. Two-group prepost study design with continuous outcome.

There are several statistical approaches to analyze the difference in the change in the average number of chocolates consumed between the group groups. We will cover several of these approaches:

  • Independent t-test

  • Linear regression model

  • One-way Analysis of Covariance (ANCOVA)

  • Linear mixed effects model

In the first part of this article, we will cover the first two.

Motivating example

Let’s assume that there was a problem with chocolate consumption. People were not eating enough chocolate!

Policymakers implemented a campaign (intervention) to highlight the benefits of chocolate to get people to consume more. We will use a simple dataset to illustrate how we can use some statistical analyses to evaluate the difference between two groups before and after an intervention.

In this hypotheticial prepost study, researchers were interested in the impact of this intervention on the average number of chocolate consumed. One group received the intervention (exposed), but the other group did not (un-exposed). The dataset has 20 patients who consumed chocolate before and after the implementation of an intervention. The data has several variables:

  • patientid: This is a unique identifier of the patient

  • group: This is the grouping variable. There are two groups in this dataset. (Note: R will initially read this as a continuous data type. We will need to use the as.factor() option to change this into a categorical variable.)

  • endpoint1: This is the number of chocolates consumed by the patient

  • time: This is the indicator for the pre and post periods (1 = pre, 2 = post). (Note: R will initially read this as a continuous data type. We will need to use the as.factor() option to change this into a categorical variable.)

## Load packages
#### pacman makes loading packages easier
if (!require("pacman")) install.packages("pacman"); library("pacman") 

#### p-load() will check if the package is installed, if "yes" it will load the package, if "no" it will install and then load the package
p_load("ggplot2", 
       "psych", 
       "tidyverse",
       "lessR",
       "Hmisc",
       "readr",
       "lme4",
       "ggeffects",
       "margins",
       "gtsummary"
       )

## Load data from GitHub
urlfile <- "https://raw.githubusercontent.com/mbounthavong/R-tutorials/refs/heads/main/Data/long1.csv"

long1 <- read.csv(urlfile, header = TRUE)

long1$group <- as.factor(long1$group) ## Make the grouping variable a factor instead of continuous
long1$time <- as.factor(long1$time) ## Make the grouping variable a factor instead of continuous

## View data in long format - head() lists the first six rows of the dataframe
head(long1)
##   patientid group endpoint1 time age
## 1         1     0         4    1  66
## 2         1     0         8    2  66
## 3         2     0         2    1  78
## 4         2     0         6    2  78
## 5         3     0         3    1  56
## 6         3     0         9    2  56

We can summarize the data so that we can estimate the mean outcome before and after the intervention was implemented.

#### Group & Summarize ####
long1 %>%
  group_by(group, time) %>%
  summarise(n_distinct(patientid),
            endpoint = mean(endpoint1),
            sd(endpoint1))
## # A tibble: 4 × 5
## # Groups:   group [2]
##   group time  `n_distinct(patientid)` endpoint `sd(endpoint1)`
##   <fct> <fct>                   <int>    <dbl>           <dbl>
## 1 0     1                          11     2.64            1.21
## 2 0     2                          11     6.18            3.16
## 3 1     1                           9     7.33            4.36
## 4 1     2                           9    13               5.15

Here is a summary of the average number of chocolates consumed before and after implementation of the intervention.

Table1a. Average number of chocolate consumed before and after implementation of the intervention.

Table1a. Average number of chocolate consumed before and after implementation of the intervention.

We can visualize the change in average chocolate consumption using the ggplot2 package in R.

## Connected plot
plot1 <- long1 %>%
            group_by(group, time) %>%
            summarise(n_distinct(patientid),
                      endpoint = mean(endpoint1),
                      sd(endpoint1))

ggplot(plot1, aes(x = time, y = endpoint, col = group, group = group)) + 
    geom_point(aes(colour = factor(group))) + 
    geom_line(show.legend = FALSE) + 
    ylab("Average number of chocolates consumed") + 
    xlab("Before and After") 

We can see that the average consumption of chocolate increased for both groups. Moreover, Group 1 appears to have a higher rate of consumption compared to Group 0. However is this rate or change in chocolate consumption significant greater in Group 1 versus Group 0 before and after implementation of the intervention?

To find out, we need to perform a statistical test.

Independent t test

We can estimate the difference in the number of chocolate consumed before and after for each individual and then compare that difference between the groups.

\[\begin{aligned} \Delta_{Group1} = Group1_{pre} - Group1_{post} \\ \Delta_{Group2} = Group2_{pre} - Group2_{post} \end{aligned}\]

We need to reshape the data to wide format.

### Reshape the dataframe from long to wide
wide1 <- reshape(long1, idvar = "patientid", timevar = "time", direction = "wide")

Once the dataframe is in the wide format, we will rename some columns. Once we rename the column, we will select only those variables that will be useful for our analysis.

### Rename the dataframe columns
wide1 <- wide1 %>%
            dplyr::rename(
                   group1 = group.1,
                   group2 = group.2, 
                   endpoint_pre = endpoint1.1, 
                   endpoint_post = endpoint1.2,
                   age = age.1
                   )

wide2 <- wide1 %>%
            dplyr::select(patientid,
                          group1,
                          endpoint_pre,
                          endpoint_post,
                          age)

Next, we need to calculate the difference in chocolate consumtion for each patient before and after the implementation of the intervention.

Then, we can estimate the average difference and standard deviation between the groups.

## We will create a new grouping variable for simplicity
wide2$group[wide2$group1 == 0] = 0
wide2$group[wide2$group1 == 1] = 1
table(wide2$group)
## 
##  0  1 
## 11  9
## Average difference between the groups - Descriptive using the `describeBy` function
wide2 %>%
  group_by(group) %>%
  summarise(n_distinct(patientid),
            mean(diff),
            sd(diff))
## # A tibble: 2 × 4
##   group `n_distinct(patientid)` `mean(diff)` `sd(diff)`
##   <fct>                   <int>        <dbl>      <dbl>
## 1 0                          11         3.55       2.73
## 2 1                           9         5.67       3.67
## To estimate the difference of the change in chocolate consumption between the two groups
## we need to create vectors of the difference in chocolate consumption pre-post intervention
## for each of the groups
## This is a "difference-in-differences" estimation
## We will also calculate the standard error in order to estimate the 95% CI.

exposed <- (wide2$diff)[wide2$group == 1]
unexposed <- (wide2$diff)[wide2$group == 0]

mean_diff.1 <- mean(exposed)
sd.1 <- sd(exposed)
var.1 <- var(exposed)
n.1 <- length(exposed)

mean_diff.0 <- mean(unexposed)
sd.0 <- sd(unexposed)
var.0 <- var(unexposed)
n.0 <- length(unexposed)

n <- n.1 + n.0
mean.diff <- mean_diff.1 - mean_diff.0
sd.diff <- sqrt(((n.0 - 1)*sd.0^2 + (n.1 - 1)*sd.1^2) / (n.0 + n.1 - 2))
se <- sqrt((var.0 / n.0) + (var.1 / n.1))

mean.diff # Average "difference-in-differences"
## [1] 2.121212
mean.diff + 1.96 * se  ## UL of 95% CI
## [1] 5.014679
mean.diff - 1.96 * se  ## LL of 95% CI
## [1] -0.7722543

The average differences in chocolate consumed for Group 0 (un-exposed) and Group 1 (exposed) were 3.55 and 5.67, respectively. Group 1 appears to have a greater average change in chocolate consumption compared to Group 0. The difference between these difference was 2.12 chocolates with a 95% confidence interval (CI) between -0.77 and 5.01. To determine if this is statistically significant, we can perform an independent t test.

t.test(wide2$diff ~ wide2$group, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  wide2$diff by wide2$group
## t = -1.4369, df = 14.507, p-value = 0.172
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.277126  1.034702
## sample estimates:
## mean in group 0 mean in group 1 
##        3.545455        5.666667

The p-value is 0.172, which means that we don’t have enough evidence to reject to null hypothesis that the average difference in chocolate consumption is different between the groups.

Table1b. Difference in chocolate consumption between the groups.

Table1b. Difference in chocolate consumption between the groups.

The independent t test compared the mean difference in chocolate consumption between the groups. But it doesn’t take into account the correlation between the number of chocolate consumed before and after the implementation.

Another method that we can use is the linear regression approach.

Linear regression model

We can compare the change in chocolate consumption between patients in the exposed and unexposed groups using a linear regression model. However, we need to include an interaction term between the grouping variable and the period when the intervention was implemented.

Here is the structural form of the linear regression model:

\[\begin{aligned} E[Y | X] = \beta_{0} + \beta_{1}Group_{i} + \beta_{2}Period_{i} + \beta_{3}(Group_{i} * Period_{1}) + \epsilon_{i} \end{aligned}\]

The run this linear regression in R, we will need to use the long format of the dataframe (long1).

lm1 <- glm(endpoint1 ~ group + time + group:time, family = "gaussian", data = long1)
summary(lm1)
## 
## Call:
## glm(formula = endpoint1 ~ group + time + group:time, family = "gaussian", 
##     data = long1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.0000  -1.7727   0.1818   2.4773   7.0000  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     2.636      1.099   2.399  0.02174 * 
## group1          4.697      1.638   2.867  0.00688 **
## time2           3.545      1.554   2.281  0.02854 * 
## group1:time2    2.121      2.317   0.916  0.36595   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13.28283)
## 
##     Null deviance: 1020.00  on 39  degrees of freedom
## Residual deviance:  478.18  on 36  degrees of freedom
## AIC: 222.76
## 
## Number of Fisher Scoring iterations: 2
lm1 %>%
  tbl_regression(intercept = TRUE,
                 estimate_fun = ~ style_number(.x, digits = 2))
Characteristic Beta 95% CI1 p-value
(Intercept) 2.64 0.48, 4.79 0.022
group
    0
    1 4.70 1.49, 7.91 0.007
time
    1
    2 3.55 0.50, 6.59 0.029
group * time
    1 * 2 2.12 -2.42, 6.66 0.4
1 CI = Confidence Interval

There are several key similarities between the output from the linear regression approach and the independent t-test.

The \(\beta_{0}\) coefficient or intercept is 2.64, which is the average number of chocolate consumed for the unexposed group before the intervention. You can see that this is the same as the value in Table 1a. Adding \(\beta_{0}\) and \(\beta_{1}\) will generate the average number of chocolate consumed for the exposed group before the intervention, which is 7.33 (2.64 + 4.70) given some rounding error. Since the p-value of \(\beta_{1}\) is 0.007, this means that the difference in the average number of chocolate consumed before the intervention was implemented was significantly different between the groups. This difference is, on average, 4.70 chocolates consumed with a 95% confidence interval (CI) of 1.49 and 7.91.

The \(\beta_{2}\) coefficient denotes the average difference in chocolate consumption between the periods before and after implementation of the intervention (regardless of the groups). This difference is 3.55 chocolates consumed with a 95% CI of 0.50 and 6.59. The P-value is 0.029. This means that the the average increase in chocolate consumption in the period after implementation of the intervention was significantly greater by 3.55 chocolates consumed than in the period before implementation.

The \(\beta_{3}\) coefficient denotes average difference in the change in chocolate consumption between the groups before and after implementation of the intervention. The average change for the exposed group was 5.67, and the average change for the unexposed group was 3.55.

To visualize how we estimated 5.67, here is a diagram that explains how the regression formula is used for estimated the difference in chocolates consumed in the exposed group before and after the implementation of the intervention:

Figure1b. Eliminating coefficients using regression model.

Figure1b. Eliminating coefficients using regression model.

Using this framework, we can estimate the difference in chocolates consumed for the exposed and unexposed groups. Thus, we calculated these estimates using the \(\beta\) coefficients from the regression output.

  • 5.67 = (\(\beta_{0}\) + \(\beta_{1}\) + \(\beta_{2}\) + \(\beta_{3}\)) - (\(\beta_{0}\) + \(\beta_{1}\))

  • 3.55 = (\(\beta_{0}\) + \(\beta_{2}\)) - (\(\beta_{0}\))

The difference between these two is 2.12. This is the same as \(\beta_{3}\). The P-value for \(\beta_{3}\) is 0.40, which means that there is no statistically significant difference in the change in chocolate consumption between the group groups before and after implementation of the intervention.

We can conclude that the average difference in the change in chocolate consumption between the groups before and after implementation of the intervention was 2.12 chocolates (95% CI: -2.42, 6.66), but this was not statistically significant.

Advantages of the linear regression approach

Given that the results of the linear regression model and independent t test approaches are similar, which one should we choose?

The linear regression approach allows us to adjust for covariates, whereas the independent t test does not. In our dataframe, we have age as a covariate, which we can compare between the groups. The average age of the exposed and unexposed groups are 58.6 and 59.7 years, respectively.

## Average age between the groups
describeBy(age ~ group, data = wide2)
## 
##  Descriptive statistics by group 
## group: 0
##     vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## age    1 11 58.64 8.27     57   57.56 7.41  49  78    29 0.94     0.11 2.49
## ------------------------------------------------------------ 
## group: 1
##     vars n  mean    sd median trimmed   mad min max range  skew kurtosis   se
## age    1 9 59.67 13.71     62   59.67 22.24  41  78    37 -0.08    -1.64 4.57

We can adjust our estimates in the linear regression model by included age as a covariate.

\[\begin{aligned} E[Y | X] = \beta_{0} + \beta_{1}Group_{i} + \beta_{2}Period_{i} + \beta_{3}(Group_{i} * Period_{1}) + \beta_{4}Age_{i} + \epsilon_{i} \end{aligned}\]
lm2 <- glm(endpoint1 ~ group + time + group:time + age, family = "gaussian", data = long1)
summary(lm2)
## 
## Call:
## glm(formula = endpoint1 ~ group + time + group:time + age, family = "gaussian", 
##     data = long1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.7470  -2.5002   0.4079   2.2774   6.3915  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -4.27973    3.23250  -1.324  0.19410   
## group1        4.57545    1.55289   2.946  0.00569 **
## time2         3.54545    1.47232   2.408  0.02144 * 
## age           0.11795    0.05219   2.260  0.03015 * 
## group1:time2  2.12121    2.19481   0.966  0.34044   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 11.92252)
## 
##     Null deviance: 1020.00  on 39  degrees of freedom
## Residual deviance:  417.29  on 35  degrees of freedom
## AIC: 219.31
## 
## Number of Fisher Scoring iterations: 2
lm2 %>%
  tbl_regression(intercept = TRUE,
                 estimate_fun = ~ style_number(.x, digits = 2))
Characteristic Beta 95% CI1 p-value
(Intercept) -4.28 -10.62, 2.06 0.2
group
    0
    1 4.58 1.53, 7.62 0.006
time
    1
    2 3.55 0.66, 6.43 0.021
age 0.12 0.02, 0.22 0.030
group * time
    1 * 2 2.12 -2.18, 6.42 0.3
1 CI = Confidence Interval

Isolating the interaction term, we can compare the independent t test, linear regression model without age, and linear regression model with age. These estimates in Table 1c denote the difference in the change in chocolate consumption between groups before and after implementation of the intervention. In other words, this is the “difference-in-differences” estimation.

Table1c. Comparison of approaches.

Table1c. Comparison of approaches.

Conclusions

The results of the independent t test and regression model approaches are similar if not exactly the same. Unlike the independent t test, the regression model allows us to adjust for other covariates such as age.

The independent t test and linear regression model approaches provide us with a way to compare the differences in mean outcomes between two groups before and after an intervention (“difference-in-differences). However, they do not allow us to capture the correlation between the two measurements before and after the intervention. This has important implications for the estimtes and their standard errors. For instance, the value of a patient in the period after the intervention is correlated to its value in the period before the intervention. If we don’t take that into account, we will make errors in our estimates and standard errors. In the next part, we will introduce methods to capture these correlations within a subject before and after an intervention.

Disclaimer

This is a work in progress.

This is for educational purposes only.