This simulation of regression to the mean will use data based on resting ACTH values in horses. It assumes mean resting ACTH is 40pg/ml with a standard deviation of 10. The data will be normally distributed, however we know that for many biomarkers like this distribution is right skewed (ie values cannot be less than zero)

Next simulate data with the same mean and standard deviation correlated with the previous baseline measurement (r=0.8) and taken after a placebo treatment was administered.

simcor = function(x, r,m,s){ r2 = r**2 
ve = 1-r2 
SD = sqrt(ve) 
e = rnorm(length(x), mean=0, sd=SD) 
y = r*x + e  
y1 = m+y*s

return(y1) } 


x=scale(baseline)

outcome=simcor(x=x,r=0.8,m=40,s=10)

mean, sd and corr can be checked to see that they are close to the original(baseline) data

mean(outcome)
## [1] 40.1118
sd(outcome)
## [1] 10.21315
cor(baseline,outcome)
##           [,1]
## [1,] 0.7995639

a paired t-test between baseline and outcome should show no statistically significant difference

t.test(baseline,outcome,paired=TRUE)
## 
##  Paired t-test
## 
## data:  baseline and outcome
## t = -0.40471, df = 999, p-value = 0.6858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4755951  0.3129630
## sample estimates:
## mean of the differences 
##             -0.08131606

Let’s generate a dataframe of these baseline and outcome variable and generate two more variables 1. delta : difference in baseline and outcome 2. group : 1000 patients divided in three groups a) PPID : Both baseline and outcome with ACTH>45(arbitrary cut off) b) inconsistent : Either baseline our outcome less than 45 on one occasion and higher than 45 on other c) normal : Both baseline and outcome less than 45 on both occasions.

df<-data.frame(baseline,outcome)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

create a histogram of values of the pre-post difference in ACTH

library(ggplot2)

df = df %>% mutate(group = case_when(
  baseline<45 & outcome <45 ~ "normal",
   baseline>45 & outcome <45 ~ "inconsistent",
  baseline<45 & outcome >45 ~ "inconsistent",baseline>45 & outcome >45 ~ "PPID"
  
), delta=outcome-baseline)
ggplot(df,aes(x=delta))+geom_histogram(fill="blue")+geom_vline(xintercept=0)+geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kernel density estimate

ggplot(df,aes(x=delta))+geom_density()

frequencies based on grooup

ggplot(df,aes(x=delta))+geom_freqpoly(aes(color=group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summarize by groups

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.5     v purrr   0.3.4
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df %>% group_by(group) %>% summarize(n=n(),meanbaseline=mean(baseline),sdbaseline=sd(baseline),meanoutcome=mean(outcome),sdoutcome=sd(outcome),meandelta=mean(delta),sddelta=sd(delta)) %>%rownames_to_column %>% 
   gather(var, value, -rowname) %>% 
   spread(rowname, value) 
## # A tibble: 8 x 4
##   var          `1`                `2`                  `3`              
##   <chr>        <chr>              <chr>                <chr>            
## 1 group        inconsistent       normal               PPID             
## 2 meanbaseline 44.7902218077453   34.0047929868025     52.4785953599269 
## 3 meandelta    0.0338973357675693 -0.00615070885504738 0.362410502229162
## 4 meanoutcome  44.8241191435129   33.9986422779474     52.841005862156  
## 5 n            187                596                  217              
## 6 sdbaseline   4.82172892106319   6.61488706365104     5.50149040209251 
## 7 sddelta      8.70758634818275   5.68635638937126     5.67867765325367 
## 8 sdoutcome    4.9701913201309    7.16943821661915     5.86868259617818

Visualize plot of baseline vs outcome

df %>% ggplot(aes(baseline,outcome,color=group))+geom_point()+geom_hline(yintercept=45)+geom_vline(xintercept=45)

Boxplots of pre and post

library(tidyr)
df %>% select(baseline,outcome) %>% gather(key="group",value="ACTH") %>% ggplot(aes(y=ACTH,x=group))+geom_boxplot()

Now select only horses deemed to have PPID ACTH>45 at baseline. This would be similar to selecting only PPID horses for a single arm of a pre-post trial.

df %>% filter(baseline>45) %>% ggplot(aes(baseline,outcome,color=group))+geom_point() +geom_hline(yintercept=45)+geom_vline(xintercept=45)

Repeating t-test on only those cases designated as PPID at baseline produces a highly significant p-value indicating the null hypothesis of no difference between the groups at baseline and outcome can be rejected

df1 = df %>% filter(baseline>45)
t.test(df1$outcome,df1$baseline,paired = TRUE)
## 
##  Paired t-test
## 
## data:  df1$outcome and df1$baseline
## t = -5.7887, df = 308, p-value = 1.747e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.793684 -1.376236
## sample estimates:
## mean of the differences 
##                -2.08496

1. Extreme care should be taken when evaluating the impact of a drug/intervention on a change from baselin in single arm studies (pre-post)

2. Without a control group regression to the mean can have a large impact on a change from baseline

3. Regression to the mean is thought to contribute to the placebo effect and “alternative” therapies which rely on descriptions of dramatic improvement