This demonstration is based on regression to mean example by stephen senn in a must read paper by him.

Let us randomly sample baseline BP of 1000 patients with mean diastolic BP 90 mm Hg and standard deviation 11 mm Hg.

set.seed(7)
baseline = rnorm(1000,90,11)

Now let’s generate an outcome BP at second time point which is correlated (r=0.8) with first baseline measurement and has same mean diastolic BP and standard deviation . you can also assume it is the second measurement after giving placebo.

simcor = function(x, r,m,s){ r2 = r**2 
ve = 1-r2 
SD = sqrt(ve) 
e = rnorm(length(x), mean=0, sd=SD) 
y = r*x + e  
y1 = m+y*s

return(y1) } 


x=scale(baseline)

outcome=simcor(x=x,r=0.8,m=90,s=11)

Just to cross check, let’s see mean , sd and correlation od outcome(post) with baseline(pre).

sd(outcome)
## [1] 11.23446
mean(outcome)
## [1] 90.12298
cor(baseline,outcome)
##           [,1]
## [1,] 0.7995639

Let’s run a paired t-test between pre and post which will be non-significant as expected since mean DBPand sd are same and measurements are correlated

t.test(baseline,outcome,paired=TRUE)
## 
##  Paired t-test
## 
## data:  baseline and outcome
## t = -0.40471, df = 999, p-value = 0.6858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5231546  0.3442593
## sample estimates:
## mean of the differences 
##             -0.08944766

Let’s generate a dataframe of these baseline and outcome variable and generate two more variables 1. delta : difference in baseline and outcome 2. group : 1000 patients divided in three groups a) hypertensives : Both baseline and outcome with DBP>95(arbitrary cut off) b) inconsistent : Either baseline our outcome less than 95 on one occasion and higher than 95 on other c) normotensives : Both baseline and outcome less than 95 on both occasions.

df = data.frame(baseline,outcome)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

df = df %>% mutate(group = case_when(
  baseline<95 & outcome <95 ~ "normotensives",
   baseline>95 & outcome <95 ~ "inconsistent",
  baseline<95 & outcome >95 ~ "inconsistent",
  baseline>95 & outcome >95 ~ "hypertensives"
  
), delta=outcome-baseline)

Let’s see a histogram of difference pre-post(baeline and outcome)

ggplot(df,aes(x=delta))+geom_histogram(fill="blue")+geom_vline(xintercept=0)+geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s see a kernel density estimate

ggplot(df,aes(x=delta))+geom_density()

Let’s see frequency polygon based on groups

ggplot(df,aes(x=delta))+geom_freqpoly(aes(color=group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s see summary stats of difference

df %>% select(delta) %>% summary()
##      delta          
##  Min.   :-18.40760  
##  1st Qu.: -5.10058  
##  Median :  0.35982  
##  Mean   :  0.08945  
##  3rd Qu.:  4.71319  
##  Max.   : 21.15423

Let’s see a groupwise summary based on cut-off of 95

library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
df %>% group_by(group) %>% summarize(n=n(),meanbaseline=mean(baseline),sdbaseline=sd(baseline),meanoutcome=mean(outcome),sdoutcome=sd(outcome),meandelta=mean(delta),sddelta=sd(delta)) %>%rownames_to_column %>% 
   gather(var, value, -rowname) %>% 
   spread(rowname, value) 
## # A tibble: 8 x 4
##            var               `1`               `2`                 `3`
## *        <chr>             <chr>             <chr>               <chr>
## 1        group     hypertensives      inconsistent       normotensives
## 2 meanbaseline  103.204094916982  94.7540166189494    83.1569467107115
## 3    meandelta 0.382854912682211 0.142865978069477 -0.0472129554606767
## 4  meanoutcome  103.586949829665  94.8968825970188    83.1097337552508
## 5            n               236               185                 579
## 6   sdbaseline  6.13135306177663  5.39674556481005    7.20561331745338
## 7      sddelta  6.16944262664557  9.67317331228261    6.25380521436097
## 8    sdoutcome  6.50449029037484  5.47596685991821    7.79835262396695

Lets visualise a plot of baseline vs outcome

df %>% ggplot(aes(baseline,outcome,color=group))+geom_point()+geom_hline(yintercept=95)+geom_vline(xintercept=95)

Let’s look at boxplot of pre(baseline) and post(outcome)

library(tidyr)
df %>% select(baseline,outcome) %>% gather(key="group",value="DBP") %>% ggplot(aes(y=DBP,x=group))+geom_boxplot()

Now let’s select only patients with DBP>95 this would be similar to hypertensive only patients entering an arm of control trial

df %>% filter(baseline>95) %>% ggplot(aes(baseline,outcome,color=group))+geom_point() +geom_hline(yintercept=95)+geom_vline(xintercept=95)

df %>% filter(baseline>95) %>% select(delta) %>% summary()
##      delta        
##  Min.   :-18.408  
##  1st Qu.: -6.973  
##  Median : -2.679  
##  Mean   : -2.146  
##  3rd Qu.:  1.965  
##  Max.   : 21.154

Thus we see in this selected group, outcome is lower than baseline as compared to previous group due to selection

Let’s see histogram and density plots

df %>% filter(baseline>95) %>% ggplot(aes(x=delta))+geom_histogram(fill="blue")+geom_vline(xintercept=0)+geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df %>% filter(baseline>95) %>% ggplot(aes(x=delta))+geom_density()

We run the paired t test and see that it is significant now.

df1 = df %>% filter(baseline>95)
t.test(df1$outcome,df1$baseline,paired = TRUE)
## 
##  Paired t-test
## 
## data:  df1$outcome and df1$baseline
## t = -5.6211, df = 326, p-value = 4.078e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.897628 -1.395226
## sample estimates:
## mean of the differences 
##               -2.146427
df %>% select(baseline,outcome) %>% filter(baseline>95) %>% gather(key="group",value="DBP") %>% ggplot(aes(y=DBP,x=group))+geom_boxplot()

KEY POINTS

  1. Scientists should be very careful while interpreting effect of drug/change in parameter based on a single arm
  2. Control group is important to address regression to mean.
  3. Regression to mean accounts contributes a lot to so-called “placebo effect” or “alternative medicine”.

Reference

  1. Three things that every medical writer should know about statistics