This demonstration is based on regression to mean example by stephen senn in a must read paper by him.
Let us randomly sample baseline BP of 1000 patients with mean diastolic BP 90 mm Hg and standard deviation 11 mm Hg.
set.seed(7)
baseline = rnorm(1000,90,11)
Now let’s generate an outcome BP at second time point which is correlated (r=0.8) with first baseline measurement and has same mean diastolic BP and standard deviation . you can also assume it is the second measurement after giving placebo.
simcor = function(x, r,m,s){ r2 = r**2
ve = 1-r2
SD = sqrt(ve)
e = rnorm(length(x), mean=0, sd=SD)
y = r*x + e
y1 = m+y*s
return(y1) }
x=scale(baseline)
outcome=simcor(x=x,r=0.8,m=90,s=11)
Just to cross check, let’s see mean , sd and correlation od outcome(post) with baseline(pre).
sd(outcome)
## [1] 11.23446
mean(outcome)
## [1] 90.12298
cor(baseline,outcome)
## [,1]
## [1,] 0.7995639
Let’s run a paired t-test between pre and post which will be non-significant as expected since mean DBPand sd are same and measurements are correlated
t.test(baseline,outcome,paired=TRUE)
##
## Paired t-test
##
## data: baseline and outcome
## t = -0.40471, df = 999, p-value = 0.6858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5231546 0.3442593
## sample estimates:
## mean of the differences
## -0.08944766
Let’s generate a dataframe of these baseline and outcome variable and generate two more variables 1. delta : difference in baseline and outcome 2. group : 1000 patients divided in three groups a) hypertensives : Both baseline and outcome with DBP>95(arbitrary cut off) b) inconsistent : Either baseline our outcome less than 95 on one occasion and higher than 95 on other c) normotensives : Both baseline and outcome less than 95 on both occasions.
df = data.frame(baseline,outcome)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
df = df %>% mutate(group = case_when(
baseline<95 & outcome <95 ~ "normotensives",
baseline>95 & outcome <95 ~ "inconsistent",
baseline<95 & outcome >95 ~ "inconsistent",
baseline>95 & outcome >95 ~ "hypertensives"
), delta=outcome-baseline)
Let’s see a histogram of difference pre-post(baeline and outcome)
ggplot(df,aes(x=delta))+geom_histogram(fill="blue")+geom_vline(xintercept=0)+geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s see a kernel density estimate
ggplot(df,aes(x=delta))+geom_density()
Let’s see frequency polygon based on groups
ggplot(df,aes(x=delta))+geom_freqpoly(aes(color=group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s see summary stats of difference
df %>% select(delta) %>% summary()
## delta
## Min. :-18.40760
## 1st Qu.: -5.10058
## Median : 0.35982
## Mean : 0.08945
## 3rd Qu.: 4.71319
## Max. : 21.15423
Let’s see a groupwise summary based on cut-off of 95
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
df %>% group_by(group) %>% summarize(n=n(),meanbaseline=mean(baseline),sdbaseline=sd(baseline),meanoutcome=mean(outcome),sdoutcome=sd(outcome),meandelta=mean(delta),sddelta=sd(delta)) %>%rownames_to_column %>%
gather(var, value, -rowname) %>%
spread(rowname, value)
## # A tibble: 8 x 4
## var `1` `2` `3`
## * <chr> <chr> <chr> <chr>
## 1 group hypertensives inconsistent normotensives
## 2 meanbaseline 103.204094916982 94.7540166189494 83.1569467107115
## 3 meandelta 0.382854912682211 0.142865978069477 -0.0472129554606767
## 4 meanoutcome 103.586949829665 94.8968825970188 83.1097337552508
## 5 n 236 185 579
## 6 sdbaseline 6.13135306177663 5.39674556481005 7.20561331745338
## 7 sddelta 6.16944262664557 9.67317331228261 6.25380521436097
## 8 sdoutcome 6.50449029037484 5.47596685991821 7.79835262396695
Lets visualise a plot of baseline vs outcome
df %>% ggplot(aes(baseline,outcome,color=group))+geom_point()+geom_hline(yintercept=95)+geom_vline(xintercept=95)
Let’s look at boxplot of pre(baseline) and post(outcome)
library(tidyr)
df %>% select(baseline,outcome) %>% gather(key="group",value="DBP") %>% ggplot(aes(y=DBP,x=group))+geom_boxplot()
Now let’s select only patients with DBP>95 this would be similar to hypertensive only patients entering an arm of control trial
df %>% filter(baseline>95) %>% ggplot(aes(baseline,outcome,color=group))+geom_point() +geom_hline(yintercept=95)+geom_vline(xintercept=95)
df %>% filter(baseline>95) %>% select(delta) %>% summary()
## delta
## Min. :-18.408
## 1st Qu.: -6.973
## Median : -2.679
## Mean : -2.146
## 3rd Qu.: 1.965
## Max. : 21.154
Thus we see in this selected group, outcome is lower than baseline as compared to previous group due to selection
Let’s see histogram and density plots
df %>% filter(baseline>95) %>% ggplot(aes(x=delta))+geom_histogram(fill="blue")+geom_vline(xintercept=0)+geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% filter(baseline>95) %>% ggplot(aes(x=delta))+geom_density()
We run the paired t test and see that it is significant now.
df1 = df %>% filter(baseline>95)
t.test(df1$outcome,df1$baseline,paired = TRUE)
##
## Paired t-test
##
## data: df1$outcome and df1$baseline
## t = -5.6211, df = 326, p-value = 4.078e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.897628 -1.395226
## sample estimates:
## mean of the differences
## -2.146427
df %>% select(baseline,outcome) %>% filter(baseline>95) %>% gather(key="group",value="DBP") %>% ggplot(aes(y=DBP,x=group))+geom_boxplot()