Before starting

This session will be focus on the analysis of continuous outcome using data from an RCT and the statistical software RStudio. You can download the software for free and you can also use an online version

https://www.rstudio.com/products/rstudio/download/

https://rstudio.cloud/

https://www.mycompiler.io/new/r

Brief introduction to RStudio: https://www.youtube.com/watch?v=FIrsOBy5k58&ab_channel=HRanalytics101.com

Loading library and data

# everything after '#' is read as a comment by the program and not a code
#install.packages("ggplot")
#install.packages("tidyr")
#install.packages("dplyr")
#install.packages("stats")
#install.packages("medicaldata")

library(ggplot2) #visualizations
library(tidyr) #data manipulation
library(dplyr) #data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stats) #statistical tests
library(medicaldata) #datasets

Data description

The objective of this randomized controlled trial was to determine whether treatment of maternal periodontal disease can reduce risk of preterm birth and low birth weight (more details available below the variable definitions).

Group: Control and Treatment Age Education: Ordinal variable BMI BL.GE: Whole-mouth average gingival index at baseline, numeric, range: 0.429-3.0, Silness-Lowe Gingival Index: Higher value indicates more severe inflammation

BL.PL.I: Whole-mouth average plaque index at baseline, Silness-Lowe Gingival Index:Higher value indicates more severe inflammation

V3.GE: Gingival index at third visit V3.PL.I: Plaque index at third visit

#This is a comment. The program won't read it as code
df1 <- medicaldata::opt
df2 <- df1[, c(1,2,3,4,10,15,38,47,48,57)]
df3 <- subset(df2, complete.cases(df2))
head(df3)

names(df3)

##  [1] "PID"       "Clinic"    "Group"     "Age"       "Education" "BMI"      
##  [7] "BL.GE"     "BL.Pl.I"   "V3.GE"     "V3.Pl.I"

table(df3$Group)

## 
##   C   T 
## 325 299

prop.table(table(df3$Group))

## 
##         C         T 
## 0.5208333 0.4791667

 table(df3$Education)

## 
## 8-12 yrs  LT 8 yrs  MT 12 yrs 
##       353       123       148

prop.table(table(df3$Education))

## 
## 8-12 yrs  LT 8 yrs  MT 12 yrs 
## 0.5657051 0.1971154 0.2371795

hist(df3$Age)

hist(df3$BMI)

hist(df3$BL.GE)

hist(df3$BL.Pl.I)

hist(df3$V3.GE)

hist(df3$V3.Pl.I)

sapply(df3, summary)

## $PID
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  100083  201129  300261  265236  302208  402477 
## 
## $Clinic
##  KY  MN  MS  NY 
## 191 212 143  78 
## 
## $Group
##   C   T 
## 325 299 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   22.00   25.00   26.08   30.00   44.00 
## 
## $Education
## 8-12 yrs  LT 8 yrs  MT 12 yrs 
##       353       123       148 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   23.00   26.00   27.33   30.00   68.00 
## 
## $BL.GE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.429   1.045   1.357   1.386   1.691   2.935 
## 
## $BL.Pl.I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.056   0.917   1.250   1.235   1.583   2.778 
## 
## $V3.GE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.030   0.994   1.244   1.285   1.563   3.000 
## 
## $V3.Pl.I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6670  0.9720  0.9953  1.3330  2.5830

options(scipen=999) #this is to avoid scientific notation of p-values

Question 1

We would like to compare whether the mean gingival index at baseline is different from the population mean of 1.34

t.test(df3$BL.GE, mu = 1.34)

## 
##  One Sample t-test
## 
## data:  df3$BL.GE
## t = 2.8293, df = 623, p-value = 0.004815
## alternative hypothesis: true mean is not equal to 1.34
## 95 percent confidence interval:
##  1.354054 1.417828
## sample estimates:
## mean of x 
##  1.385941

EXTRA TASKS:

Is the average age different from the population (pregnant women) mean of 26 years old?

t.test(df3$Age, mu = 26)

## 
##  One Sample t-test
## 
## data:  df3$Age
## t = 0.38099, df = 623, p-value = 0.7033
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
##  25.64715 26.52273
## sample estimates:
## mean of x 
##  26.08494

Null hypothesis: the population mean is equal to 26

How does the average BMI level compare to the recommended (made up) BMI of 25 value

t.test(df3$BMI, mu = 25)

## 
##  One Sample t-test
## 
## data:  df3$BMI
## t = 8.4134, df = 623, p-value = 0.0000000000000002726
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
##  26.78871 27.87796
## sample estimates:
## mean of x 
##  27.33333

Question 2

Let’s now calculate whether there are differences in gingival index at baseline between the treatment and control group. For this, you must use the variable on treatment groups (Group). What is the difference between these groups? The 95% CI of that difference? what is the p-value? Interpret the results

t.test(df3$BL.GE ~ df3$Group)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.GE by df3$Group
## t = -0.21859, df = 607, p-value = 0.827
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07121156  0.05694668
## sample estimates:
## mean in group C mean in group T 
##        1.382523        1.389656

EXTRA TASK:

Calculate the same but for plaque index

t.test(df3$BL.Pl.I ~ df3$Group)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.Pl.I by df3$Group
## t = -0.57484, df = 611.16, p-value = 0.5656
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09670351  0.05291006
## sample estimates:
## mean in group C mean in group T 
##        1.224582        1.246478

Question 3

Now, please find out if there are differences before and after treatment for the whole sample in terms of the following outcomes: (1) Gingival index, (2) plaque index

Gingival index

t.test(df3$BL.GE, df3$V3.GE)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.GE and df3$V3.GE
## t = 4.4454, df = 1245.4, p-value = 0.000009554
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05643771 0.14560396
## sample estimates:
## mean of x mean of y 
##  1.385941  1.284920

Plaque index

t.test(df3$BL.Pl.I, df3$V3.Pl.I)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.Pl.I and df3$V3.Pl.I
## t = 8.9975, df = 1245.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1874832 0.2920424
## sample estimates:
## mean of x mean of y 
## 1.2350737 0.9953109

Question 4

But is the change in gingival and plaque indices statistically distinguishable differ between the intervention and control groups?

To answer this, we need to create a new variable of the difference between baseline and follow-up levels

change_GE = V3.GE - BL.GE

#Creating 
df3$change_GE <- df3$V3.GE - df3$BL.GE
summary(df3$change_GE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6880 -0.2347 -0.0535 -0.1010  0.0600  0.6670

df3$change_PL <- df3$V3.Pl.I - df3$BL.Pl.I
summary(df3$change_PL)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8330 -0.4720 -0.1950 -0.2398  0.0000  1.2770

t.test(df3$change_GE~ df3$Group) #gingival

## 
##  Welch Two Sample t-test
## 
## data:  df3$change_GE by df3$Group
## t = 11.783, df = 579.68, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1973083 0.2762417
## sample estimates:
## mean in group C mean in group T 
##      0.01243385     -0.22434114

t.test(df3$change_PL~ df3$Group) #plaque

## 
##  Welch Two Sample t-test
## 
## data:  df3$change_PL by df3$Group
## t = 12.812, df = 583.14, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3193941 0.4350428
## sample estimates:
## mean in group C mean in group T 
##     -0.05901231     -0.43623077

ANOVA

Calculate the mean BMI by group of education. Is there a difference in the baseline BMI by education?

#Mean BMI by educational group
aggregate(df3$BMI, list(df3$Education), FUN=mean)

#Different approach, same result
df3 %>%
  group_by(Education) %>%
  summarise_at(vars(BMI), list(name = mean))

aov(df3$BMI ~ df3$Education)

## Call:
##    aov(formula = df3$BMI ~ df3$Education)
## 
## Terms:
##                 df3$Education Residuals
## Sum of Squares        685.607 29215.059
## Deg. of Freedom             2       621
## 
## Residual standard error: 6.858949
## Estimated effects may be unbalanced

summary(aov(df3$BMI ~ df3$Education))

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## df3$Education   2    686   342.8   7.287 0.000745 ***
## Residuals     621  29215    47.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Mean square of education / mean square of residuals = F value
342.8/47

## [1] 7.293617

Correlation

We want to assess the association between gingival and plaque index at baseline. These are two normally distributed variables and we want to examine the association both visually (scatterplot) and obtain an estimate (correlation). What can you say about this relationship?

scatter.smooth(df3$BL.GE, df3$BL.Pl.I)

cor(df3$BL.GE, df3$BL.Pl.I)

## [1] 0.5829316

Linear regression

To introduce linear regression, we are going to assess the association between plaque index at follow-up and group of treatment. Try to interpret these results.

lm(df3$V3.Pl.I ~ df3$Group)

## 
## Call:
## lm(formula = df3$V3.Pl.I ~ df3$Group)
## 
## Coefficients:
## (Intercept)   df3$GroupT  
##      1.1656      -0.3553

summary(lm(df3$V3.Pl.I ~ df3$Group))

## 
## Call:
## lm(formula = df3$V3.Pl.I ~ df3$Group)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10957 -0.28225  0.00593  0.27417  1.77275 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  1.16557    0.02399   48.58 <0.0000000000000002 ***
## df3$GroupT  -0.35532    0.03466  -10.25 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4325 on 622 degrees of freedom
## Multiple R-squared:  0.1446, Adjusted R-squared:  0.1432 
## F-statistic: 105.1 on 1 and 622 DF,  p-value: < 0.00000000000000022

Calculate the 95% CI for the regression estimate (-0.35). Hint: the formula is be -0.35 - 1.96* Std. Error for the lower limit and -0.35 + 1.96*Std. Error of the upper limit

Optional

Repeat all the exercises with the subsample (df_subsample) and compare results. Hint: copy the different codes and replace df3 by df_subsample.

df_subsample <- df3[sample(nrow(df3), 100), ]
head(df_subsample)

Session 2 practical:Analysis of continuous outcomes and introduction to linear regression WITH ANSWERS