Before starting

This session will be focus on the analysis of continuous outcome using data from an RCT and the statistical software RStudio. You can download the software for free and you can also use an online version

https://www.rstudio.com/products/rstudio/download/

https://rstudio.cloud/

https://www.mycompiler.io/new/r

Brief introduction to RStudio: https://www.youtube.com/watch?v=FIrsOBy5k58&ab_channel=HRanalytics101.com

Loading library and data

# everything after '#' is read as a comment by the program and not a code
#install.packages("ggplot")
#install.packages("tidyr")
#install.packages("dplyr")
#install.packages("stats")
#install.packages("medicaldata")

library(ggplot2) #visualizations
library(tidyr) #data manipulation
library(dplyr) #data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stats) #statistical tests
library(medicaldata) #datasets

Data description

The objective of this randomized controlled trial was to determine whether treatment of maternal periodontal disease can reduce risk of preterm birth and low birth weight (more details available below the variable definitions).

Group: Control and Treatment Age Education: Ordinal variable BMI BL.GE: Whole-mouth average gingival index at baseline, numeric, range: 0.429-3.0, Silness-Lowe Gingival Index: Higher value indicates more severe inflammation

BL.PL.I: Whole-mouth average plaque index at baseline, Silness-Lowe Gingival Index:Higher value indicates more severe inflammation

V3.GE: Gingival index at third visit V3.PL.I: Plaque index at third visit

#This is a comment. The program won't read it as code
df1 <- medicaldata::opt
df2 <- df1[, c(1,2,3,4,10,15,38,47,48,57)]
df3 <- subset(df2, complete.cases(df2))
head(df3)

names(df3)

##  [1] "PID"       "Clinic"    "Group"     "Age"       "Education" "BMI"      
##  [7] "BL.GE"     "BL.Pl.I"   "V3.GE"     "V3.Pl.I"

table(df3$Group)

## 
##   C   T 
## 325 299

prop.table(table(df3$Group))

## 
##         C         T 
## 0.5208333 0.4791667

 table(df3$Education)

## 
## 8-12 yrs  LT 8 yrs  MT 12 yrs 
##       353       123       148

prop.table(table(df3$Education))

## 
## 8-12 yrs  LT 8 yrs  MT 12 yrs 
## 0.5657051 0.1971154 0.2371795

hist(df3$Age)

hist(df3$BMI)

hist(df3$BL.GE)

hist(df3$BL.Pl.I)

hist(df3$V3.GE)

hist(df3$V3.Pl.I)

sapply(df3, summary)

## $PID
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  100083  201129  300261  265236  302208  402477 
## 
## $Clinic
##  KY  MN  MS  NY 
## 191 212 143  78 
## 
## $Group
##   C   T 
## 325 299 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   22.00   25.00   26.08   30.00   44.00 
## 
## $Education
## 8-12 yrs  LT 8 yrs  MT 12 yrs 
##       353       123       148 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   23.00   26.00   27.33   30.00   68.00 
## 
## $BL.GE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.429   1.045   1.357   1.386   1.691   2.935 
## 
## $BL.Pl.I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.056   0.917   1.250   1.235   1.583   2.778 
## 
## $V3.GE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.030   0.994   1.244   1.285   1.563   3.000 
## 
## $V3.Pl.I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6670  0.9720  0.9953  1.3330  2.5830

options(scipen=999) #this is to avoid scientific notation of p-values

A total of 624 women participated in this RCT. Df3 only includes those women with no missing observations. All continuous variables seem normally distributed. There are 52% in the control group and 48% in the treatment group.

Question 1

We would like to compare whether the mean gingival index at baseline is different from the population mean of 1.34

t.test(df3$BL.GE, mu = 1.34)

## 
##  One Sample t-test
## 
## data:  df3$BL.GE
## t = 2.8293, df = 623, p-value = 0.004815
## alternative hypothesis: true mean is not equal to 1.34
## 95 percent confidence interval:
##  1.354054 1.417828
## sample estimates:
## mean of x 
##  1.385941

The t-test suggests that our data is not compatible with a gingival index population mean of 1.34. The 95% CI suggests that true mean of the population of this data lies between 1.35 and 1.42. The low p-value provides evidence to reject the null hypothesis that the population mean is equal to 1.34

EXTRA TASKS:

Is the average age different from the population (pregnant women) mean of 26 years old?

t.test(df3$Age, mu = 26)

## 
##  One Sample t-test
## 
## data:  df3$Age
## t = 0.38099, df = 623, p-value = 0.7033
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
##  25.64715 26.52273
## sample estimates:
## mean of x 
##  26.08494

Null hypothesis: the population mean is equal to 26

The t-test suggests that the data is compatible with a population mean age of 26. The 95% CI suggests that our data is compatible with a population mean between 25.65 and 26.52. The large p-value, 0.7, provides no evidence to reject the null hypothesis

How does the average BMI level compare to the recommended (made up) BMI of 25 value

t.test(df3$BMI, mu = 25)

## 
##  One Sample t-test
## 
## data:  df3$BMI
## t = 8.4134, df = 623, p-value = 0.0000000000000002726
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
##  26.78871 27.87796
## sample estimates:
## mean of x 
##  27.33333

The t-test suggests that the data is not with a population mean BMI of 25 The 95% CI suggests that our data is compatible with a population mean between 26.79 and 27.88. The p-value provides strong evidence to reject the null hypothesis.

Question 2

Let’s now calculate whether there are differences in gingival index at baseline between the treatment and control group. For this, you must use the variable on treatment groups (Group). What is the difference between these groups? The 95% CI of that difference? what is the p-value? Interpret the results

t.test(df3$BL.GE ~ df3$Group)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.GE by df3$Group
## t = -0.21859, df = 607, p-value = 0.827
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07121156  0.05694668
## sample estimates:
## mean in group C mean in group T 
##        1.382523        1.389656

The t-test suggests no statistically distinguishable difference between the baseline levels of gingival index between control and treatment group.

EXTRA TASK:

Calculate the same but for plaque index

t.test(df3$BL.Pl.I ~ df3$Group)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.Pl.I by df3$Group
## t = -0.57484, df = 611.16, p-value = 0.5656
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09670351  0.05291006
## sample estimates:
## mean in group C mean in group T 
##        1.224582        1.246478

The t-test suggests no statistically distinguishable difference between the baseline levels of plaque index between control and treatment group.

Question 3

Now, please find out if there are differences before and after treatment for the whole sample in terms of the following outcomes: (1) Gingival index, (2) plaque index

Gingival index

t.test(df3$BL.GE, df3$V3.GE)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.GE and df3$V3.GE
## t = 4.4454, df = 1245.4, p-value = 0.000009554
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05643771 0.14560396
## sample estimates:
## mean of x mean of y 
##  1.385941  1.284920

The t-test provides strong evidence for a difference in the levels of gingival index between baseline and follow-up levels. We are 95% confident that the difference between baseline levels and follow-up levels of gingival index lie between 0.056 and 0.145.

Plaque index

t.test(df3$BL.Pl.I, df3$V3.Pl.I)

## 
##  Welch Two Sample t-test
## 
## data:  df3$BL.Pl.I and df3$V3.Pl.I
## t = 8.9975, df = 1245.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1874832 0.2920424
## sample estimates:
## mean of x mean of y 
## 1.2350737 0.9953109

The t-test provides strong evidence for a difference in the levels of plaque index between baseline and follow-up levels.We are 95% confident that the difference between baseline levels and follow-up levels of plaque index lie between 0.187 and 0.292.

Question 4

But is the change in gingival and plaque indices statistically distinguishable differ between the intervention and control groups?

To answer this, we need to create a new variable of the difference between baseline and follow-up levels

change_GE = V3.GE - BL.GE

#Creating 
df3$change_GE <- df3$V3.GE - df3$BL.GE
summary(df3$change_GE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6880 -0.2347 -0.0535 -0.1010  0.0600  0.6670

df3$change_PL <- df3$V3.Pl.I - df3$BL.Pl.I
summary(df3$change_PL)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8330 -0.4720 -0.1950 -0.2398  0.0000  1.2770

t.test(df3$change_GE~ df3$Group) #gingival

## 
##  Welch Two Sample t-test
## 
## data:  df3$change_GE by df3$Group
## t = 11.783, df = 579.68, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1973083 0.2762417
## sample estimates:
## mean in group C mean in group T 
##      0.01243385     -0.22434114

t.test(df3$change_PL~ df3$Group) #plaque

## 
##  Welch Two Sample t-test
## 
## data:  df3$change_PL by df3$Group
## t = 12.812, df = 583.14, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3193941 0.4350428
## sample estimates:
## mean in group C mean in group T 
##     -0.05901231     -0.43623077

Both tests provide strong evidence for a difference between gingival and plaque index change by group based on the p-values and 95% CIs.

ANOVA

Calculate the mean BMI by group of education. Is there a difference in the baseline BMI by education?

#Mean BMI by educational group
aggregate(df3$BMI, list(df3$Education), FUN=mean)

#Different approach, same result
df3 %>%
  group_by(Education) %>%
  summarise_at(vars(BMI), list(name = mean))

aov(df3$BMI ~ df3$Education)

## Call:
##    aov(formula = df3$BMI ~ df3$Education)
## 
## Terms:
##                 df3$Education Residuals
## Sum of Squares        685.607 29215.059
## Deg. of Freedom             2       621
## 
## Residual standard error: 6.858949
## Estimated effects may be unbalanced

summary(aov(df3$BMI ~ df3$Education))

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## df3$Education   2    686   342.8   7.287 0.000745 ***
## Residuals     621  29215    47.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Mean square of education / mean square of residuals = F value
342.8/47

## [1] 7.293617

The ANOVA tests provides strong evidence for a difference in BMI by education level.

Correlation

We want to assess the association between gingival and plaque index at baseline. These are two normally distributed variables and we want to examine the association both visually (scatterplot) and obtain an estimate (correlation). What can you say about this relationship?

scatter.smooth(df3$BL.GE, df3$BL.Pl.I)

cor(df3$BL.GE, df3$BL.Pl.I)

## [1] 0.5829316

There seems to be a positive association between baseline gingival and baseline plaque levels. As one increases, the other tend to increase. However, is important to know that this is NOT causation.

Linear regression

To introduce linear regression, we are going to assess the association between plaque index at follow-up and group of treatment. Try to interpret these results.

lm(df3$V3.Pl.I ~ df3$Group)

## 
## Call:
## lm(formula = df3$V3.Pl.I ~ df3$Group)
## 
## Coefficients:
## (Intercept)   df3$GroupT  
##      1.1656      -0.3553

summary(lm(df3$V3.Pl.I ~ df3$Group))

## 
## Call:
## lm(formula = df3$V3.Pl.I ~ df3$Group)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10957 -0.28225  0.00593  0.27417  1.77275 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  1.16557    0.02399   48.58 <0.0000000000000002 ***
## df3$GroupT  -0.35532    0.03466  -10.25 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4325 on 622 degrees of freedom
## Multiple R-squared:  0.1446, Adjusted R-squared:  0.1432 
## F-statistic: 105.1 on 1 and 622 DF,  p-value: < 0.00000000000000022

The linear regression suggests that those from the treatment group have a lower mean plaque index (-0.35) compared to the control group. This means that the group

Calculate the 95% CI for the regression estimate (-0.35). Hint: the formula is be -0.35 - 1.96* Std. Error for the lower limit and -0.35 + 1.96*Std. Error of the upper limit

Optional

Repeat all the exercises with the subsample (df_subsample) and compare results. Hint: copy the different codes and replace df3 by df_subsample.

df_subsample <- df3[sample(nrow(df3), 100), ]
head(df_subsample)

Session 2 practical:Analysis of continuous outcomes and introduction to linear regression WITH ANSWERS