This session will be focus on the analysis of continuous outcome using data from an RCT and the statistical software RStudio. You can download the software for free and you can also use an online version
https://www.rstudio.com/products/rstudio/download/
https://www.mycompiler.io/new/r
Brief introduction to RStudio: https://www.youtube.com/watch?v=FIrsOBy5k58&ab_channel=HRanalytics101.com
Loading library and data
# everything after '#' is read as a comment by the program and not a code
#install.packages("ggplot")
#install.packages("tidyr")
#install.packages("dplyr")
#install.packages("stats")
#install.packages("medicaldata")
library(ggplot2) #visualizations
library(tidyr) #data manipulation
library(dplyr) #data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stats) #statistical tests
library(medicaldata) #datasets
The objective of this randomized controlled trial was to determine whether treatment of maternal periodontal disease can reduce risk of preterm birth and low birth weight (more details available below the variable definitions).
Group: Control and Treatment Age Education: Ordinal variable BMI BL.GE: Whole-mouth average gingival index at baseline, numeric, range: 0.429-3.0, Silness-Lowe Gingival Index: Higher value indicates more severe inflammation
BL.PL.I: Whole-mouth average plaque index at baseline, Silness-Lowe Gingival Index:Higher value indicates more severe inflammation
V3.GE: Gingival index at third visit V3.PL.I: Plaque index at third visit
#This is a comment. The program won't read it as code
df1 <- medicaldata::opt
df2 <- df1[, c(1,2,3,4,10,15,38,47,48,57)]
df3 <- subset(df2, complete.cases(df2))
head(df3)
names(df3)
## [1] "PID" "Clinic" "Group" "Age" "Education" "BMI"
## [7] "BL.GE" "BL.Pl.I" "V3.GE" "V3.Pl.I"
table(df3$Group)
##
## C T
## 325 299
prop.table(table(df3$Group))
##
## C T
## 0.5208333 0.4791667
table(df3$Education)
##
## 8-12 yrs LT 8 yrs MT 12 yrs
## 353 123 148
prop.table(table(df3$Education))
##
## 8-12 yrs LT 8 yrs MT 12 yrs
## 0.5657051 0.1971154 0.2371795
hist(df3$Age)
hist(df3$BMI)
hist(df3$BL.GE)
hist(df3$BL.Pl.I)
hist(df3$V3.GE)
hist(df3$V3.Pl.I)
sapply(df3, summary)
## $PID
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100083 201129 300261 265236 302208 402477
##
## $Clinic
## KY MN MS NY
## 191 212 143 78
##
## $Group
## C T
## 325 299
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 22.00 25.00 26.08 30.00 44.00
##
## $Education
## 8-12 yrs LT 8 yrs MT 12 yrs
## 353 123 148
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 23.00 26.00 27.33 30.00 68.00
##
## $BL.GE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.429 1.045 1.357 1.386 1.691 2.935
##
## $BL.Pl.I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.056 0.917 1.250 1.235 1.583 2.778
##
## $V3.GE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.030 0.994 1.244 1.285 1.563 3.000
##
## $V3.Pl.I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6670 0.9720 0.9953 1.3330 2.5830
options(scipen=999) #this is to avoid scientific notation of p-values
A total of 624 women participated in this RCT. Df3 only includes those women with no missing observations. All continuous variables seem normally distributed. There are 52% in the control group and 48% in the treatment group.
t.test(df3$BL.GE, mu = 1.34)
##
## One Sample t-test
##
## data: df3$BL.GE
## t = 2.8293, df = 623, p-value = 0.004815
## alternative hypothesis: true mean is not equal to 1.34
## 95 percent confidence interval:
## 1.354054 1.417828
## sample estimates:
## mean of x
## 1.385941
The t-test suggests that our data is not compatible with a gingival index population mean of 1.34. The 95% CI suggests that true mean of the population of this data lies between 1.35 and 1.42. The low p-value provides evidence to reject the null hypothesis that the population mean is equal to 1.34
EXTRA TASKS:
Is the average age different from the population (pregnant women) mean of 26 years old?
t.test(df3$Age, mu = 26)
##
## One Sample t-test
##
## data: df3$Age
## t = 0.38099, df = 623, p-value = 0.7033
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
## 25.64715 26.52273
## sample estimates:
## mean of x
## 26.08494
Null hypothesis: the population mean is equal to 26
The t-test suggests that the data is compatible with a population mean age of 26. The 95% CI suggests that our data is compatible with a population mean between 25.65 and 26.52. The large p-value, 0.7, provides no evidence to reject the null hypothesis
How does the average BMI level compare to the recommended (made up) BMI of 25 value
t.test(df3$BMI, mu = 25)
##
## One Sample t-test
##
## data: df3$BMI
## t = 8.4134, df = 623, p-value = 0.0000000000000002726
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
## 26.78871 27.87796
## sample estimates:
## mean of x
## 27.33333
The t-test suggests that the data is not with a population mean BMI of 25 The 95% CI suggests that our data is compatible with a population mean between 26.79 and 27.88. The p-value provides strong evidence to reject the null hypothesis.
t.test(df3$BL.GE ~ df3$Group)
##
## Welch Two Sample t-test
##
## data: df3$BL.GE by df3$Group
## t = -0.21859, df = 607, p-value = 0.827
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07121156 0.05694668
## sample estimates:
## mean in group C mean in group T
## 1.382523 1.389656
The t-test suggests no statistically distinguishable difference between the baseline levels of gingival index between control and treatment group.
EXTRA TASK:
Calculate the same but for plaque index
t.test(df3$BL.Pl.I ~ df3$Group)
##
## Welch Two Sample t-test
##
## data: df3$BL.Pl.I by df3$Group
## t = -0.57484, df = 611.16, p-value = 0.5656
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09670351 0.05291006
## sample estimates:
## mean in group C mean in group T
## 1.224582 1.246478
The t-test suggests no statistically distinguishable difference between the baseline levels of plaque index between control and treatment group.
Gingival index
t.test(df3$BL.GE, df3$V3.GE)
##
## Welch Two Sample t-test
##
## data: df3$BL.GE and df3$V3.GE
## t = 4.4454, df = 1245.4, p-value = 0.000009554
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.05643771 0.14560396
## sample estimates:
## mean of x mean of y
## 1.385941 1.284920
The t-test provides strong evidence for a difference in the levels of gingival index between baseline and follow-up levels. We are 95% confident that the difference between baseline levels and follow-up levels of gingival index lie between 0.056 and 0.145.
Plaque index
t.test(df3$BL.Pl.I, df3$V3.Pl.I)
##
## Welch Two Sample t-test
##
## data: df3$BL.Pl.I and df3$V3.Pl.I
## t = 8.9975, df = 1245.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1874832 0.2920424
## sample estimates:
## mean of x mean of y
## 1.2350737 0.9953109
The t-test provides strong evidence for a difference in the levels of plaque index between baseline and follow-up levels.We are 95% confident that the difference between baseline levels and follow-up levels of plaque index lie between 0.187 and 0.292.
To answer this, we need to create a new variable of the difference between baseline and follow-up levels
change_GE = V3.GE - BL.GE
#Creating
df3$change_GE <- df3$V3.GE - df3$BL.GE
summary(df3$change_GE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6880 -0.2347 -0.0535 -0.1010 0.0600 0.6670
df3$change_PL <- df3$V3.Pl.I - df3$BL.Pl.I
summary(df3$change_PL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8330 -0.4720 -0.1950 -0.2398 0.0000 1.2770
t.test(df3$change_GE~ df3$Group) #gingival
##
## Welch Two Sample t-test
##
## data: df3$change_GE by df3$Group
## t = 11.783, df = 579.68, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1973083 0.2762417
## sample estimates:
## mean in group C mean in group T
## 0.01243385 -0.22434114
t.test(df3$change_PL~ df3$Group) #plaque
##
## Welch Two Sample t-test
##
## data: df3$change_PL by df3$Group
## t = 12.812, df = 583.14, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3193941 0.4350428
## sample estimates:
## mean in group C mean in group T
## -0.05901231 -0.43623077
Both tests provide strong evidence for a difference between gingival and plaque index change by group based on the p-values and 95% CIs.
Calculate the mean BMI by group of education. Is there a difference in the baseline BMI by education?
#Mean BMI by educational group
aggregate(df3$BMI, list(df3$Education), FUN=mean)
#Different approach, same result
df3 %>%
group_by(Education) %>%
summarise_at(vars(BMI), list(name = mean))
aov(df3$BMI ~ df3$Education)
## Call:
## aov(formula = df3$BMI ~ df3$Education)
##
## Terms:
## df3$Education Residuals
## Sum of Squares 685.607 29215.059
## Deg. of Freedom 2 621
##
## Residual standard error: 6.858949
## Estimated effects may be unbalanced
summary(aov(df3$BMI ~ df3$Education))
## Df Sum Sq Mean Sq F value Pr(>F)
## df3$Education 2 686 342.8 7.287 0.000745 ***
## Residuals 621 29215 47.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Mean square of education / mean square of residuals = F value
342.8/47
## [1] 7.293617
The ANOVA tests provides strong evidence for a difference in BMI by education level.
We want to assess the association between gingival and plaque index at baseline. These are two normally distributed variables and we want to examine the association both visually (scatterplot) and obtain an estimate (correlation). What can you say about this relationship?
scatter.smooth(df3$BL.GE, df3$BL.Pl.I)
cor(df3$BL.GE, df3$BL.Pl.I)
## [1] 0.5829316
There seems to be a positive association between baseline gingival and baseline plaque levels. As one increases, the other tend to increase. However, is important to know that this is NOT causation.
To introduce linear regression, we are going to assess the association between plaque index at follow-up and group of treatment. Try to interpret these results.
lm(df3$V3.Pl.I ~ df3$Group)
##
## Call:
## lm(formula = df3$V3.Pl.I ~ df3$Group)
##
## Coefficients:
## (Intercept) df3$GroupT
## 1.1656 -0.3553
summary(lm(df3$V3.Pl.I ~ df3$Group))
##
## Call:
## lm(formula = df3$V3.Pl.I ~ df3$Group)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10957 -0.28225 0.00593 0.27417 1.77275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.16557 0.02399 48.58 <0.0000000000000002 ***
## df3$GroupT -0.35532 0.03466 -10.25 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4325 on 622 degrees of freedom
## Multiple R-squared: 0.1446, Adjusted R-squared: 0.1432
## F-statistic: 105.1 on 1 and 622 DF, p-value: < 0.00000000000000022
The linear regression suggests that those from the treatment group have a lower mean plaque index (-0.35) compared to the control group. This means that the group
Calculate the 95% CI for the regression estimate (-0.35). Hint: the formula is be -0.35 - 1.96* Std. Error for the lower limit and -0.35 + 1.96*Std. Error of the upper limit
Repeat all the exercises with the subsample (df_subsample) and compare results. Hint: copy the different codes and replace df3 by df_subsample.
df_subsample <- df3[sample(nrow(df3), 100), ]
head(df_subsample)