In this document, I would like to share with you my experience in doing descriptive statistical analysis using R packages table1 and compareGroups. For detailed instruction, you should refer to the authors’ vignettes which are availble here:
https://cran.r-project.org/web/packages/compareGroups/vignettes/compareGroups_vignette.html
https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html
First, we need to load the packages into the R environment:
library(table1)
## Warning: package 'table1' was built under R version 4.0.2
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
library(compareGroups)
## Loading required package: SNPassoc
## Loading required package: haplo.stats
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "numericVector" of class "Mnumeric"; definition not updated
## Loading required package: survival
## Loading required package: mvtnorm
## Loading required package: parallel
## Registered S3 method overwritten by 'SNPassoc':
## method from
## summary.haplo.glm haplo.stats
In the compareGroups package, there is a dataset called predimed which consists of 85% random sample from the real study PREDIMED (Estruch et al. 2013). It is a multicenter randomized clinical trial comparing 3 diets: a Mediterranean diet supplemented with extra-virgin olive oil (MedDiet+VOO), a Mediterranean diet supplemented with mixed nuts (MedDiet+Nuts), or a control diet (advice to reduce dietary fat).The primary outcome was the rate of major cardiovascular events (myocardial infarction, stroke, or death from cardiovascular causes. So, let us access to the data and create a working version called ‘df’:
data(predimed)
df = predimed
head(df)
## group sex age smoke bmi waist wth htn diab hyperchol
## 1 Control Male 58 Former 33.53 122 0.7530864 No No Yes
## 2 Control Male 77 Current 31.05 119 0.7300614 Yes Yes No
## 4 MedDiet + VOO Female 72 Former 30.86 106 0.6543210 No Yes No
## 5 MedDiet + Nuts Male 71 Former 27.68 118 0.6941177 Yes No Yes
## 6 MedDiet + VOO Female 79 Never 35.94 129 0.8062500 Yes No Yes
## 8 Control Male 63 Former 41.66 143 0.8033708 Yes Yes Yes
## famhist hormo p14 toevent event
## 1 No No 10 5.374401 Yes
## 2 No No 10 6.097194 No
## 4 Yes No 8 5.946612 No
## 5 No No 8 2.907598 Yes
## 6 No No 9 4.761123 No
## 8 No <NA> 9 3.148528 Yes
Using table1 function we can get summary stats for each variable as follows:
table1(~sex + age + bmi + waist + group, data=df)
| Overall (N=6324) |
|
|---|---|
| Sex | |
| Male | 2679 (42.4%) |
| Female | 3645 (57.6%) |
| Age | |
| Mean (SD) | 67.0 (6.17) |
| Median [Min, Max] | 67.0 [49.0, 87.0] |
| Body mass index | |
| Mean (SD) | 30.0 (3.82) |
| Median [Min, Max] | 29.8 [19.6, 51.9] |
| Waist circumference | |
| Mean (SD) | 100 (10.6) |
| Median [Min, Max] | 100 [50.0, 177] |
| Intervention group | |
| Control | 2042 (32.3%) |
| MedDiet + Nuts | 2100 (33.2%) |
| MedDiet + VOO | 2182 (34.5%) |
We can stratify the analysis by treatment group and add a highlight for each row by using the optiontopclass=“Rtable1-zebra”:
table1(~sex + age + bmi + waist | group, overall=F, topclass="Rtable1-zebra", data=df)
| Control (N=2042) |
MedDiet + Nuts (N=2100) |
MedDiet + VOO (N=2182) |
|
|---|---|---|---|
| Sex | |||
| Male | 812 (39.8%) | 968 (46.1%) | 899 (41.2%) |
| Female | 1230 (60.2%) | 1132 (53.9%) | 1283 (58.8%) |
| Age | |||
| Mean (SD) | 67.3 (6.28) | 66.7 (6.02) | 67.0 (6.21) |
| Median [Min, Max] | 67.0 [49.0, 82.0] | 66.0 [50.0, 84.0] | 67.0 [49.0, 87.0] |
| Body mass index | |||
| Mean (SD) | 30.3 (3.96) | 29.7 (3.77) | 29.9 (3.71) |
| Median [Min, Max] | 30.0 [19.7, 51.9] | 29.5 [19.6, 51.8] | 29.7 [20.0, 49.1] |
| Waist circumference | |||
| Mean (SD) | 101 (10.8) | 100 (10.6) | 100 (10.4) |
| Median [Min, Max] | 101 [58.0, 177] | 100 [61.0, 150] | 100 [50.0, 176] |
# further stratified by sex by using * sign
table1(~age + bmi + waist + event | group*sex, overall=F, topclass="Rtable1-zebra", data=df)
Control |
MedDiet + Nuts |
MedDiet + VOO |
||||
|---|---|---|---|---|---|---|
| Male (N=812) |
Female (N=1230) |
Male (N=968) |
Female (N=1132) |
Male (N=899) |
Female (N=1283) |
|
| Age | ||||||
| Mean (SD) | 66.4 (6.62) | 68.0 (5.96) | 65.8 (6.40) | 67.4 (5.57) | 66.1 (6.61) | 67.7 (5.84) |
| Median [Min, Max] | 66.0 [54.0, 82.0] | 68.0 [49.0, 82.0] | 65.0 [50.0, 83.0] | 67.0 [54.0, 84.0] | 66.0 [49.0, 83.0] | 67.0 [51.0, 87.0] |
| Body mass index | ||||||
| Mean (SD) | 29.6 (3.45) | 30.8 (4.20) | 29.1 (3.28) | 30.2 (4.08) | 29.2 (3.28) | 30.4 (3.91) |
| Median [Min, Max] | 29.3 [20.0, 48.4] | 30.6 [19.7, 51.9] | 29.0 [20.7, 40.7] | 30.0 [19.6, 51.8] | 29.2 [20.1, 40.1] | 30.4 [20.0, 49.1] |
| Waist circumference | ||||||
| Mean (SD) | 104 (9.82) | 99.0 (11.0) | 103 (9.36) | 97.8 (11.0) | 103 (9.65) | 98.0 (10.5) |
| Median [Min, Max] | 103 [58.0, 177] | 99.0 [58.0, 150] | 102 [61.0, 142] | 98.0 [67.0, 150] | 103 [58.0, 176] | 98.0 [50.0, 153] |
| AMI, stroke, or CV Death | ||||||
| No | 754 (92.9%) | 1191 (96.8%) | 927 (95.8%) | 1103 (97.4%) | 847 (94.2%) | 1250 (97.4%) |
| Yes | 58 (7.1%) | 39 (3.2%) | 41 (4.2%) | 29 (2.6%) | 52 (5.8%) | 33 (2.6%) |
Another useful feature of table P-value. This can be done by cut-and-paste the authors’ pvalue function below and then add to the table1 function. OK, let us run the pvalue function
pvalue <- function(x, ...) {
# Construct vectors of data y, and groups (strata) g
y <- unlist(x)
g <- factor(rep(1:length(x), times=sapply(x, length)))
if (is.numeric(y)) {
# For numeric variables, perform a standard 2-sample t-test
p <- t.test(y ~ g)$p.value
} else {
# For categorical variables, perform a chi-squared test of independence
p <- chisq.test(table(y, g))$p.value
}
# Format the p-value, using an HTML entity for the less-than sign.
# The initial empty string places the output on the line below the variable label.
c("", sub("<", "<", format.pval(p, digits=3, eps=0.001)))
}
Please note that this function is designed to compute P-value for 2 groups only. In the function below, I compare the variables between 2 treatment groups, namely, Control and MedDiet + VOO:
df1 = subset(df, group=="Control" | group=="MedDiet + VOO")
table1(~sex + age + bmi + waist + event | group, overall=F, topclass="Rtable1-zebra", extra.col=list("P-value" = pvalue), data=df1)
| Control (N=2042) |
MedDiet + VOO (N=2182) |
P-value | |
|---|---|---|---|
| sex | |||
| Male | 812 (39.8%) | 899 (41.2%) | 0.358 |
| Female | 1230 (60.2%) | 1283 (58.8%) | |
| age | |||
| Mean (SD) | 67.3 (6.28) | 67.0 (6.21) | 0.0936 |
| Median [Min, Max] | 67.0 [49.0, 82.0] | 67.0 [49.0, 87.0] | |
| bmi | |||
| Mean (SD) | 30.3 (3.96) | 29.9 (3.71) | 0.00408 |
| Median [Min, Max] | 30.0 [19.7, 51.9] | 29.7 [20.0, 49.1] | |
| waist | |||
| Mean (SD) | 101 (10.8) | 100 (10.4) | 0.0206 |
| Median [Min, Max] | 101 [58.0, 177] | 100 [50.0, 176] | |
| event | |||
| No | 1945 (95.2%) | 2097 (96.1%) | 0.197 |
| Yes | 97 (4.8%) | 85 (3.9%) |
This package uses a slightly different syntax from that of table1. The main functions are createTable and compareGroups. The idea is to specify the variables for comparison and then create the table after that.
In the example below, I want to get a summary stat for all variables in the dataset:
compareGroups(group ~ ., data=df)
##
##
## -------- Summary of results by groups of 'Intervention group'---------
##
##
## var N p.value method selection
## 1 Sex 6324 <0.001** categorical ALL
## 2 Age 6324 0.003** continuous normal ALL
## 3 Smoking 6324 0.444 categorical ALL
## 4 Body mass index 6324 <0.001** continuous normal ALL
## 5 Waist circumference 6324 0.045** continuous normal ALL
## 6 Waist-to-height ratio 6324 <0.001** continuous normal ALL
## 7 Hypertension 6324 0.249 categorical ALL
## 8 Type-2 diabetes 6324 0.017** categorical ALL
## 9 Dyslipidemia 6324 0.423 categorical ALL
## 10 Family history of premature CHD 6324 0.581 categorical ALL
## 11 Hormone-replacement therapy 5661 0.850 categorical ALL
## 12 MeDiet Adherence score 6324 <0.001** continuous normal ALL
## 13 follow-up to main event (years) 6324 <0.001** continuous normal ALL
## 14 AMI, stroke, or CV Death 6324 0.064* categorical ALL
## -----
## Signif. codes: 0 '**' 0.05 '*' 0.1 ' ' 1
OK, let us focus on some interesting variables
t = compareGroups(~ sex + age + bmi + waist, data=df)
createTable(t)
##
## --------Summary descriptives table ---------
##
## _____________________________________
## [ALL] N
## N=6324
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Sex: 6324
## Male 2679 (42.4%)
## Female 3645 (57.6%)
## Age 67.0 (6.17) 6324
## Body mass index 30.0 (3.82) 6324
## Waist circumference 100 (10.6) 6324
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# then stratify by group
createTable(compareGroups(group ~ sex + age + bmi + waist, data=df))
##
## --------Summary descriptives table by 'Intervention group'---------
##
## _______________________________________________________________________
## Control MedDiet + Nuts MedDiet + VOO p.overall
## N=2042 N=2100 N=2182
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Sex: <0.001
## Male 812 (39.8%) 968 (46.1%) 899 (41.2%)
## Female 1230 (60.2%) 1132 (53.9%) 1283 (58.8%)
## Age 67.3 (6.28) 66.7 (6.02) 67.0 (6.21) 0.003
## Body mass index 30.3 (3.96) 29.7 (3.77) 29.9 (3.71) <0.001
## Waist circumference 101 (10.8) 100 (10.6) 100 (10.4) 0.045
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
We can indicate the variables that need a non-parametric analysis:
t = compareGroups(group ~ age + smoke + waist + hormo, data=df, method = c(waist = 2), Q1=0.025, Q3=0.975)
## Warning in cor.test.default(x, as.integer(y), method = "spearman"): Cannot
## compute exact p-value with ties
createTable(t)
##
## --------Summary descriptives table by 'Intervention group'---------
##
## ___________________________________________________________________________________
## Control MedDiet + Nuts MedDiet + VOO p.overall
## N=2042 N=2100 N=2182
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Age 67.3 (6.28) 66.7 (6.02) 67.0 (6.21) 0.003
## Smoking: 0.444
## Never 1282 (62.8%) 1259 (60.0%) 1351 (61.9%)
## Current 270 (13.2%) 296 (14.1%) 292 (13.4%)
## Former 490 (24.0%) 545 (26.0%) 539 (24.7%)
## Waist circumference 101 [80.0;123] 100 [80.0;121] 100 [80.0;121] 0.085
## Hormone-replacement therapy: 0.850
## No 1811 (98.3%) 1835 (98.4%) 1918 (98.2%)
## Yes 31 (1.68%) 30 (1.61%) 36 (1.84%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
A good feature of compareGroups is that it allows us to obtain odds ratio:
t = compareGroups(htn ~ age + sex + bmi + smoke, data = df, ref=1)
createTable(t, show.ratio = T)
##
## --------Summary descriptives table by 'Hypertension'---------
##
## ___________________________________________________________________________
## No Yes OR p.ratio p.overall
## N=1089 N=5235
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Age 65.9 (6.19) 67.2 (6.15) 1.04 [1.03;1.05] <0.001 <0.001
## Sex: <0.001
## Male 595 (54.6%) 2084 (39.8%) Ref. Ref.
## Female 494 (45.4%) 3151 (60.2%) 1.82 [1.60;2.08] 0.000
## Body mass index 28.9 (3.69) 30.2 (3.80) 1.10 [1.08;1.12] <0.001 <0.001
## Smoking: <0.001
## Never 536 (49.2%) 3356 (64.1%) Ref. Ref.
## Current 233 (21.4%) 625 (11.9%) 0.43 [0.36;0.51] 0.000
## Former 320 (29.4%) 1254 (24.0%) 0.63 [0.54;0.73] <0.001
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯