In this document, I would like to share with you my experience in doing descriptive statistical analysis using R packages table1 and compareGroups. For detailed instruction, you should refer to the authors’ vignettes which are availble here:

https://cran.r-project.org/web/packages/compareGroups/vignettes/compareGroups_vignette.html

https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html

First, we need to load the packages into the R environment:

library(table1)

## Warning: package 'table1' was built under R version 4.0.2

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

library(compareGroups)

## Loading required package: SNPassoc

## Loading required package: haplo.stats

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "numericVector" of class "Mnumeric"; definition not updated

## Loading required package: survival

## Loading required package: mvtnorm

## Loading required package: parallel

## Registered S3 method overwritten by 'SNPassoc':
##   method            from       
##   summary.haplo.glm haplo.stats

In the compareGroups package, there is a dataset called predimed which consists of 85% random sample from the real study PREDIMED (Estruch et al. 2013). It is a multicenter randomized clinical trial comparing 3 diets: a Mediterranean diet supplemented with extra-virgin olive oil (MedDiet+VOO), a Mediterranean diet supplemented with mixed nuts (MedDiet+Nuts), or a control diet (advice to reduce dietary fat).The primary outcome was the rate of major cardiovascular events (myocardial infarction, stroke, or death from cardiovascular causes. So, let us access to the data and create a working version called ‘df’:

data(predimed)
df = predimed 
head(df)

##            group    sex age   smoke   bmi waist       wth htn diab hyperchol
## 1        Control   Male  58  Former 33.53   122 0.7530864  No   No       Yes
## 2        Control   Male  77 Current 31.05   119 0.7300614 Yes  Yes        No
## 4  MedDiet + VOO Female  72  Former 30.86   106 0.6543210  No  Yes        No
## 5 MedDiet + Nuts   Male  71  Former 27.68   118 0.6941177 Yes   No       Yes
## 6  MedDiet + VOO Female  79   Never 35.94   129 0.8062500 Yes   No       Yes
## 8        Control   Male  63  Former 41.66   143 0.8033708 Yes  Yes       Yes
##   famhist hormo p14  toevent event
## 1      No    No  10 5.374401   Yes
## 2      No    No  10 6.097194    No
## 4     Yes    No   8 5.946612    No
## 5      No    No   8 2.907598   Yes
## 6      No    No   9 4.761123    No
## 8      No  <NA>   9 3.148528   Yes

Using table1 package

Using table1 function we can get summary stats for each variable as follows:

table1(~sex + age + bmi + waist + group, data=df)

	Overall (N=6324)
Sex
Male	2679 (42.4%)
Female	3645 (57.6%)
Age
Mean (SD)	67.0 (6.17)
Median [Min, Max]	67.0 [49.0, 87.0]
Body mass index
Mean (SD)	30.0 (3.82)
Median [Min, Max]	29.8 [19.6, 51.9]
Waist circumference
Mean (SD)	100 (10.6)
Median [Min, Max]	100 [50.0, 177]
Intervention group
Control	2042 (32.3%)
MedDiet + Nuts	2100 (33.2%)
MedDiet + VOO	2182 (34.5%)

We can stratify the analysis by treatment group and add a highlight for each row by using the optiontopclass=“Rtable1-zebra”:

table1(~sex + age + bmi + waist | group, overall=F, topclass="Rtable1-zebra", data=df)

	Control (N=2042)	MedDiet + Nuts (N=2100)	MedDiet + VOO (N=2182)
Sex
Male	812 (39.8%)	968 (46.1%)	899 (41.2%)
Female	1230 (60.2%)	1132 (53.9%)	1283 (58.8%)
Age
Mean (SD)	67.3 (6.28)	66.7 (6.02)	67.0 (6.21)
Median [Min, Max]	67.0 [49.0, 82.0]	66.0 [50.0, 84.0]	67.0 [49.0, 87.0]
Body mass index
Mean (SD)	30.3 (3.96)	29.7 (3.77)	29.9 (3.71)
Median [Min, Max]	30.0 [19.7, 51.9]	29.5 [19.6, 51.8]	29.7 [20.0, 49.1]
Waist circumference
Mean (SD)	101 (10.8)	100 (10.6)	100 (10.4)
Median [Min, Max]	101 [58.0, 177]	100 [61.0, 150]	100 [50.0, 176]

# further stratified by sex by using * sign 
table1(~age + bmi + waist + event | group*sex, overall=F, topclass="Rtable1-zebra", data=df)

	Control		MedDiet + Nuts		MedDiet + VOO
	Male (N=812)	Female (N=1230)	Male (N=968)	Female (N=1132)	Male (N=899)	Female (N=1283)
Age
Mean (SD)	66.4 (6.62)	68.0 (5.96)	65.8 (6.40)	67.4 (5.57)	66.1 (6.61)	67.7 (5.84)
Median [Min, Max]	66.0 [54.0, 82.0]	68.0 [49.0, 82.0]	65.0 [50.0, 83.0]	67.0 [54.0, 84.0]	66.0 [49.0, 83.0]	67.0 [51.0, 87.0]
Body mass index
Mean (SD)	29.6 (3.45)	30.8 (4.20)	29.1 (3.28)	30.2 (4.08)	29.2 (3.28)	30.4 (3.91)
Median [Min, Max]	29.3 [20.0, 48.4]	30.6 [19.7, 51.9]	29.0 [20.7, 40.7]	30.0 [19.6, 51.8]	29.2 [20.1, 40.1]	30.4 [20.0, 49.1]
Waist circumference
Mean (SD)	104 (9.82)	99.0 (11.0)	103 (9.36)	97.8 (11.0)	103 (9.65)	98.0 (10.5)
Median [Min, Max]	103 [58.0, 177]	99.0 [58.0, 150]	102 [61.0, 142]	98.0 [67.0, 150]	103 [58.0, 176]	98.0 [50.0, 153]
AMI, stroke, or CV Death
No	754 (92.9%)	1191 (96.8%)	927 (95.8%)	1103 (97.4%)	847 (94.2%)	1250 (97.4%)
Yes	58 (7.1%)	39 (3.2%)	41 (4.2%)	29 (2.6%)	52 (5.8%)	33 (2.6%)

Another useful feature of table P-value. This can be done by cut-and-paste the authors’ pvalue function below and then add to the table1 function. OK, let us run the pvalue function

P-value function for using table1 (refer to the above link)

pvalue <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- t.test(y ~ g)$p.value
    } else {
        # For categorical variables, perform a chi-squared test of independence
        p <- chisq.test(table(y, g))$p.value
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

Please note that this function is designed to compute P-value for 2 groups only. In the function below, I compare the variables between 2 treatment groups, namely, Control and MedDiet + VOO:

df1 = subset(df, group=="Control" | group=="MedDiet + VOO")

table1(~sex + age + bmi + waist + event | group, overall=F, topclass="Rtable1-zebra", extra.col=list("P-value" = pvalue), data=df1)

	Control (N=2042)	MedDiet + VOO (N=2182)	P-value
sex
Male	812 (39.8%)	899 (41.2%)	0.358
Female	1230 (60.2%)	1283 (58.8%)
age
Mean (SD)	67.3 (6.28)	67.0 (6.21)	0.0936
Median [Min, Max]	67.0 [49.0, 82.0]	67.0 [49.0, 87.0]
bmi
Mean (SD)	30.3 (3.96)	29.9 (3.71)	0.00408
Median [Min, Max]	30.0 [19.7, 51.9]	29.7 [20.0, 49.1]
waist
Mean (SD)	101 (10.8)	100 (10.4)	0.0206
Median [Min, Max]	101 [58.0, 177]	100 [50.0, 176]
event
No	1945 (95.2%)	2097 (96.1%)	0.197
Yes	97 (4.8%)	85 (3.9%)

Using compareGroups package

This package uses a slightly different syntax from that of table1. The main functions are createTable and compareGroups. The idea is to specify the variables for comparison and then create the table after that.

In the example below, I want to get a summary stat for all variables in the dataset:

compareGroups(group ~ ., data=df)

## 
## 
## -------- Summary of results by groups of 'Intervention group'---------
## 
## 
##    var                             N    p.value  method            selection
## 1  Sex                             6324 <0.001** categorical       ALL      
## 2  Age                             6324 0.003**  continuous normal ALL      
## 3  Smoking                         6324 0.444    categorical       ALL      
## 4  Body mass index                 6324 <0.001** continuous normal ALL      
## 5  Waist circumference             6324 0.045**  continuous normal ALL      
## 6  Waist-to-height ratio           6324 <0.001** continuous normal ALL      
## 7  Hypertension                    6324 0.249    categorical       ALL      
## 8  Type-2 diabetes                 6324 0.017**  categorical       ALL      
## 9  Dyslipidemia                    6324 0.423    categorical       ALL      
## 10 Family history of premature CHD 6324 0.581    categorical       ALL      
## 11 Hormone-replacement therapy     5661 0.850    categorical       ALL      
## 12 MeDiet Adherence score          6324 <0.001** continuous normal ALL      
## 13 follow-up to main event (years) 6324 <0.001** continuous normal ALL      
## 14 AMI, stroke, or CV Death        6324 0.064*   categorical       ALL      
## -----
## Signif. codes:  0 '**' 0.05 '*' 0.1 ' ' 1

OK, let us focus on some interesting variables

t = compareGroups(~ sex + age + bmi + waist, data=df)
createTable(t)

## 
## --------Summary descriptives table ---------
## 
## _____________________________________ 
##                        [ALL]      N   
##                        N=6324         
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Sex:                             6324 
##     Male            2679 (42.4%)      
##     Female          3645 (57.6%)      
## Age                 67.0 (6.17)  6324 
## Body mass index     30.0 (3.82)  6324 
## Waist circumference  100 (10.6)  6324 
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

# then stratify by group
createTable(compareGroups(group ~ sex + age + bmi + waist, data=df))

## 
## --------Summary descriptives table by 'Intervention group'---------
## 
## _______________________________________________________________________ 
##                       Control    MedDiet + Nuts MedDiet + VOO p.overall 
##                        N=2042        N=2100        N=2182               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Sex:                                                           <0.001   
##     Male            812 (39.8%)   968 (46.1%)    899 (41.2%)            
##     Female          1230 (60.2%)  1132 (53.9%)  1283 (58.8%)            
## Age                 67.3 (6.28)   66.7 (6.02)    67.0 (6.21)    0.003   
## Body mass index     30.3 (3.96)   29.7 (3.77)    29.9 (3.71)   <0.001   
## Waist circumference  101 (10.8)    100 (10.6)    100 (10.4)     0.045   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

We can indicate the variables that need a non-parametric analysis:

t = compareGroups(group ~ age + smoke + waist + hormo, data=df, method = c(waist = 2), Q1=0.025, Q3=0.975)

## Warning in cor.test.default(x, as.integer(y), method = "spearman"): Cannot
## compute exact p-value with ties

createTable(t)

## 
## --------Summary descriptives table by 'Intervention group'---------
## 
## ___________________________________________________________________________________ 
##                                 Control     MedDiet + Nuts MedDiet + VOO  p.overall 
##                                  N=2042         N=2100         N=2182               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Age                           67.3 (6.28)    66.7 (6.02)    67.0 (6.21)     0.003   
## Smoking:                                                                    0.444   
##     Never                     1282 (62.8%)   1259 (60.0%)   1351 (61.9%)            
##     Current                   270 (13.2%)    296 (14.1%)    292 (13.4%)             
##     Former                    490 (24.0%)    545 (26.0%)    539 (24.7%)             
## Waist circumference          101 [80.0;123] 100 [80.0;121] 100 [80.0;121]   0.085   
## Hormone-replacement therapy:                                                0.850   
##     No                        1811 (98.3%)   1835 (98.4%)   1918 (98.2%)            
##     Yes                        31 (1.68%)     30 (1.61%)     36 (1.84%)             
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

A good feature of compareGroups is that it allows us to obtain odds ratio:

t = compareGroups(htn ~ age + sex + bmi + smoke, data = df, ref=1)
createTable(t, show.ratio = T)

## 
## --------Summary descriptives table by 'Hypertension'---------
## 
## ___________________________________________________________________________ 
##                     No          Yes             OR        p.ratio p.overall 
##                   N=1089       N=5235                                       
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Age             65.9 (6.19) 67.2 (6.15)  1.04 [1.03;1.05] <0.001   <0.001   
## Sex:                                                               <0.001   
##     Male        595 (54.6%) 2084 (39.8%)       Ref.        Ref.             
##     Female      494 (45.4%) 3151 (60.2%) 1.82 [1.60;2.08]  0.000            
## Body mass index 28.9 (3.69) 30.2 (3.80)  1.10 [1.08;1.12] <0.001   <0.001   
## Smoking:                                                           <0.001   
##     Never       536 (49.2%) 3356 (64.1%)       Ref.        Ref.             
##     Current     233 (21.4%) 625 (11.9%)  0.43 [0.36;0.51]  0.000            
##     Former      320 (29.4%) 1254 (24.0%) 0.63 [0.54;0.73] <0.001            
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Doing descriptive analysis with R

T Nguyen

03/10/2021

Using table1 package

P-value function for using table1 (refer to the above link)

Using compareGroups package