Descriptive Tables

janitor {}, dyplr{} and gtsummary{}

Author

Raphael Anea

Published

June 5, 2026

Code
# Load Packages
# This chunk load packages
pacman::p_load(
  rio,           # for importing data
  here,          # for relative file paths
  skimr,         # for reviewing the data
  janitor,       # for cleaning data and creating summary
                 # tables (add totals & percents to tables) 
  gt,            # for 
  gtExtras,      # 
  gtsummary,     # for nice tables
  flextable,     # for converting tables to pretty images
  kableExtra,    # for making HTML tables
  scales,        # for percents and date labels
  officer,       # border lines
  RColorBrewer,  # for color palettes
  viridis,       # for color-blind friendly color scales
  tidyverse      # for data management and visualization
)
Code
# Import Clean LineList
linelist <- 
  import(here("data", "clean", "linelist_cleaned.rds"))

1 BACKGROUND

Objective of this output is to demonstrate the use of janitor{}, dyplr{} and gtsummary{} packages to create tables. And flextable{} and kable_extra{} package is used to format presentation ready tables and added aesthetics.

Mainly this project is to achieve how Linear Regression, Logistic Regression and Univariate/Multivariable Regression can be expressed using a descriptive table.

In this model, I NOT providing interpretation of the findings.

I am doing this out of curiosity to explore the R packages for creating Tables in a simple descriptive and advance predictive models, reproducible coding.

It is a revisit to “Introduction to R, from the EPI R HANDBOOK”, provided by APPLIED EPI.

With added enthusiasm of exploring and integration with Artificial Intelligence.

2 Exploratory Analysis

2.1 Viewing the First 10 Datasets

The dataset is pre-cleaned prior importing for this purposes. Thus, this is a exploratory of the “clean_linelist” dataset.

Code
linelist %>% 
  head(10)
   case_id generation date_infection date_onset date_hospitalisation
1   5fe599          4     2014-05-08 2014-05-13           2014-05-15
2   8689b7          4           <NA> 2014-05-13           2014-05-14
3   11f8ea          2           <NA> 2014-05-16           2014-05-18
4   b8812a          3     2014-05-04 2014-05-18           2014-05-20
5   893f25          3     2014-05-18 2014-05-21           2014-05-22
6   be99c8          3     2014-05-03 2014-05-22           2014-05-23
7   07e3e8          4     2014-05-22 2014-05-27           2014-05-29
8   369449          4     2014-05-28 2014-06-02           2014-06-03
9   f393b4          4           <NA> 2014-06-05           2014-06-06
10  1389ca          4           <NA> 2014-06-05           2014-06-07
   date_outcome outcome gender age age_unit age_years age_cat age_cat5
1          <NA>    <NA>      m   2    years         2     0-4      0-4
2    2014-05-18 Recover      f   3    years         3     0-4      0-4
3    2014-05-30 Recover      m  56    years        56   50-69    55-59
4          <NA>    <NA>      f  18    years        18   15-19    15-19
5    2014-05-29 Recover      m   3    years         3     0-4      0-4
6    2014-05-24 Recover      f  16    years        16   15-19    15-19
7    2014-06-01 Recover      f  16    years        16   15-19    15-19
8    2014-06-07   Death      f   0    years         0     0-4      0-4
9    2014-06-18 Recover      m  61    years        61   50-69    60-64
10   2014-06-09   Death      f  27    years        27   20-29    25-29
                               hospital       lon      lat infector source
1                                 Other -13.21574 8.468973   f547d6  other
2                               Missing -13.21523 8.451719     <NA>   <NA>
3  St. Mark's Maternity Hospital (SMMH) -13.21291 8.464817     <NA>   <NA>
4                         Port Hospital -13.23637 8.475476   f90f5f  other
5                     Military Hospital -13.22286 8.460824   11f8ea  other
6                         Port Hospital -13.22263 8.461831   aec8ec  other
7                               Missing -13.23315 8.462729   893f25  other
8                               Missing -13.23210 8.461444   133ee7  other
9                               Missing -13.22255 8.461913     <NA>   <NA>
10                              Missing -13.25722 8.472923     <NA>   <NA>
   wt_kg ht_cm ct_blood fever chills cough aches vomit temp time_admission
1     27    48       22    no     no   yes    no   yes 36.8           <NA>
2     25    59       22  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          09:36
3     91   238       21  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          16:48
4     41   135       23    no     no    no    no    no 36.8          11:22
5     36    71       23    no     no   yes    no   yes 36.9          12:60
6     56   116       21    no     no   yes    no   yes 37.6          14:13
7     47    87       21  <NA>   <NA>  <NA>  <NA>  <NA> 37.3          14:33
8      0    11       22    no     no   yes    no   yes 37.0          09:25
9     86   226       22    no     no   yes    no   yes 36.4          11:16
10    69   174       22    no     no   yes    no    no 35.9          10:55
         bmi days_onset_hosp
1  117.18750               2
2   71.81844               1
3   16.06525               2
4   22.49657               2
5   71.41440               1
6   41.61712               1
7   62.09539               2
8    0.00000               1
9   16.83765               1
10  22.79033               2

2.2 Using skimr() Package

Code
# Gets information about each variable in the datset
skim(linelist)
Data summary
Name linelist
Number of rows 5888
Number of columns 30
_______________________
Column type frequency:
character 13
Date 4
factor 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 0 1.00 6 6 0 5888 0
outcome 1323 0.78 5 7 0 2 0
gender 278 0.95 1 1 0 2 0
age_unit 0 1.00 5 6 0 2 0
hospital 0 1.00 5 36 0 6 0
infector 2088 0.65 6 6 0 2697 0
source 2088 0.65 5 7 0 2 0
fever 249 0.96 2 3 0 2 0
chills 249 0.96 2 3 0 2 0
cough 249 0.96 2 3 0 2 0
aches 249 0.96 2 3 0 2 0
vomit 249 0.96 2 3 0 2 0
time_admission 765 0.87 5 5 0 1072 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_infection 2087 0.65 2014-03-19 2015-04-27 2014-10-11 359
date_onset 256 0.96 2014-04-07 2015-04-30 2014-10-23 367
date_hospitalisation 0 1.00 2014-04-17 2015-04-30 2014-10-23 363
date_outcome 936 0.84 2014-04-19 2015-06-04 2014-11-01 371

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age_cat 86 0.99 FALSE 8 0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941
age_cat5 86 0.99 FALSE 17 0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
generation 0 1.00 16.56 5.79 0.00 13.00 16.00 20.00 37.00 ▁▆▇▂▁
age 86 0.99 16.07 12.62 0.00 6.00 13.00 23.00 84.00 ▇▅▁▁▁
age_years 86 0.99 16.02 12.64 0.00 6.00 13.00 23.00 84.00 ▇▅▁▁▁
lon 0 1.00 -13.23 0.02 -13.27 -13.25 -13.23 -13.22 -13.21 ▅▃▃▆▇
lat 0 1.00 8.47 0.01 8.45 8.46 8.47 8.48 8.49 ▅▇▇▇▆
wt_kg 0 1.00 52.64 18.58 -11.00 41.00 54.00 66.00 111.00 ▁▃▇▅▁
ht_cm 0 1.00 124.96 49.52 4.00 91.00 129.00 159.00 295.00 ▂▅▇▂▁
ct_blood 0 1.00 21.21 1.69 16.00 20.00 22.00 22.00 26.00 ▁▃▇▃▁
temp 149 0.97 38.56 0.98 35.20 38.20 38.80 39.20 40.80 ▁▂▂▇▁
bmi 0 1.00 46.89 55.39 -1200.00 24.56 32.12 50.01 1250.00 ▁▁▇▁▁
days_onset_hosp 256 0.96 2.06 2.26 0.00 1.00 1.00 3.00 22.00 ▇▁▁▁▁

2.3 Summary Statistics

Code
# age years
summary(linelist$age_years)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    6.00   13.00   16.02   23.00   84.00      86 
Code
# return only 2nd element
summary(linelist$age_years)[[2]]
[1] 6

3 CREATING TABLES

In this section I demonstrated the use of creating tables using janitor{}, dplyr{} and gtsummary{} packages, as per the sub-headings.

In the gtsummary{}, subheading, i explore the Data Completeness using gt{} package prior exploring prio exploring the Regression Models using gtsummmary{}.

Its table is provided, with reproducible codes.

3.1 USING janitor{} PACKAGE

Simple use of tabyl()

Code
# age_cat <- 
linelist %>% 
  tabyl(age_cat) %>% 
  # Express percentage values as percentages instead of decimals
  mutate(percent = scales::percent(percent)) %>%
  mutate(percent = ifelse(is.na(percent), "0%", percent)) %>%
  mutate(valid_percent = scales::percent(valid_percent)) %>%
  flextable() %>% 
  autofit() %>%
  # Add header title
  add_header_row(values = "Distribution of Cases by Age Category", 
                 colwidths = 4) %>%
  # Change column names
  set_header_labels(age_cat = "Age Category", 
                    n = "Count", 
                    percent = "Percentage",
                    valid_percent = "Valid Percentage") %>%
  
# Add aesthetics to flextable
theme_box() %>% 
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "all") %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output

Distribution of Cases by Age Category

Age Category

Count

Percentage

Valid Percentage

0-4

1,095

18.60%

18.87%

5-9

1,095

18.60%

18.87%

10-14

941

15.98%

16.22%

15-19

743

12.62%

12.81%

20-29

1,073

18.22%

18.49%

30-49

754

12.81%

13.00%

50-69

95

1.61%

1.64%

70+

6

0.10%

0.10%

86

1.46%

Cross-tabulation

Expressing Age categories by Gender.

Code
linelist %>% 
  tabyl(age_cat,gender) %>% 
  # Add totals for rows and columns
  adorn_totals(where = c("row", "col")) %>%
  flextable() %>% 
  autofit() %>%
  # Theme and aesthetics
  theme_box() %>%
  # Add Header title
  add_header_row(values = "Distribution of Cases by Age Category", 
                 colwidths = 5) %>%
  # Change column names
  set_header_labels(age_cat = "Age Category",
                    f       = "Females",
                    m       = "Males",
                    NA_     = "Unknown") %>% 
# Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "all") 

Distribution of Cases by Age Category

Age Category

Females

Males

Unknown

Total

0-4

640

416

39

1,095

5-9

641

412

42

1,095

10-14

518

383

40

941

15-19

359

364

20

743

20-29

468

575

30

1,073

30-49

179

557

18

754

50-69

2

91

2

95

70+

0

5

1

6

0

0

86

86

Total

2,807

2,803

278

5,888

Adorning the tables

Code
linelist %>% 
  tabyl(age_cat) %>% 
  # Add totals for rows and columns
  adorn_totals() %>%
  adorn_pct_formatting() %>% 
  flextable() %>% 
  autofit() %>%
  theme_box() %>% 
  add_header_row(values = "Distribution of Cases by Age Category", 
                 colwidths = 4) %>%
  set_header_labels(age_cat = "Age Category",
                    n = "Count",
                    percent = "Percentage",
                    valid_percent = "Valid Percentage") %>%
# Add aesthetics to flextable   
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  # Align Column 2 to 4(percent, valid_percent, n)
  align(align = "right", part = "body", j = 2:4) %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "header")

Distribution of Cases by Age Category

Age Category

Count

Percentage

Valid Percentage

0-4

1,095

18.6%

18.9%

5-9

1,095

18.6%

18.9%

10-14

941

16.0%

16.2%

15-19

743

12.6%

12.8%

20-29

1,073

18.2%

18.5%

30-49

754

12.8%

13.0%

50-69

95

1.6%

1.6%

70+

6

0.1%

0.1%

86

1.5%

-

Total

5,888

100.0%

100.0%

A cross-tabulation adjusted to capture both count and percentage

Capture Age-categories by gender and express in count and percentages.

Code
linelist %>% 
  tabyl(age_cat, gender) %>% 
  adorn_totals(where = "col") %>%  #
  adorn_percentages(denominator = "col") %>%
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front") %>% 
  # adorn_title(
  #   row_name = "Age Category",
  #   placement = "combined",
  #   col_name = "Gender") %>% 
 flextable() %>% 
  autofit() %>% 
  theme_box() %>%
  add_header_row(values = "Distribution of Cases by Age Category and Gender", 
                 colwidths = 5) %>%  
  set_header_labels(age_cat = "Age Category",
                    f       = "Females",
                    m       = "Males",
                    NA_     = "Unknown") %>%
  # Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  align(align = "right", part = "body", j = 2:5) %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "header") %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Distribution of Cases by Age Category and Gender

Age Category

Females

Males

Unknown

Total

0-4

640 (22.8%)

416 (14.8%)

39 (14.0%)

1,095 (18.6%)

5-9

641 (22.8%)

412 (14.7%)

42 (15.1%)

1,095 (18.6%)

10-14

518 (18.5%)

383 (13.7%)

40 (14.4%)

941 (16.0%)

15-19

359 (12.8%)

364 (13.0%)

20 (7.2%)

743 (12.6%)

20-29

468 (16.7%)

575 (20.5%)

30 (10.8%)

1,073 (18.2%)

30-49

179 (6.4%)

557 (19.9%)

18 (6.5%)

754 (12.8%)

50-69

2 (0.1%)

91 (3.2%)

2 (0.7%)

95 (1.6%)

70+

0 (0.0%)

5 (0.2%)

1 (0.4%)

6 (0.1%)

0 (0.0%)

0 (0.0%)

86 (30.9%)

86 (1.5%)

Use of other variables

Code
linelist %>% 
  count(hospital) %>%    # dyplr function
  adorn_totals() %>%     # janitor function
  flextable() %>% 
  autofit()

hospital

n

Central Hospital

454

Military Hospital

896

Missing

1,469

Other

885

Port Hospital

1,762

St. Mark's Maternity Hospital (SMMH)

422

Total

5,888

Code
linelist %>% 
  # Remove NA values from the table
  tabyl(hospital, age_cat) %>%
    adorn_totals() %>%
  adorn_percentages(denominator = "col") %>%
  adorn_pct_formatting() %>%
  adorn_ns(position = "front") %>%
  flextable() %>%
  theme_box() %>%
  add_header_row(values = "Distribution of Cases by Hospital and Age Category( Age in Years) ", 
                 colwidths = 10) %>% 
  # Change column names
  set_header_labels(hospital = "Hospital",
                    `0-4` = "0-4",
                    `5-9` = "5-9",
                    `10-14` = "10-14",
                    `15-19` = "15-19",
                    `20-29` = "20-29",
                    `30-49` = "30-49",
                    `50-69` = "50-69",
                    `70+`   = "70+") %>% 
# Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black",
                                                         width = 2)) %>%
  border(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>%
  bg(bg = "lightblue", part = "header") %>%
  bg(bg = "lightgray", part = "body") %>%
  bold(part = "header") %>%
  # italic(part = "body") %>%
  align(align = "center", part = "header") %>%
  align(align = "center", part = "body", j = 2:10) %>% 
  fontsize(size = 12, part = "all") %>%
  color(color = "darkblue", part = "header") %>% 
  autofit() %>%
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Distribution of Cases by Hospital and Age Category( Age in Years)

Hospital

0-4

5-9

10-14

15-19

20-29

30-49

50-69

70+

NA_

Central Hospital

84 (7.7%)

84 (7.7%)

82 (8.7%)

58 (7.8%)

73 (6.8%)

57 (7.6%)

7 (7.4%)

0 (0.0%)

9 (10.5%)

Military Hospital

171 (15.6%)

176 (16.1%)

121 (12.9%)

106 (14.3%)

180 (16.8%)

121 (16.0%)

7 (7.4%)

2 (33.3%)

12 (14.0%)

Missing

281 (25.7%)

268 (24.5%)

219 (23.3%)

196 (26.4%)

275 (25.6%)

177 (23.5%)

22 (23.2%)

3 (50.0%)

28 (32.6%)

Other

163 (14.9%)

154 (14.1%)

162 (17.2%)

104 (14.0%)

167 (15.6%)

104 (13.8%)

19 (20.0%)

0 (0.0%)

12 (14.0%)

Port Hospital

331 (30.2%)

318 (29.0%)

267 (28.4%)

234 (31.5%)

309 (28.8%)

248 (32.9%)

32 (33.7%)

0 (0.0%)

23 (26.7%)

St. Mark's Maternity Hospital (SMMH)

65 (5.9%)

95 (8.7%)

90 (9.6%)

45 (6.1%)

69 (6.4%)

47 (6.2%)

8 (8.4%)

1 (16.7%)

2 (2.3%)

Total

1,095 (100.0%)

1,095 (100.0%)

941 (100.0%)

743 (100.0%)

1,073 (100.0%)

754 (100.0%)

95 (100.0%)

6 (100.0%)

86 (100.0%)

Saving the tabyl

Code
linelist %>% 
  count(hospital) %>% # dyplr function
  adorn_totals() %>%     # janitor function
  flextable() %>% 
  autofit() %>% 
  save_as_docx(path = "tabyl.docx")

Statistics

MORE DETAILS WILL BE NEXT

Code
age_by_outcome <- linelist %>% 
  tabyl(age_cat, outcome, show_na = FALSE)

chisq.test(age_by_outcome)

    Pearson's Chi-squared test

data:  age_by_outcome
X-squared = 6.4931, df = 7, p-value = 0.4835

3.2 USING dyplr{} PACKAGE

Simple Counts in table

If you want to get counts of a variable, you can use the count() function from dplyr. This is similar to tabyl() but does not include percentages or totals.

Code
# Above code can be shortened by count()
linelist %>% 
  count(
    # count number of rows in each age category, package dplyr function
    age_cat) %>%  
  adorn_totals() %>%  # janitor function to add totals row
  flextable() %>%
  autofit()

age_cat

n

0-4

1,095

5-9

1,095

10-14

941

15-19

743

20-29

1,073

30-49

754

50-69

95

70+

6

86

Total

5,888

Proportion

Calculate the Counts to percentage using scales Package

Code
# Calculate percentage of cases in each age category by dividing the count 
# in each category by the total count and multiplying by 100

linelist %>% 
  count(age_cat) %>% 
  mutate(percent = n / sum(n) * 100)  %>% 
  # round the percent to 2 decimal places
  mutate(percent = round(percent, 2)) %>%
  mutate(percent = scales::percent(percent / 100)) %>%
  adorn_totals() %>% # Janitor function to add totals row
  flextable() %>%
  autofit()

age_cat

n

percent

0-4

1,095

18.60%

5-9

1,095

18.60%

10-14

941

15.98%

15-19

743

12.62%

20-29

1,073

18.22%

30-49

754

12.81%

50-69

95

1.61%

70+

6

0.10%

86

1.46%

Total

5,888

-

Code
# The above code can be shortened using the scales package to calculate 
# percent directly in the mutate() function

linelist %>% 
  count(age_cat) %>% 
  # use of scales package to calculate percent directly in the mutate() function
  mutate(percent = scales::percent(n / sum(n))) %>% 
  adorn_totals() %>% # Janitor function to add totals row
  flextable() %>%
  autofit()

age_cat

n

percent

0-4

1,095

18.60%

5-9

1,095

18.60%

10-14

941

15.98%

15-19

743

12.62%

20-29

1,073

18.22%

30-49

754

12.81%

50-69

95

1.61%

70+

6

0.10%

86

1.46%

Total

5,888

-

Calculate proportion within groups

Code
age_by_outcome <-
  linelist %>% 
  group_by(outcome) %>% 
  count(age_cat) %>% 
  mutate(
    # Calculate percentage of cases in each age category within each outcome 
    # group by dividing the count in each category by the total count for that 
    # outcome group and multiplying by 100 
    percent = scales::percent(n / sum(n)))

age_by_outcome 
# A tibble: 26 × 4
# Groups:   outcome [3]
   outcome age_cat     n percent
   <chr>   <fct>   <int> <chr>  
 1 Death   0-4       471 18.242%
 2 Death   5-9       476 18.435%
 3 Death   10-14     438 16.964%
 4 Death   15-19     323 12.510%
 5 Death   20-29     477 18.474%
 6 Death   30-49     329 12.742%
 7 Death   50-69      33 1.278% 
 8 Death   70+         3 0.116% 
 9 Death   <NA>       32 1.239% 
10 Recover 0-4       364 18.36% 
# ℹ 16 more rows
Code
# NOTE: This age_by_outcome table is in long format, 
# we can pivot it to wide format for better presentation

Summary Statisitics

Code
summary_table <-
  linelist %>%
  group_by(hospital) %>% 
  summarise(cases = n(),
            delay_max = max(days_onset_hosp, na.rm = TRUE),
            delay_mean = round(mean(days_onset_hosp, na.rm = TRUE), 2),
            delay_sd = round(sd(days_onset_hosp, na.rm = TRUE), 2),
            delay_3  = sum(days_onset_hosp >= 3, na.rm = TRUE),
            pct_delay_3 = scales::percent(delay_3 / cases)) 
# %>% 
#   flextable() %>% 
#   autofit()
            # mean_age = mean(age_years, na.rm = TRUE),
            # median_age = median(age_years, na.rm = TRUE),
            # sd_age = sd(age_years, na.rm = TRUE)) %>%)
summary_table
# A tibble: 6 × 7
  hospital               cases delay_max delay_mean delay_sd delay_3 pct_delay_3
  <chr>                  <int>     <dbl>      <dbl>    <dbl>   <int> <chr>      
1 Central Hospital         454        12       1.89     1.95     108 24%        
2 Military Hospital        896        15       2.12     2.36     253 28%        
3 Missing                 1469        22       2.08     2.29     399 27%        
4 Other                    885        18       2.05     2.23     234 26%        
5 Port Hospital           1762        16       2.05     2.24     470 27%        
6 St. Mark's Maternity …   422        18       2.08     2.33     116 27%        

Conditional Statistics

Code
linelist %>% 
  group_by(hospital) %>%
  summarise(
    # max temp among those with fever
    max_tem_fever = max(temp[fever == "yes"], na.rm = TRUE),
    # max temp among those without fever
    max_tem_no_fever = max(temp[fever == "no"], na.rm = TRUE) 
    ) %>% 
  flextable() %>%
  autofit()

hospital

max_tem_fever

max_tem_no_fever

Central Hospital

40.4

38.0

Military Hospital

40.5

38.0

Missing

40.6

38.0

Other

40.8

37.9

Port Hospital

40.6

38.0

St. Mark's Maternity Hospital (SMMH)

40.6

37.9

GLUEING TOGETHER str_glue() function from stringr package

Code
summary_table %>% 
  mutate(
    # Combined and format other values in single column
    delay = str_glue("{delay_mean}({delay_sd})")) %>%
  select(-c(delay_mean, delay_sd)) %>%
  select(                      
    # Order and rename cols, package dplyr function
    "Hospital Names" = hospital,
    "Cases"          = cases,
    "Max delay"      = delay_max,
    "Mean (SD)"      = delay,
    "Delay 3+ Days"   = delay_3,
    "% Delay 3+ Days" = pct_delay_3
    
  ) %>% 
  flextable() %>% 
  autofit() %>% 
  theme_box() %>% 
  add_header_row(values = "Summary of Cases and Delays by Hospital", 
                 colwidths = 6) %>%
  # Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>%
  bg(bg = "lightblue", part = "header") %>%
  bg(bg = "lightgray", part = "body") %>%
  bold(part = "header") %>%
  italic(part = "body") %>%
  align(align = "center", part = "header") %>%
  align(align = "center", part = "body", j = 3:6) %>%
  fontsize(size = 12, part = "all") %>%
  color(color = "darkblue", part = "header") %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Summary of Cases and Delays by Hospital

Hospital Names

Cases

Max delay

Mean (SD)

Delay 3+ Days

% Delay 3+ Days

Central Hospital

454

12

1.89(1.95)

108

24%

Military Hospital

896

15

2.12(2.36)

253

28%

Missing

1,469

22

2.08(2.29)

399

27%

Other

885

18

2.05(2.23)

234

26%

Port Hospital

1,762

16

2.05(2.24)

470

27%

St. Mark's Maternity Hospital (SMMH)

422

18

2.08(2.33)

116

27%

Percentiles

Code
# Get default percentile values of age (0%, 25%, 50%, 75%, 100%)
linelist %>% 
  summarise(age_percentile = quantile(age_years, na.rm = TRUE)) 
  age_percentile
1              0
2              6
3             13
4             23
5             84
Code
# Get manually-specified percentil values of age (0%, 25%, 50%, 75%, 100%)
linelist %>% 
  summarise(age_percentile = quantile(age_years, 
                                      probs = c(.05, 0.5, 0.75, 0.98),
                                      na.rm = TRUE)) %>% 
  flextable() %>% 
  autofit()

age_percentile

1

13

23

48

Percentile Group by hospital

Code
linelist %>% 
  group_by(hospital) %>% 
  summarise(
    p05 = quantile(age_years, probs = 0.05, na.rm = TRUE),
    p50 = quantile(age_years, probs = 0.5, na.rm = TRUE),
    p75 = quantile(age_years, probs = 0.75, na.rm = TRUE),
    p98 = quantile(age_years, probs = 0.98, na.rm = TRUE),
    
  ) %>% 
  flextable() %>% 
  autofit()

hospital

p05

p50

p75

p98

Central Hospital

1

12

21

48.00

Military Hospital

1

13

24

45.00

Missing

1

13

23

48.20

Other

1

13

23

50.00

Port Hospital

1

14

24

49.00

St. Mark's Maternity Hospital (SMMH)

2

12

22

50.24

The dplyr summarise() funtion can be produce with get_summary_stat() function from the statix package.

Code
linelist %>% 
  # Group by hospital, package dplyr function
  group_by(hospital) %>% 
  # Package statix function to get quantiles of age variable
  rstatix::get_summary_stats(age, type = "quantile") %>% 
  # remove variable column and reorder columns
  select(-variable) %>%
  
  # Passing the table to flextable pacakge 
  # to add header rows and aesthetics to the summary table
  flextable() %>% 
  # Step 1: Rename columns first
  set_header_labels(
    hospital = "Hospital",
    variable = "Variable",
    n        = "N",
    p0       = "0%",
    p25      = "25%",
    p50      = "50%",
    p75      = "75%",
    p100     = "100%"
  ) %>%
  # Step 2: Add sub-header grouping row (Percentiles span)
  add_header_row(
    values    = c(" Hospital", "N", "Percentiles"),
    colwidths = c(1, 1, 5),    # hospital | variable | n | 5 percentile cols
    top       = TRUE            # place BELOW the renamed labels row
  ) %>%
  # Step 3: Add main title row at the very top
  add_header_row(
    values    = "Summary of Age Percentiles by Hospital",
    colwidths = 7,    # span all 7 columns
    top       = TRUE                   # place ABOVE everything
  ) %>%
  # Step 4: Add Aesthetics
  theme_box() %>%
  autofit() %>% 
  # Merge Header rows for Hospital and N columns (no sub-columns needed)
  merge_at(part = "header",
           j    = 1, # hospital and n columns
           i    = 2:3) %>% # merge first two header rows
  merge_at(part = "header",
           j    = 2, # variable column
           i    = 2:3) %>% # merge first two header rows
  
  align(align = "center", part = "header") %>%
  align(j = 3:7, align = "center", part = "body") %>%
  bold(part = "header") %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  fontsize(i = 1, part = "header", size = 13) %>%   # main title font size
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>%
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output_

Summary of Age Percentiles by Hospital

Hospital

N

Percentiles

0%

25%

50%

75%

100%

Central Hospital

445

0

6

12

21

58

Military Hospital

884

0

6

14

24

72

Missing

1,441

0

6

13

23

76

Other

873

0

6

13

23

69

Port Hospital

1,739

0

6

14

24

68

St. Mark's Maternity Hospital (SMMH)

420

0

7

12

22

84

SUMMARISE AGGREGATE DATA

Code
linelist_arg <- 
linelist %>% 
  # package dplyr function to drop NA values in gender & outcome variables
  drop_na(gender, outcome) %>% 
  count(outcome, gender) #%>% 
  
linelist_arg
  outcome gender    n
1   Death      f 1227
2   Death      m 1228
3 Recover      f  953
4 Recover      m  950
Code
linelist_arg %>% 
  # Part 1: Use dplyr to summarise counts by outcome and gender
  group_by(outcome) %>% 
  summarise( # summarise counts
    total_cases  = sum(n, na.rm = TRUE),
    male_cases   = sum(n[gender == "m"], na.rm = TRUE),
    female_cases = sum(n[gender == "f"], na.rm = TRUE)
  ) %>% 
  adorn_totals() %>%
  
  # Part 2: Passing the table to flextable column Rename and add aesthetics 
  # to the summary table
  flextable() %>%
  theme_box() %>%
  # Step 1: Rename columns
  set_header_labels(
    outcome      = "Outcome",
    total_cases  = "Total Cases",
    male_cases   = "Males",
    female_cases = "Females"
  ) %>%
  # Step 2: Add sub-header row grouping Gender columns
  add_header_row(
    values    = c("", "Total Cases", "Gender"),
    colwidths = c(1, 1, 2),
    top       = TRUE
  ) %>%
  # Step 3: Merge first two rows of Outcome column (no sub-column needed)
  merge_at(part = "header",
           j    = 1,
           i    = 1:2) %>%
  # Step 4: Restore "Outcome" label lost after merge
  flextable::compose(
    part = "header",
    i    = 1,          # merged cell sits at row 1 after merge
    j    = 1,
    value = as_paragraph("Outcome")
  ) %>%
  # Step 5: Also merge "Total Cases" rows so it spans both header rows
  merge_at(part = "header",
           j    = 2,
           i    = 1:2) %>%
  # Step 6: Add main title at the very top
  add_header_row(
    values    = "Summary of Cases by Outcome and Gender",
    colwidths = 4,
    top       = TRUE
  ) %>%
  # Step 7: Add aesthetics to the table
  align(align = "center", part = "header") %>%
  align(align = "center", part = "body", j = 2:4) %>%
  bold(part  = "header") %>%
  bold(part  = "body", i = 3) %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  # font size first header & Second header only
   fontsize(i = 1, part = "header", size = 13) %>%
   fontsize(i = 2, part = "header", size = 11) %>%
  border_outer(part = "all", border = officer::fp_border(color = "black",
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>%
  autofit()

Summary of Cases by Outcome and Gender

Outcome

Total Cases

Gender

Males

Females

Death

2,455

1,228

1,227

Recover

1,903

950

953

Total

4,358

2,178

2,180

Multiple Columns using across()

across() function from dplyr can be used to apply a function to multiple columns at once. This is useful when you want to calculate summary statistics for multiple groups or categories in your data.

Code
linelist %>% 
  group_by(outcome,
           # show NA values in outcome variable as "Unknown" instead of
           outcome = ifelse(is.na(outcome), "Unknown", outcome)
           ) %>%
  # Create column using across. Summarise across by mean() and sd()
  summarise(across(.cols = c(age_years, temp, wt_kg,  ht_cm),
                   .fns = list("mean" = mean, "sd" = sd),
                   na.rm = TRUE)) %>%
  # round values to 2 decimal places
  mutate(across(.cols = c(age_years_mean,
                          age_years_sd,
                          temp_mean, 
                          temp_sd,
                          wt_kg_mean,
                          wt_kg_sd,
                          ht_cm_mean,
                          ht_cm_sd),
                .fns = ~ round(.x, 2))) %>%
  
  mutate(
    # Combined and format other values in single column
    temp_mean_sd    = str_glue("{temp_mean}({temp_sd})"),
    age_yrs_mean_sd = str_glue("{age_years_mean}({age_years_sd})"),
    wt_mean_sd      = str_glue("{wt_kg_mean}({wt_kg_sd})"),
    ht_cm_mean_sd   = str_glue("{ht_cm_mean}({ht_cm_sd}")) %>%
  
  select(-c(temp_mean,
            temp_sd,
            age_years_mean,
            age_years_sd,
            wt_kg_mean,
            wt_kg_sd,
            ht_cm_mean,
            ht_cm_sd)) %>%
flextable() %>%
  theme_box() %>%
  set_header_labels(outcome = "Outcome",
                    temp_mean_sd = "Temeprature",
                    age_yrs_mean_sd = "Age(Years)",
                    wt_mean_sd = "Weigth (kg)",
                    ht_cm_mean_sd = "Height (cm)") %>%
  add_header_row(
    values    = c("", "Mean (SD)"),
    colwidths = c(1, 4),
    top       = FALSE ) %>%
  merge_at(part = "header",
           j = 1,
           i = 1:2) %>%
  add_header_row(
    values    = "Summary of Clinical Characteristics by Outcome",
    colwidths = 5,
    top       = TRUE ) %>%
   border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                          width = 2)) %>%
   border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                          width = 1)) %>%
   bg(part = "header", bg = "#2C3E50") %>%
   color(part = "header", color = "white") %>%
   fontsize(i = 1, part = "header", size = 13) %>%
   fontsize(i = 2, part = "header", size = 11) %>%
   fontsize(i = 3, part = "header", size = 10) %>%
  align(align = "center", part = "header") %>%
  align(j = 2:5, align = "right", part = "body") %>%
  bold(part = "header") %>%
  italic(part = "header",
         i = 3) %>%
  autofit()

Summary of Clinical Characteristics by Outcome

Outcome

Temeprature

Age(Years)

Weigth (kg)

Height (cm)

Mean (SD)

Death

38.56(0.96)

15.88(12.31)

52.59(18.4)

124.86(48.67

Recover

38.56(1)

16.06(12.96)

52.48(18.64)

124.98(50.06

Unknown

38.55(0.98)

16.24(12.81)

53.01(18.87)

125.13(50.38

Pivot wider

Piwot wider() function from tidyr package can be used to reshape data from long format to wide format. This is useful when you want to create a summary table that shows the distribution of a variable across different groups or categories.

Code
age_by_outcome %>% 
  select(-percent) %>%
  pivot_wider(names_from = age_cat, values_from = n) %>% 
  # Rename NA values in outcome variable to "Unknown"
  mutate(outcome = ifelse(is.na(outcome), "Unknown", outcome)) %>%
  flextable() %>%
  theme_box() %>%
  set_header_labels(outcome = "Outcome",
                    `NA` = "Unknown",
                    `0-4` = "0 to 4 years",
                    `5-9` = "5 to 9 years",
                    `10-14` = "10 to 14 years",
                    `15-19` = "15 to 19 years",
                    `20-29` = "20 to 29 years",
                    `30-49` = "30 to 49 years",
                    `50-69` = "50 to 69 years",
                    `70+`   = "70+ years") %>%
    add_header_row(
    values    = c("Outcomes", "Age Categories"),
    colwidths = c(1, 9),
    top       = T ) %>%
  add_header_row(
    values    = "Distribution of Cases by Age Category and Outcome",
    colwidths = 10,
    top       = TRUE ) %>%
  merge_at(part = "header",
           j = 1,
           i = 2:3) %>%
  border_outer(part = "all", border = officer::fp_border(color = "black", width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black", width = 1)) %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  fontsize(i = 1, part = "header", size = 13) %>% 
  fontsize(i = 2, part = "header", size = 11) %>%
  fontsize(i = 3, part = "header", size = 10) %>%
  align(align = "center", part = "header") %>%
  align(j = 2:10, align = "center", part = "body") %>% 
  bold(part = "header") %>%
  italic(part = "header",
         i = 2) %>%
  autofit() %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output

Distribution of Cases by Age Category and Outcome

Outcomes

Age Categories

0 to 4 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 29 years

30 to 49 years

50 to 69 years

70+ years

Unknown

Death

471

476

438

323

477

329

33

3

32

Recover

364

391

303

251

367

238

38

3

28

Unknown

260

228

200

169

229

187

24

26

TOTAL ROWS

janitor package adorn_totals() function can be used to add a total row to a summary table. This is useful for providing a summary of the overall data in addition to the breakdown by groups.

Code
linelist %>% 
  group_by(gender) %>%
  summarise(
    # Number of rows in group where outcome is not missing
    known_outcome = sum(!is.na(outcome)), 
    # Number of rows in group where outcome is Death
    n_death  = sum(outcome == "Death", na.rm=T),
    # Number of rows in group where outcome is Recovered
    n_recover = sum(outcome == "Recover", na.rm=T), 
  ) %>% 
  # Recode gender values before adorning
  mutate(
    gender = case_match(
      gender,
      "f" ~ "Female",
      "m" ~ "Male",
      .default = gender,       # keeps NA or any other value as-is
      # Raname NA values to "Unknown"
      NA ~ "Unknown"
    )
  ) %>%
  adorn_totals() %>%            # Adorn total row (sums of each numeric column)
  adorn_percentages("col") %>%  # Get column proportions
  adorn_pct_formatting() %>%    # Convert proportions to percents
  adorn_ns(position = "front") %>%# display % and counts (with counts in front)
  flextable() %>%
  autofit() %>% 

# Add aesthetics to flextable
theme_box() %>% 
  border_outer(part = "all", border = officer::fp_border(color = "black", width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "all") %>% 
  # Add header rows
  add_header_row(values = "Summary of Outcomes by Gender", colwidths = 4) %>% 
  set_header_labels(gender = "Gender",
                    known_outcome = "Known Outcome",
                    n_death = "Deaths",
                    n_recover = "Recovered") %>%
  align(align = "center", part = "header") %>%
align(align = "right", part = "body", j = 2:4)

Summary of Outcomes by Gender

Gender

Known Outcome

Deaths

Recovered

Female

2,180 (47.8%)

1,227 (47.5%)

953 (48.1%)

Male

2,178 (47.7%)

1,228 (47.6%)

950 (47.9%)

Unknown

207 (4.5%)

127 (4.9%)

80 (4.0%)

Total

4,565 (100.0%)

2,582 (100.0%)

1,983 (100.0%)

summarise() on “Total” data and then bind_rows()

Group by hospital and outcome, then get mean ct_blood value for each

Code
by_hospital <- 
  linelist %>% 
  # Remove cases with NA or missing outcome and hospital values
  filter(!is.na(outcome) & hospital != "Missing") %>%
  group_by(hospital, outcome) %>%
  summarise(
    N  = n(),
    ct_value = mean(ct_blood, na.rm = TRUE)
  )

by_hospital
# A tibble: 10 × 4
# Groups:   hospital [5]
   hospital                             outcome     N ct_value
   <chr>                                <chr>   <int>    <dbl>
 1 Central Hospital                     Death     193     21.3
 2 Central Hospital                     Recover   165     21.4
 3 Military Hospital                    Death     399     21.2
 4 Military Hospital                    Recover   309     21.1
 5 Other                                Death     395     21.5
 6 Other                                Recover   290     21.2
 7 Port Hospital                        Death     785     21.2
 8 Port Hospital                        Recover   579     21.2
 9 St. Mark's Maternity Hospital (SMMH) Death     199     21.3
10 St. Mark's Maternity Hospital (SMMH) Recover   126     21.1

Groups totals by outcome only, ignoring hospital groups, to get overall mean ct_blood value for each outcome category

Code
totals <- 
  linelist %>% 
  # Remove cases with NA or missing outcome and hospital values
  filter(!is.na(outcome) & hospital != "Missing") %>%
  group_by(outcome) %>%
  summarise(
    N  = n(),
    ct_value = mean(ct_blood, na.rm = TRUE)
  )

totals
# A tibble: 2 × 3
  outcome     N ct_value
  <chr>   <int>    <dbl>
1 Death    1971     21.3
2 Recover  1469     21.2

Bind the by_hospital and totals tables together using bind_rows()

Code
# Bind the by_hospital and totals tables together using bind_rows() function 
# from dplyr package
table_long <- bind_rows(by_hospital, totals) %>% 
  mutate(
    # Replace NA values in hospital column with "Total" to indicate these rows
    # are the overall totals for each outcome category
    hospital = replace_na(hospital, "Total")) 

table_long
# A tibble: 12 × 4
# Groups:   hospital [6]
   hospital                             outcome     N ct_value
   <chr>                                <chr>   <int>    <dbl>
 1 Central Hospital                     Death     193     21.3
 2 Central Hospital                     Recover   165     21.4
 3 Military Hospital                    Death     399     21.2
 4 Military Hospital                    Recover   309     21.1
 5 Other                                Death     395     21.5
 6 Other                                Recover   290     21.2
 7 Port Hospital                        Death     785     21.2
 8 Port Hospital                        Recover   579     21.2
 9 St. Mark's Maternity Hospital (SMMH) Death     199     21.3
10 St. Mark's Maternity Hospital (SMMH) Recover   126     21.1
11 Total                                Death    1971     21.3
12 Total                                Recover  1469     21.2

Pivot Table_long to wider for presentation

Code
table_long %>% 
  # Pivot wider and format
  ##################
mutate(hopital = replace_na(hospital, "Total")) %>%
  # Pivot wider to create separate columns for each outcome category, 
  # with values from N and ct_value columns
  pivot_wider(names_from = outcome,values_from = c(N, ct_value)) %>%  
  mutate(
    # Calculate total N for known outcomes by summing Death and Recover counts
    N_Known = N_Death + N_Recover,  
    # Calculate percentage of Deaths among known outcomes
    Pct_Death = scales::percent(N_Death / N_Known, 0.1),
    # Calculate percentage of Recoveries among known outcomes
    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% 
  select(hospital, N_Known, 
         N_Recover, Pct_Recover, ct_value_Recover,
         N_Death, Pct_Death, ct_value_Death) %>% 
    arrange(N_Known) %>% # arrange rows by total N for known outcomes
  flextable() %>%
  theme_box() %>%
    add_header_row(
    values    = c("Hospital", "Known Outcomes", "Recoveries", "Deaths"),
    colwidths = c(1, 1, 3, 3),
    top       = TRUE ) %>%
  # Add aesthetics to the table
  set_header_labels(hospital = "Hospital",
                    N_Known = "Known Outcomes",
                    N_Recover = "Total",
                    Pct_Recover = "% of Cases",
                    ct_value_Recover = "Mean Ct Value",
                    N_Death = "Total",
                    Pct_Death = "% of Cases",
                    st_value_Death = "Mean Ct Value") %>%
  add_header_row(
    values    = "Summary of Mean Ct Values by Hospital and Outcome",
    colwidths = 8,
    top       = TRUE ) %>%
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  fontsize(i = 1, part = "header", size = 13) %>% 
  fontsize(i = 2, part = "header", size = 11) %>%
  align(align = "center", part = "header") %>%
  align(j = 2:8, align = "center", part = "body") %>%
  bold(part = "header") %>% 
  merge_at(part = "header", j = 2, i = 2:3) %>% # Merge Known Outcomes header
  merge_at(part = "header", j = 1, i = 2:3) %>% # Merge Recoveries header)
  # italic(part = "body",
  #        i = nrow(table_long)) %>% # italicize total row
  autofit() %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Summary of Mean Ct Values by Hospital and Outcome

Hospital

Known Outcomes

Recoveries

Deaths

Total

% of Cases

Mean Ct Value

Total

% of Cases

ct_value_Death

St. Mark's Maternity Hospital (SMMH)

325

126

38.8%

21.14286

199

61.2%

21.31156

Central Hospital

358

165

46.1%

21.41818

193

53.9%

21.25389

Other

685

290

42.3%

21.15172

395

57.7%

21.45316

Military Hospital

708

309

43.6%

21.09385

399

56.4%

21.23308

Port Hospital

1,364

579

42.4%

21.17444

785

57.6%

21.20127

Total

3,440

1,469

42.7%

21.17767

1,971

57.3%

21.27448

3.3 USING gtsummary{} PACKAGE

The default summary table from gtsummary package, with function tbl_summary() to get summary statistics for all variables in the dataset.

The function prints statistics appropriate to the column class: median and interquartile range (ICR) for numeric columns and count (%) for the categorical columns. Missing values are converted into “Unknown”. Footnotes are added to the bottom to explain the statistics, while the total N is shown at the top

Code
linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% 
  tbl_summary() %>% 
  # modify_caption("**Summary of Clinical Characteristics**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )
Table 1: Summary of Clinical Characterisitcs
Characteristic N = 5,888
age_years 13 (6, 23)
Unknown 86
gender
f 2,807 (50%)
m 2,803 (50%)
Unknown 278
outcome
Death 2,582 (57%)
Recover 1,983 (43%)
Unknown 1,323
fever 4,549 (81%)
Unknown 249
temp 38.80 (38.20, 39.20)
Unknown 149
hospital
Central Hospital 454 (7.7%)
Military Hospital 896 (15%)
Missing 1,469 (25%)
Other 885 (15%)
Port Hospital 1,762 (30%)
St. Mark's Maternity Hospital (SMMH) 422 (7.2%)
1 Median (Q1, Q3); n (%)

Adjustments to capture specific statistics for numeric and categorical variables

For Categorical variables, we can specify the statistic to be count and percentage using the argument statistic = list(all_categorical() ~ “{n} ({p}%)”). For numeric variables, we can specify the statistic to be mean and standard deviation using the argument statistic = list(all_numeric() ~ “{mean} ({sd})”).

Code
linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% 
  tbl_summary(by = outcome # stratify summary statistics by outcome variable
              ) %>%  # For numeric variables, show mean and standard deviation)
  # modify_caption("**Summary of Clinical Characteristics by Outcome**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )
Table 2: Summary of Clinical Characterisits by Outcome
Characteristic Death
N = 2,582
Recover
N = 1,983
age_years 13 (6, 23) 13 (6, 23)
Unknown 32 28
gender
f 1,227 (50%) 953 (50%)
m 1,228 (50%) 950 (50%)
Unknown 127 80
fever 2,002 (81%) 1,543 (81%)
Unknown 122 79
temp 38.80 (38.20, 39.20) 38.80 (38.20, 39.25)
Unknown 60 55
hospital
Central Hospital 193 (7.5%) 165 (8.3%)
Military Hospital 399 (15%) 309 (16%)
Missing 611 (24%) 514 (26%)
Other 395 (15%) 290 (15%)
Port Hospital 785 (30%) 579 (29%)
St. Mark's Maternity Hospital (SMMH) 199 (7.7%) 126 (6.4%)
1 Median (Q1, Q3); n (%)
Code
linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% 
  tbl_summary(
    # stratify summary statistics by outcome variable
    by = outcome, 
    # For numeric variables, show mean and standard deviation
    statistic = list(all_continuous() ~ "{mean} ({sd})", 
                all_categorical() ~ "{n} / {N} ({p}%)"),
    label = list(
      outcome = "Patient Outcome",
      age_years = "Age (Years)",
      gender = "Gender",
      fever = "Fever",
      temp = "Temperature (°C)",
      hospital = "Hospital", 
      # Replace NA values with "Missing" in the summary table
      missing_text = "Missing" 
      
    )) %>% 
   # modify_caption("**Summary of Clinical Characteristics by Outcome**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )
Table 3: Summary of Clinical Characteristics by Outcome
Characteristic Death
N = 2,582
Recover
N = 1,983
Age (Years) 16 (12) 16 (13)
Unknown 32 28
Gender
f 1,227 / 2,455 (50%) 953 / 1,903 (50%)
m 1,228 / 2,455 (50%) 950 / 1,903 (50%)
Unknown 127 80
Fever 2,002 / 2,460 (81%) 1,543 / 1,904 (81%)
Unknown 122 79
Temperature (°C) 38.56 (0.96) 38.56 (1.00)
Unknown 60 55
Hospital
Central Hospital 193 / 2,582 (7.5%) 165 / 1,983 (8.3%)
Military Hospital 399 / 2,582 (15%) 309 / 1,983 (16%)
Missing 611 / 2,582 (24%) 514 / 1,983 (26%)
Other 395 / 2,582 (15%) 290 / 1,983 (15%)
Port Hospital 785 / 2,582 (30%) 579 / 1,983 (29%)
St. Mark's Maternity Hospital (SMMH) 199 / 2,582 (7.7%) 126 / 1,983 (6.4%)
1 Mean (SD); n / N (%)

Multi-line stats for Continuos variables

Code
linelist %>% 
  select(age_years, temp, ct_blood, wt_kg, outcome) %>% 
  tbl_summary(
    by = outcome, # stratify summary statistics by outcome variable
    type = all_continuous() ~ "continuous2", # Use continuous2 type for multi-line stats
    statistic = list(all_continuous() ~ c("{mean} ({sd})", 
                                           "{median} ({p25}, {p75})",
                                          "{min}, {max}")),
    label = list(
      age_years = "Age (Years)",
      temp = "Temperature (°C)",
      ct_blood = "Ct Value",
      wt_kg = "Weight (kg)"
    )) %>% 
  add_p() %>% # Add p-values for comparisons across outcome groups
  modify_header(label = "**Variable**") %>% # Bold variable names in header
  # modify_caption(
    # "**Summary of Clinical Characteristics by Outcome with Multi-line Stats**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )
Table 4: Summary of Clinical Characteristics by Outcome with Multi-line Stats
Variable Death
N = 2,582
Recover
N = 1,983
p-value
Age (Years) 0.8
Mean (SD) 16 (12) 16 (13)
Median (Q1, Q3) 13 (6, 23) 13 (6, 23)
Min, Max 0, 76 0, 84
Unknown 32 28
Temperature (°C) 0.6
Mean (SD) 38.56 (0.96) 38.56 (1.00)
Median (Q1, Q3) 38.80 (38.20, 39.20) 38.80 (38.20, 39.25)
Min, Max 35.30, 40.60 35.20, 40.70
Unknown 60 55
Ct Value 0.044
Mean (SD) 21.26 (1.65) 21.14 (1.74)
Median (Q1, Q3) 22.00 (20.00, 22.00) 21.00 (20.00, 22.00)
Min, Max 16.00, 26.00 16.00, 25.00
Weight (kg) 0.7
Mean (SD) 53 (18) 52 (19)
Median (Q1, Q3) 54 (42, 66) 54 (40, 65)
Min, Max -11, 106 -10, 111
1 Wilcoxon rank sum test

3.4 REGRESSION MODELS Using gtsummary package

This section demonstrates how to use the gtsummary package to create summary tables for regression models, including linear regression, logistic regression, and Uni-variate and Multivariable regression models. The package provides functions to easily extract and format model results, such as coefficients, confidence intervals, p-values, and model fit statistics.

This section also explores how to customize the appearance of regression summary tables using gtsummary’s formatting options, as well as how to use kableExtra for enhanced table styling in PDF outputs.

3.4.1 Exploratory Data Completeness USE of gt{} Package

Code
# ── Build missing data summary ────────────────────────────────────────────────
missing_summary <- linelist %>%
  select(-case_id, -generation, - age_unit, -age ) %>% 
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing_n") %>%
  # Group
  mutate(variable_cat = case_when(
    Variable %in% c(
      "outcome",
      "infector",
      "source")
      ~ "OUTCOME & IDENTIFIERS",
    Variable %in% c(
      "date_infection",
      "date_outcome",
      "date_onset",
      "date_hospitalisation",
      "time_admission",
      "days_onset_hosp")
      ~ "DATES",
    Variable %in% c(
      "gender",
      "age_years",
      "age_cat",
      "age_cat5",
      "hospital")
      ~ "DEMOGRAPHICS",
    Variable %in% c(
      "temp",
      "fever",
      "chills",
      "cough",
      "aches",
      "vomit")
      ~ "CLINICAL",
    Variable %in% c(
      "ct_blood",
      "wt_kg",
      "ht_cm",
      "bmi",
      "lon",
      "lat")
      ~ "MEASUREMENTS (COMPLETE)"
    )
    ) %>% 
  mutate(
    Total         = nrow(linelist),
    Complete_n    = Total - Missing_n,
    Complete_rate = Complete_n / Total,
    Missing_pct   = Missing_n / Total, # * 100,
    Concern = case_when(
      Complete_rate == 1.00          ~ "✅ Complete",
      Complete_rate >= 0.90          ~ "🟢 Low",
      Complete_rate >= 0.70          ~ "🟡 Moderate",
      TRUE                           ~ "🔴 High"
    )
  ) %>%
  
  arrange(Complete_rate)

# ── Render with gt ────────────────────────────────────────────────────────────
missing_summary %>%
  select(variable_cat, Variable, Missing_n, Complete_n, Complete_rate, Missing_pct, Concern) %>%
  group_by(variable_cat) %>% 
  gt() %>%
  tab_header(
    title    = md("**Missing Data Summary**"),
    subtitle = md(glue::glue("*linelist* — {nrow(linelist)} total observations"))
  ) %>%
  cols_label(
    Variable      = "Variable",
    Missing_n     = "Missing (n)",
    Complete_n    = "Complete (n)",
    Complete_rate = "Complete Rate",
    Missing_pct   = "Missing (%)",
    Concern       = "Concern Level"
  ) %>%
  fmt_number(columns = c(Missing_n, Complete_n), decimals = 0) %>%
  fmt_percent(columns = c(Complete_rate,Missing_pct), decimals = 2) %>% 
  # fmt_percent(columns = Complete_rate, decimals = 1) %>%
  # fmt_number(columns = Missing_pct, decimals = 1, suffix = "%") %>%
  # Colour-code the complete rate column
  data_color(
    columns = Complete_rate,
    method  = "numeric",
    palette = c("#E24B4A", "#EF9F27", "#639922"),
    domain  = c(0, 1)
  ) %>%
  # # Add an inline bar using gtExtra
  # gt_plt_bar_pct(
  #   column     = Complete_rate,
  #   scaled     = T,
  #   fill       = "steelblue",
  #   # background = "#f2f2f2"
  # ) %>%
  cols_label(Complete_rate = "Completeness") %>%
  # opt_row_striping()
  # Bold rows with high missingness
  tab_style(
    style     = cell_text(weight = "bold"),
    locations = cells_body(rows = Complete_rate < 0.70)
  ) %>%
  # Stripe rows
  opt_row_striping() %>%
  tab_options(
    table.font.size        = 13,
    heading.align          = "left",
    column_labels.font.weight = "bold"
  ) %>%
  
  tab_footnote(
    footnote  = "High missingness (> 70%) may affect analysis validity.",
    locations = cells_column_labels(columns = Concern)
  )
Missing Data Summary
linelist — 5888 total observations
Variable Missing (n) Complete (n) Completeness Missing (%) Concern Level1
OUTCOME & IDENTIFIERS
infector 2,088 3,800 64.54% 35.46% 🔴 High
source 2,088 3,800 64.54% 35.46% 🔴 High
outcome 1,323 4,565 77.53% 22.47% 🟡 Moderate
DATES
date_infection 2,087 3,801 64.56% 35.44% 🔴 High
date_outcome 936 4,952 84.10% 15.90% 🟡 Moderate
time_admission 765 5,123 87.01% 12.99% 🟡 Moderate
date_onset 256 5,632 95.65% 4.35% 🟢 Low
days_onset_hosp 256 5,632 95.65% 4.35% 🟢 Low
date_hospitalisation 0 5,888 100.00% 0.00% ✅ Complete
DEMOGRAPHICS
gender 278 5,610 95.28% 4.72% 🟢 Low
age_years 86 5,802 98.54% 1.46% 🟢 Low
age_cat 86 5,802 98.54% 1.46% 🟢 Low
age_cat5 86 5,802 98.54% 1.46% 🟢 Low
hospital 0 5,888 100.00% 0.00% ✅ Complete
CLINICAL
fever 249 5,639 95.77% 4.23% 🟢 Low
chills 249 5,639 95.77% 4.23% 🟢 Low
cough 249 5,639 95.77% 4.23% 🟢 Low
aches 249 5,639 95.77% 4.23% 🟢 Low
vomit 249 5,639 95.77% 4.23% 🟢 Low
temp 149 5,739 97.47% 2.53% 🟢 Low
MEASUREMENTS (COMPLETE)
lon 0 5,888 100.00% 0.00% ✅ Complete
lat 0 5,888 100.00% 0.00% ✅ Complete
wt_kg 0 5,888 100.00% 0.00% ✅ Complete
ht_cm 0 5,888 100.00% 0.00% ✅ Complete
ct_blood 0 5,888 100.00% 0.00% ✅ Complete
bmi 0 5,888 100.00% 0.00% ✅ Complete
1 High missingness (> 70%) may affect analysis validity.

3.4.2 Linear Regression Model

Code
# ── Linear Regression Example ──────────────────────────────────────────────

# Step 1: Fit the model
lm_model <- lm(age ~ gender + hospital + outcome, data = linelist)

# Step 2: Create the summary table
tbl_regression(lm_model,
               label        = list(
      gender   ~ "Gender",
      hospital ~ "Hospital",
      outcome  ~ "Outcome"

      )) %>% 
  
  bold_p(t = 0.05) %>%                        # Bold p-values < 0.05
  bold_labels() %>%                            # Bold variable labels
  italicize_levels() %>%                       # Italicize variable levels
  add_significance_stars() %>%                 # Add * ** *** stars
  add_glance_source_note() %>%                 # Add model fit stats (R², AIC)
  modify_header(label = "**Variable**") %>%    # Rename label column
  # modify_caption("**Linear Regression Model: Predictors of Age**") %>% 
  
  as_kable_extra(                             # For PDF output
    booktabs  = TRUE,
    linesep   = "",
    longtable = FALSE
  ) %>%
  kable_styling(
    latex_options = c(
      "hold_position",
      "scale_down",
      "striped"),
    stripe_color  = "#f2f2f2",
    position      = "center",
    full_width = FALSE,
    font_size = 14
    ) 
Table 5: Linear Regression Model: Predictors of Age
Variable Beta SE
Gender
f
m 6.9*** 0.367
Hospital
Central Hospital
Military Hospital 0.78 0.802
Missing 0.24 0.751
Other 0.94 0.807
Port Hospital 0.86 0.734
St. Mark's Maternity Hospital (SMMH) 0.48 0.948
Outcome
Death
Recover 0.27 0.370
R² = 0.076; Adjusted R² = 0.074; Sigma = 12.1; Statistic = 51.0; p-value = <0.001; df = 7; Log-likelihood = -17,045; AIC = 34,107; BIC = 34,164; Deviance = 636,703; Residual df = 4,350; No. Obs. = 4,358
Abbreviations: CI = Confidence Interval, SE = Standard Error
1 *p<0.05; **p<0.01; ***p<0.001

3.4.3 Logistic Regression Model

Logistic Regression model is used to model the relationship between a binary outcome variable and one or more predictor variables. The gtsummary package can be used to create a summary table of the logistic regression model results, including odds ratios (OR), confidence intervals, p-values, and model fit statistics.

Code
# ── Logistic Regression Example ────────────────────────────────────────────

# Step 1: Fit the model
logit_model <- glm(
  outcome ~ age + gender + hospital + ct_blood + temp + wt_kg + ht_cm, data   = linelist %>% 
    mutate(outcome = ifelse(outcome == "Death", 1, 0)),
  family = binomial
)

# Step 2: Create the summary table
tbl_regression(
  logit_model,
  exponentiate = TRUE,               # Converts log-odds to Odds Ratios (OR)
  label        = list(
      age      ~ "Age",
      gender   ~ "Gender",
      hospital ~ "Hospital",
      ct_blood ~ "CT Values",
      temp     ~ "Temperature (°C)",
      wt_kg    ~ "Weight (kg)",
      ht_cm    ~ "Height (cm)")) %>%
  bold_p(t    = 0.05) %>%           # Bold significant p-values
  bold_labels() %>%                 # Bold variable names
  italicize_levels() %>%            # Italicize category levels
  add_significance_stars() %>%      # Add * ** *** to significant rows
  add_n() %>%                       # Add number of observations
  add_nevent() %>%                  # Add number of events (Deaths)
  add_glance_source_note() %>%      # Add AIC, BIC, log-likelihood
  modify_header(
    label    = "**Variable**",
    estimate = "**OR**",            # Rename estimate column to OR
    ci       = "**95% CI**"
  ) %>%
  # modify_caption("**Logistic Regression: Predictors of Death Outcome**") %>% 
  
# Step 3: For PDF output, use kableExtra for enhanced formatting
  as_kable_extra(
    booktabs  = TRUE,
    linesep   = "",
    longtable = FALSE
  )  %>%
  kable_styling(
    latex_options = c(
      "hold_position",
      "scale_down",
      "striped"),
    stripe_color  = "#f2f2f2",
    position      = "center",
    full_width = TRUE,
    font_size = 14) #%>%
#   column_spec(1, width = "6cm") %>%   # Variable column - wider
#   column_spec(2, width = "3cm") %>%   # N
#   column_spec(3, width = "3cm") %>%   # Event (N)
#   column_spec(4, width = "2.5cm") %>% # OR
#   column_spec(5, width = "2.5cm") %>% # SE
#   column_spec(6, width = "2.5cm")     # 95%CI
Table 6: Logistic Regression: Predictors of Death Outcome
Variable N Event N OR SE 95% CI
Age 4,243 2,395 0.99 0.005 0.98, 1.00
Gender 4,243 2,395
f
m 1.00 0.067 0.87, 1.14
Hospital 4,243 2,395
Central Hospital
Military Hospital 1.11 0.135 0.85, 1.45
Missing 0.98 0.126 0.76, 1.25
Other 1.14 0.136 0.88, 1.49
Port Hospital 1.10 0.123 0.86, 1.40
St. Mark's Maternity Hospital (SMMH) 1.37 0.161 1.00, 1.88
CT Values 4,243 2,395 1.03 0.019 1.00, 1.07
Temperature (°C) 4,243 2,395 1.00 0.032 0.94, 1.06
Weight (kg) 4,243 2,395 1.00 0.004 1.00, 1.01
Height (cm) 4,243 2,395 1.00 0.002 1.00, 1.00
Null deviance = 5,811; Null df = 4,242; Log-likelihood = -2,899; AIC = 5,822; BIC = 5,898; Deviance = 5,798; Residual df = 4,231; No. Obs. = 4,243
Abbreviations: CI = Confidence Interval, OR = Odds Ratio, SE = Standard Error
1 *p<0.05; **p<0.01; ***p<0.001

3.4.4 Univariate and Multivariable Logistic Regression Models with gt summary tables

Univariate and Multivariable Logistic Regression models are used to model the relationship between a binary outcome variable and one or more predictor variables, while controlling for potential confounders. The gtsummary package can be used to create summary tables for both univariate and multivariable logistic regression models, allowing for easy comparison of the results.

Code
# Univariate and Multivariable Logistic Regression Models with gt summary tables

# ── Step 1: Prepare the dataset ──────────────────────────────────────────────
linelist_model <- linelist %>%
  mutate(
    # Recode outcome to binary (1 for Death, 0 for Recover)
    outcome = ifelse(outcome == "Death", 1, 0)) %>%
  select(outcome, age, gender, hospital, ct_blood, temp, wt_kg, ht_cm) %>%
  drop_na()

# ── Step 2: Define variable labels ───────────────────────────────────────────
var_labels <- list(
  age      ~ "Age",
  gender   ~ "Gender",
  hospital ~ "Hospital",
  ct_blood ~ "CT Values",
  temp     ~ "Temperature (°C)",
  wt_kg    ~ "Weight (kg)",
  ht_cm    ~ "Hieght (cm)"
)

# ── Step 3: Fit multivariable model ──────────────────────────────────────────
logit_model <- glm(
  # Model formula with outcome as dependent variable and predictors
  outcome ~ age + gender + hospital + ct_blood + temp + wt_kg + ht_cm,
  data   = linelist_model,
  family = binomial
)

# ── Step 4: Univariate table ─────────────────────────────────────────────────
tbl_uv <- linelist_model %>%
  tbl_uvregression(
    method       = glm, # Use glm for univariate logistic regression
    y            = outcome, # Specify the outcome variable for univariate models
    method.args  = list(family = binomial), # Specify family for logistic regression
    exponentiate = TRUE, # Exponentiate coefficients to get ORs
    label        = var_labels 
  ) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  italicize_levels()

# ── Step 5: Multivariable table ───────────────────────────────────────────────
tbl_mv <- tbl_regression(
  logit_model,
  exponentiate = TRUE,
  label        = var_labels
) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  italicize_levels()

# ── Step 6: Merge and render for PDF ─────────────────────────────────────────
tbl_merge(
  tbls        = list(tbl_uv, tbl_mv),
  tab_spanner = c("**Univariate**", "**Multivariable**")
) %>%
  # modify_caption("**Univariate and Multivariable Logistic Regression**") %>%
  as_kable_extra(
    booktabs  = TRUE,
    linesep   = "",
    longtable = FALSE            # set TRUE if table spans multiple pages
  ) %>%
  kable_styling(
    latex_options = c(
      "hold_position",           # prevents table from floating away
      "scale_down",              # shrinks table to fit page width
      "striped"                  # alternating row shading
    ),
    stripe_color  = "#f2f2f2",   # light gray stripe
    position      = "center",
    full_width    = TRUE,        # stretch to full page width
    font_size     = 14           # smaller font for PDF
  ) %>%
  column_spec(1, width = "4cm") %>%   # Variable column - wider
  column_spec(2, width = "1.5cm") %>% # N
  column_spec(3, width = "2cm") %>%   # OR (univariate)
  column_spec(4, width = "2.5cm") %>% # 95% CI (univariate)
  column_spec(5, width = "1.5cm") %>% # p-value (univariate)
  column_spec(6, width = "2cm") %>%   # OR (multivariable)
  column_spec(7, width = "2.5cm") %>% # 95% CI (multivariable)
  column_spec(8, width = "1.5cm")     # p-value (multivariable)
Univariate and Multivariable Logistic Regression
Univariate
Multivariable
Characteristic N OR 95% CI p-value OR 95% CI p-value
Age 4,243 1.00 0.99, 1.00 0.6 0.99 0.98, 1.00 0.2
Gender 4,243
f
m 1.01 0.90, 1.14 0.9 1.00 0.87, 1.14 >0.9
Hospital 4,243
Central Hospital
Military Hospital 1.10 0.85, 1.44 0.5 1.11 0.85, 1.45 0.4
Missing 0.97 0.76, 1.24 0.8 0.98 0.76, 1.25 0.8
Other 1.14 0.88, 1.49 0.3 1.14 0.88, 1.49 0.3
Port Hospital 1.09 0.86, 1.39 0.5 1.10 0.86, 1.40 0.4
St. Mark's Maternity Hospital (SMMH) 1.37 1.00, 1.88 0.049 1.37 1.00, 1.88 0.051
CT Values 4,243 1.03 1.00, 1.07 0.066 1.03 1.00, 1.07 0.071
Temperature (°C) 4,243 0.99 0.93, 1.06 0.8 1.00 0.94, 1.06 >0.9
Weight (kg) 4,243 1.00 1.00, 1.00 0.7 1.00 1.00, 1.01 0.3
Hieght (cm) 4,243 1.00 1.00, 1.00 >0.9 1.00 1.00, 1.00 0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio