Descriptive Tables

janitor {}, dyplr{} and gtsummary{}

Author

Raphael Anea

Published

June 5, 2026

Code

# Load Packages
# This chunk load packages
pacman::p_load(
  rio,           # for importing data
  here,          # for relative file paths
  skimr,         # for reviewing the data
  janitor,       # for cleaning data and creating summary
                 # tables (add totals & percents to tables) 
  gt,            # for 
  gtExtras,      # 
  gtsummary,     # for nice tables
  flextable,     # for converting tables to pretty images
  kableExtra,    # for making HTML tables
  scales,        # for percents and date labels
  officer,       # border lines
  RColorBrewer,  # for color palettes
  viridis,       # for color-blind friendly color scales
  tidyverse      # for data management and visualization
)

Code

# Import Clean LineList
linelist <- 
  import(here("data", "clean", "linelist_cleaned.rds"))

1 BACKGROUND

Objective of this output is to demonstrate the use of janitor{}, dyplr{} and gtsummary{} packages to create tables. And flextable{} and kable_extra{} package is used to format presentation ready tables and added aesthetics.

Mainly this project is to achieve how Linear Regression, Logistic Regression and Univariate/Multivariable Regression can be expressed using a descriptive table.

In this model, I NOT providing interpretation of the findings.

I am doing this out of curiosity to explore the R packages for creating Tables in a simple descriptive and advance predictive models, reproducible coding.

It is a revisit to “Introduction to R, from the EPI R HANDBOOK”, provided by APPLIED EPI.

With added enthusiasm of exploring and integration with Artificial Intelligence.

2 Exploratory Analysis

2.1 Viewing the First 10 Datasets

The dataset is pre-cleaned prior importing for this purposes. Thus, this is a exploratory of the “clean_linelist” dataset.

Code

linelist %>% 
  head(10)

   case_id generation date_infection date_onset date_hospitalisation
1   5fe599          4     2014-05-08 2014-05-13           2014-05-15
2   8689b7          4           <NA> 2014-05-13           2014-05-14
3   11f8ea          2           <NA> 2014-05-16           2014-05-18
4   b8812a          3     2014-05-04 2014-05-18           2014-05-20
5   893f25          3     2014-05-18 2014-05-21           2014-05-22
6   be99c8          3     2014-05-03 2014-05-22           2014-05-23
7   07e3e8          4     2014-05-22 2014-05-27           2014-05-29
8   369449          4     2014-05-28 2014-06-02           2014-06-03
9   f393b4          4           <NA> 2014-06-05           2014-06-06
10  1389ca          4           <NA> 2014-06-05           2014-06-07
   date_outcome outcome gender age age_unit age_years age_cat age_cat5
1          <NA>    <NA>      m   2    years         2     0-4      0-4
2    2014-05-18 Recover      f   3    years         3     0-4      0-4
3    2014-05-30 Recover      m  56    years        56   50-69    55-59
4          <NA>    <NA>      f  18    years        18   15-19    15-19
5    2014-05-29 Recover      m   3    years         3     0-4      0-4
6    2014-05-24 Recover      f  16    years        16   15-19    15-19
7    2014-06-01 Recover      f  16    years        16   15-19    15-19
8    2014-06-07   Death      f   0    years         0     0-4      0-4
9    2014-06-18 Recover      m  61    years        61   50-69    60-64
10   2014-06-09   Death      f  27    years        27   20-29    25-29
                               hospital       lon      lat infector source
1                                 Other -13.21574 8.468973   f547d6  other
2                               Missing -13.21523 8.451719     <NA>   <NA>
3  St. Mark's Maternity Hospital (SMMH) -13.21291 8.464817     <NA>   <NA>
4                         Port Hospital -13.23637 8.475476   f90f5f  other
5                     Military Hospital -13.22286 8.460824   11f8ea  other
6                         Port Hospital -13.22263 8.461831   aec8ec  other
7                               Missing -13.23315 8.462729   893f25  other
8                               Missing -13.23210 8.461444   133ee7  other
9                               Missing -13.22255 8.461913     <NA>   <NA>
10                              Missing -13.25722 8.472923     <NA>   <NA>
   wt_kg ht_cm ct_blood fever chills cough aches vomit temp time_admission
1     27    48       22    no     no   yes    no   yes 36.8           <NA>
2     25    59       22  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          09:36
3     91   238       21  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          16:48
4     41   135       23    no     no    no    no    no 36.8          11:22
5     36    71       23    no     no   yes    no   yes 36.9          12:60
6     56   116       21    no     no   yes    no   yes 37.6          14:13
7     47    87       21  <NA>   <NA>  <NA>  <NA>  <NA> 37.3          14:33
8      0    11       22    no     no   yes    no   yes 37.0          09:25
9     86   226       22    no     no   yes    no   yes 36.4          11:16
10    69   174       22    no     no   yes    no    no 35.9          10:55
         bmi days_onset_hosp
1  117.18750               2
2   71.81844               1
3   16.06525               2
4   22.49657               2
5   71.41440               1
6   41.61712               1
7   62.09539               2
8    0.00000               1
9   16.83765               1
10  22.79033               2

2.2 Using skimr() Package

Code

# Gets information about each variable in the datset
skim(linelist)

Data summary
Name	linelist
Number of rows	5888
Number of columns	30
_______________________
Column type frequency:
character	13
Date	4
factor	2
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
case_id	0	1.00	6	6	5888
outcome	1323	0.78	5	7	2
gender	278	0.95	1	1	2
age_unit	0	1.00	5	6	2
hospital	0	1.00	5	36	6
infector	2088	0.65	6	6	2697
source	2088	0.65	5	7	2
fever	249	0.96	2	3	2
chills	249	0.96	2	3	2
cough	249	0.96	2	3	2
aches	249	0.96	2	3	2
vomit	249	0.96	2	3	2
time_admission	765	0.87	5	5	1072

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date_infection	2087	0.65	2014-03-19	2015-04-27	2014-10-11	359
date_onset	256	0.96	2014-04-07	2015-04-30	2014-10-23	367
date_hospitalisation	0	1.00	2014-04-17	2015-04-30	2014-10-23	363
date_outcome	936	0.84	2014-04-19	2015-06-04	2014-11-01	371

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
age_cat	86	0.99	FALSE	8	0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941
age_cat5	86	0.99	FALSE	17	0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
generation	0	1.00	16.56	5.79	0.00	13.00	16.00	20.00	37.00	▁▆▇▂▁
age	86	0.99	16.07	12.62	0.00	6.00	13.00	23.00	84.00	▇▅▁▁▁
age_years	86	0.99	16.02	12.64	0.00	6.00	13.00	23.00	84.00	▇▅▁▁▁
lon	0	1.00	-13.23	0.02	-13.27	-13.25	-13.23	-13.22	-13.21	▅▃▃▆▇
lat	0	1.00	8.47	0.01	8.45	8.46	8.47	8.48	8.49	▅▇▇▇▆
wt_kg	0	1.00	52.64	18.58	-11.00	41.00	54.00	66.00	111.00	▁▃▇▅▁
ht_cm	0	1.00	124.96	49.52	4.00	91.00	129.00	159.00	295.00	▂▅▇▂▁
ct_blood	0	1.00	21.21	1.69	16.00	20.00	22.00	22.00	26.00	▁▃▇▃▁
temp	149	0.97	38.56	0.98	35.20	38.20	38.80	39.20	40.80	▁▂▂▇▁
bmi	0	1.00	46.89	55.39	-1200.00	24.56	32.12	50.01	1250.00	▁▁▇▁▁
days_onset_hosp	256	0.96	2.06	2.26	0.00	1.00	1.00	3.00	22.00	▇▁▁▁▁

2.3 Summary Statistics

Code

# age years
summary(linelist$age_years)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    6.00   13.00   16.02   23.00   84.00      86

Code

# return only 2nd element
summary(linelist$age_years)[[2]]

[1] 6

3 CREATING TABLES

In this section I demonstrated the use of creating tables using janitor{}, dplyr{} and gtsummary{} packages, as per the sub-headings.

In the gtsummary{}, subheading, i explore the Data Completeness using gt{} package prior exploring prio exploring the Regression Models using gtsummmary{}.

Its table is provided, with reproducible codes.

3.1 USING janitor{} PACKAGE

Simple use of tabyl()

Code

# age_cat <- 
linelist %>% 
  tabyl(age_cat) %>% 
  # Express percentage values as percentages instead of decimals
  mutate(percent = scales::percent(percent)) %>%
  mutate(percent = ifelse(is.na(percent), "0%", percent)) %>%
  mutate(valid_percent = scales::percent(valid_percent)) %>%
  flextable() %>% 
  autofit() %>%
  # Add header title
  add_header_row(values = "Distribution of Cases by Age Category", 
                 colwidths = 4) %>%
  # Change column names
  set_header_labels(age_cat = "Age Category", 
                    n = "Count", 
                    percent = "Percentage",
                    valid_percent = "Valid Percentage") %>%
  
# Add aesthetics to flextable
theme_box() %>% 
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "all") %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output

Distribution of Cases by Age Category
Age Category	Count	Percentage	Valid Percentage
0-4	1,095	18.60%	18.87%
5-9	1,095	18.60%	18.87%
10-14	941	15.98%	16.22%
15-19	743	12.62%	12.81%
20-29	1,073	18.22%	18.49%
30-49	754	12.81%	13.00%
50-69	95	1.61%	1.64%
70+	6	0.10%	0.10%
	86	1.46%

Cross-tabulation

Expressing Age categories by Gender.

Code

linelist %>% 
  tabyl(age_cat,gender) %>% 
  # Add totals for rows and columns
  adorn_totals(where = c("row", "col")) %>%
  flextable() %>% 
  autofit() %>%
  # Theme and aesthetics
  theme_box() %>%
  # Add Header title
  add_header_row(values = "Distribution of Cases by Age Category", 
                 colwidths = 5) %>%
  # Change column names
  set_header_labels(age_cat = "Age Category",
                    f       = "Females",
                    m       = "Males",
                    NA_     = "Unknown") %>% 
# Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "all")

Distribution of Cases by Age Category
Age Category	Females	Males	Unknown	Total
0-4	640	416	39	1,095
5-9	641	412	42	1,095
10-14	518	383	40	941
15-19	359	364	20	743
20-29	468	575	30	1,073
30-49	179	557	18	754
50-69	2	91	2	95
70+	0	5	1	6
	0	0	86	86
Total	2,807	2,803	278	5,888

Adorning the tables

Code

linelist %>% 
  tabyl(age_cat) %>% 
  # Add totals for rows and columns
  adorn_totals() %>%
  adorn_pct_formatting() %>% 
  flextable() %>% 
  autofit() %>%
  theme_box() %>% 
  add_header_row(values = "Distribution of Cases by Age Category", 
                 colwidths = 4) %>%
  set_header_labels(age_cat = "Age Category",
                    n = "Count",
                    percent = "Percentage",
                    valid_percent = "Valid Percentage") %>%
# Add aesthetics to flextable   
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  # Align Column 2 to 4(percent, valid_percent, n)
  align(align = "right", part = "body", j = 2:4) %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "header")

Distribution of Cases by Age Category
Age Category	Count	Percentage	Valid Percentage
0-4	1,095	18.6%	18.9%
5-9	1,095	18.6%	18.9%
10-14	941	16.0%	16.2%
15-19	743	12.6%	12.8%
20-29	1,073	18.2%	18.5%
30-49	754	12.8%	13.0%
50-69	95	1.6%	1.6%
70+	6	0.1%	0.1%
	86	1.5%	-
Total	5,888	100.0%	100.0%

A cross-tabulation adjusted to capture both count and percentage

Capture Age-categories by gender and express in count and percentages.

Code

linelist %>% 
  tabyl(age_cat, gender) %>% 
  adorn_totals(where = "col") %>%  #
  adorn_percentages(denominator = "col") %>%
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front") %>% 
  # adorn_title(
  #   row_name = "Age Category",
  #   placement = "combined",
  #   col_name = "Gender") %>% 
 flextable() %>% 
  autofit() %>% 
  theme_box() %>%
  add_header_row(values = "Distribution of Cases by Age Category and Gender", 
                 colwidths = 5) %>%  
  set_header_labels(age_cat = "Age Category",
                    f       = "Females",
                    m       = "Males",
                    NA_     = "Unknown") %>%
  # Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  align(align = "center", part = "header") %>%
  align(align = "right", part = "body", j = 2:5) %>%
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "header") %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Distribution of Cases by Age Category and Gender
Age Category	Females	Males	Unknown	Total
0-4	640 (22.8%)	416 (14.8%)	39 (14.0%)	1,095 (18.6%)
5-9	641 (22.8%)	412 (14.7%)	42 (15.1%)	1,095 (18.6%)
10-14	518 (18.5%)	383 (13.7%)	40 (14.4%)	941 (16.0%)
15-19	359 (12.8%)	364 (13.0%)	20 (7.2%)	743 (12.6%)
20-29	468 (16.7%)	575 (20.5%)	30 (10.8%)	1,073 (18.2%)
30-49	179 (6.4%)	557 (19.9%)	18 (6.5%)	754 (12.8%)
50-69	2 (0.1%)	91 (3.2%)	2 (0.7%)	95 (1.6%)
70+	0 (0.0%)	5 (0.2%)	1 (0.4%)	6 (0.1%)
	0 (0.0%)	0 (0.0%)	86 (30.9%)	86 (1.5%)

Use of other variables

Code

linelist %>% 
  count(hospital) %>%    # dyplr function
  adorn_totals() %>%     # janitor function
  flextable() %>% 
  autofit()

hospital	n
Central Hospital	454
Military Hospital	896
Missing	1,469
Other	885
Port Hospital	1,762
St. Mark's Maternity Hospital (SMMH)	422
Total	5,888

Code

linelist %>% 
  # Remove NA values from the table
  tabyl(hospital, age_cat) %>%
    adorn_totals() %>%
  adorn_percentages(denominator = "col") %>%
  adorn_pct_formatting() %>%
  adorn_ns(position = "front") %>%
  flextable() %>%
  theme_box() %>%
  add_header_row(values = "Distribution of Cases by Hospital and Age Category( Age in Years) ", 
                 colwidths = 10) %>% 
  # Change column names
  set_header_labels(hospital = "Hospital",
                    `0-4` = "0-4",
                    `5-9` = "5-9",
                    `10-14` = "10-14",
                    `15-19` = "15-19",
                    `20-29` = "20-29",
                    `30-49` = "30-49",
                    `50-69` = "50-69",
                    `70+`   = "70+") %>% 
# Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black",
                                                         width = 2)) %>%
  border(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>%
  bg(bg = "lightblue", part = "header") %>%
  bg(bg = "lightgray", part = "body") %>%
  bold(part = "header") %>%
  # italic(part = "body") %>%
  align(align = "center", part = "header") %>%
  align(align = "center", part = "body", j = 2:10) %>% 
  fontsize(size = 12, part = "all") %>%
  color(color = "darkblue", part = "header") %>% 
  autofit() %>%
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Distribution of Cases by Hospital and Age Category( Age in Years)
Hospital	0-4	5-9	10-14	15-19	20-29	30-49	50-69	70+	NA_
Central Hospital	84 (7.7%)	84 (7.7%)	82 (8.7%)	58 (7.8%)	73 (6.8%)	57 (7.6%)	7 (7.4%)	0 (0.0%)	9 (10.5%)
Military Hospital	171 (15.6%)	176 (16.1%)	121 (12.9%)	106 (14.3%)	180 (16.8%)	121 (16.0%)	7 (7.4%)	2 (33.3%)	12 (14.0%)
Missing	281 (25.7%)	268 (24.5%)	219 (23.3%)	196 (26.4%)	275 (25.6%)	177 (23.5%)	22 (23.2%)	3 (50.0%)	28 (32.6%)
Other	163 (14.9%)	154 (14.1%)	162 (17.2%)	104 (14.0%)	167 (15.6%)	104 (13.8%)	19 (20.0%)	0 (0.0%)	12 (14.0%)
Port Hospital	331 (30.2%)	318 (29.0%)	267 (28.4%)	234 (31.5%)	309 (28.8%)	248 (32.9%)	32 (33.7%)	0 (0.0%)	23 (26.7%)
St. Mark's Maternity Hospital (SMMH)	65 (5.9%)	95 (8.7%)	90 (9.6%)	45 (6.1%)	69 (6.4%)	47 (6.2%)	8 (8.4%)	1 (16.7%)	2 (2.3%)
Total	1,095 (100.0%)	1,095 (100.0%)	941 (100.0%)	743 (100.0%)	1,073 (100.0%)	754 (100.0%)	95 (100.0%)	6 (100.0%)	86 (100.0%)

Saving the tabyl

Code

linelist %>% 
  count(hospital) %>% # dyplr function
  adorn_totals() %>%     # janitor function
  flextable() %>% 
  autofit() %>% 
  save_as_docx(path = "tabyl.docx")

Statistics

MORE DETAILS WILL BE NEXT

Code

age_by_outcome <- linelist %>% 
  tabyl(age_cat, outcome, show_na = FALSE)

chisq.test(age_by_outcome)


    Pearson's Chi-squared test

data:  age_by_outcome
X-squared = 6.4931, df = 7, p-value = 0.4835

3.2 USING dyplr{} PACKAGE

Simple Counts in table

If you want to get counts of a variable, you can use the count() function from dplyr. This is similar to tabyl() but does not include percentages or totals.

Code

# Above code can be shortened by count()
linelist %>% 
  count(
    # count number of rows in each age category, package dplyr function
    age_cat) %>%  
  adorn_totals() %>%  # janitor function to add totals row
  flextable() %>%
  autofit()

age_cat	n
0-4	1,095
5-9	1,095
10-14	941
15-19	743
20-29	1,073
30-49	754
50-69	95
70+	6
	86
Total	5,888

Proportion

Calculate the Counts to percentage using scales Package

Code

# Calculate percentage of cases in each age category by dividing the count 
# in each category by the total count and multiplying by 100

linelist %>% 
  count(age_cat) %>% 
  mutate(percent = n / sum(n) * 100)  %>% 
  # round the percent to 2 decimal places
  mutate(percent = round(percent, 2)) %>%
  mutate(percent = scales::percent(percent / 100)) %>%
  adorn_totals() %>% # Janitor function to add totals row
  flextable() %>%
  autofit()

age_cat	n	percent
0-4	1,095	18.60%
5-9	1,095	18.60%
10-14	941	15.98%
15-19	743	12.62%
20-29	1,073	18.22%
30-49	754	12.81%
50-69	95	1.61%
70+	6	0.10%
	86	1.46%
Total	5,888	-

Code

# The above code can be shortened using the scales package to calculate 
# percent directly in the mutate() function

linelist %>% 
  count(age_cat) %>% 
  # use of scales package to calculate percent directly in the mutate() function
  mutate(percent = scales::percent(n / sum(n))) %>% 
  adorn_totals() %>% # Janitor function to add totals row
  flextable() %>%
  autofit()

age_cat	n	percent
0-4	1,095	18.60%
5-9	1,095	18.60%
10-14	941	15.98%
15-19	743	12.62%
20-29	1,073	18.22%
30-49	754	12.81%
50-69	95	1.61%
70+	6	0.10%
	86	1.46%
Total	5,888	-

Calculate proportion within groups

Code

age_by_outcome <-
  linelist %>% 
  group_by(outcome) %>% 
  count(age_cat) %>% 
  mutate(
    # Calculate percentage of cases in each age category within each outcome 
    # group by dividing the count in each category by the total count for that 
    # outcome group and multiplying by 100 
    percent = scales::percent(n / sum(n)))

age_by_outcome

# A tibble: 26 × 4
# Groups:   outcome [3]
   outcome age_cat     n percent
   <chr>   <fct>   <int> <chr>  
 1 Death   0-4       471 18.242%
 2 Death   5-9       476 18.435%
 3 Death   10-14     438 16.964%
 4 Death   15-19     323 12.510%
 5 Death   20-29     477 18.474%
 6 Death   30-49     329 12.742%
 7 Death   50-69      33 1.278% 
 8 Death   70+         3 0.116% 
 9 Death   <NA>       32 1.239% 
10 Recover 0-4       364 18.36% 
# ℹ 16 more rows

Code

# NOTE: This age_by_outcome table is in long format, 
# we can pivot it to wide format for better presentation

Summary Statisitics

Code

summary_table <-
  linelist %>%
  group_by(hospital) %>% 
  summarise(cases = n(),
            delay_max = max(days_onset_hosp, na.rm = TRUE),
            delay_mean = round(mean(days_onset_hosp, na.rm = TRUE), 2),
            delay_sd = round(sd(days_onset_hosp, na.rm = TRUE), 2),
            delay_3  = sum(days_onset_hosp >= 3, na.rm = TRUE),
            pct_delay_3 = scales::percent(delay_3 / cases)) 
# %>% 
#   flextable() %>% 
#   autofit()
            # mean_age = mean(age_years, na.rm = TRUE),
            # median_age = median(age_years, na.rm = TRUE),
            # sd_age = sd(age_years, na.rm = TRUE)) %>%)
summary_table

# A tibble: 6 × 7
  hospital               cases delay_max delay_mean delay_sd delay_3 pct_delay_3
  <chr>                  <int>     <dbl>      <dbl>    <dbl>   <int> <chr>      
1 Central Hospital         454        12       1.89     1.95     108 24%        
2 Military Hospital        896        15       2.12     2.36     253 28%        
3 Missing                 1469        22       2.08     2.29     399 27%        
4 Other                    885        18       2.05     2.23     234 26%        
5 Port Hospital           1762        16       2.05     2.24     470 27%        
6 St. Mark's Maternity …   422        18       2.08     2.33     116 27%

Conditional Statistics

Code

linelist %>% 
  group_by(hospital) %>%
  summarise(
    # max temp among those with fever
    max_tem_fever = max(temp[fever == "yes"], na.rm = TRUE),
    # max temp among those without fever
    max_tem_no_fever = max(temp[fever == "no"], na.rm = TRUE) 
    ) %>% 
  flextable() %>%
  autofit()

hospital	max_tem_fever	max_tem_no_fever
Central Hospital	40.4	38.0
Military Hospital	40.5	38.0
Missing	40.6	38.0
Other	40.8	37.9
Port Hospital	40.6	38.0
St. Mark's Maternity Hospital (SMMH)	40.6	37.9

GLUEING TOGETHER str_glue() function from stringr package

Code

summary_table %>% 
  mutate(
    # Combined and format other values in single column
    delay = str_glue("{delay_mean}({delay_sd})")) %>%
  select(-c(delay_mean, delay_sd)) %>%
  select(                      
    # Order and rename cols, package dplyr function
    "Hospital Names" = hospital,
    "Cases"          = cases,
    "Max delay"      = delay_max,
    "Mean (SD)"      = delay,
    "Delay 3+ Days"   = delay_3,
    "% Delay 3+ Days" = pct_delay_3
    
  ) %>% 
  flextable() %>% 
  autofit() %>% 
  theme_box() %>% 
  add_header_row(values = "Summary of Cases and Delays by Hospital", 
                 colwidths = 6) %>%
  # Add aesthetics to flextable
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>%
  bg(bg = "lightblue", part = "header") %>%
  bg(bg = "lightgray", part = "body") %>%
  bold(part = "header") %>%
  italic(part = "body") %>%
  align(align = "center", part = "header") %>%
  align(align = "center", part = "body", j = 3:6) %>%
  fontsize(size = 12, part = "all") %>%
  color(color = "darkblue", part = "header") %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Summary of Cases and Delays by Hospital
Hospital Names	Cases	Max delay	Mean (SD)	Delay 3+ Days	% Delay 3+ Days
Central Hospital	454	12	1.89(1.95)	108	24%
Military Hospital	896	15	2.12(2.36)	253	28%
Missing	1,469	22	2.08(2.29)	399	27%
Other	885	18	2.05(2.23)	234	26%
Port Hospital	1,762	16	2.05(2.24)	470	27%
St. Mark's Maternity Hospital (SMMH)	422	18	2.08(2.33)	116	27%

Percentiles

Code

# Get default percentile values of age (0%, 25%, 50%, 75%, 100%)
linelist %>% 
  summarise(age_percentile = quantile(age_years, na.rm = TRUE))

  age_percentile
1              0
2              6
3             13
4             23
5             84

Code

# Get manually-specified percentil values of age (0%, 25%, 50%, 75%, 100%)
linelist %>% 
  summarise(age_percentile = quantile(age_years, 
                                      probs = c(.05, 0.5, 0.75, 0.98),
                                      na.rm = TRUE)) %>% 
  flextable() %>% 
  autofit()

age_percentile
1
13
23
48

Percentile Group by hospital

Code

linelist %>% 
  group_by(hospital) %>% 
  summarise(
    p05 = quantile(age_years, probs = 0.05, na.rm = TRUE),
    p50 = quantile(age_years, probs = 0.5, na.rm = TRUE),
    p75 = quantile(age_years, probs = 0.75, na.rm = TRUE),
    p98 = quantile(age_years, probs = 0.98, na.rm = TRUE),
    
  ) %>% 
  flextable() %>% 
  autofit()

hospital	p05	p50	p75	p98
Central Hospital	1	12	21	48.00
Military Hospital	1	13	24	45.00
Missing	1	13	23	48.20
Other	1	13	23	50.00
Port Hospital	1	14	24	49.00
St. Mark's Maternity Hospital (SMMH)	2	12	22	50.24

The dplyr summarise() funtion can be produce with get_summary_stat() function from the statix package.

Code

linelist %>% 
  # Group by hospital, package dplyr function
  group_by(hospital) %>% 
  # Package statix function to get quantiles of age variable
  rstatix::get_summary_stats(age, type = "quantile") %>% 
  # remove variable column and reorder columns
  select(-variable) %>%
  
  # Passing the table to flextable pacakge 
  # to add header rows and aesthetics to the summary table
  flextable() %>% 
  # Step 1: Rename columns first
  set_header_labels(
    hospital = "Hospital",
    variable = "Variable",
    n        = "N",
    p0       = "0%",
    p25      = "25%",
    p50      = "50%",
    p75      = "75%",
    p100     = "100%"
  ) %>%
  # Step 2: Add sub-header grouping row (Percentiles span)
  add_header_row(
    values    = c(" Hospital", "N", "Percentiles"),
    colwidths = c(1, 1, 5),    # hospital | variable | n | 5 percentile cols
    top       = TRUE            # place BELOW the renamed labels row
  ) %>%
  # Step 3: Add main title row at the very top
  add_header_row(
    values    = "Summary of Age Percentiles by Hospital",
    colwidths = 7,    # span all 7 columns
    top       = TRUE                   # place ABOVE everything
  ) %>%
  # Step 4: Add Aesthetics
  theme_box() %>%
  autofit() %>% 
  # Merge Header rows for Hospital and N columns (no sub-columns needed)
  merge_at(part = "header",
           j    = 1, # hospital and n columns
           i    = 2:3) %>% # merge first two header rows
  merge_at(part = "header",
           j    = 2, # variable column
           i    = 2:3) %>% # merge first two header rows
  
  align(align = "center", part = "header") %>%
  align(j = 3:7, align = "center", part = "body") %>%
  bold(part = "header") %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  fontsize(i = 1, part = "header", size = 13) %>%   # main title font size
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>%
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output_

Summary of Age Percentiles by Hospital
Hospital	N	Percentiles
Hospital	N	0%	25%	50%	75%	100%
Central Hospital	445	0	6	12	21	58
Military Hospital	884	0	6	14	24	72
Missing	1,441	0	6	13	23	76
Other	873	0	6	13	23	69
Port Hospital	1,739	0	6	14	24	68
St. Mark's Maternity Hospital (SMMH)	420	0	7	12	22	84

SUMMARISE AGGREGATE DATA

Code

linelist_arg <- 
linelist %>% 
  # package dplyr function to drop NA values in gender & outcome variables
  drop_na(gender, outcome) %>% 
  count(outcome, gender) #%>% 
  
linelist_arg

  outcome gender    n
1   Death      f 1227
2   Death      m 1228
3 Recover      f  953
4 Recover      m  950

Code

linelist_arg %>% 
  # Part 1: Use dplyr to summarise counts by outcome and gender
  group_by(outcome) %>% 
  summarise( # summarise counts
    total_cases  = sum(n, na.rm = TRUE),
    male_cases   = sum(n[gender == "m"], na.rm = TRUE),
    female_cases = sum(n[gender == "f"], na.rm = TRUE)
  ) %>% 
  adorn_totals() %>%
  
  # Part 2: Passing the table to flextable column Rename and add aesthetics 
  # to the summary table
  flextable() %>%
  theme_box() %>%
  # Step 1: Rename columns
  set_header_labels(
    outcome      = "Outcome",
    total_cases  = "Total Cases",
    male_cases   = "Males",
    female_cases = "Females"
  ) %>%
  # Step 2: Add sub-header row grouping Gender columns
  add_header_row(
    values    = c("", "Total Cases", "Gender"),
    colwidths = c(1, 1, 2),
    top       = TRUE
  ) %>%
  # Step 3: Merge first two rows of Outcome column (no sub-column needed)
  merge_at(part = "header",
           j    = 1,
           i    = 1:2) %>%
  # Step 4: Restore "Outcome" label lost after merge
  flextable::compose(
    part = "header",
    i    = 1,          # merged cell sits at row 1 after merge
    j    = 1,
    value = as_paragraph("Outcome")
  ) %>%
  # Step 5: Also merge "Total Cases" rows so it spans both header rows
  merge_at(part = "header",
           j    = 2,
           i    = 1:2) %>%
  # Step 6: Add main title at the very top
  add_header_row(
    values    = "Summary of Cases by Outcome and Gender",
    colwidths = 4,
    top       = TRUE
  ) %>%
  # Step 7: Add aesthetics to the table
  align(align = "center", part = "header") %>%
  align(align = "center", part = "body", j = 2:4) %>%
  bold(part  = "header") %>%
  bold(part  = "body", i = 3) %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  # font size first header & Second header only
   fontsize(i = 1, part = "header", size = 13) %>%
   fontsize(i = 2, part = "header", size = 11) %>%
  border_outer(part = "all", border = officer::fp_border(color = "black",
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black",
                                                         width = 1)) %>%
  autofit()

Summary of Cases by Outcome and Gender
Outcome	Total Cases	Gender
Outcome	Total Cases	Males	Females
Death	2,455	1,228	1,227
Recover	1,903	950	953
Total	4,358	2,178	2,180

Multiple Columns using across()

across() function from dplyr can be used to apply a function to multiple columns at once. This is useful when you want to calculate summary statistics for multiple groups or categories in your data.

Code

linelist %>% 
  group_by(outcome,
           # show NA values in outcome variable as "Unknown" instead of
           outcome = ifelse(is.na(outcome), "Unknown", outcome)
           ) %>%
  # Create column using across. Summarise across by mean() and sd()
  summarise(across(.cols = c(age_years, temp, wt_kg,  ht_cm),
                   .fns = list("mean" = mean, "sd" = sd),
                   na.rm = TRUE)) %>%
  # round values to 2 decimal places
  mutate(across(.cols = c(age_years_mean,
                          age_years_sd,
                          temp_mean, 
                          temp_sd,
                          wt_kg_mean,
                          wt_kg_sd,
                          ht_cm_mean,
                          ht_cm_sd),
                .fns = ~ round(.x, 2))) %>%
  
  mutate(
    # Combined and format other values in single column
    temp_mean_sd    = str_glue("{temp_mean}({temp_sd})"),
    age_yrs_mean_sd = str_glue("{age_years_mean}({age_years_sd})"),
    wt_mean_sd      = str_glue("{wt_kg_mean}({wt_kg_sd})"),
    ht_cm_mean_sd   = str_glue("{ht_cm_mean}({ht_cm_sd}")) %>%
  
  select(-c(temp_mean,
            temp_sd,
            age_years_mean,
            age_years_sd,
            wt_kg_mean,
            wt_kg_sd,
            ht_cm_mean,
            ht_cm_sd)) %>%
flextable() %>%
  theme_box() %>%
  set_header_labels(outcome = "Outcome",
                    temp_mean_sd = "Temeprature",
                    age_yrs_mean_sd = "Age(Years)",
                    wt_mean_sd = "Weigth (kg)",
                    ht_cm_mean_sd = "Height (cm)") %>%
  add_header_row(
    values    = c("", "Mean (SD)"),
    colwidths = c(1, 4),
    top       = FALSE ) %>%
  merge_at(part = "header",
           j = 1,
           i = 1:2) %>%
  add_header_row(
    values    = "Summary of Clinical Characteristics by Outcome",
    colwidths = 5,
    top       = TRUE ) %>%
   border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                          width = 2)) %>%
   border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                          width = 1)) %>%
   bg(part = "header", bg = "#2C3E50") %>%
   color(part = "header", color = "white") %>%
   fontsize(i = 1, part = "header", size = 13) %>%
   fontsize(i = 2, part = "header", size = 11) %>%
   fontsize(i = 3, part = "header", size = 10) %>%
  align(align = "center", part = "header") %>%
  align(j = 2:5, align = "right", part = "body") %>%
  bold(part = "header") %>%
  italic(part = "header",
         i = 3) %>%
  autofit()

Summary of Clinical Characteristics by Outcome
Outcome	Temeprature	Age(Years)	Weigth (kg)	Height (cm)
Outcome	Mean (SD)
Death	38.56(0.96)	15.88(12.31)	52.59(18.4)	124.86(48.67
Recover	38.56(1)	16.06(12.96)	52.48(18.64)	124.98(50.06
Unknown	38.55(0.98)	16.24(12.81)	53.01(18.87)	125.13(50.38

Pivot wider

Piwot wider() function from tidyr package can be used to reshape data from long format to wide format. This is useful when you want to create a summary table that shows the distribution of a variable across different groups or categories.

Code

age_by_outcome %>% 
  select(-percent) %>%
  pivot_wider(names_from = age_cat, values_from = n) %>% 
  # Rename NA values in outcome variable to "Unknown"
  mutate(outcome = ifelse(is.na(outcome), "Unknown", outcome)) %>%
  flextable() %>%
  theme_box() %>%
  set_header_labels(outcome = "Outcome",
                    `NA` = "Unknown",
                    `0-4` = "0 to 4 years",
                    `5-9` = "5 to 9 years",
                    `10-14` = "10 to 14 years",
                    `15-19` = "15 to 19 years",
                    `20-29` = "20 to 29 years",
                    `30-49` = "30 to 49 years",
                    `50-69` = "50 to 69 years",
                    `70+`   = "70+ years") %>%
    add_header_row(
    values    = c("Outcomes", "Age Categories"),
    colwidths = c(1, 9),
    top       = T ) %>%
  add_header_row(
    values    = "Distribution of Cases by Age Category and Outcome",
    colwidths = 10,
    top       = TRUE ) %>%
  merge_at(part = "header",
           j = 1,
           i = 2:3) %>%
  border_outer(part = "all", border = officer::fp_border(color = "black", width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black", width = 1)) %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  fontsize(i = 1, part = "header", size = 13) %>% 
  fontsize(i = 2, part = "header", size = 11) %>%
  fontsize(i = 3, part = "header", size = 10) %>%
  align(align = "center", part = "header") %>%
  align(j = 2:10, align = "center", part = "body") %>% 
  bold(part = "header") %>%
  italic(part = "header",
         i = 2) %>%
  autofit() %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output

Distribution of Cases by Age Category and Outcome
Outcomes	Age Categories
Outcomes	0 to 4 years	5 to 9 years	10 to 14 years	15 to 19 years	20 to 29 years	30 to 49 years	50 to 69 years	70+ years	Unknown
Death	471	476	438	323	477	329	33	3	32
Recover	364	391	303	251	367	238	38	3	28
Unknown	260	228	200	169	229	187	24		26

TOTAL ROWS

janitor package adorn_totals() function can be used to add a total row to a summary table. This is useful for providing a summary of the overall data in addition to the breakdown by groups.

Code

linelist %>% 
  group_by(gender) %>%
  summarise(
    # Number of rows in group where outcome is not missing
    known_outcome = sum(!is.na(outcome)), 
    # Number of rows in group where outcome is Death
    n_death  = sum(outcome == "Death", na.rm=T),
    # Number of rows in group where outcome is Recovered
    n_recover = sum(outcome == "Recover", na.rm=T), 
  ) %>% 
  # Recode gender values before adorning
  mutate(
    gender = case_match(
      gender,
      "f" ~ "Female",
      "m" ~ "Male",
      .default = gender,       # keeps NA or any other value as-is
      # Raname NA values to "Unknown"
      NA ~ "Unknown"
    )
  ) %>%
  adorn_totals() %>%            # Adorn total row (sums of each numeric column)
  adorn_percentages("col") %>%  # Get column proportions
  adorn_pct_formatting() %>%    # Convert proportions to percents
  adorn_ns(position = "front") %>%# display % and counts (with counts in front)
  flextable() %>%
  autofit() %>% 

# Add aesthetics to flextable
theme_box() %>% 
  border_outer(part = "all", border = officer::fp_border(color = "black", width = 2)) %>% 
  border_inner(part = "all", border = officer::fp_border(color = "black", width = 1)) %>% 
  bg(bg = "lightblue", part = "header") %>% 
  bg(bg = "lightgray", part = "body") %>% 
  bold(part = "header") %>% 
  italic(part = "body") %>% 
  fontsize(size = 12, part = "all") %>% 
  color(color = "darkblue", part = "all") %>% 
  # Add header rows
  add_header_row(values = "Summary of Outcomes by Gender", colwidths = 4) %>% 
  set_header_labels(gender = "Gender",
                    known_outcome = "Known Outcome",
                    n_death = "Deaths",
                    n_recover = "Recovered") %>%
  align(align = "center", part = "header") %>%
align(align = "right", part = "body", j = 2:4)

Summary of Outcomes by Gender
Gender	Known Outcome	Deaths	Recovered
Female	2,180 (47.8%)	1,227 (47.5%)	953 (48.1%)
Male	2,178 (47.7%)	1,228 (47.6%)	950 (47.9%)
Unknown	207 (4.5%)	127 (4.9%)	80 (4.0%)
Total	4,565 (100.0%)	2,582 (100.0%)	1,983 (100.0%)

summarise() on “Total” data and then bind_rows()

Group by hospital and outcome, then get mean ct_blood value for each

Code

by_hospital <- 
  linelist %>% 
  # Remove cases with NA or missing outcome and hospital values
  filter(!is.na(outcome) & hospital != "Missing") %>%
  group_by(hospital, outcome) %>%
  summarise(
    N  = n(),
    ct_value = mean(ct_blood, na.rm = TRUE)
  )

by_hospital

# A tibble: 10 × 4
# Groups:   hospital [5]
   hospital                             outcome     N ct_value
   <chr>                                <chr>   <int>    <dbl>
 1 Central Hospital                     Death     193     21.3
 2 Central Hospital                     Recover   165     21.4
 3 Military Hospital                    Death     399     21.2
 4 Military Hospital                    Recover   309     21.1
 5 Other                                Death     395     21.5
 6 Other                                Recover   290     21.2
 7 Port Hospital                        Death     785     21.2
 8 Port Hospital                        Recover   579     21.2
 9 St. Mark's Maternity Hospital (SMMH) Death     199     21.3
10 St. Mark's Maternity Hospital (SMMH) Recover   126     21.1

Groups totals by outcome only, ignoring hospital groups, to get overall mean ct_blood value for each outcome category

Code

totals <- 
  linelist %>% 
  # Remove cases with NA or missing outcome and hospital values
  filter(!is.na(outcome) & hospital != "Missing") %>%
  group_by(outcome) %>%
  summarise(
    N  = n(),
    ct_value = mean(ct_blood, na.rm = TRUE)
  )

totals

# A tibble: 2 × 3
  outcome     N ct_value
  <chr>   <int>    <dbl>
1 Death    1971     21.3
2 Recover  1469     21.2

Bind the by_hospital and totals tables together using bind_rows()

Code

# Bind the by_hospital and totals tables together using bind_rows() function 
# from dplyr package
table_long <- bind_rows(by_hospital, totals) %>% 
  mutate(
    # Replace NA values in hospital column with "Total" to indicate these rows
    # are the overall totals for each outcome category
    hospital = replace_na(hospital, "Total")) 

table_long

# A tibble: 12 × 4
# Groups:   hospital [6]
   hospital                             outcome     N ct_value
   <chr>                                <chr>   <int>    <dbl>
 1 Central Hospital                     Death     193     21.3
 2 Central Hospital                     Recover   165     21.4
 3 Military Hospital                    Death     399     21.2
 4 Military Hospital                    Recover   309     21.1
 5 Other                                Death     395     21.5
 6 Other                                Recover   290     21.2
 7 Port Hospital                        Death     785     21.2
 8 Port Hospital                        Recover   579     21.2
 9 St. Mark's Maternity Hospital (SMMH) Death     199     21.3
10 St. Mark's Maternity Hospital (SMMH) Recover   126     21.1
11 Total                                Death    1971     21.3
12 Total                                Recover  1469     21.2

Pivot Table_long to wider for presentation

Code

table_long %>% 
  # Pivot wider and format
  ##################
mutate(hopital = replace_na(hospital, "Total")) %>%
  # Pivot wider to create separate columns for each outcome category, 
  # with values from N and ct_value columns
  pivot_wider(names_from = outcome,values_from = c(N, ct_value)) %>%  
  mutate(
    # Calculate total N for known outcomes by summing Death and Recover counts
    N_Known = N_Death + N_Recover,  
    # Calculate percentage of Deaths among known outcomes
    Pct_Death = scales::percent(N_Death / N_Known, 0.1),
    # Calculate percentage of Recoveries among known outcomes
    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% 
  select(hospital, N_Known, 
         N_Recover, Pct_Recover, ct_value_Recover,
         N_Death, Pct_Death, ct_value_Death) %>% 
    arrange(N_Known) %>% # arrange rows by total N for known outcomes
  flextable() %>%
  theme_box() %>%
    add_header_row(
    values    = c("Hospital", "Known Outcomes", "Recoveries", "Deaths"),
    colwidths = c(1, 1, 3, 3),
    top       = TRUE ) %>%
  # Add aesthetics to the table
  set_header_labels(hospital = "Hospital",
                    N_Known = "Known Outcomes",
                    N_Recover = "Total",
                    Pct_Recover = "% of Cases",
                    ct_value_Recover = "Mean Ct Value",
                    N_Death = "Total",
                    Pct_Death = "% of Cases",
                    st_value_Death = "Mean Ct Value") %>%
  add_header_row(
    values    = "Summary of Mean Ct Values by Hospital and Outcome",
    colwidths = 8,
    top       = TRUE ) %>%
  border_outer(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 2)) %>%
  border_inner(part = "all", border = officer::fp_border(color = "black", 
                                                         width = 1)) %>%
  bg(part = "header", bg = "#2C3E50") %>%
  color(part = "header", color = "white") %>%
  fontsize(i = 1, part = "header", size = 13) %>% 
  fontsize(i = 2, part = "header", size = 11) %>%
  align(align = "center", part = "header") %>%
  align(j = 2:8, align = "center", part = "body") %>%
  bold(part = "header") %>% 
  merge_at(part = "header", j = 2, i = 2:3) %>% # Merge Known Outcomes header
  merge_at(part = "header", j = 1, i = 2:3) %>% # Merge Recoveries header)
  # italic(part = "body",
  #        i = nrow(table_long)) %>% # italicize total row
  autofit() %>% 
  fit_to_width(max_width = 6.5) # fit table to page width in PDF output)

Summary of Mean Ct Values by Hospital and Outcome
Hospital	Known Outcomes	Recoveries			Deaths
Hospital	Known Outcomes	Total	% of Cases	Mean Ct Value	Total	% of Cases	ct_value_Death
St. Mark's Maternity Hospital (SMMH)	325	126	38.8%	21.14286	199	61.2%	21.31156
Central Hospital	358	165	46.1%	21.41818	193	53.9%	21.25389
Other	685	290	42.3%	21.15172	395	57.7%	21.45316
Military Hospital	708	309	43.6%	21.09385	399	56.4%	21.23308
Port Hospital	1,364	579	42.4%	21.17444	785	57.6%	21.20127
Total	3,440	1,469	42.7%	21.17767	1,971	57.3%	21.27448

3.3 USING gtsummary{} PACKAGE

The default summary table from gtsummary package, with function tbl_summary() to get summary statistics for all variables in the dataset.

The function prints statistics appropriate to the column class: median and interquartile range (ICR) for numeric columns and count (%) for the categorical columns. Missing values are converted into “Unknown”. Footnotes are added to the bottom to explain the statistics, while the total N is shown at the top

Code

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% 
  tbl_summary() %>% 
  # modify_caption("**Summary of Clinical Characteristics**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )

Table 1: Summary of Clinical Characterisitcs

Characteristic	N = 5,888
age_years	13 (6, 23)
Unknown	86
gender
f	2,807 (50%)
m	2,803 (50%)
Unknown	278
outcome
Death	2,582 (57%)
Recover	1,983 (43%)
Unknown	1,323
fever	4,549 (81%)
Unknown	249
temp	38.80 (38.20, 39.20)
Unknown	149
hospital
Central Hospital	454 (7.7%)
Military Hospital	896 (15%)
Missing	1,469 (25%)
Other	885 (15%)
Port Hospital	1,762 (30%)
St. Mark's Maternity Hospital (SMMH)	422 (7.2%)
¹ Median (Q1, Q3); n (%)

Adjustments to capture specific statistics for numeric and categorical variables

For Categorical variables, we can specify the statistic to be count and percentage using the argument statistic = list(all_categorical() ~ “{n} ({p}%)”). For numeric variables, we can specify the statistic to be mean and standard deviation using the argument statistic = list(all_numeric() ~ “{mean} ({sd})”).

Code

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% 
  tbl_summary(by = outcome # stratify summary statistics by outcome variable
              ) %>%  # For numeric variables, show mean and standard deviation)
  # modify_caption("**Summary of Clinical Characteristics by Outcome**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )

Table 2: Summary of Clinical Characterisits by Outcome

Characteristic	Death N = 2,582	Recover N = 1,983
age_years	13 (6, 23)	13 (6, 23)
Unknown	32	28
gender
f	1,227 (50%)	953 (50%)
m	1,228 (50%)	950 (50%)
Unknown	127	80
fever	2,002 (81%)	1,543 (81%)
Unknown	122	79
temp	38.80 (38.20, 39.20)	38.80 (38.20, 39.25)
Unknown	60	55
hospital
Central Hospital	193 (7.5%)	165 (8.3%)
Military Hospital	399 (15%)	309 (16%)
Missing	611 (24%)	514 (26%)
Other	395 (15%)	290 (15%)
Port Hospital	785 (30%)	579 (29%)
St. Mark's Maternity Hospital (SMMH)	199 (7.7%)	126 (6.4%)
¹ Median (Q1, Q3); n (%)

Code

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% 
  tbl_summary(
    # stratify summary statistics by outcome variable
    by = outcome, 
    # For numeric variables, show mean and standard deviation
    statistic = list(all_continuous() ~ "{mean} ({sd})", 
                all_categorical() ~ "{n} / {N} ({p}%)"),
    label = list(
      outcome = "Patient Outcome",
      age_years = "Age (Years)",
      gender = "Gender",
      fever = "Fever",
      temp = "Temperature (°C)",
      hospital = "Hospital", 
      # Replace NA values with "Missing" in the summary table
      missing_text = "Missing" 
      
    )) %>% 
   # modify_caption("**Summary of Clinical Characteristics by Outcome**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )

Table 3: Summary of Clinical Characteristics by Outcome

Characteristic	Death N = 2,582	Recover N = 1,983
Age (Years)	16 (12)	16 (13)
Unknown	32	28
Gender
f	1,227 / 2,455 (50%)	953 / 1,903 (50%)
m	1,228 / 2,455 (50%)	950 / 1,903 (50%)
Unknown	127	80
Fever	2,002 / 2,460 (81%)	1,543 / 1,904 (81%)
Unknown	122	79
Temperature (°C)	38.56 (0.96)	38.56 (1.00)
Unknown	60	55
Hospital
Central Hospital	193 / 2,582 (7.5%)	165 / 1,983 (8.3%)
Military Hospital	399 / 2,582 (15%)	309 / 1,983 (16%)
Missing	611 / 2,582 (24%)	514 / 1,983 (26%)
Other	395 / 2,582 (15%)	290 / 1,983 (15%)
Port Hospital	785 / 2,582 (30%)	579 / 1,983 (29%)
St. Mark's Maternity Hospital (SMMH)	199 / 2,582 (7.7%)	126 / 1,983 (6.4%)
¹ Mean (SD); n / N (%)

Multi-line stats for Continuos variables

Code

linelist %>% 
  select(age_years, temp, ct_blood, wt_kg, outcome) %>% 
  tbl_summary(
    by = outcome, # stratify summary statistics by outcome variable
    type = all_continuous() ~ "continuous2", # Use continuous2 type for multi-line stats
    statistic = list(all_continuous() ~ c("{mean} ({sd})", 
                                           "{median} ({p25}, {p75})",
                                          "{min}, {max}")),
    label = list(
      age_years = "Age (Years)",
      temp = "Temperature (°C)",
      ct_blood = "Ct Value",
      wt_kg = "Weight (kg)"
    )) %>% 
  add_p() %>% # Add p-values for comparisons across outcome groups
  modify_header(label = "**Variable**") %>% # Bold variable names in header
  # modify_caption(
    # "**Summary of Clinical Characteristics by Outcome with Multi-line Stats**") %>% 
  as_kable_extra(
    booktabs  = TRUE, # Use booktabs style for better formatting in PDF output
    linesep   = "", # Remove extra lines between rows for a cleaner look
    longtable = FALSE # Disable longtable for better control over table formatting in PDF output
  )

Table 4: Summary of Clinical Characteristics by Outcome with Multi-line Stats

Variable	Death N = 2,582	Recover N = 1,983	p-value
Age (Years)			0.8
Mean (SD)	16 (12)	16 (13)
Median (Q1, Q3)	13 (6, 23)	13 (6, 23)
Min, Max	0, 76	0, 84
Unknown	32	28
Temperature (°C)			0.6
Mean (SD)	38.56 (0.96)	38.56 (1.00)
Median (Q1, Q3)	38.80 (38.20, 39.20)	38.80 (38.20, 39.25)
Min, Max	35.30, 40.60	35.20, 40.70
Unknown	60	55
Ct Value			0.044
Mean (SD)	21.26 (1.65)	21.14 (1.74)
Median (Q1, Q3)	22.00 (20.00, 22.00)	21.00 (20.00, 22.00)
Min, Max	16.00, 26.00	16.00, 25.00
Weight (kg)			0.7
Mean (SD)	53 (18)	52 (19)
Median (Q1, Q3)	54 (42, 66)	54 (40, 65)
Min, Max	-11, 106	-10, 111
¹ Wilcoxon rank sum test

3.4 REGRESSION MODELS Using gtsummary package

This section demonstrates how to use the gtsummary package to create summary tables for regression models, including linear regression, logistic regression, and Uni-variate and Multivariable regression models. The package provides functions to easily extract and format model results, such as coefficients, confidence intervals, p-values, and model fit statistics.

This section also explores how to customize the appearance of regression summary tables using gtsummary’s formatting options, as well as how to use kableExtra for enhanced table styling in PDF outputs.

3.4.1 Exploratory Data Completeness USE of gt{} Package

Code

# ── Build missing data summary ────────────────────────────────────────────────
missing_summary <- linelist %>%
  select(-case_id, -generation, - age_unit, -age ) %>% 
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Missing_n") %>%
  # Group
  mutate(variable_cat = case_when(
    Variable %in% c(
      "outcome",
      "infector",
      "source")
      ~ "OUTCOME & IDENTIFIERS",
    Variable %in% c(
      "date_infection",
      "date_outcome",
      "date_onset",
      "date_hospitalisation",
      "time_admission",
      "days_onset_hosp")
      ~ "DATES",
    Variable %in% c(
      "gender",
      "age_years",
      "age_cat",
      "age_cat5",
      "hospital")
      ~ "DEMOGRAPHICS",
    Variable %in% c(
      "temp",
      "fever",
      "chills",
      "cough",
      "aches",
      "vomit")
      ~ "CLINICAL",
    Variable %in% c(
      "ct_blood",
      "wt_kg",
      "ht_cm",
      "bmi",
      "lon",
      "lat")
      ~ "MEASUREMENTS (COMPLETE)"
    )
    ) %>% 
  mutate(
    Total         = nrow(linelist),
    Complete_n    = Total - Missing_n,
    Complete_rate = Complete_n / Total,
    Missing_pct   = Missing_n / Total, # * 100,
    Concern = case_when(
      Complete_rate == 1.00          ~ "✅ Complete",
      Complete_rate >= 0.90          ~ "🟢 Low",
      Complete_rate >= 0.70          ~ "🟡 Moderate",
      TRUE                           ~ "🔴 High"
    )
  ) %>%
  
  arrange(Complete_rate)

# ── Render with gt ────────────────────────────────────────────────────────────
missing_summary %>%
  select(variable_cat, Variable, Missing_n, Complete_n, Complete_rate, Missing_pct, Concern) %>%
  group_by(variable_cat) %>% 
  gt() %>%
  tab_header(
    title    = md("**Missing Data Summary**"),
    subtitle = md(glue::glue("*linelist* — {nrow(linelist)} total observations"))
  ) %>%
  cols_label(
    Variable      = "Variable",
    Missing_n     = "Missing (n)",
    Complete_n    = "Complete (n)",
    Complete_rate = "Complete Rate",
    Missing_pct   = "Missing (%)",
    Concern       = "Concern Level"
  ) %>%
  fmt_number(columns = c(Missing_n, Complete_n), decimals = 0) %>%
  fmt_percent(columns = c(Complete_rate,Missing_pct), decimals = 2) %>% 
  # fmt_percent(columns = Complete_rate, decimals = 1) %>%
  # fmt_number(columns = Missing_pct, decimals = 1, suffix = "%") %>%
  # Colour-code the complete rate column
  data_color(
    columns = Complete_rate,
    method  = "numeric",
    palette = c("#E24B4A", "#EF9F27", "#639922"),
    domain  = c(0, 1)
  ) %>%
  # # Add an inline bar using gtExtra
  # gt_plt_bar_pct(
  #   column     = Complete_rate,
  #   scaled     = T,
  #   fill       = "steelblue",
  #   # background = "#f2f2f2"
  # ) %>%
  cols_label(Complete_rate = "Completeness") %>%
  # opt_row_striping()
  # Bold rows with high missingness
  tab_style(
    style     = cell_text(weight = "bold"),
    locations = cells_body(rows = Complete_rate < 0.70)
  ) %>%
  # Stripe rows
  opt_row_striping() %>%
  tab_options(
    table.font.size        = 13,
    heading.align          = "left",
    column_labels.font.weight = "bold"
  ) %>%
  
  tab_footnote(
    footnote  = "High missingness (> 70%) may affect analysis validity.",
    locations = cells_column_labels(columns = Concern)
  )

Variable	Missing (n)	Complete (n)	Completeness	Missing (%)	Concern Level¹
Missing Data Summary
linelist — 5888 total observations
OUTCOME & IDENTIFIERS
infector	2,088	3,800	64.54%	35.46%	🔴 High
source	2,088	3,800	64.54%	35.46%	🔴 High
outcome	1,323	4,565	77.53%	22.47%	🟡 Moderate
DATES
date_infection	2,087	3,801	64.56%	35.44%	🔴 High
date_outcome	936	4,952	84.10%	15.90%	🟡 Moderate
time_admission	765	5,123	87.01%	12.99%	🟡 Moderate
date_onset	256	5,632	95.65%	4.35%	🟢 Low
days_onset_hosp	256	5,632	95.65%	4.35%	🟢 Low
date_hospitalisation	0	5,888	100.00%	0.00%	✅ Complete
DEMOGRAPHICS
gender	278	5,610	95.28%	4.72%	🟢 Low
age_years	86	5,802	98.54%	1.46%	🟢 Low
age_cat	86	5,802	98.54%	1.46%	🟢 Low
age_cat5	86	5,802	98.54%	1.46%	🟢 Low
hospital	0	5,888	100.00%	0.00%	✅ Complete
CLINICAL
fever	249	5,639	95.77%	4.23%	🟢 Low
chills	249	5,639	95.77%	4.23%	🟢 Low
cough	249	5,639	95.77%	4.23%	🟢 Low
aches	249	5,639	95.77%	4.23%	🟢 Low
vomit	249	5,639	95.77%	4.23%	🟢 Low
temp	149	5,739	97.47%	2.53%	🟢 Low
MEASUREMENTS (COMPLETE)
lon	0	5,888	100.00%	0.00%	✅ Complete
lat	0	5,888	100.00%	0.00%	✅ Complete
wt_kg	0	5,888	100.00%	0.00%	✅ Complete
ht_cm	0	5,888	100.00%	0.00%	✅ Complete
ct_blood	0	5,888	100.00%	0.00%	✅ Complete
bmi	0	5,888	100.00%	0.00%	✅ Complete
¹ High missingness (> 70%) may affect analysis validity.

3.4.2 Linear Regression Model

Code

# ── Linear Regression Example ──────────────────────────────────────────────

# Step 1: Fit the model
lm_model <- lm(age ~ gender + hospital + outcome, data = linelist)

# Step 2: Create the summary table
tbl_regression(lm_model,
               label        = list(
      gender   ~ "Gender",
      hospital ~ "Hospital",
      outcome  ~ "Outcome"

      )) %>% 
  
  bold_p(t = 0.05) %>%                        # Bold p-values < 0.05
  bold_labels() %>%                            # Bold variable labels
  italicize_levels() %>%                       # Italicize variable levels
  add_significance_stars() %>%                 # Add * ** *** stars
  add_glance_source_note() %>%                 # Add model fit stats (R², AIC)
  modify_header(label = "**Variable**") %>%    # Rename label column
  # modify_caption("**Linear Regression Model: Predictors of Age**") %>% 
  
  as_kable_extra(                             # For PDF output
    booktabs  = TRUE,
    linesep   = "",
    longtable = FALSE
  ) %>%
  kable_styling(
    latex_options = c(
      "hold_position",
      "scale_down",
      "striped"),
    stripe_color  = "#f2f2f2",
    position      = "center",
    full_width = FALSE,
    font_size = 14
    )

Table 5: Linear Regression Model: Predictors of Age

Variable	Beta	SE
Gender
f	—	—
m	6.9***	0.367
Hospital
Central Hospital	—	—
Military Hospital	0.78	0.802
Missing	0.24	0.751
Other	0.94	0.807
Port Hospital	0.86	0.734
St. Mark's Maternity Hospital (SMMH)	0.48	0.948
Outcome
Death	—	—
Recover	0.27	0.370
R² = 0.076; Adjusted R² = 0.074; Sigma = 12.1; Statistic = 51.0; p-value = <0.001; df = 7; Log-likelihood = -17,045; AIC = 34,107; BIC = 34,164; Deviance = 636,703; Residual df = 4,350; No. Obs. = 4,358
Abbreviations: CI = Confidence Interval, SE = Standard Error
¹ p<0.05; p<0.01; **p<0.001

3.4.3 Logistic Regression Model

Logistic Regression model is used to model the relationship between a binary outcome variable and one or more predictor variables. The gtsummary package can be used to create a summary table of the logistic regression model results, including odds ratios (OR), confidence intervals, p-values, and model fit statistics.

Code

# ── Logistic Regression Example ────────────────────────────────────────────

# Step 1: Fit the model
logit_model <- glm(
  outcome ~ age + gender + hospital + ct_blood + temp + wt_kg + ht_cm, data   = linelist %>% 
    mutate(outcome = ifelse(outcome == "Death", 1, 0)),
  family = binomial
)

# Step 2: Create the summary table
tbl_regression(
  logit_model,
  exponentiate = TRUE,               # Converts log-odds to Odds Ratios (OR)
  label        = list(
      age      ~ "Age",
      gender   ~ "Gender",
      hospital ~ "Hospital",
      ct_blood ~ "CT Values",
      temp     ~ "Temperature (°C)",
      wt_kg    ~ "Weight (kg)",
      ht_cm    ~ "Height (cm)")) %>%
  bold_p(t    = 0.05) %>%           # Bold significant p-values
  bold_labels() %>%                 # Bold variable names
  italicize_levels() %>%            # Italicize category levels
  add_significance_stars() %>%      # Add * ** *** to significant rows
  add_n() %>%                       # Add number of observations
  add_nevent() %>%                  # Add number of events (Deaths)
  add_glance_source_note() %>%      # Add AIC, BIC, log-likelihood
  modify_header(
    label    = "**Variable**",
    estimate = "**OR**",            # Rename estimate column to OR
    ci       = "**95% CI**"
  ) %>%
  # modify_caption("**Logistic Regression: Predictors of Death Outcome**") %>% 
  
# Step 3: For PDF output, use kableExtra for enhanced formatting
  as_kable_extra(
    booktabs  = TRUE,
    linesep   = "",
    longtable = FALSE
  )  %>%
  kable_styling(
    latex_options = c(
      "hold_position",
      "scale_down",
      "striped"),
    stripe_color  = "#f2f2f2",
    position      = "center",
    full_width = TRUE,
    font_size = 14) #%>%
#   column_spec(1, width = "6cm") %>%   # Variable column - wider
#   column_spec(2, width = "3cm") %>%   # N
#   column_spec(3, width = "3cm") %>%   # Event (N)
#   column_spec(4, width = "2.5cm") %>% # OR
#   column_spec(5, width = "2.5cm") %>% # SE
#   column_spec(6, width = "2.5cm")     # 95%CI

Table 6: Logistic Regression: Predictors of Death Outcome

Variable	N	Event N	OR	SE	95% CI
Age	4,243	2,395	0.99	0.005	0.98, 1.00
Gender	4,243	2,395
f			—	—	—
m			1.00	0.067	0.87, 1.14
Hospital	4,243	2,395
Central Hospital			—	—	—
Military Hospital			1.11	0.135	0.85, 1.45
Missing			0.98	0.126	0.76, 1.25
Other			1.14	0.136	0.88, 1.49
Port Hospital			1.10	0.123	0.86, 1.40
St. Mark's Maternity Hospital (SMMH)			1.37	0.161	1.00, 1.88
CT Values	4,243	2,395	1.03	0.019	1.00, 1.07
Temperature (°C)	4,243	2,395	1.00	0.032	0.94, 1.06
Weight (kg)	4,243	2,395	1.00	0.004	1.00, 1.01
Height (cm)	4,243	2,395	1.00	0.002	1.00, 1.00
Null deviance = 5,811; Null df = 4,242; Log-likelihood = -2,899; AIC = 5,822; BIC = 5,898; Deviance = 5,798; Residual df = 4,231; No. Obs. = 4,243
Abbreviations: CI = Confidence Interval, OR = Odds Ratio, SE = Standard Error
¹ p<0.05; p<0.01; **p<0.001

3.4.4 Univariate and Multivariable Logistic Regression Models with gt summary tables

Univariate and Multivariable Logistic Regression models are used to model the relationship between a binary outcome variable and one or more predictor variables, while controlling for potential confounders. The gtsummary package can be used to create summary tables for both univariate and multivariable logistic regression models, allowing for easy comparison of the results.

Code

# Univariate and Multivariable Logistic Regression Models with gt summary tables

# ── Step 1: Prepare the dataset ──────────────────────────────────────────────
linelist_model <- linelist %>%
  mutate(
    # Recode outcome to binary (1 for Death, 0 for Recover)
    outcome = ifelse(outcome == "Death", 1, 0)) %>%
  select(outcome, age, gender, hospital, ct_blood, temp, wt_kg, ht_cm) %>%
  drop_na()

# ── Step 2: Define variable labels ───────────────────────────────────────────
var_labels <- list(
  age      ~ "Age",
  gender   ~ "Gender",
  hospital ~ "Hospital",
  ct_blood ~ "CT Values",
  temp     ~ "Temperature (°C)",
  wt_kg    ~ "Weight (kg)",
  ht_cm    ~ "Hieght (cm)"
)

# ── Step 3: Fit multivariable model ──────────────────────────────────────────
logit_model <- glm(
  # Model formula with outcome as dependent variable and predictors
  outcome ~ age + gender + hospital + ct_blood + temp + wt_kg + ht_cm,
  data   = linelist_model,
  family = binomial
)

# ── Step 4: Univariate table ─────────────────────────────────────────────────
tbl_uv <- linelist_model %>%
  tbl_uvregression(
    method       = glm, # Use glm for univariate logistic regression
    y            = outcome, # Specify the outcome variable for univariate models
    method.args  = list(family = binomial), # Specify family for logistic regression
    exponentiate = TRUE, # Exponentiate coefficients to get ORs
    label        = var_labels 
  ) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  italicize_levels()

# ── Step 5: Multivariable table ───────────────────────────────────────────────
tbl_mv <- tbl_regression(
  logit_model,
  exponentiate = TRUE,
  label        = var_labels
) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  italicize_levels()

# ── Step 6: Merge and render for PDF ─────────────────────────────────────────
tbl_merge(
  tbls        = list(tbl_uv, tbl_mv),
  tab_spanner = c("**Univariate**", "**Multivariable**")
) %>%
  # modify_caption("**Univariate and Multivariable Logistic Regression**") %>%
  as_kable_extra(
    booktabs  = TRUE,
    linesep   = "",
    longtable = FALSE            # set TRUE if table spans multiple pages
  ) %>%
  kable_styling(
    latex_options = c(
      "hold_position",           # prevents table from floating away
      "scale_down",              # shrinks table to fit page width
      "striped"                  # alternating row shading
    ),
    stripe_color  = "#f2f2f2",   # light gray stripe
    position      = "center",
    full_width    = TRUE,        # stretch to full page width
    font_size     = 14           # smaller font for PDF
  ) %>%
  column_spec(1, width = "4cm") %>%   # Variable column - wider
  column_spec(2, width = "1.5cm") %>% # N
  column_spec(3, width = "2cm") %>%   # OR (univariate)
  column_spec(4, width = "2.5cm") %>% # 95% CI (univariate)
  column_spec(5, width = "1.5cm") %>% # p-value (univariate)
  column_spec(6, width = "2cm") %>%   # OR (multivariable)
  column_spec(7, width = "2.5cm") %>% # 95% CI (multivariable)
  column_spec(8, width = "1.5cm")     # p-value (multivariable)

**Univariate and Multivariable Logistic Regression**
	Univariate				Multivariable
Characteristic	N	OR	95% CI	p-value	OR	95% CI	p-value
Age	4,243	1.00	0.99, 1.00	0.6	0.99	0.98, 1.00	0.2
Gender	4,243
f		—	—		—	—
m		1.01	0.90, 1.14	0.9	1.00	0.87, 1.14	>0.9
Hospital	4,243
Central Hospital		—	—		—	—
Military Hospital		1.10	0.85, 1.44	0.5	1.11	0.85, 1.45	0.4
Missing		0.97	0.76, 1.24	0.8	0.98	0.76, 1.25	0.8
Other		1.14	0.88, 1.49	0.3	1.14	0.88, 1.49	0.3
Port Hospital		1.09	0.86, 1.39	0.5	1.10	0.86, 1.40	0.4
St. Mark's Maternity Hospital (SMMH)		1.37	1.00, 1.88	0.049	1.37	1.00, 1.88	0.051
CT Values	4,243	1.03	1.00, 1.07	0.066	1.03	1.00, 1.07	0.071
Temperature (°C)	4,243	0.99	0.93, 1.06	0.8	1.00	0.94, 1.06	>0.9
Weight (kg)	4,243	1.00	1.00, 1.00	0.7	1.00	1.00, 1.01	0.3
Hieght (cm)	4,243	1.00	1.00, 1.00	>0.9	1.00	1.00, 1.00	0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio