Create publication-ready analytical and summary tables using
{gtsummary} package.
Using the {gtsummary} package offers a stylish and adaptable method
for producing analytical and summary tables that are ready for
publication.
This package uses sensible defaults with fully customizable features to
summarize datasets, regression models, and more.
Here are the steps for using the {gtsummary} package:
- Load the libraries.
# install.packages("gtsummary")
library(gtsummary)
library(gt)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mlbench)
- Upload dataset.
# load the dataset
data(gss_cat)
str(gss_cat)
## tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
## $ age : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
## $ race : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
## $ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
## $ relig : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
## $ denom : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
## $ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...
ls(gss_cat)
## [1] "age" "denom" "marital" "partyid" "race" "relig" "rincome"
## [8] "tvhours" "year"
- Filter dataset for year 2014 and Black and White races only.
gss_cat_2014<- gss_cat %>%
filter(year == 2014 & race %in% c("Black", "White")) %>%
# Use 'droplevels' to remove levels/categories with 0(0.00%) in created dataset
droplevels
- Perform EDA on gss_cat_2014.
library(summarytools)
##
## Attaching package: 'summarytools'
## The following object is masked from 'package:tibble':
##
## view
dfSummary(gss_cat_2014,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp",
max.distinct.values = 30)
## ### Data Frame Summary
## #### gss_cat_2014
## **Dimensions:** 2276 x 9
## **Duplicates:** 22
##
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
## +====+===========+==============================+======================+======================+=========+
## | 1 | year\ | 1 distinct value | 2014 : 2276 (100.0%) |  | 0\ |
## | | [integer] | | | | (0.0%) |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 2 | marital\ | 1\. No answer\ | 4 ( 0.2%)\ |  | 0\ |
## | | [factor] | 2\. Never married\ | 584 (25.7%)\ | | (0.0%) |
## | | | 3\. Separated\ | 69 ( 3.0%)\ | | |
## | | | 4\. Divorced\ | 374 (16.4%)\ | | |
## | | | 5\. Widowed\ | 199 ( 8.7%)\ | | |
## | | | 6\. Married | 1046 (46.0%) | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 3 | age\ | Mean (sd) : 49.8 (17.5)\ | 72 distinct values |  | 8\ |
## | | [integer] | min < med < max:\ | | | (0.4%) |
## | | | 18 < 50 < 89\ | | | |
## | | | IQR (CV) : 28 (0.4) | | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 4 | race\ | 1\. Black\ | 386 (17.0%)\ |  | 0\ |
## | | [factor] | 2\. White | 1890 (83.0%) | | (0.0%) |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 5 | rincome\ | 1\. Don't know\ | 17 ( 0.7%)\ |  | 0\ |
## | | [factor] | 2\. Refused\ | 64 ( 2.8%)\ | | (0.0%) |
## | | | 3\. $25000 or more\ | 869 (38.2%)\ | | |
## | | | 4\. $20000 - 24999\ | 126 ( 5.5%)\ | | |
## | | | 5\. $15000 - 19999\ | 75 ( 3.3%)\ | | |
## | | | 6\. $10000 - 14999\ | 95 ( 4.2%)\ | | |
## | | | 7\. $8000 to 9999\ | 24 ( 1.1%)\ | | |
## | | | 8\. $7000 to 7999\ | 11 ( 0.5%)\ | | |
## | | | 9\. $6000 to 6999\ | 21 ( 0.9%)\ | | |
## | | | 10\. $5000 to 5999\ | 27 ( 1.2%)\ | | |
## | | | 11\. $4000 to 4999\ | 22 ( 1.0%)\ | | |
## | | | 12\. $3000 to 3999\ | 33 ( 1.4%)\ | | |
## | | | 13\. $1000 to 2999\ | 32 ( 1.4%)\ | | |
## | | | 14\. Lt $1000\ | 27 ( 1.2%)\ | | |
## | | | 15\. Not applicable | 833 (36.6%) | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 6 | partyid\ | 1\. No answer\ | 22 ( 1.0%)\ |  | 0\ |
## | | [factor] | 2\. Don't know\ | 1 ( 0.0%)\ | | (0.0%) |
## | | | 3\. Other party\ | 57 ( 2.5%)\ | | |
## | | | 4\. Strong republican\ | 238 (10.5%)\ | | |
## | | | 5\. Not str republican\ | 277 (12.2%)\ | | |
## | | | 6\. Ind,near rep\ | 228 (10.0%)\ | | |
## | | | 7\. Independent\ | 416 (18.3%)\ | | |
## | | | 8\. Ind,near dem\ | 292 (12.8%)\ | | |
## | | | 9\. Not str democrat\ | 354 (15.6%)\ | | |
## | | | 10\. Strong democrat | 391 (17.2%) | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 7 | relig\ | 1\. No answer\ | 10 ( 0.4%)\ |  | 0\ |
## | | [factor] | 2\. Don't know\ | 3 ( 0.1%)\ | | (0.0%) |
## | | | 3\. Inter-nondenominational\ | 4 ( 0.2%)\ | | |
## | | | 4\. Christian\ | 124 ( 5.4%)\ | | |
## | | | 5\. Orthodox-christian\ | 9 ( 0.4%)\ | | |
## | | | 6\. Moslem/islam\ | 7 ( 0.3%)\ | | |
## | | | 7\. Other eastern\ | 1 ( 0.0%)\ | | |
## | | | 8\. Hinduism\ | 1 ( 0.0%)\ | | |
## | | | 9\. Buddhism\ | 18 ( 0.8%)\ | | |
## | | | 10\. Other\ | 20 ( 0.9%)\ | | |
## | | | 11\. None\ | 467 (20.5%)\ | | |
## | | | 12\. Jewish\ | 39 ( 1.7%)\ | | |
## | | | 13\. Catholic\ | 494 (21.7%)\ | | |
## | | | 14\. Protestant | 1079 (47.4%) | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 8 | denom\ | 1\. No answer\ | 10 ( 0.4%)\ |  | 0\ |
## | | [factor] | 2\. Don't know\ | 3 ( 0.1%)\ | | (0.0%) |
## | | | 3\. No denomination\ | 276 (12.1%)\ | | |
## | | | 4\. Other\ | 235 (10.3%)\ | | |
## | | | 5\. Episcopal\ | 38 ( 1.7%)\ | | |
## | | | 6\. Presbyterian-dk wh\ | 26 ( 1.1%)\ | | |
## | | | 7\. Presbyterian, merged\ | 11 ( 0.5%)\ | | |
## | | | 8\. Other presbyterian\ | 2 ( 0.1%)\ | | |
## | | | 9\. United pres ch in us\ | 9 ( 0.4%)\ | | |
## | | | 10\. Presbyterian c in us\ | 7 ( 0.3%)\ | | |
## | | | 11\. Lutheran-dk which\ | 29 ( 1.3%)\ | | |
## | | | 12\. Evangelical luth\ | 16 ( 0.7%)\ | | |
## | | | 13\. Other lutheran\ | 3 ( 0.1%)\ | | |
## | | | 14\. Wi evan luth synod\ | 4 ( 0.2%)\ | | |
## | | | 15\. Lutheran-mo synod\ | 25 ( 1.1%)\ | | |
## | | | 16\. Luth ch in america\ | 4 ( 0.2%)\ | | |
## | | | 17\. Am lutheran\ | 9 ( 0.4%)\ | | |
## | | | 18\. Methodist-dk which\ | 17 ( 0.7%)\ | | |
## | | | 19\. Other methodist\ | 4 ( 0.2%)\ | | |
## | | | 20\. United methodist\ | 109 ( 4.8%)\ | | |
## | | | 21\. Afr meth ep zion\ | 4 ( 0.2%)\ | | |
## | | | 22\. Afr meth episcopal\ | 8 ( 0.4%)\ | | |
## | | | 23\. Baptist-dk which\ | 151 ( 6.6%)\ | | |
## | | | 24\. Other baptists\ | 20 ( 0.9%)\ | | |
## | | | 25\. Southern baptist\ | 143 ( 6.3%)\ | | |
## | | | 26\. Nat bapt conv usa\ | 3 ( 0.1%)\ | | |
## | | | 27\. Nat bapt conv of am\ | 9 ( 0.4%)\ | | |
## | | | 28\. Am bapt ch in usa\ | 16 ( 0.7%)\ | | |
## | | | 29\. Am baptist asso\ | 22 ( 1.0%)\ | | |
## | | | 30\. Not applicable | 1063 (46.7%) | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
## | 9 | tvhours\ | Mean (sd) : 3 (2.6)\ | 0 : 106 ( 7.0%)\ |  | 765\ |
## | | [integer] | min < med < max:\ | 1 : 278 (18.4%)\ | | (33.6%) |
## | | | 0 < 2 < 24\ | 2 : 403 (26.7%)\ | | |
## | | | IQR (CV) : 3 (0.9) | 3 : 266 (17.6%)\ | | |
## | | | | 4 : 201 (13.3%)\ | | |
## | | | | 5 : 101 ( 6.7%)\ | | |
## | | | | 6 : 71 ( 4.7%)\ | | |
## | | | | 7 : 11 ( 0.7%)\ | | |
## | | | | 8 : 33 ( 2.2%)\ | | |
## | | | | 9 : 1 ( 0.1%)\ | | |
## | | | | 10 : 15 ( 1.0%)\ | | |
## | | | | 12 : 13 ( 0.9%)\ | | |
## | | | | 14 : 2 ( 0.1%)\ | | |
## | | | | 16 : 2 ( 0.1%)\ | | |
## | | | | 17 : 1 ( 0.1%)\ | | |
## | | | | 18 : 1 ( 0.1%)\ | | |
## | | | | 20 : 2 ( 0.1%)\ | | |
## | | | | 24 : 4 ( 0.3%) | | |
## +----+-----------+------------------------------+----------------------+----------------------+---------+
- Use tbl_summary() to summarize specific variables from the
dataset.
library(gtsummary)
# summarize the data with our package
table1 <-
gss_cat_2014 %>%
tbl_summary(include = c(age, tvhours, race, marital, rincome)) %>%
# add table captions
as_gt() %>%
gt::tab_header(title = "Table 1. Clients Characteristics",
subtitle = " January to December, 2014")
table1
Table 1. Clients Characteristics |
January to December, 2014 |
Characteristic |
N = 2,276 |
age |
50 (35, 63) |
Unknown |
8 |
tvhours |
2 (1, 4) |
Unknown |
765 |
race |
|
Black |
386 (17%) |
White |
1,890 (83%) |
marital |
|
No answer |
4 (0.2%) |
Never married |
584 (26%) |
Separated |
69 (3.0%) |
Divorced |
374 (16%) |
Widowed |
199 (8.7%) |
Married |
1,046 (46%) |
rincome |
|
Don't know |
17 (0.7%) |
Refused |
64 (2.8%) |
$25000 or more |
869 (38%) |
$20000 - 24999 |
126 (5.5%) |
$15000 - 19999 |
75 (3.3%) |
$10000 - 14999 |
95 (4.2%) |
$8000 to 9999 |
24 (1.1%) |
$7000 to 7999 |
11 (0.5%) |
$6000 to 6999 |
21 (0.9%) |
$5000 to 5999 |
27 (1.2%) |
$4000 to 4999 |
22 (1.0%) |
$3000 to 3999 |
33 (1.4%) |
$1000 to 2999 |
32 (1.4%) |
Lt $1000 |
27 (1.2%) |
Not applicable |
833 (37%) |
- Use tbl_summary() with customization options to create
crosstabualtion.
table2 <-
tbl_summary(
gss_cat_2014,
include = c(age, tvhours, relig, partyid, marital, rincome),
by = race, # split table by group
missing = "no" # don't list missing data separately
) %>%
# add_n() %>% # add column with total number of non-missing observations
# add_p() %>% # test for a difference between groups
modify_header(label = "**Variable**") %>% # update the column header
bold_labels() %>%
# add table captions
as_gt() %>%
gt::tab_header(title = "Table 2. Cleients Characteristics by Race",
subtitle = " January to December, 2014")
table2
Table 2. Cleients Characteristics by Race |
January to December, 2014 |
Variable |
Black, N = 386 |
White, N = 1,890 |
age |
43 (30, 57) |
51 (36, 64) |
tvhours |
3 (2, 5) |
2 (1, 4) |
relig |
|
|
No answer |
3 (0.8%) |
7 (0.4%) |
Don't know |
0 (0%) |
3 (0.2%) |
Inter-nondenominational |
2 (0.5%) |
2 (0.1%) |
Christian |
36 (9.3%) |
88 (4.7%) |
Orthodox-christian |
0 (0%) |
9 (0.5%) |
Moslem/islam |
3 (0.8%) |
4 (0.2%) |
Other eastern |
0 (0%) |
1 (<0.1%) |
Hinduism |
0 (0%) |
1 (<0.1%) |
Buddhism |
1 (0.3%) |
17 (0.9%) |
Other |
1 (0.3%) |
19 (1.0%) |
None |
62 (16%) |
405 (21%) |
Jewish |
2 (0.5%) |
37 (2.0%) |
Catholic |
28 (7.3%) |
466 (25%) |
Protestant |
248 (64%) |
831 (44%) |
partyid |
|
|
No answer |
7 (1.8%) |
15 (0.8%) |
Don't know |
0 (0%) |
1 (<0.1%) |
Other party |
5 (1.3%) |
52 (2.8%) |
Strong republican |
7 (1.8%) |
231 (12%) |
Not str republican |
10 (2.6%) |
267 (14%) |
Ind,near rep |
8 (2.1%) |
220 (12%) |
Independent |
52 (13%) |
364 (19%) |
Ind,near dem |
48 (12%) |
244 (13%) |
Not str democrat |
82 (21%) |
272 (14%) |
Strong democrat |
167 (43%) |
224 (12%) |
marital |
|
|
No answer |
2 (0.5%) |
2 (0.1%) |
Never married |
167 (43%) |
417 (22%) |
Separated |
19 (4.9%) |
50 (2.6%) |
Divorced |
68 (18%) |
306 (16%) |
Widowed |
33 (8.5%) |
166 (8.8%) |
Married |
97 (25%) |
949 (50%) |
rincome |
|
|
Don't know |
3 (0.8%) |
14 (0.7%) |
Refused |
7 (1.8%) |
57 (3.0%) |
$25000 or more |
128 (33%) |
741 (39%) |
$20000 - 24999 |
22 (5.7%) |
104 (5.5%) |
$15000 - 19999 |
19 (4.9%) |
56 (3.0%) |
$10000 - 14999 |
22 (5.7%) |
73 (3.9%) |
$8000 to 9999 |
4 (1.0%) |
20 (1.1%) |
$7000 to 7999 |
3 (0.8%) |
8 (0.4%) |
$6000 to 6999 |
6 (1.6%) |
15 (0.8%) |
$5000 to 5999 |
6 (1.6%) |
21 (1.1%) |
$4000 to 4999 |
6 (1.6%) |
16 (0.8%) |
$3000 to 3999 |
11 (2.8%) |
22 (1.2%) |
$1000 to 2999 |
8 (2.1%) |
24 (1.3%) |
Lt $1000 |
8 (2.1%) |
19 (1.0%) |
Not applicable |
133 (34%) |
700 (37%) |
Regression Models
- Use tbl_regression() to display linear regression
model results in a table.
# Get Marketing dataset from {datarium} library
library(datarium)
data(marketing)
head(marketing, 5)
## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
## 5 216.96 12.96 70.08 15.48
# Create a scatter plot with smoothed line displaying the sales units versus YouTube advertising budget.
ggplot(marketing, aes(x = youtube, y = sales)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Create a scatter plot with smoothed line displaying the sales units versus Facebook advertising budget.
ggplot(marketing, aes(x = facebook, y = sales)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Create a scatter plot with smoothed line displaying the sales units versus newspaper advertising budget.
ggplot(marketing, aes(x = newspaper, y = sales)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Linear regression tries to find the best line to predict sales on the basis of YouTube advertising budget.
model0 <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary(model0)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## youtube 0.045765 0.001395 32.809 <2e-16 ***
## facebook 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
# Display table for regression model0
model0 %>%
tbl_regression() %>%
# add table captions
as_gt() %>%
gt::tab_header(title = "Table 3. Linear Regression Analysis for Sales",
subtitle = " Dataset: Marketing {datarium}")
Table 3. Linear Regression Analysis for Sales |
Dataset: Marketing {datarium} |
Characteristic |
Beta |
95% CI |
p-value |
youtube |
0.05 |
0.04, 0.05 |
<0.001 |
facebook |
0.19 |
0.17, 0.21 |
<0.001 |
newspaper |
0.00 |
-0.01, 0.01 |
0.9 |
- Finally, use tbl_regression() to display logistic regression
model results in a table.
# Get trial dataset from {gtsummary} library
data(trial)
head(trial, 5)
## # A tibble: 5 × 8
## trt age marker stage grade response death ttdeath
## <chr> <dbl> <dbl> <fct> <fct> <int> <int> <dbl>
## 1 Drug A 23 0.16 T1 II 0 0 24
## 2 Drug B 9 1.11 T2 I 1 0 24
## 3 Drug A 31 0.277 T1 II 0 0 24
## 4 Drug A NA 2.07 T3 III 1 1 17.6
## 5 Drug A 51 2.77 T4 III 1 1 16.4
# Logistic regression is a classification algorithm. We are using it to predict tumor response based on a set of independent variables.
model1 <- glm(response ~ trt + age + grade, data=trial, family = binomial)
# Display table for regression model1
model1_tbl<-model1 %>%
tbl_regression(exponentiate = TRUE) %>%
# add table captions
as_gt() %>%
gt::tab_header(title = "Table 4. Logistic Regression Analysis for Tumor Response to Treatment",
subtitle = " Dataset: Trial {gtsummary}")
model1_tbl
Table 4. Logistic Regression Analysis for Tumor Response to Treatment |
Dataset: Trial {gtsummary} |
Characteristic |
OR |
95% CI |
p-value |
Chemotherapy Treatment |
|
|
|
Drug A |
— |
— |
|
Drug B |
1.13 |
0.60, 2.13 |
0.7 |
Age |
1.02 |
1.00, 1.04 |
0.10 |
Grade |
|
|
|
I |
— |
— |
|
II |
0.85 |
0.39, 1.85 |
0.7 |
III |
1.01 |
0.47, 2.15 |
>0.9 |
A.M.D.G.