Summary Tables with gtsummary

Revealjs Presentation

Michael Arteaga

IBM 6400, Cal Poly Pomona

2025-04-11

Summarize gtsummary

Q.1 In Step 1, you have seen how to create a summary table and modify it for various statistical analyses. Summarize the package’s capabilities.

The gtsummary packge in R is a practical tool designed to create publication-ready summary tables, which can be especially useful in clinical research, but great for any type of data summarization.

Some Features Include:

tbl_summary(): to generate summary statistcs for each variable in the dataset
Stratified Tables: to add a grouping variable using the by= argument to compare groups side-by-side
Regression Model Summaries: tbl_regression() to create clean summaries of models like linear regression, logistic regression, and cox models
Table Customization: Modify labels, formatting, reference levels with helper functions like modify_header(), modify_caption(), bold_labels(), etc

Question 1 Continued…

How may the {gtsummary} benefit you for your school work or career?

With its many use cases, gtsummary could be helpful for a number of reasons. With the package, you can produce client-ready tables for use in presentations and campaign reports without needing to manually format in Excel. Audience Segmentation would also be valuable as you can use tbl_summary() to compare customer segments (age group, geographic region) across variables (revenue, CTR). Campaign A/B testing would also be possible as you can run logicistic and linear regressions to display clear model outputs using tbl_regression(). This would be ideal for showing differences in campaign performance with confidence intervals and p-values.

Q.2 gtsummary with Dr. Dsjoberg

2.1. How does it differ from gt and gtExtras?

gtsummary is designed to automatically create statistical summary tables (descriptive stats, model outputs, p-values) while implementing features such as one-line summaries of data, regression results, and group comparisons to name a few.

gt is used as a general table-building engine for highly customizable tables from data frames and tibbles. Some highlights of gt include full control over fonts, colors, borders, alignment, etc; creating formatted tables from scratch; and the ability to export to HTML, PDF, and Word.

gtExtras serves as a companion to gt that adds visual polish and shortcuts. Some features include the ability to add inline bar plots, bullet charts, images, and emojis; the use of themed table templates; and implementing quick summaries with visuals.

Question 2 Continued..

2.2. Give three things you learned newly that were not explained in the lecture in Step 1.

Apart from what was shared in the first lecture video, some new content was also found to be of interest.

Adding p-values proved to be easy using the add_p() function as it lets you incorporate p-values into tables to facilitate statistical comparisons between groups.
Table customizations were also mentioned, which allow you to enhance the table readability and presentation through functions like modify_header(), bold_labels(), italicize_levels().
He also discussed the ease of merging and stacking tables through the two functions tbl_merge() and tbl_stack(), which are useful for more comprehensive reporting.

Q.3 Choosing Variables

3. Apply what you learned to your MSDM CEP data. Choose two appropriate variables for cross-tabulation and show if the two variables are associated or not. Use appropriate statistics to test if the two variables are associated. Code, produce the table, and interpret the result.

I chose to use age and gender is independent variables to see if they would be associated in response to Q20, which asks how likely the participant would be to use a time-saving service. Based on the results, it would appear that age and gender are not directly associated and bares no correlation to how the question is answered. The p-value is also statistically insignificant.

library(tidyverse)
library(gtsummary)

# Load updated data
df <- read_csv("AI Impatience Video Survey Data.csv")

# Clean and prepare the data
df_clean <- df %>%
  filter(!is.na(Age), !is.na(Gender), !is.na(Q21)) %>%
  mutate(
    Age = as.numeric(Age),
    Gender = as.factor(Gender),
    Q20 = as.numeric(Q21)  # assuming Q20 is a Likert-style response
  )

# Create gtsummary table
df_clean %>%
  tbl_summary(
    by = Gender,
    include = c(Age, Q21),
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_p() %>%
  modify_caption("**Table: Influence of Age and Gender on Responses to Q20**") %>%
  bold_labels()

**Table: Influence of Age and Gender on Responses to Q20**
Characteristic	Female N = 46¹	Male N = 47¹	What is your gender? N = 1¹	p-value²
Age	21.6 (5.5)	21.0 (2.0)	NA (NA)	>0.9
Q21				0.2
1	12 (26%)	10 (21%)	0 (0%)
2	5 (11%)	9 (19%)	0 (0%)
3	2 (4.3%)	7 (15%)	0 (0%)
4	7 (15%)	5 (11%)	0 (0%)
5	9 (20%)	6 (13%)	0 (0%)
6	3 (6.5%)	3 (6.4%)	0 (0%)
7	8 (17%)	7 (15%)	0 (0%)
You're enjoying a quiet night in at home, about to watch a movie but you've just gotten a craving for your favorite restaurant. To what extent do you find it desirable to use an app-based delivery service (i.e. Uber Eats, Door Dash) and pay for the food delivery fee plus some tips (i.e., around 5-10 dollars in total) to bring your food to you in the comfort of your own home?	0 (0%)	0 (0%)	1 (100%)
¹ Mean (SD); n (%)
² Kruskal-Wallis rank sum test; Fisher’s exact test

Q.4 Multiple Regressions

4. Using the MSDM CEP data, run multiple regressions. The dependent variable can be continuous or dichotomous. Regress a dependent variable on a set of independent variables. Code, produce the table, and interpret the result.

We can interpret the following results as:

Age coefficient: Positive → older participants are more likely to agree with the statement.
Gender: Indicates how the response differs between males and females (e.g., if Female is the reference).
Significance (p-values): Tells whether the predictor has a statistically meaningful relationship with the dependent variable.

In these results, we are using “Male” as the baseline, with “Female” showing in the regression output.

library(tidyverse)
library(gtsummary)

# Load and clean data
df <- read_csv("AI Impatience Video Survey Data.csv") %>%
  filter(
    Age != "What is your age?",
    Gender != "What is your gender?",
    Q23 != "How will you choose? Please indicate your disagreement or agreement with each of the following statements.",
    Q24 != "I would rather eat the cheese right away.",
    Q25 != "I would rather eat the cheese in the distant future.",
    Q26 != "I would rather wait and eat it later."
  ) %>%
  mutate(
    Age = as.numeric(Age),
    Gender = as.factor(Gender),
    Gender = fct_relevel(Gender, "Male"),
    Q23 = as.numeric(Q23),
    Q24 = as.numeric(Q24),
    Q25 = as.numeric(Q25),
    Q26 = as.numeric(Q26)
  ) %>%
  drop_na(Age, Gender, Q23, Q24, Q25, Q26)

# Run regressions
model_q23 <- lm(Q23 ~ Age + Gender, data = df)
model_q24 <- lm(Q24 ~ Age + Gender, data = df)
model_q25 <- lm(Q25 ~ Age + Gender, data = df)
model_q26 <- lm(Q26 ~ Age + Gender, data = df)

# Combine tables
tbl_merge(
  tbls = list(
    tbl_regression(model_q23, exponentiate = FALSE, label = list(Age = "Age", Gender = "Gender")),
    tbl_regression(model_q24, exponentiate = FALSE, label = list(Age = "Age", Gender = "Gender")),
    tbl_regression(model_q25, exponentiate = FALSE, label = list(Age = "Age", Gender = "Gender")),
    tbl_regression(model_q26, exponentiate = FALSE, label = list(Age = "Age", Gender = "Gender"))
  ),
  tab_spanner = c("**Q23: Choose Immediately**", "**Q24: Eat Cheese Right Away**",
                  "**Q25: Eat Cheese Later**", "**Q26: Wait and Eat Later**")
) %>%
  modify_caption("**Table: Regressions of Delay Discounting Statements on Age and Gender**") %>%
  bold_labels()

**Table: Regressions of Delay Discounting Statements on Age and Gender**
Characteristic	Q23: Choose Immediately			Q24: Eat Cheese Right Away			Q25: Eat Cheese Later			Q26: Wait and Eat Later
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
Age	0.02	-0.06, 0.10	0.7	-0.01	-0.10, 0.08	0.8	0.01	-0.06, 0.09	0.7	0.02	-0.06, 0.10	0.6
Gender
Male	—	—		—	—		—	—		—	—
Female	0.67	0.00, 1.3	0.050	0.49	-0.23, 1.2	0.2	-0.24	-0.85, 0.38	0.4	-0.35	-0.97, 0.28	0.3
¹ CI = Confidence Interval

Published Presentation

https://rpubs.com/MDArteaga/1296709