I have been playing with the new across() function in dplyr which is a magical solution when you are looking to get several summary statistic across a range of variables. BUT…. working out how to get those summary stats nicely formatted is a bit of a challenge, so I’ve also been looking at options from the skimr and tableone packages.

my goal

To work out the best way to make tables containing summary stats.

load packages

library(tidyverse)
library(palmerpenguins)
library(skimr)
library(gt)
library(tableone)

get data

Lets play with the penguin data.

penguins <- penguins

I want a table that looks like this

penguin_summary <- tibble(
  penguin_bits = c("bill", "flipper"),
  mean = c(43.92, 200.92),
  sd = c(5.46, 14.06),
  min = c(32.1, 172),
  max = c(59.6, 231))


penguin_summary %>%
  gt::gt()
penguin_bits mean sd min max
bill 43.92 5.46 32.1 59.6
flipper 200.92 14.06 172.0 231.0

I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. This manual df construction is a bit silly.

Is there an easy way to get summary stats that are formatted like this and compatible with gt()?

create summary stats the old way

penguins %>%
  summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
            bill_sd = sd(bill_length_mm, na.rm = TRUE),
             bill_min = min(bill_length_mm, na.rm = TRUE),
             bill_max = max(bill_length_mm, na.rm = TRUE), 
             flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
            flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
             flipper_min = min(flipper_length_mm, na.rm = TRUE),
             flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
  gt()
bill_mean bill_sd bill_min bill_max flipper_mean flipper_sd flipper_min flipper_max
43.92193 5.459584 32.1 59.6 200.9152 14.06171 172 231

The summarise function spits out summary stats in SUPER wide format.

create summary stats using across()

The new across() function gets you the same thing in many fewer lines of code, but still, it is VERY wide.

penguins %>%
  summarise(across(contains("length"), 
                   list(mean = mean, sd = sd, min = min, max = max),                          na.rm = TRUE))  %>%
  gt()
bill_length_mm_mean bill_length_mm_sd bill_length_mm_min bill_length_mm_max flipper_length_mm_mean flipper_length_mm_sd flipper_length_mm_min flipper_length_mm_max
43.92193 5.459584 32.1 59.6 200.9152 14.06171 172 231

skimr

skim is nice because it is tidyverse compatible (so you can use contains to select the variables you want to use) but you can also assign the output to a dataframe so you can mess with it.

skimtable <- penguins %>%
  select(contains("length")) %>%
  skim()

skimtable
Data summary
Name Piped data
Number of rows 344
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂

skim gives you more than you need so there is some selecting and renaming to do after the fact. You do end up with a gt compatible dataframe though

names(skimtable)
##  [1] "skim_type"     "skim_variable" "n_missing"     "complete_rate"
##  [5] "numeric.mean"  "numeric.sd"    "numeric.p0"    "numeric.p25"  
##  [9] "numeric.p50"   "numeric.p75"   "numeric.p100"  "numeric.hist"
skimtable %>%
  select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
  rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100) %>%
  gt()
penguin_bits mean sd min max
bill_length_mm 43.92193 5.459584 32.1 59.6
flipper_length_mm 200.91520 14.061714 172.0 231.0

tableone package

The TableOne package gives you mean and SD, I wonder whether there is a way to add other summary stats (like min & max to the TableOne?)When I assign the tableone to an object, it ends up being a list, which doesn’t work with gt(), unfortunately.

to_summarise <- c("bill_length_mm", "flipper_length_mm")

CreateTableOne(vars = to_summarise, data = penguins)
##                                
##                                 Overall       
##   n                                344        
##   bill_length_mm (mean (SD))     43.92 (5.46) 
##   flipper_length_mm (mean (SD)) 200.92 (14.06)

challenges/successes

There are lots of summary stat options which are all pretty easy to use. 😀.

I like the across() function a lot, but I don’t know how to format the output to look like this.

penguin_summary %>%
  gt::gt()
penguin_bits mean sd min max
bill 43.92 5.46 32.1 59.6
flipper 200.92 14.06 172.0 231.0

Ideally, I want to be able to use summarise() and across() and somehow make the wide output long. The problem is that pivot_longer() will take more than 1 “names_to” argument, but not more than a single “values_to” argument.

It would be great if pivot worked like this…

penguins %>%
  summarise(across(contains("length"),
        list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
      pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")
      

next steps…