I have been playing with the new across() function in dplyr which is a magical solution when you are looking to get several summary statistic across a range of variables. BUT…. working out how to get those summary stats nicely formatted is a bit of a challenge, so I’ve also been looking at options from the skimr and tableone packages.

my goal

To work out the best way to make tables containing summary stats.

load packages

library(tidyverse)
library(palmerpenguins)
library(skimr)
library(gt)
library(tableone)

get data

Lets play with the penguin data.

penguins <- penguins

I want a table that looks like this

penguin_summary <- tibble(
  penguin_bits = c("bill", "flipper"),
  mean = c(43.92, 200.92),
  sd = c(5.46, 14.06),
  min = c(32.1, 172),
  max = c(59.6, 231))


penguin_summary %>%
  gt::gt()

penguin_bits	mean	sd	min	max
bill	43.92	5.46	32.1	59.6
flipper	200.92	14.06	172.0	231.0

I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. This manual df construction is a bit silly.

Is there an easy way to get summary stats that are formatted like this and compatible with gt()?

create summary stats the old way

penguins %>%
  summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
            bill_sd = sd(bill_length_mm, na.rm = TRUE),
             bill_min = min(bill_length_mm, na.rm = TRUE),
             bill_max = max(bill_length_mm, na.rm = TRUE), 
             flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
            flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
             flipper_min = min(flipper_length_mm, na.rm = TRUE),
             flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
  gt()

bill_mean	bill_sd	bill_min	bill_max	flipper_mean	flipper_sd	flipper_min	flipper_max
43.92193	5.459584	32.1	59.6	200.9152	14.06171	172	231

The summarise function spits out summary stats in SUPER wide format.

create summary stats using across()

The new across() function gets you the same thing in many fewer lines of code, but still, it is VERY wide.

penguins %>%
  summarise(across(contains("length"), 
                   list(mean = mean, sd = sd, min = min, max = max),                          na.rm = TRUE))  %>%
  gt()

bill_length_mm_mean	bill_length_mm_sd	bill_length_mm_min	bill_length_mm_max	flipper_length_mm_mean	flipper_length_mm_sd	flipper_length_mm_min	flipper_length_mm_max
43.92193	5.459584	32.1	59.6	200.9152	14.06171	172	231

skimr

skim is nice because it is tidyverse compatible (so you can use contains to select the variables you want to use) but you can also assign the output to a dataframe so you can mess with it.

skimtable <- penguins %>%
  select(contains("length")) %>%
  skim()

skimtable

Data summary
Name	Piped data
Number of rows	344
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂

skim gives you more than you need so there is some selecting and renaming to do after the fact. You do end up with a gt compatible dataframe though

names(skimtable)

##  [1] "skim_type"     "skim_variable" "n_missing"     "complete_rate"
##  [5] "numeric.mean"  "numeric.sd"    "numeric.p0"    "numeric.p25"  
##  [9] "numeric.p50"   "numeric.p75"   "numeric.p100"  "numeric.hist"

skimtable %>%
  select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
  rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100) %>%
  gt()

penguin_bits	mean	sd	min	max
bill_length_mm	43.92193	5.459584	32.1	59.6
flipper_length_mm	200.91520	14.061714	172.0	231.0

tableone package

The TableOne package gives you mean and SD, I wonder whether there is a way to add other summary stats (like min & max to the TableOne?)When I assign the tableone to an object, it ends up being a list, which doesn’t work with gt(), unfortunately.

to_summarise <- c("bill_length_mm", "flipper_length_mm")

CreateTableOne(vars = to_summarise, data = penguins)

##                                
##                                 Overall       
##   n                                344        
##   bill_length_mm (mean (SD))     43.92 (5.46) 
##   flipper_length_mm (mean (SD)) 200.92 (14.06)

challenges/successes

There are lots of summary stat options which are all pretty easy to use. 😀.

I like the across() function a lot, but I don’t know how to format the output to look like this.

penguin_summary %>%
  gt::gt()

penguin_bits	mean	sd	min	max
bill	43.92	5.46	32.1	59.6
flipper	200.92	14.06	172.0	231.0

Ideally, I want to be able to use summarise() and across() and somehow make the wide output long. The problem is that pivot_longer() will take more than 1 “names_to” argument, but not more than a single “values_to” argument.

It would be great if pivot worked like this…

penguins %>%
  summarise(across(contains("length"),
        list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
      pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")

next steps…

maybe I need to make a reprex!

tables and summary stats

Jen Richmond

24/02/2021