I have been playing with the new across() function in dplyr which is a magical solution when you are looking to get several summary statistic across a range of variables. BUT…. working out how to get those summary stats nicely formatted is a bit of a challenge, so I’ve also been looking at options from the skimr and tableone packages.
To work out the best way to make tables containing summary stats.
library(tidyverse)
library(palmerpenguins)
library(skimr)
library(gt)
library(tableone)
Lets play with the penguin data.
penguins <- penguins
penguin_summary <- tibble(
penguin_bits = c("bill", "flipper"),
mean = c(43.92, 200.92),
sd = c(5.46, 14.06),
min = c(32.1, 172),
max = c(59.6, 231))
penguin_summary %>%
gt::gt()
| penguin_bits | mean | sd | min | max |
|---|---|---|---|---|
| bill | 43.92 | 5.46 | 32.1 | 59.6 |
| flipper | 200.92 | 14.06 | 172.0 | 231.0 |
I want to get summary stats for bill and flipper length (mean, sd, min, and max) and have them display in a nice table like this. This manual df construction is a bit silly.
Is there an easy way to get summary stats that are formatted like this and compatible with gt()?
penguins %>%
summarise(bill_mean = mean(bill_length_mm, na.rm = TRUE),
bill_sd = sd(bill_length_mm, na.rm = TRUE),
bill_min = min(bill_length_mm, na.rm = TRUE),
bill_max = max(bill_length_mm, na.rm = TRUE),
flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
flipper_sd = sd(flipper_length_mm, na.rm = TRUE),
flipper_min = min(flipper_length_mm, na.rm = TRUE),
flipper_max = max(flipper_length_mm, na.rm = TRUE)) %>%
gt()
| bill_mean | bill_sd | bill_min | bill_max | flipper_mean | flipper_sd | flipper_min | flipper_max |
|---|---|---|---|---|---|---|---|
| 43.92193 | 5.459584 | 32.1 | 59.6 | 200.9152 | 14.06171 | 172 | 231 |
The summarise function spits out summary stats in SUPER wide format.
The new across() function gets you the same thing in many fewer lines of code, but still, it is VERY wide.
penguins %>%
summarise(across(contains("length"),
list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
gt()
| bill_length_mm_mean | bill_length_mm_sd | bill_length_mm_min | bill_length_mm_max | flipper_length_mm_mean | flipper_length_mm_sd | flipper_length_mm_min | flipper_length_mm_max |
|---|---|---|---|---|---|---|---|
| 43.92193 | 5.459584 | 32.1 | 59.6 | 200.9152 | 14.06171 | 172 | 231 |
skim is nice because it is tidyverse compatible (so you can use contains to select the variables you want to use) but you can also assign the output to a dataframe so you can mess with it.
skimtable <- penguins %>%
select(contains("length")) %>%
skim()
skimtable
| Name | Piped data |
| Number of rows | 344 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
skim gives you more than you need so there is some selecting and renaming to do after the fact. You do end up with a gt compatible dataframe though
names(skimtable)
## [1] "skim_type" "skim_variable" "n_missing" "complete_rate"
## [5] "numeric.mean" "numeric.sd" "numeric.p0" "numeric.p25"
## [9] "numeric.p50" "numeric.p75" "numeric.p100" "numeric.hist"
skimtable %>%
select(skim_variable, numeric.mean, numeric.sd, numeric.p0, numeric.p100) %>%
rename(penguin_bits = skim_variable, mean = numeric.mean, sd = numeric.sd, min = numeric.p0, max = numeric.p100) %>%
gt()
| penguin_bits | mean | sd | min | max |
|---|---|---|---|---|
| bill_length_mm | 43.92193 | 5.459584 | 32.1 | 59.6 |
| flipper_length_mm | 200.91520 | 14.061714 | 172.0 | 231.0 |
The TableOne package gives you mean and SD, I wonder whether there is a way to add other summary stats (like min & max to the TableOne?)When I assign the tableone to an object, it ends up being a list, which doesn’t work with gt(), unfortunately.
to_summarise <- c("bill_length_mm", "flipper_length_mm")
CreateTableOne(vars = to_summarise, data = penguins)
##
## Overall
## n 344
## bill_length_mm (mean (SD)) 43.92 (5.46)
## flipper_length_mm (mean (SD)) 200.92 (14.06)
There are lots of summary stat options which are all pretty easy to use. 😀.
I like the across() function a lot, but I don’t know how to format the output to look like this.
penguin_summary %>%
gt::gt()
| penguin_bits | mean | sd | min | max |
|---|---|---|---|---|
| bill | 43.92 | 5.46 | 32.1 | 59.6 |
| flipper | 200.92 | 14.06 | 172.0 | 231.0 |
Ideally, I want to be able to use summarise() and across() and somehow make the wide output long. The problem is that pivot_longer() will take more than 1 “names_to” argument, but not more than a single “values_to” argument.
It would be great if pivot worked like this…
penguins %>%
summarise(across(contains("length"),
list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE)) %>%
pivot_longer(names_to = "penguin_bits", values_to = c("mean", "sd", "max", "min"), 1:8, values_sep = "_")