强大的stat_summary神器

概述

对于tidyverse家族里的画图大杀器ggplot2，大家更熟悉于geom_*family*函数。但在其中还隐藏着34个stat_族函数。先把其中一个stat_summary拉出来跟大家认识一下。

iris数据集想必都不陌生

library(rmarkdown)
paged_table(iris)

应用场景

在对数据进行探索性分析时，一般要对某变量的分布特征进行可视化操作，以均数条数、误差线图等形式展示该变量的分布。下面我用两种不同方式进行操作。

比如我们想探索一下，iris数据集中不同Species的Sepal.Length的均数分布情况。

geom_family

此种方式我们一般先分组汇总计算不同组Species的Sepal.Length的均数，然后计算其标准误，再用geom_col和geom_errorbar函数作图。

library(tidyverse)

iris %>% 
  
  #分组汇总
  group_by(Species) %>% 
  summarise(
    mean_sepal_length = mean(Sepal.Length, na.rm = T),
    se_sepal_length = sd(Sepal.Length) / sqrt(length(Sepal.Length)),
    
    # 分组汇总后输出数据流时，最好要解散分组，不然可能有意外错误
    .groups = "drop"   
  ) %>%
  
  #计算上下限
  mutate(
    lower = mean_sepal_length - se_sepal_length,
    upper = mean_sepal_length + se_sepal_length,
  ) %>% 
  
  #绘图
  ggplot(aes(Species, mean_sepal_length))+
  geom_col()+
  geom_errorbar(aes(ymin = lower, ymax = upper))

此时，数据管道里的数据流是这样的

## # A tibble: 3 × 5
##   Species    mean_sepal_length se_sepal_length lower upper
##   <fct>                  <dbl>           <dbl> <dbl> <dbl>
## 1 setosa                  5.01          0.0498  4.96  5.06
## 2 versicolor              5.94          0.0730  5.86  6.01
## 3 virginica               6.59          0.0899  6.50  6.68

stat_summary()

如果用stat_summary(),可见过程就简单的多了，作同样的图，仅需四行代码。

library(tidyverse)

iris %>% 
  ggplot(aes(Species, Sepal.Length)) +
  stat_summary(geom = "bar") +
  stat_summary(geom = "errorbar")

卖高！鹅迷扔~~~~ 此时数据管道里我们输入进去的仅是原数据集，并没有做任何转换。

解析

既然我们没有对数据进行转换操作，那肯定是stat_summary()代劳了这部分工作。

数据流

看看源数据在传递到绘图指令后，数据流变成了什么样子？

p <- iris %>% 
  ggplot(aes(Species, Sepal.Length)) +
        stat_summary(geom = "bar") +
        stat_summary(geom = "errorbar")

layer_data(p, 1)

##   x group     y ymin  ymax PANEL flipped_aes xmin xmax
## 1 1     1 5.006    0 5.006     1       FALSE 0.55 1.45
## 2 2     2 5.936    0 5.936     1       FALSE 1.55 2.45
## 3 3     3 6.588    0 6.588     1       FALSE 2.55 3.45
##   colour   fill linewidth linetype alpha
## 1     NA grey35       0.5        1    NA
## 2     NA grey35       0.5        1    NA
## 3     NA grey35       0.5        1    NA

没错，这个函数stat_summary绘图层确实运算返回了一个包含均值y、下限值ymin和上限值ymax等其他数据的数据框。那是怎么来得呢？

如果细心的话，查看一下代码运行后的警告信息的话，会发现：

library(tidyverse)

iris %>% 
  ggplot(aes(Species, Sepal.Length)) +
  stat_summary(geom = "bar") +
  stat_summary(geom = "errorbar")

No summary function supplied, defaulting to mean_se()

mean_se函数

这句警示信息里提到了mean_se这个函数，让我们看看mean_se这个函数

mean_se

## function (x, mult = 1) 
## {
##     x <- stats::na.omit(x)
##     se <- mult * sqrt(stats::var(x)/length(x))
##     mean <- mean(x)
##     data_frame0(y = mean, ymin = mean - se, ymax = mean + se, 
##         .size = 1)
## }
## <bytecode: 0x000001ddfe8e7e38>
## <environment: namespace:ggplot2>

iris %>% 
  group_by(Species) %>%
  summarise(y = mean_se(Sepal.Length)) %>% 
  unnest()

## # A tibble: 3 × 4
##   Species        y  ymin  ymax
##   <fct>      <dbl> <dbl> <dbl>
## 1 setosa      5.01  4.96  5.06
## 2 versicolor  5.94  5.86  6.01
## 3 virginica   6.59  6.50  6.68

Bingo！正是这个函数输入一个向量后，汇总计算了mean和se，并返回了一个包含均值y、下限值ymin和上限值ymax的数据框。且数据与我们前面手动汇总的值一致。

我们再看一下stat_summary的参数说明，是怎么调用和配置数据。

formatR::usage(stat_summary)

## stat_summary(mapping = NULL, data = NULL,
##     geom = "pointrange", position = "identity", ...,
##     fun.data = NULL, fun = NULL, fun.max = NULL,
##     fun.min = NULL, fun.args = list(), na.rm = FALSE,
##     orientation = NA, show.legend = NA, inherit.aes = TRUE,
##     fun.y = deprecated(), fun.ymin = deprecated(),
##     fun.ymax = deprecated())

geom
The geometric object to use to display the data, either as a ggproto Geom subclass or as a string naming the geom stripped of the geom_ prefix (e.g. “point” rather than “geom_point”) 指定需要展示数据的几何对象，可以是ggproto Geom类对象，也可以是去除了geom前缀的geom_family函数的几何对象名。

fun.data
A function that is given the complete data and should return a data frame with variables ymin, y, and ymax. 指定一个可以返回ymin,y和ymax变量的数据框的函数。

fun.min, fun, fun.max Alternatively, supply three individual functions that are each passed a vector of values and should return a single number. 或者指定三个可以返回单一数值的独立函数。

fun.args
Optional additional arguments passed on to the functions. 指定需要传递的可选参数

那么我们逐个参数验证一下。

geom参数

pointrange

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(geom = "pointrange")

crossbar

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(geom = "crossbar")

当然几何对象要与fun.data给予的数据流相适应

fun.data参数

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(
    geom = "crossbar",
    fun.data = mean_sd
  )

实际上，fun.data可以是能返回符合指定geom需求数据的任何函数，包括其他包内自带的函数，也包括自定义函数。

ggpubr自带函数mean_range¹

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(geom = "crossbar",
               #均数和最大、最小值
               fun.data = ggpubr::mean_range
               )

也可以用自定义函数(改编自ggpubr包的median_hilow_函数)

median_hilo <- function (x, ci = 0.95, error.limit = "both") 
        {
          quant <- stats::quantile(x, 
                                   probs = c(0.5, (1 - ci)/2, (1 + ci)/2), 
                                   na.rm = TRUE)
          names(quant) <- c("median", "lower", "upper")
          data.frame(y = quant["median"], 
                     ymin = quant["lower"], 
                     ymax = quant["upper"]) 
}

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(fun.data = median_hilo)

fun.args

指定需要传递的可选参数

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(geom = "pointrange",
               fun.data = mean_se,
               fun.args = list(mult = 1.96))

也可以匿名函数的形式传递参数

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(geom = "pointrange",
               fun.data = ~mean_se(., mult = 1.96)
               )

隐藏技能

如果仔细观察的话，在上文提到的stat_summary绘图层时产生的数据流的结构会发现，除了必要的y,ymin,ymax之外，那个数据集里还有诸如color，fill, alpha等aes参数同时产生。

##   x group     y ymin  ymax PANEL flipped_aes xmin xmax
## 1 1     1 5.006    0 5.006     1       FALSE 0.55 1.45
## 2 2     2 5.936    0 5.936     1       FALSE 1.55 2.45
## 3 3     3 6.588    0 6.588     1       FALSE 2.55 3.45
##   colour   fill linewidth linetype alpha
## 1     NA grey35       0.5        1    NA
## 2     NA grey35       0.5        1    NA
## 3     NA grey35       0.5        1    NA

这样的话，如果我们在引用相应函数时，同时计算生成相应参数的话，我们甚至可以同步改变图像的aes参数

计算更改条图颜色

素图是这样的

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(
    geom = "bar",
    fun.data = mean_se
  )

计算自定义颜色

calc_mean_and_aes <- function(x, threshold = 5.5) {
   tibble(y = mean(x, na.rm = T)) %>% 
    mutate(fill = ifelse(y < threshold,  "#FC4E07", "#00AFBB"))
}

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(
    geom = "bar",
    fun.data = calc_mean_and_aes
  )

计算更改条图宽度

calc_mean_and_aes <- function(x, threshold = 5.5) {
  tibble(y = mean(x)) %>% 
    mutate(fill = ifelse(y < threshold,  "#FC4E07", "#00AFBB"),
           width = sd(x) / nrow(iris) * 100,
          )
}

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(
    geom = "bar",
    fun.data = calc_mean_and_aes
  )

计算更改条图透明度

calc_mean_and_aes <- function(x, threshold = 5.5) {
  tibble(y = mean(x)) %>% 
    mutate(fill = ifelse(y < threshold,  "#FC4E07", "#00AFBB"),
           width = sd(x) / nrow(iris) * 100,
           alpha = ifelse(y < threshold, 0.8, 0.3)
          )
}

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(
    geom = "bar",
    fun.data = calc_mean_and_aes
  )

计算更改条图外框线型

calc_mean_and_aes <- function(x, threshold = 5.5) {
  tibble(y = mean(x)) %>% 
    mutate(fill = ifelse(y < threshold,  "#FC4E07", "#00AFBB"),
           width = sd(x) / nrow(iris) * 100,
           alpha = ifelse(y < threshold, 0.8, 0.3),
           color = "black",
           linetype = ifelse(y < threshold, 1, 23)
          )
}

iris %>% 
  ggplot(aes(Species, Sepal.Length))+
  stat_summary(
    geom = "bar",
    fun.data = calc_mean_and_aes
  )

此文受June Choe先生的文章^{[https://yjunechoe.github.io/posts/2020-09-26-demystifying-stat-layers-ggplot2/]和王敏杰老师的在线书籍}[https://bookdown.org/wangminjie/R4DS/tidyverse-ggplot2-stat-layer.html]启发

ggpubr还有一系列类似的汇总统计函数mean_sd, median_hilow_等↩︎