二元图显示了两个变量之间的关系。图表的类型将取决于变量的测量水平(离散或连续)。

4.1 两个离散变量

在绘制两个离散变量之间的关系时,通常使用堆叠、分组或分段条形图。一个不太常见的方法是马赛克图。

4.1.1 堆叠条形图

让我们在燃油经济性 Fuel economy 数据集中为汽车绘制汽车类别和驱动类型(前轮、后轮或四轮驱动)之间的关系。

pacman::p_load(tidyverse,DT,DataExplorer)
# stacked bar chart
ggplot(mpg, 
       aes(x = class)) + 
  geom_bar(position = "stack")    # y为count

# stacked bar chart
ggplot(mpg, 
       aes(x = class, 
           fill = drv)) +        # fill就是添加一个离散变量
  geom_bar(position = "stack")

ggplot(mpg, 
       aes(x = class %>% as_factor() %>% fct_infreq(), 
           fill = drv)) +        # fill就是添加一个离散变量
  geom_bar(position = "stack") +
  labs(x = "drv")

从图表中,我们可以看到,最常见的交通工具是 SUV

4.1.2 分组条形图

分组条形图将第二个分类变量的条形图并排放置。 要创建一个分组的条形图,使用position = "dodge"选项。

# grouped bar plot
ggplot(mpg, 
       aes(x = class, 
           fill = drv)) + 
  geom_bar(position = "dodge")

注意所有的Minivans都是前轮驱动。默认情况下,零计数条会被删除,其余的条会变宽。 这可能不是你想要的行为。 您可以使用position = position_dodge(preserve = "single")选项修改此选项。

# grouped bar plot preserving zero count bars
ggplot(mpg, 
       aes(x = class, 
           fill = drv)) + 
  geom_bar(position = position_dodge(preserve = "single"))   # 并排

一个分段的条形图是一个堆叠的条形图,其中每个条形图代表100% 。 您可以使用position = "filled"创建一个分段的条形图。

4.1.3 比例条形图

# bar plot, with each bar representing 100%
ggplot(mpg, 
       aes(x = class, 
           fill = drv)) + 
  geom_bar(position = "fill") +
  labs(y = "Proportion")

如果目标是比较一个类别在一个变量中与另一个变量的每个级别的百分比,那么这种类型的图表就特别有用。 例如,当你从compact到midsize再到 minivan时,前轮驱动的比例就会上升。

4.1.4 改善颜色和标签

您可以使用其他选项来改善颜色和标签。在下图中,

  • factor: modifies the order of the categories for the class variable and both the order and the labels for the drive variable
  • scale_y_continuous: modifies the y-axis tick mark labels
  • labs: provides a title and changed the labels for the x and y axes and the legend
  • scale_fill_brewer: changes the fill color scheme
  • theme_minimal: removes the grey background and changed the grid color
# bar plot, with each bar representing 100%, reordered bars, and better labels and colors
mpg$class %>% fct_count()
## # A tibble: 7 x 2
##   f              n
##   <fct>      <int>
## 1 2seater        5
## 2 compact       47
## 3 midsize       41
## 4 minivan       11
## 5 pickup        33
## 6 subcompact    35
## 7 suv           62
library(scales)
ggplot(mpg, 
       aes(x = factor(class,
                      levels = c("2seater", "subcompact", 
                                "compact", "midsize", 
                                "minivan", "suv", "pickup")),
           fill = factor(drv, 
                         levels = c("f", "r", "4"),             # 水平
                         labels = c("front-wheel",              # label
                                    "rear-wheel", 
                                    "4-wheel")))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = scales::percent) +                  # 百分比
  scale_fill_brewer(palette = "Set2") +                          # fill填充颜色
  labs(y = "Percent", 
       fill = "Drive Train",
       x = "Class",
       title = "Automobile Drive by Class") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

在上图中,因子函数用于重新排序和 / 或重命名离散变量的级别。您还可以将此应用于原始数据集,使这些更改永久化。 然后,它将应用于所有使用该数据集的未来图形。例如:

# change the order the levels for the categorical variable "class"
mpg$class = factor(mpg$class,
                   levels = c("2seater", "subcompact", 
                              "compact", "midsize", 
                              "minivan", "suv", "pickup"))

4.1.5 一个完整的条形图

我将 factor 函数放在 ggplot 函数中,以演示如果需要,可以更改单个图的类别和标签的顺序。 其他函数将在 自定义图的章节中进行更详细的讨论。 接下来,让我们为每个片段添加% 的标签。 首先,我们将创建一个具有必要标签的摘要数据集。

plotdata <- mpg %>%
  group_by(class, drv) %>%
  summarize(n = n()) %>% 
  mutate(pct = n/sum(n),
         lbl = scales::percent(pct))      # 变成百分比
plotdata
## # A tibble: 12 x 5
## # Groups:   class [7]
##    class      drv       n    pct lbl  
##    <fct>      <chr> <int>  <dbl> <chr>
##  1 2seater    r         5 1      100% 
##  2 subcompact 4         4 0.114  11%  
##  3 subcompact f        22 0.629  63%  
##  4 subcompact r         9 0.257  26%  
##  5 compact    4        12 0.255  26%  
##  6 compact    f        35 0.745  74%  
##  7 midsize    4         3 0.0732 7%   
##  8 midsize    f        38 0.927  93%  
##  9 minivan    f        11 1      100% 
## 10 suv        4        51 0.823  82%  
## 11 suv        r        11 0.177  18%  
## 12 pickup     4        33 1      100%

接下来,我们将使用这个数据集和 geom 文本函数为每个条形图段添加标签。

ggplot(plotdata, 
       aes(x = factor(class,
                      levels = c("2seater", "subcompact", 
                                 "compact", "midsize", 
                                 "minivan", "suv", "pickup")),     # 因子变量
           y = pct,                                                # 比例
           fill = factor(drv,                                      # fill
                         levels = c("f", "r", "4"),
                         labels = c("front-wheel", 
                                    "rear-wheel", 
                                    "4-wheel")))) + 
  geom_bar(stat = "identity",                                     # 统计变换
           position = "fill") +                                   # 比例图
  scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = scales::percent) +
  geom_text(aes(label = lbl),                                     # 添加文本,label = lbl
            size = 3, 
            position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette = "Set2") +
  labs(y = "Percent", 
       fill = "Drive Train",
       x = "Class",
       title = "Automobile Drive by Class") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

4.1.6 其它图

马赛克图为显示分类变量之间的关系提供了一种堆叠条形图的替代方法。 它们还可以提供更复杂的统计信息。

4.2 两个连续图

两个连续变量之间的关系通常用散点图和线图来表示。

4.2.1 散点图

两个连续变量最简单的显示是散点图,每个变量用一个轴表示。 例如,使用薪水数据集,我们可以绘制大学教授的经验(yrs.since.phd)和学术工资(薪水)。

library(ggplot2)
pacman::p_load(carData)
data(Salaries, package="carData")

# simple scatterplot
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary)) +
  geom_point()

geom_point函数有以下重要参数:

  • color - point color
  • size - point size
  • shape - point shape
  • alpha - point transparency. Transparency ranges from 0 (transparent) to 1 (opaque), and is a useful parameter when points overlap.

scale_x_continuousscale_y_continuous分别控制 x 轴和 y 轴的缩放。

我们可以利用这些选项和函数来创建一个更有吸引力的散点图。

# enhanced scatter plot
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary)) +
  geom_point(color="cornflowerblue", 
             size = 2, 
             alpha=.8) +
  scale_y_continuous(label = scales::dollar,                # 记住:label是添加额外信息,limits也很好用
                     limits = c(50000, 250000)) +
  scale_x_continuous(breaks = seq(0, 60, 10), 
                     limits=c(0, 60)) + 
  labs(x = "Years Since PhD",
       y = "",
       title = "Experience vs. Salary",
       subtitle = "9-month salary for 2008-2009")

4.2.1.1 添加最佳拟合线

用最佳拟合线将关系总结在散点图中是很有用的。 支持许多类型的线,包括线性多项式非参数(loess)。 默认情况下,显示这些行的95% 置信区间。

# scatterplot with linear fit line
ggplot(Salaries,
       aes(x = yrs.since.phd, 
           y = salary)) +
  geom_point(color= "steelblue") +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Salaries %>% 
  ggplot(aes(x = yrs.since.phd, y = salary)) +
  geom_point(col = "steelblue") +
  geom_smooth(method = "lm") +
  scale_y_continuous(labels = scales::dollar,
                     limits = c(50000,250000)) +           # break label limits三个很重要的参数
  scale_x_continuous(breaks = seq(0,60,10),limits = c(0,60),expand = c(0,0)) +
  labs(x = "Years Since PhD",
       y = "",
       title = "Experience vs. Salary",
       subtitle = "9-month salary for 2008-2009") +
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'

显然,工资会随着经验的增加而增加。 然而,经验丰富、薪水较低的右端教授似乎有所下降。 直线并不能捕捉到这种非线性效应。 弯曲的线在这里更合适。

通常使用二次(一个弯曲)线或三次(两个弯曲)线。 很少有必要使用高阶多项式。 对工资数据集应用二次拟合会产生以下结果。

# scatterplot with quadratic line of best fit
ggplot(Salaries, 
       aes(x = yrs.since.phd,                     # 主要是X轴,Y轴,颜色,方法,公式
           y = salary)) +
  geom_point(color= "steelblue") +
  geom_smooth(method = "lm", 
              formula = y ~ poly(x, 2), 
              color = "indianred3")

最后,一个平滑的非参数拟合直线通常可以提供一个良好的图像关系。 ggplot2的默认值是一条loess线,代表局部加权散点图平滑。

# scatterplot with loess smoothed line
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary)) +
  geom_point(color= "steelblue") +                   # 注意X轴、Y轴的位置,col什么时候在aes()函数里边!
  geom_smooth(color = "tomato")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

您可以通过包含选项 se = FALSE 来抑制置信区间。

4.2.1.2 一个完整的散点图

下面是一个完整的(也更有吸引力的)情节。

# scatterplot with loess smoothed line 
# and better labeling and color
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary)) +
  geom_point(color="cornflowerblue", 
             size = 2, 
             alpha = .6) +
  geom_smooth(size = 1.5,
              color = "darkgrey") +
  scale_y_continuous(label = scales::dollar, 
                     limits = c(50000, 250000)) +        # 这个可以不用break
  scale_x_continuous(breaks = seq(0, 60, 10), 
                     limits = c(0, 60)) +                # 这个最好用break
  labs(x = "Years Since PhD",
       y = "",
       title = "Experience vs. Salary",
       subtitle = "9-month salary for 2008-2009") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

4.2.2 线图

当两个变量中的一个表示时间时,线图可以是显示关系的有效方法。 例如,下面的代码显示了1952年至2007年间美国的时间(年)和预期寿命(lifeExp)之间的关系。 数据来自 gapminder 数据集。

pacman::p_load(gapminder)
data(gapminder, package="gapminder")
gapminder %>% DT::datatable()
# Select US cases
library(dplyr)
plotdata <- filter(gapminder, 
                   country == "United States")

plotdata
## # A tibble: 12 x 6
##    country       continent  year lifeExp       pop gdpPercap
##    <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
##  1 United States Americas   1952    68.4 157553000    13990.
##  2 United States Americas   1957    69.5 171984000    14847.
##  3 United States Americas   1962    70.2 186538000    16173.
##  4 United States Americas   1967    70.8 198712000    19530.
##  5 United States Americas   1972    71.3 209896000    21806.
##  6 United States Americas   1977    73.4 220239000    24073.
##  7 United States Americas   1982    74.6 232187835    25010.
##  8 United States Americas   1987    75.0 242803533    29884.
##  9 United States Americas   1992    76.1 256894189    32004.
## 10 United States Americas   1997    76.8 272911760    35767.
## 11 United States Americas   2002    77.3 287675526    39097.
## 12 United States Americas   2007    78.2 301139947    42952.
# simple line plot
ggplot(plotdata, 
       aes(x = year, 
           y = lifeExp)) +
  geom_line() 

很难读出上面图表中的单个值。在下一个图表中,我们也将添加点

# line plot with points and improved labeling
ggplot(plotdata, 
       aes(x = year, 
           y = lifeExp)) +
  geom_line(size = 1.5, 
            color = "lightgrey") +
  geom_point(size = 3, 
             color = "steelblue") +
  labs(y = "Life Expectancy (years)", 
       x = "Year",
       title = "Life expectancy changes over time",
       subtitle = "United States (1952-2007)",
       caption = "Source: http://www.gapminder.org/data/") +
  scale_x_continuous(breaks = seq(1950,2010,5)) +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90,vjust = 0.5))

时间相关的数据在绘制时间序列图中有更详细的介绍。

4.3 离散-连续图

当绘制离散变量和连续变量之间的关系时,可以得到大量的图形类型。其中包括使用汇总统计的柱状图分组核密度图并排的箱线图、*并排的小提琴图**、Mean/SEM图山脊线图克利夫兰图*等。

4.3.1 条形图(汇总统计)

在前面的部分中,条形图用于按类别显示单个变量或两个变量的数量。 您还可以使用条形图针对分类变量的每个级别在定量变量上显示其他摘要统计信息(例如,均值或中位数)。

例如,下面的图表显示了一个大学教授样本的平均工资,根据他们的学术排名。

data(Salaries, package="carData")

# calculate mean salary for each rank
library(dplyr)
plotdata <- Salaries %>%
  group_by(rank) %>%
  summarize(mean_salary = mean(salary))
plotdata
## # A tibble: 3 x 2
##   rank      mean_salary
##   <fct>           <dbl>
## 1 AsstProf       80776.
## 2 AssocProf      93876.
## 3 Prof          126772.
# plot mean salaries
ggplot(plotdata, 
       aes(x = rank, 
           y = mean_salary)) +
  geom_bar(stat = "identity")           # 统计变换

我们可以通过一些选择使它更有吸引力。

# plot mean salaries in a more attractive fashion
library(scales)
ggplot(plotdata, 
       aes(x = factor(rank,
                      labels = c("Assistant\nProfessor",
                                 "Associate\nProfessor",
                                 "Full\nProfessor")), 
                      y = mean_salary)) +
  geom_bar(stat = "identity", 
           fill = "cornflowerblue")  +
   labs(title = "Mean Salary by Rank", 
       subtitle = "9-month academic salary for 2008-2009",
       x = "",
       y = "")

# plot mean salaries in a more attractive fashion
library(scales)
ggplot(plotdata, 
       aes(x = factor(rank,
                      labels = c("Assistant\nProfessor",
                                 "Associate\nProfessor",
                                 "Full\nProfessor")), 
                      y = mean_salary)) +                   # 本质还是条形图
  geom_bar(stat = "identity", 
           fill = "cornflowerblue") +
  geom_text(aes(label = dollar(mean_salary)), 
            vjust = -0.25) +
  scale_y_continuous(breaks = seq(0, 130000, 20000), 
                     label = dollar) +
  labs(title = "Mean Salary by Rank", 
       subtitle = "9-month academic salary for 2008-2009",
       x = "",
       y = "")

此类图的一个限制是它们不显示数据的分布—只显示每个组的汇总统计信息。下面的情节在一定程度上纠正了这一限制。

4.3.2 分组核密度图

可以通过在单个图中叠加核密度图来比较数值变量上的组。

# plot the distribution of salaries by rank using kernel density plots
Salaries %>% 
  ggplot(aes(x = salary)) +
  geom_density(alpha = 0.4)
ggplot(Salaries, 
       aes(x = salary, 
           fill = rank)) +
  geom_density(alpha = 0.4) +
  labs(title = "Salary distribution by rank")

alpha选项使密度图部分透明,这样我们就可以看到在重叠部分发生了什么。Alpha值的范围从0(透明)到1(不透明)。这张图表清楚地表明,一般来说,工资随rank的增加而增加。然而,全职教授的薪水范围非常广泛。

4.3.3 箱线图

箱线图显示分布的第25百分位、中位数和第75百分位。这些触须(垂线)捕捉了约99%的正态分布,超出这个范围的观测结果被绘制成代表离群值的点。并排箱形图对比较组非常有用。

# plot the distribution of salaries by rank using boxplots
ggplot(Salaries, 
       aes(x = rank, 
           y = salary)) +
  geom_boxplot() +
  labs(title = "Salary distribution by rank")

Notched箱形图提供了一种近似的方法来观察群体是否不同。虽然不是一个正式的测试,如果两个箱形图的切口不重叠,有强有力的证据(95%的信心)表明两组的中位数不同。

# plot the distribution of salaries by rank using boxplots
ggplot(Salaries, aes(x = rank, 
                     y = salary)) +
  geom_boxplot(notch = TRUE, 
               fill = "cornflowerblue", 
               alpha = .7) +
  labs(title = "Salary distribution by rank")

在上面的例子中,三个组似乎都不同。箱形图的优点之一是它们的宽度通常没有意义。这允许您在一个图中比较多个组的分布。

4.3.4 小提琴图

小提琴图类似于核密度图

# plot the distribution of salaries 
# by rank using violin plots
ggplot(Salaries, 
       aes(x = rank,
           y = salary)) +
  geom_violin() +
  labs(title = "Salary distribution by rank")

一个有用的变化是在小提琴图上叠加箱线图。

Salaries %>% 
  ggplot(aes(x = rank, y = salary)) +
  geom_violin(aes(fill = rank)) +
  geom_boxplot(aes(fill = rank),alpha = 0.5,width = 0.2,outlier.color = "orange",outlier.size = 2) +
  labs(title = "Salary distribution by rank") +
  theme(legend.position = "NULL")

4.3.5 山脊线图

山脊线图(也称为游戏图)显示了一个定量变量在几个群体中的分布情况。它们类似于具有垂直分面的核密度图,但占用的空间较小。山脊线图是用 ggridges 包创建的。

# create ridgeline graph
library(ggplot2)
library(ggridges)

ggplot(mpg, 
       aes(x = cty, 
           y = class, 
           fill = class)) +
  geom_density_ridges(alpha = 0.5) + 
  labs("Highway mileage by auto class") +
  theme(legend.position = "none")
## Picking joint bandwidth of 0.929

我在这里省略了图例,因为它是多余的(分布已经标记在y轴上了)。不出所料,pickup trucks的里程数最低,而subcompacts和compact cars的里程数往往最高。然而,这些小排量汽车的油耗分数范围很广。

注意,分布的可能重叠是为了获得更紧凑的图。如果重叠严重,可以使用geom_density_ridges(alpha = n)添加透明度,其中n的范围是从0(透明)到1(不透明)。

4.3.6 Mean/SEM图

一种比较数值变量上的组的流行方法是带误差条的平均图。 误差条可以表示标准偏差、平均值的标准误差或置信区间。 在本节中,我们将绘制平均值和标准误差图。

# calculate means,standard deviations,standard errors, and 95% confidence intervals by rank
library(dplyr)
plotdata <- Salaries %>%
  group_by(rank) %>%
  summarize(n = n(),
         mean = mean(salary),
         sd = sd(salary),
         se = sd / sqrt(n),
         ci = qt(0.975, df = n - 1) * sd / sqrt(n))

plotdata
## # A tibble: 3 x 6
##   rank          n    mean     sd    se    ci
##   <fct>     <int>   <dbl>  <dbl> <dbl> <dbl>
## 1 AsstProf     67  80776.  8174.  999. 1994.
## 2 AssocProf    64  93876. 13832. 1729. 3455.
## 3 Prof        266 126772. 27719. 1700. 3346.
# plot the means and standard errors
ggplot(plotdata, 
       aes(x = rank, 
           y = mean, 
           group = 1)) +
  geom_point(size = 3) +
  geom_line() 

ggplot(plotdata, 
       aes(x = rank, 
           y = mean, 
           group = 1)) +
  geom_point(size = 3) +
  geom_line() +
  geom_errorbar(aes(ymin = mean - se, 
                    ymax = mean + se), 
                width = .1)

虽然我们绘制的误差线代表标准误差,但我们可以绘制标准偏差或95% 的置信区间。 只需在 aes 选项中用 sd 或 error 替换 se 即可。 我们可以使用同样的技术来比较不同rank和性别的薪水。 (从技术上来说,这不是双变量,因为我们在绘制等级、性别和薪水,但它似乎适合这里)

# calculate means and standard errors by rank and sex
plotdata <- Salaries %>%
  group_by(rank, sex) %>%
  summarize(n = n(),
            mean = mean(salary),
            sd = sd(salary),
            se = sd/sqrt(n))
plotdata
## # A tibble: 6 x 6
## # Groups:   rank [3]
##   rank      sex        n    mean     sd    se
##   <fct>     <fct>  <int>   <dbl>  <dbl> <dbl>
## 1 AsstProf  Female    11  78050.  9372. 2826.
## 2 AsstProf  Male      56  81311.  7901. 1056.
## 3 AssocProf Female    10  88513. 17965. 5681.
## 4 AssocProf Male      54  94870. 12891. 1754.
## 5 Prof      Female    18 121968. 19620. 4624.
## 6 Prof      Male     248 127121. 28214. 1792.
# plot the means and standard errors by sex
ggplot(plotdata, aes(x = rank,
                     y = mean, 
                     group=sex, 
                     color=sex)) +
  geom_point(size = 3) +
  geom_line(size = 1) +
  geom_errorbar(aes(ymin  =mean - se, 
                    ymax = mean+se), 
                width = .1)

# plot the means and standard errors by sex (dodged)
pd <- position_dodge(0.2)
ggplot(plotdata, 
       aes(x = rank, 
           y = mean, 
           group=sex, 
           color=sex)) +
  geom_point(position = pd, 
             size = 3) +
  geom_line(position = pd,
            size = 1) +
  geom_errorbar(aes(ymin = mean - se, 
                    ymax = mean + se), 
                width = .1, 
                position= pd)

最后,让我们添加一些选项,使图形更有吸引力。

# improved means/standard error plot
pd <- position_dodge(0.2)
ggplot(plotdata, 
       aes(x = factor(rank, 
                      labels = c("Assistant\nProfessor",
                                 "Associate\nProfessor",
                                 "Full\nProfessor")), 
           y = mean, 
           group=sex, 
           color=sex)) +
  geom_point(position=pd, 
             size = 3) +
  geom_line(position = pd, 
            size = 1) +
  geom_errorbar(aes(ymin = mean - se, 
                    ymax = mean + se), 
                width = .1, 
                position = pd, 
                size = 1) +
  scale_y_continuous(label = scales::dollar) +
  scale_color_brewer(palette="Set1") +
  labs(title = "Mean salary by rank and sex",
       subtitle = "(mean +/- standard error)",
       x = "", 
       y = "",
       color = "Gender")

4.3.7 带图

分组变量和数值变量之间的关系可以用散点图显示

# plot the distribution of salaries by rank using strip plots
ggplot(Salaries, 
       aes(y = rank, 
           x = salary)) +
  geom_point() + 
  labs(title = "Salary distribution by rank")

这些一维的散点被称为条形图。不幸的是,要点的重叠使解释变得困难。这种关系更容易看出这些点是否受到了影响。基本上每个y坐标都加上一个小随机数。

# plot the distribution of salaries by rank using jittering
ggplot(Salaries, 
       aes(y = rank, 
           x = salary)) +
  geom_jitter() + 
  labs(title = "Salary distribution by rank")

如果我们使用颜色,比较组会更容易。

# plot the distribution of salaries 
# by rank using jittering
library(scales)
ggplot(Salaries, 
       aes(y = factor(rank,
                      labels = c("Assistant\nProfessor",
                                 "Associate\nProfessor",
                                 "Full\nProfessor")), 
           x = salary, 
           color = rank)) +
  geom_jitter(alpha = 0.7,
              size = 1.5) + 
  scale_x_continuous(label = dollar) +
  labs(title = "Academic Salary by Rank", 
       subtitle = "9-month salary for 2008-2009",
       x = "",
       y = "") +
  theme(legend.position = "none")

position = "none"用于隐藏图例(这里不需要这个图例)。当点数不太大时,抖动的情节很好。

4.3.7.1 将抖动和箱线图结合起来

# plot the distribution of salaries by rank using jittering
library(scales)
ggplot(Salaries, 
       aes(x = factor(rank,
                      labels = c("Assistant\nProfessor",
                                 "Associate\nProfessor",
                                 "Full\nProfessor")), 
           y = salary, 
           color = rank)) +
  geom_boxplot(size=1,
               outlier.shape = 1,
               outlier.color = "black",
               outlier.size  = 3) +
  geom_jitter(alpha = 0.5, 
              width=.2) + 
  scale_y_continuous(label = dollar) +
  labs(title = "Academic Salary by Rank", 
       subtitle = "9-month salary for 2008-2009",
       x = "",
       y = "") +
  theme_minimal() +
  theme(legend.position = "none") +
  coord_flip()

添加了几个选项来创建这个图

对于箱线图:

  • size = 1 makes the lines thicker
  • outlier.color = “black” makes outliers black
  • outlier.shape = 1 specifies circles for outliers
  • outlier.size = 3 increases the size of the outlier symbol

对于散点图

  • alpha = 0.5 makes the points more transparent
  • width = .2 decreases the amount of jitter (.4 is the default)

4.3.8 Beeswarm图

Beeswarm图(也称为小提琴的散点图)类似于抖动的散点图,因为它们通过以减少重叠的方式绘制点来显示定量变量的分布。此外,它们还有助于显示每个点上的数据密度(以类似于小提琴图的方式)。继续前面的例子

# plot the distribution of salaries by rank using beewarm-syle plots
pacman::p_load(ggbeeswarm)
library(scales)
ggplot(Salaries, 
       aes(x = factor(rank,
                      labels = c("Assistant\nProfessor",
                                 "Associate\nProfessor",
                                 "Full\nProfessor")),           # 分类变量为因子,color和x变量一样
           y = salary, 
           color = rank)) +
  geom_quasirandom(alpha = 0.7,
                   size = 1.5) + 
  scale_y_continuous(label = dollar) +
  labs(title = "Academic Salary by Rank", 
       subtitle = "9-month salary for 2008-2009",
       x = "",
       y = "") +
  theme(legend.position = "none")

这些绘图是使用geom_quasirandom函数创建的。这些图形比简单的条状图更容易读懂。

4.3.9 克利夫兰点图

当您希望比较大量组的数字统计信息时,Cleveland图非常有用。例如,假设您想要使用gapminder数据集比较2007年亚洲国家的预期寿命。

data(gapminder, package="gapminder")

# subset Asian countries in 2007
library(dplyr)
plotdata <- gapminder %>%
  filter(continent == "Asia" & 
         year == 2007)

# basic Cleveland plot of life expectancy by country
ggplot(plotdata, 
       aes(x= lifeExp, y = country)) +
  geom_point()

# Sorted Cleveland plot
ggplot(plotdata, 
       aes(x=lifeExp, 
           y=reorder(country, lifeExp))) +   # 排序函数很有用
  geom_point()

# Fancy Cleveland plot
ggplot(plotdata, 
       aes(x=lifeExp, 
           y=reorder(country, lifeExp))) +
  geom_point(color="blue", 
             size = 2) +
  geom_segment(aes(x = 40, xend = lifeExp, y = reorder(country, lifeExp), yend = reorder(country, lifeExp)),color = "lightgrey") +
  labs (x = "Life Expectancy (years)",
        y = "",
        title = "Life Expectancy by Country",
        subtitle = "GapMinder data for Asia - 2007") +
  theme_minimal() + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

日本人的预期寿命显然是最高的,而阿富汗人的预期寿命则是最低的。最后一个图也被称为棒棒糖图(你可以看到为什么)。