knitr::opts_chunk$set(echo = TRUE,fig.align = "center")

5.1 简介

在本章中,您将深入了解图层的细节,以及如何控制所有五个组件数据美学映射几何统计变换位置调整

这里的目标是为您提供工具来构建针对当前问题而定制的复杂绘图。

5.2 建立图形

library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.2.1     √ purrr   0.3.3
## √ tibble  2.1.3     √ dplyr   0.8.3
## √ tidyr   1.0.0     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# install.packages("animate")
# library(animate)
library(DT)
# 添加一个geom_point()
mpg %>% 
  ggplot2::ggplot(aes(displ,hwy)) +
  geom_point()

mpg %>% 
  ggplot2::ggplot(aes(displ,hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
自定义图片格式自定义图片格式

自定义图片格式

5.3 数据

mod <- loess(hwy~displ,data = mpg)

grid <- data.frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))

grid$hwy <- predict(mod, newdata = grid)

grid %>% datatable()
std_resid <- resid(mod) / mod$s

outlier <- mpg %>% 
  filter(abs(std_resid)>2)

outlier
## # A tibble: 6 x 11
##   manufacturer model  displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet    corve~   5.7  1999     8 manu~ r        16    26 p     2sea~
## 2 pontiac      grand~   3.8  2008     6 auto~ f        18    28 r     mids~
## 3 pontiac      grand~   5.3  2008     8 auto~ f        16    25 p     mids~
## 4 volkswagen   jetta    1.9  1999     4 manu~ f        33    44 d     comp~
## 5 volkswagen   new b~   1.9  1999     4 manu~ f        35    44 d     subc~
## 6 volkswagen   new b~   1.9  1999     4 auto~ f        29    41 d     subc~
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_line(data = grid, colour = "blue", size = 1.5) +
  geom_text(data = outlier, aes(label = model))

mpg %>% 
  ggplot(aes(displ,hwy)) +
  geom_point() +
  geom_line(data = grid) +
  geom_text(data = outlier,aes(label = model))

5.3.1 练习

2 The following code uses dplyr to generate some summary statistics about each class of car。

library(dplyr)

class <- mpg %>%
  group_by(class) %>%
  summarise(n = n(), hwy = mean(hwy))

class
## # A tibble: 7 x 3
##   class          n   hwy
##   <chr>      <int> <dbl>
## 1 2seater        5  24.8
## 2 compact       47  28.3
## 3 midsize       41  27.3
## 4 minivan       11  22.4
## 5 pickup        33  16.9
## 6 subcompact    35  28.1
## 7 suv           62  18.1
class %>% 
  ggplot(aes(class,hwy)) +
  geom_point(col = "red",size = 10) +
  geom_jitter(data = mpg,aes(class,hwy),size = 2,width = 0.1,height = 1)

5.4 属性映射

5.4.1 在plot或者layers中指定属性

mpg %>% 
  ggplot(aes(displ,hwy,col = drv,shape = drv)) +
  geom_point() +
  geom_smooth(aes(linetype = drv),se = FALSE) +
  theme(legend.position = "none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 或者如下

ggplot(mpg) +
  geom_point(aes(displ,hwy,col = drv,shape = drv)) +
  geom_smooth(aes(displ,hwy,col= drv,linetype = drv),se = FALSE) +
  theme(legend.position = "none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
在plot或者layers中指定属性在plot或者layers中指定属性

在plot或者layers中指定属性

# 下边这种方式更简单
ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme(legend.position = "none")

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +   # 点有颜色映射
  geom_smooth(method = "lm", se = FALSE) +
  theme(legend.position = "none")
在plot或者layers中指定属性在plot或者layers中指定属性

在plot或者layers中指定属性

通常,您希望设置映射来阐明图形下的结构并尽可能减少输入。最好的方法可能需要一段时间才能立即显现出来,因此,如果您已经迭代了处理复杂图形的方法,那么重写它以使结构更清晰可能是值得的。

5.4.2 Setting vs. Mapping

如果希望外观由变量控制,请将值放在aes()中;如果希望覆盖默认大小或颜色,请将值放在aes()之外。

下面的图是用类似的代码创建的,但是有不同的输出。第二个plot将颜色映射(而不是设置)为“darkblue”。这有效地创建了一个新变量,该变量只包含值“darkblue”,然后使用颜色刻度对其进行缩放。因为这个值是离散的,所以默认的颜色刻度在色轮上使用均匀间隔的颜色,因为只有一个值,所以这个颜色是粉红色的。

mpg %>% 
  ggplot(aes(cty,hwy)) +
  geom_point(color = "darkblue",size = 3)

mpg %>% 
  ggplot(aes(cty,hwy)) +
  geom_point(aes(col = "darkblue"),size = 3)
color的位置color的位置

color的位置

ggplot(mpg, aes(cty, hwy)) +
  geom_point(aes(colour = "darkblue"),size = 3) +
  scale_colour_identity()

这是最有用的,如果你总是有一个列已经包含颜色。

将属性映射到常数上有时是有用的。例如,如果您想要显示多个不同参数的图层,您可以“命名”每一图层:

ggplot(mpg, aes(displ, hwy)) +
  geom_point()+
  geom_smooth(aes(colour = "loess"), method = "loess", se = FALSE) +
  geom_smooth(aes(colour = "lm"), method = "lm", se = FALSE) +
  labs(colour = "Method")

5.4.3 练习

  1. Simplify the following plot specifications:
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
mpg %>% 
  ggplot(aes(displ,hwy)) +
  geom_point() +
  geom_smooth(aes(col = "loess"),method = "loess",se = FALSE) +
  geom_smooth(aes(col = "lm"),method = "lm",se = FALSE) +
  geom_smooth(aes(col = "rlm"),method = "rlm",se = FALSE) +
  labs(colour = "Method")

  1. What does the following code do? Does it work? Does it make sense? Why/why not?
# 这个图没什么大的作用
ggplot(mpg) +
  geom_jitter(aes(class, cty),width = 0.1) +
  geom_boxplot(aes(trans, hwy)) +
  coord_flip()

ggplot(mpg) +
  geom_boxplot(aes(trans,hwy)) +
  coord_flip()

5.5 几何对象

  • Graphical primitives:
  • geom blank(): display nothing. Most useful for adjusting axes limits using data.
  • geom point(): points.
  • geom path(): paths.
  • geom ribbon(): ribbons, a path with vertical thickness.
  • geom segment(): a line segment, specified by start and end position.
  • geom rect(): rectangles.
  • geom polyon(): filled polygons.
  • geom text(): text.

  • One variable:
  • Discrete:
    • geom bar(): display distribution of discrete variable.
  • Continuous
    • geom histogram(): bin and count continuous variable, display with bars.
    • geom density(): smoothed density estimate.
    • geom dotplot(): stack individual points into a dot plot.
    • geom freqpoly(): bin and count continuous variable, display withlines.
  • Two variables:
  • Both continuous:
    • geom point(): scatterplot.
    • geom quantile(): smoothed quantile regression.
    • geom rug(): marginal rug plots.
    • geom smooth(): smoothed line of best fit.
    • geom text(): text labels.
  • Show distribution:
    • geom bin2d(): bin into rectangles and count.
    • geom density2d(): smoothed 2d density estimate.
    • geom hex(): bin into hexagons and count.
  • At least one discrete:
    • geom count(): count number of point at distinct locations
    • geom jitter(): randomly jitter overlapping points.
  • One continuous, one discrete:
    • geom bar(stat = “identity”): a bar chart of precomputed summaries.
    • geom boxplot(): boxplots.
    • geom violin(): show density of values in each group
  • One time, one continuous
    • geom area(): area plot.
    • geom line(): line plot.
    • geom step(): step plot.
  • Display uncertainty:
    • geom crossbar(): vertical bar with center.
    • geom errorbar(): error bars.
    • geom linerange(): vertical line.
    • geom pointrange(): vertical line with center.
  • Spatial
    • geom map(): fast version of geom polygon() for map data.
  • Three variables:
  • geom contour(): contours.
  • geom tile(): tile the plane with rectangles.
  • geom raster(): fast version of geom tile() for equal sized tiles

5.5.1 练习

  1. Download and print out the ggplot2 cheatsheet from http://www.rstudio.com/resources/cheatsheets/ so you have a handy visual reference for all the geoms.

  2. Look at the documentation for the graphical primitive geoms. Which aesthetics do they use? How can you summarise them in a compact form?

  3. What’s the best way to master an unfamiliar geom? List three resources to help you get started.

官方网站,help,ggplot2 cheet…

  1. For each of the plots below, identify the geom used to draw it
# 小提琴图
mpg %>% ggplot() +
  geom_violin(aes(drv,displ,col = drv,fill = drv))

# 气泡图

mpg %>% ggplot() +
  geom_jitter(aes(hwy,cty,size =cty),
              alpha = 0.1,height = 0.5)

# 一个连续,一个离散
mpg %>% 
  ggplot() +
  geom_jitter(aes(cyl,drv),height = 0.1,width = 0.1,size = 8)

  1. For each of the following problems, suggest a useful geom:
  • Display how a variable has changed over time.
  • One time, one continuous

  • Show the detailed distribution of a single variable.
  • Show distribution:

  • Focus attention on the overall trend in a large dataset.

  • Draw a map.
  • Spatial

  • Label outlying points.

5.6 统计变换

  • stat bin(): geom_bar(), geom_freqpoly(), geom_histogram()
  • stat bin2d(): geom_bin2d()
  • stat bindot(): geom_dotplot()
  • stat binhex(): geom_hex()
  • stat boxplot(): geom_boxplot()
  • stat contour(): geom_contour()
  • stat quantile(): geom_quantile()
  • stat smooth(): geom_smooth()
  • stat sum(): geom_count()
mpg %>% 
  ggplot(aes(trans,cty)) +
  geom_point() +
  stat_summary(geom = "point",fun.y = "median",col = "red",size = 10)

# 另外一种方式

mpg %>% 
  ggplot(aes(trans,cty)) +
  geom_point() +
  geom_point(stat = "summary",fun.y = "median",col = "red",size = 10)
统计变换统计变换

统计变换

我认为最好使用第一种形式,因为它使您更清楚地显示摘要,而不是原始数据。

5.6.1 生成变量

在内部,统计信息将数据帧作为输入,并返回数据帧作为输出,因此,统计信息可以将新变量添加到原始数据集中。可以将美学映射到这些新变量。例如,用于生成直方图的统计量stat_bin产生以下变量:

  • count, the number of observations in each bin

  • density, the density of observations in each bin (percentage of total/barwidth)

  • x, the centre of the bin

可以使用这些生成的变量代替原始数据集中存在的变量。例如,默认的直方图geom将条形的高度分配给观察值(计数),但是如果您希望使用更传统的直方图,则可以使用密度(密度)。要引用诸如密度之类的已生成变量,必须在名称周围加上“..”。这样可以避免在原始数据集包含与生成的变量同名的变量的情况下的混淆,并且可以使以后的代码阅读者都清楚该变量是由统计信息生成的。每个统计信息都列出了在其文档中创建的变量。

diamonds$price %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823
ggplot(diamonds, aes(price)) +
  geom_histogram(binwidth = 500)

ggplot(diamonds, aes(price)) +
  geom_histogram(binwidth = 5000)
统计变换统计变换

统计变换

ggplot(diamonds, aes(price)) +
  geom_histogram(aes(y = ..density..), binwidth = 500)

ggplot(diamonds, aes(price)) +
  geom_histogram(aes(y = ..count..), binwidth = 500)
统计变换统计变换

统计变换

当您想要比较具有非常不同大小的多个组的分布时,此技术特别有用。

diamonds %>% 
  ggplot(aes(price,col = cut)) +
  geom_freqpoly(size = 1.2,binwidth = 500) +
  theme(legend.position = "bottom")

diamonds %>% 
  ggplot(aes(price,col = cut)) +
  geom_freqpoly(aes(y = ..density..),size = 1.2,binwidth = 500) +
  theme(legend.position = "bottom")
比较不同组的分布比较不同组的分布

比较不同组的分布

结果相当令人吃惊:低质量的钻石平均价格似乎更高。真的是这样吗?

sum(is.na(diamonds))
## [1] 0
data <- diamonds %>% 
  group_by(cut) %>% 
  summarise(n = n(),price_mean = mean(price)) %>% 
  arrange(price_mean)

data %>% datatable()

5.6.2 练习

  1. The code below creates a similar dataset to stat_smooth(). Use the appropriate geoms to mimic the default geom_smooth() display.
mod <- loess(hwy~displ, data = mpg)

smoothed <- data.frame(displ = seq(1.6, 7, length = 50))  # 测试集

pred <- predict(mod, newdata = smoothed, se = TRUE)     # 预测值

smoothed$hwy <- pred$fit

smoothed$hwy_lwr <- pred$fit - 1.96 * pred$se.fit

smoothed$hwy_upr <- pred$fit + 1.96 * pred$se.fit

names(smoothed)
## [1] "displ"   "hwy"     "hwy_lwr" "hwy_upr"
mpg %>% 
  ggplot(aes(displ,hwy)) +
  geom_point(col = "red",size = 2) +
  geom_point(data = smoothed,aes(displ,hwy),size = 3) +
  geom_smooth(data = mpg,aes(displ,hwy),col = "blue") +
  geom_errorbar(data = smoothed,mapping = aes(ymin = hwy_lwr,ymax = hwy_upr),col = 6,size = 1.5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. What stats were used to create the following plots?

  2. Read the help for stat sum() then use geom count() to create a plot that shows the proportion of cars that have each combination of drv and trans

5.7 位置调整

Three adjustments apply primarily to bars:

  • position_stack(): stack overlapping bars (or areas) on top of each other

  • position_fill(): stack overlapping bars, scaling so the top is always at 1

  • position_dodge(): place overlapping bars (or boxplots) side-by-side.

diamonds$color %>% table()
## .
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808
ggplot(diamonds, aes(color)) +
  xlab(NULL) + ylab(NULL) + theme(legend.position = "none") +
  geom_bar()

dplot <- ggplot(diamonds, aes(color, fill = cut)) +
  xlab(NULL) + ylab(NULL) + theme(legend.position = "none")

dplot + geom_bar()

dplot + geom_bar(position = "dodge")

dplot + geom_bar(position = "fill")

There’s also a position adjustment that does nothing: position identity().The identity position adjustment isn’t useful for bars, because each bar obscures the bars behind, but there are many geoms that don’t need adjusting,like lines:

dplot + geom_bar(position = "identity", alpha = 1/3, colour = "grey50")

ggplot(diamonds, aes(color, colour = cut)) +
  geom_line(aes(group = cut), stat = "count") +
  xlab(NULL) + ylab(NULL) +
  theme(legend.position = "none")

There are three position adjustments that are primarily useful for points:

  • position nudge(): move points by a fixed offset

  • position jitter(): add a little random noise to every position

  • position jitterdodge(): dodge points within groups, then add a little random noise

请注意,将参数传递到位置调整的方式不同于stats和geoms。而不是在…,构造一个位置调整对象,在调用中提供额外的参数:

mpg %>% 
  ggplot(aes(displ,hwy)) +
  geom_point(position = "jitter")

mpg %>% 
  ggplot(aes(displ,hwy)) +
  geom_point(position = position_jitter(width =  0.5, height = 0.5))
点的位置调整点的位置调整

点的位置调整

这相当冗长,因此geom jitter()提供了一个方便的快捷方式:

ggplot(mpg, aes(displ, hwy)) +
  geom_jitter(width = 0.05, height = 0.5)

连续数据通常不会完全重叠,当重叠时(由于数据密度高),轻微的调整(如抖动)通常不足以解决问题。因此,位置调整通常对离散数据最有用

5.7.1 练习

  1. When might you use position_nudge()? Read the documentation.

  2. Many position adjustments can only be used with a few geoms. For example, you can’t stack boxplots or errors bars. Why not? What properties must a geom possess in order to be stackable? What properties must it possess to be dodgeable?

  3. Why might you use geom_jitter() instead of geom_count()? What are the advantages and disadvantages of each technique?

  4. When might you use a stacked area plot? What are the advantages anddisadvantages compared to a line plot?