knitr::opts_chunk$set(echo = TRUE,fig.align = "center")
在本章中,您将深入了解图层的细节,以及如何控制所有五个组件:数据、美学映射、几何、统计变换和位置调整。
这里的目标是为您提供工具来构建针对当前问题而定制的复杂绘图。
library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## √ ggplot2 3.2.1 √ purrr 0.3.3
## √ tibble 2.1.3 √ dplyr 0.8.3
## √ tidyr 1.0.0 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# install.packages("animate")
# library(animate)
library(DT)
# 添加一个geom_point()
mpg %>%
ggplot2::ggplot(aes(displ,hwy)) +
geom_point()
mpg %>%
ggplot2::ggplot(aes(displ,hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
自定义图片格式
mod <- loess(hwy~displ,data = mpg)
grid <- data.frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))
grid$hwy <- predict(mod, newdata = grid)
grid %>% datatable()
std_resid <- resid(mod) / mod$s
outlier <- mpg %>%
filter(abs(std_resid)>2)
outlier
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corve~ 5.7 1999 8 manu~ r 16 26 p 2sea~
## 2 pontiac grand~ 3.8 2008 6 auto~ f 18 28 r mids~
## 3 pontiac grand~ 5.3 2008 8 auto~ f 16 25 p mids~
## 4 volkswagen jetta 1.9 1999 4 manu~ f 33 44 d comp~
## 5 volkswagen new b~ 1.9 1999 4 manu~ f 35 44 d subc~
## 6 volkswagen new b~ 1.9 1999 4 auto~ f 29 41 d subc~
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_line(data = grid, colour = "blue", size = 1.5) +
geom_text(data = outlier, aes(label = model))
mpg %>%
ggplot(aes(displ,hwy)) +
geom_point() +
geom_line(data = grid) +
geom_text(data = outlier,aes(label = model))
2 The following code uses dplyr to generate some summary statistics about each class of car。
library(dplyr)
class <- mpg %>%
group_by(class) %>%
summarise(n = n(), hwy = mean(hwy))
class
## # A tibble: 7 x 3
## class n hwy
## <chr> <int> <dbl>
## 1 2seater 5 24.8
## 2 compact 47 28.3
## 3 midsize 41 27.3
## 4 minivan 11 22.4
## 5 pickup 33 16.9
## 6 subcompact 35 28.1
## 7 suv 62 18.1
class %>%
ggplot(aes(class,hwy)) +
geom_point(col = "red",size = 10) +
geom_jitter(data = mpg,aes(class,hwy),size = 2,width = 0.1,height = 1)
mpg %>%
ggplot(aes(displ,hwy,col = drv,shape = drv)) +
geom_point() +
geom_smooth(aes(linetype = drv),se = FALSE) +
theme(legend.position = "none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 或者如下
ggplot(mpg) +
geom_point(aes(displ,hwy,col = drv,shape = drv)) +
geom_smooth(aes(displ,hwy,col= drv,linetype = drv),se = FALSE) +
theme(legend.position = "none")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
在plot或者layers中指定属性
# 下边这种方式更简单
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) + # 点有颜色映射
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none")
在plot或者layers中指定属性
通常,您希望设置映射来阐明图形下的结构并尽可能减少输入。最好的方法可能需要一段时间才能立即显现出来,因此,如果您已经迭代了处理复杂图形的方法,那么重写它以使结构更清晰可能是值得的。
如果希望外观由变量控制,请将值放在aes()中;如果希望覆盖默认大小或颜色,请将值放在aes()之外。
下面的图是用类似的代码创建的,但是有不同的输出。第二个plot将颜色映射(而不是设置)为“darkblue”。这有效地创建了一个新变量,该变量只包含值“darkblue”,然后使用颜色刻度对其进行缩放。因为这个值是离散的,所以默认的颜色刻度在色轮上使用均匀间隔的颜色,因为只有一个值,所以这个颜色是粉红色的。
mpg %>%
ggplot(aes(cty,hwy)) +
geom_point(color = "darkblue",size = 3)
mpg %>%
ggplot(aes(cty,hwy)) +
geom_point(aes(col = "darkblue"),size = 3)
color的位置
ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(colour = "darkblue"),size = 3) +
scale_colour_identity()
这是最有用的,如果你总是有一个列已经包含颜色。
将属性映射到常数上有时是有用的。例如,如果您想要显示多个不同参数的图层,您可以“命名”每一图层:
ggplot(mpg, aes(displ, hwy)) +
geom_point()+
geom_smooth(aes(colour = "loess"), method = "loess", se = FALSE) +
geom_smooth(aes(colour = "lm"), method = "lm", se = FALSE) +
labs(colour = "Method")
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
mpg %>%
ggplot(aes(displ,hwy)) +
geom_point() +
geom_smooth(aes(col = "loess"),method = "loess",se = FALSE) +
geom_smooth(aes(col = "lm"),method = "lm",se = FALSE) +
geom_smooth(aes(col = "rlm"),method = "rlm",se = FALSE) +
labs(colour = "Method")
# 这个图没什么大的作用
ggplot(mpg) +
geom_jitter(aes(class, cty),width = 0.1) +
geom_boxplot(aes(trans, hwy)) +
coord_flip()
ggplot(mpg) +
geom_boxplot(aes(trans,hwy)) +
coord_flip()
geom text(): text.
geom raster(): fast version of geom tile() for equal sized tiles
Download and print out the ggplot2 cheatsheet from http://www.rstudio.com/resources/cheatsheets/ so you have a handy visual reference for all the geoms.
Look at the documentation for the graphical primitive geoms. Which aesthetics do they use? How can you summarise them in a compact form?
What’s the best way to master an unfamiliar geom? List three resources to help you get started.
官方网站,help,ggplot2 cheet…
# 小提琴图
mpg %>% ggplot() +
geom_violin(aes(drv,displ,col = drv,fill = drv))
# 气泡图
mpg %>% ggplot() +
geom_jitter(aes(hwy,cty,size =cty),
alpha = 0.1,height = 0.5)
# 一个连续,一个离散
mpg %>%
ggplot() +
geom_jitter(aes(cyl,drv),height = 0.1,width = 0.1,size = 8)
One time, one continuous
Show distribution:
Focus attention on the overall trend in a large dataset.
Spatial
Label outlying points.
mpg %>%
ggplot(aes(trans,cty)) +
geom_point() +
stat_summary(geom = "point",fun.y = "median",col = "red",size = 10)
# 另外一种方式
mpg %>%
ggplot(aes(trans,cty)) +
geom_point() +
geom_point(stat = "summary",fun.y = "median",col = "red",size = 10)
统计变换
我认为最好使用第一种形式,因为它使您更清楚地显示摘要,而不是原始数据。
在内部,统计信息将数据帧作为输入,并返回数据帧作为输出,因此,统计信息可以将新变量添加到原始数据集中。可以将美学映射到这些新变量。例如,用于生成直方图的统计量stat_bin产生以下变量:
count, the number of observations in each bin
density, the density of observations in each bin (percentage of total/barwidth)
x, the centre of the bin
可以使用这些生成的变量代替原始数据集中存在的变量。例如,默认的直方图geom将条形的高度分配给观察值(计数),但是如果您希望使用更传统的直方图,则可以使用密度(密度)。要引用诸如密度之类的已生成变量,必须在名称周围加上“..”。这样可以避免在原始数据集包含与生成的变量同名的变量的情况下的混淆,并且可以使以后的代码阅读者都清楚该变量是由统计信息生成的。每个统计信息都列出了在其文档中创建的变量。
diamonds$price %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
ggplot(diamonds, aes(price)) +
geom_histogram(binwidth = 500)
ggplot(diamonds, aes(price)) +
geom_histogram(binwidth = 5000)
统计变换
ggplot(diamonds, aes(price)) +
geom_histogram(aes(y = ..density..), binwidth = 500)
ggplot(diamonds, aes(price)) +
geom_histogram(aes(y = ..count..), binwidth = 500)
统计变换
当您想要比较具有非常不同大小的多个组的分布时,此技术特别有用。
diamonds %>%
ggplot(aes(price,col = cut)) +
geom_freqpoly(size = 1.2,binwidth = 500) +
theme(legend.position = "bottom")
diamonds %>%
ggplot(aes(price,col = cut)) +
geom_freqpoly(aes(y = ..density..),size = 1.2,binwidth = 500) +
theme(legend.position = "bottom")
比较不同组的分布
结果相当令人吃惊:低质量的钻石平均价格似乎更高。真的是这样吗?
sum(is.na(diamonds))
## [1] 0
data <- diamonds %>%
group_by(cut) %>%
summarise(n = n(),price_mean = mean(price)) %>%
arrange(price_mean)
data %>% datatable()
stat_smooth(). Use the appropriate geoms to mimic the default geom_smooth() display.mod <- loess(hwy~displ, data = mpg)
smoothed <- data.frame(displ = seq(1.6, 7, length = 50)) # 测试集
pred <- predict(mod, newdata = smoothed, se = TRUE) # 预测值
smoothed$hwy <- pred$fit
smoothed$hwy_lwr <- pred$fit - 1.96 * pred$se.fit
smoothed$hwy_upr <- pred$fit + 1.96 * pred$se.fit
names(smoothed)
## [1] "displ" "hwy" "hwy_lwr" "hwy_upr"
mpg %>%
ggplot(aes(displ,hwy)) +
geom_point(col = "red",size = 2) +
geom_point(data = smoothed,aes(displ,hwy),size = 3) +
geom_smooth(data = mpg,aes(displ,hwy),col = "blue") +
geom_errorbar(data = smoothed,mapping = aes(ymin = hwy_lwr,ymax = hwy_upr),col = 6,size = 1.5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What stats were used to create the following plots?
Read the help for stat sum() then use geom count() to create a plot that shows the proportion of cars that have each combination of drv and trans
Three adjustments apply primarily to bars:
position_stack(): stack overlapping bars (or areas) on top of each other
position_fill(): stack overlapping bars, scaling so the top is always at 1
position_dodge(): place overlapping bars (or boxplots) side-by-side.
diamonds$color %>% table()
## .
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
ggplot(diamonds, aes(color)) +
xlab(NULL) + ylab(NULL) + theme(legend.position = "none") +
geom_bar()
dplot <- ggplot(diamonds, aes(color, fill = cut)) +
xlab(NULL) + ylab(NULL) + theme(legend.position = "none")
dplot + geom_bar()
dplot + geom_bar(position = "dodge")
dplot + geom_bar(position = "fill")
There’s also a position adjustment that does nothing: position identity().The identity position adjustment isn’t useful for bars, because each bar obscures the bars behind, but there are many geoms that don’t need adjusting,like lines:
dplot + geom_bar(position = "identity", alpha = 1/3, colour = "grey50")
ggplot(diamonds, aes(color, colour = cut)) +
geom_line(aes(group = cut), stat = "count") +
xlab(NULL) + ylab(NULL) +
theme(legend.position = "none")
There are three position adjustments that are primarily useful for points:
position nudge(): move points by a fixed offset
position jitter(): add a little random noise to every position
position jitterdodge(): dodge points within groups, then add a little random noise
请注意,将参数传递到位置调整的方式不同于stats和geoms。而不是在…,构造一个位置调整对象,在调用中提供额外的参数:
mpg %>%
ggplot(aes(displ,hwy)) +
geom_point(position = "jitter")
mpg %>%
ggplot(aes(displ,hwy)) +
geom_point(position = position_jitter(width = 0.5, height = 0.5))
点的位置调整
这相当冗长,因此geom jitter()提供了一个方便的快捷方式:
ggplot(mpg, aes(displ, hwy)) +
geom_jitter(width = 0.05, height = 0.5)
连续数据通常不会完全重叠,当重叠时(由于数据密度高),轻微的调整(如抖动)通常不足以解决问题。因此,位置调整通常对离散数据最有用。
When might you use position_nudge()? Read the documentation.
Many position adjustments can only be used with a few geoms. For example, you can’t stack boxplots or errors bars. Why not? What properties must a geom possess in order to be stackable? What properties must it possess to be dodgeable?
Why might you use geom_jitter() instead of geom_count()? What are the advantages and disadvantages of each technique?
When might you use a stacked area plot? What are the advantages anddisadvantages compared to a line plot?