二元图显示了两个变量之间的关系。图表的类型将取决于变量的测量水平(离散或连续)。
在绘制两个离散变量之间的关系时,通常使用堆叠、分组或分段条形图。一个不太常见的方法是马赛克图。
让我们在燃油经济性 Fuel economy 数据集中为汽车绘制汽车类别和驱动类型(前轮、后轮或四轮驱动)之间的关系。
pacman::p_load(tidyverse,DT,DataExplorer)
# stacked bar chart
ggplot(mpg,
aes(x = class)) +
geom_bar(position = "stack") # y为count
# stacked bar chart
ggplot(mpg,
aes(x = class,
fill = drv)) + # fill就是添加一个离散变量
geom_bar(position = "stack")
ggplot(mpg,
aes(x = class %>% as_factor() %>% fct_infreq(),
fill = drv)) + # fill就是添加一个离散变量
geom_bar(position = "stack") +
labs(x = "drv")
从图表中,我们可以看到,最常见的交通工具是 SUV。
分组条形图将第二个分类变量的条形图并排放置。 要创建一个分组的条形图,使用position = "dodge"选项。
# grouped bar plot
ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = "dodge")
注意所有的Minivans都是前轮驱动。默认情况下,零计数条会被删除,其余的条会变宽。 这可能不是你想要的行为。 您可以使用position = position_dodge(preserve = "single")选项修改此选项。
# grouped bar plot preserving zero count bars
ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = position_dodge(preserve = "single")) # 并排
一个分段的条形图是一个堆叠的条形图,其中每个条形图代表100% 。 您可以使用position = "filled"创建一个分段的条形图。
# bar plot, with each bar representing 100%
ggplot(mpg,
aes(x = class,
fill = drv)) +
geom_bar(position = "fill") +
labs(y = "Proportion")
如果目标是比较一个类别在一个变量中与另一个变量的每个级别的百分比,那么这种类型的图表就特别有用。 例如,当你从compact到midsize再到 minivan时,前轮驱动的比例就会上升。
您可以使用其他选项来改善颜色和标签。在下图中,
factor: modifies the order of the categories for the class variable and both the order and the labels for the drive variablescale_y_continuous: modifies the y-axis tick mark labelslabs: provides a title and changed the labels for the x and y axes and the legendscale_fill_brewer: changes the fill color schemetheme_minimal: removes the grey background and changed the grid color# bar plot, with each bar representing 100%, reordered bars, and better labels and colors
mpg$class %>% fct_count()
## # A tibble: 7 x 2
## f n
## <fct> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62
library(scales)
ggplot(mpg,
aes(x = factor(class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")),
fill = factor(drv,
levels = c("f", "r", "4"), # 水平
labels = c("front-wheel", # label
"rear-wheel",
"4-wheel")))) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = scales::percent) + # 百分比
scale_fill_brewer(palette = "Set2") + # fill填充颜色
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
在上图中,因子函数用于重新排序和 / 或重命名离散变量的级别。您还可以将此应用于原始数据集,使这些更改永久化。 然后,它将应用于所有使用该数据集的未来图形。例如:
# change the order the levels for the categorical variable "class"
mpg$class = factor(mpg$class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup"))
我将 factor 函数放在 ggplot 函数中,以演示如果需要,可以更改单个图的类别和标签的顺序。 其他函数将在 自定义图的章节中进行更详细的讨论。 接下来,让我们为每个片段添加% 的标签。 首先,我们将创建一个具有必要标签的摘要数据集。
plotdata <- mpg %>%
group_by(class, drv) %>%
summarize(n = n()) %>%
mutate(pct = n/sum(n),
lbl = scales::percent(pct)) # 变成百分比
plotdata
## # A tibble: 12 x 5
## # Groups: class [7]
## class drv n pct lbl
## <fct> <chr> <int> <dbl> <chr>
## 1 2seater r 5 1 100%
## 2 subcompact 4 4 0.114 11%
## 3 subcompact f 22 0.629 63%
## 4 subcompact r 9 0.257 26%
## 5 compact 4 12 0.255 26%
## 6 compact f 35 0.745 74%
## 7 midsize 4 3 0.0732 7%
## 8 midsize f 38 0.927 93%
## 9 minivan f 11 1 100%
## 10 suv 4 51 0.823 82%
## 11 suv r 11 0.177 18%
## 12 pickup 4 33 1 100%
接下来,我们将使用这个数据集和 geom 文本函数为每个条形图段添加标签。
ggplot(plotdata,
aes(x = factor(class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")), # 因子变量
y = pct, # 比例
fill = factor(drv, # fill
levels = c("f", "r", "4"),
labels = c("front-wheel",
"rear-wheel",
"4-wheel")))) +
geom_bar(stat = "identity", # 统计变换
position = "fill") + # 比例图
scale_y_continuous(breaks = seq(0, 1, .2),
label = scales::percent) +
geom_text(aes(label = lbl), # 添加文本,label = lbl
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
马赛克图为显示分类变量之间的关系提供了一种堆叠条形图的替代方法。 它们还可以提供更复杂的统计信息。
两个连续变量之间的关系通常用散点图和线图来表示。
两个连续变量最简单的显示是散点图,每个变量用一个轴表示。 例如,使用薪水数据集,我们可以绘制大学教授的经验(yrs.since.phd)和学术工资(薪水)。
library(ggplot2)
pacman::p_load(carData)
data(Salaries, package="carData")
# simple scatterplot
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point()
geom_point函数有以下重要参数:
color - point colorsize - point sizeshape - point shapealpha - point transparency. Transparency ranges from 0 (transparent) to 1 (opaque), and is a useful parameter when points overlap.scale_x_continuous和scale_y_continuous分别控制 x 轴和 y 轴的缩放。
我们可以利用这些选项和函数来创建一个更有吸引力的散点图。
# enhanced scatter plot
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color="cornflowerblue",
size = 2,
alpha=.8) +
scale_y_continuous(label = scales::dollar, # 记住:label是添加额外信息,limits也很好用
limits = c(50000, 250000)) +
scale_x_continuous(breaks = seq(0, 60, 10),
limits=c(0, 60)) +
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009")
用最佳拟合线将关系总结在散点图中是很有用的。 支持许多类型的线,包括线性、多项式和非参数(loess)。 默认情况下,显示这些行的95% 置信区间。
# scatterplot with linear fit line
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color= "steelblue") +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Salaries %>%
ggplot(aes(x = yrs.since.phd, y = salary)) +
geom_point(col = "steelblue") +
geom_smooth(method = "lm") +
scale_y_continuous(labels = scales::dollar,
limits = c(50000,250000)) + # break label limits三个很重要的参数
scale_x_continuous(breaks = seq(0,60,10),limits = c(0,60),expand = c(0,0)) +
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'
显然,工资会随着经验的增加而增加。 然而,经验丰富、薪水较低的右端教授似乎有所下降。 直线并不能捕捉到这种非线性效应。 弯曲的线在这里更合适。
通常使用二次(一个弯曲)线或三次(两个弯曲)线。 很少有必要使用高阶多项式。 对工资数据集应用二次拟合会产生以下结果。
# scatterplot with quadratic line of best fit
ggplot(Salaries,
aes(x = yrs.since.phd, # 主要是X轴,Y轴,颜色,方法,公式
y = salary)) +
geom_point(color= "steelblue") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 2),
color = "indianred3")
最后,一个平滑的非参数拟合直线通常可以提供一个良好的图像关系。 ggplot2的默认值是一条loess线,代表局部加权散点图平滑。
# scatterplot with loess smoothed line
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color= "steelblue") + # 注意X轴、Y轴的位置,col什么时候在aes()函数里边!
geom_smooth(color = "tomato")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
您可以通过包含选项 se = FALSE 来抑制置信区间。
下面是一个完整的(也更有吸引力的)情节。
# scatterplot with loess smoothed line
# and better labeling and color
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point(color="cornflowerblue",
size = 2,
alpha = .6) +
geom_smooth(size = 1.5,
color = "darkgrey") +
scale_y_continuous(label = scales::dollar,
limits = c(50000, 250000)) + # 这个可以不用break
scale_x_continuous(breaks = seq(0, 60, 10),
limits = c(0, 60)) + # 这个最好用break
labs(x = "Years Since PhD",
y = "",
title = "Experience vs. Salary",
subtitle = "9-month salary for 2008-2009") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
当两个变量中的一个表示时间时,线图可以是显示关系的有效方法。 例如,下面的代码显示了1952年至2007年间美国的时间(年)和预期寿命(lifeExp)之间的关系。 数据来自 gapminder 数据集。
pacman::p_load(gapminder)
data(gapminder, package="gapminder")
gapminder %>% DT::datatable()
# Select US cases
library(dplyr)
plotdata <- filter(gapminder,
country == "United States")
plotdata
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 1952 68.4 157553000 13990.
## 2 United States Americas 1957 69.5 171984000 14847.
## 3 United States Americas 1962 70.2 186538000 16173.
## 4 United States Americas 1967 70.8 198712000 19530.
## 5 United States Americas 1972 71.3 209896000 21806.
## 6 United States Americas 1977 73.4 220239000 24073.
## 7 United States Americas 1982 74.6 232187835 25010.
## 8 United States Americas 1987 75.0 242803533 29884.
## 9 United States Americas 1992 76.1 256894189 32004.
## 10 United States Americas 1997 76.8 272911760 35767.
## 11 United States Americas 2002 77.3 287675526 39097.
## 12 United States Americas 2007 78.2 301139947 42952.
# simple line plot
ggplot(plotdata,
aes(x = year,
y = lifeExp)) +
geom_line()
很难读出上面图表中的单个值。在下一个图表中,我们也将添加点。
# line plot with points and improved labeling
ggplot(plotdata,
aes(x = year,
y = lifeExp)) +
geom_line(size = 1.5,
color = "lightgrey") +
geom_point(size = 3,
color = "steelblue") +
labs(y = "Life Expectancy (years)",
x = "Year",
title = "Life expectancy changes over time",
subtitle = "United States (1952-2007)",
caption = "Source: http://www.gapminder.org/data/") +
scale_x_continuous(breaks = seq(1950,2010,5)) +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90,vjust = 0.5))
时间相关的数据在绘制时间序列图中有更详细的介绍。
当绘制离散变量和连续变量之间的关系时,可以得到大量的图形类型。其中包括使用汇总统计的柱状图、分组核密度图、并排的箱线图、*并排的小提琴图**、Mean/SEM图、山脊线图和克利夫兰图*等。
在前面的部分中,条形图用于按类别显示单个变量或两个变量的数量。 您还可以使用条形图针对分类变量的每个级别在定量变量上显示其他摘要统计信息(例如,均值或中位数)。
例如,下面的图表显示了一个大学教授样本的平均工资,根据他们的学术排名。
data(Salaries, package="carData")
# calculate mean salary for each rank
library(dplyr)
plotdata <- Salaries %>%
group_by(rank) %>%
summarize(mean_salary = mean(salary))
plotdata
## # A tibble: 3 x 2
## rank mean_salary
## <fct> <dbl>
## 1 AsstProf 80776.
## 2 AssocProf 93876.
## 3 Prof 126772.
# plot mean salaries
ggplot(plotdata,
aes(x = rank,
y = mean_salary)) +
geom_bar(stat = "identity") # 统计变换
我们可以通过一些选择使它更有吸引力。
# plot mean salaries in a more attractive fashion
library(scales)
ggplot(plotdata,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = mean_salary)) +
geom_bar(stat = "identity",
fill = "cornflowerblue") +
labs(title = "Mean Salary by Rank",
subtitle = "9-month academic salary for 2008-2009",
x = "",
y = "")
# plot mean salaries in a more attractive fashion
library(scales)
ggplot(plotdata,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = mean_salary)) + # 本质还是条形图
geom_bar(stat = "identity",
fill = "cornflowerblue") +
geom_text(aes(label = dollar(mean_salary)),
vjust = -0.25) +
scale_y_continuous(breaks = seq(0, 130000, 20000),
label = dollar) +
labs(title = "Mean Salary by Rank",
subtitle = "9-month academic salary for 2008-2009",
x = "",
y = "")
此类图的一个限制是它们不显示数据的分布—只显示每个组的汇总统计信息。下面的情节在一定程度上纠正了这一限制。
可以通过在单个图中叠加核密度图来比较数值变量上的组。
# plot the distribution of salaries by rank using kernel density plots
Salaries %>%
ggplot(aes(x = salary)) +
geom_density(alpha = 0.4)
ggplot(Salaries,
aes(x = salary,
fill = rank)) +
geom_density(alpha = 0.4) +
labs(title = "Salary distribution by rank")
alpha选项使密度图部分透明,这样我们就可以看到在重叠部分发生了什么。Alpha值的范围从0(透明)到1(不透明)。这张图表清楚地表明,一般来说,工资随rank的增加而增加。然而,全职教授的薪水范围非常广泛。
箱线图显示分布的第25百分位、中位数和第75百分位。这些触须(垂线)捕捉了约99%的正态分布,超出这个范围的观测结果被绘制成代表离群值的点。并排箱形图对比较组非常有用。
# plot the distribution of salaries by rank using boxplots
ggplot(Salaries,
aes(x = rank,
y = salary)) +
geom_boxplot() +
labs(title = "Salary distribution by rank")
Notched箱形图提供了一种近似的方法来观察群体是否不同。虽然不是一个正式的测试,如果两个箱形图的切口不重叠,有强有力的证据(95%的信心)表明两组的中位数不同。
# plot the distribution of salaries by rank using boxplots
ggplot(Salaries, aes(x = rank,
y = salary)) +
geom_boxplot(notch = TRUE,
fill = "cornflowerblue",
alpha = .7) +
labs(title = "Salary distribution by rank")
在上面的例子中,三个组似乎都不同。箱形图的优点之一是它们的宽度通常没有意义。这允许您在一个图中比较多个组的分布。
小提琴图类似于核密度图
# plot the distribution of salaries
# by rank using violin plots
ggplot(Salaries,
aes(x = rank,
y = salary)) +
geom_violin() +
labs(title = "Salary distribution by rank")
一个有用的变化是在小提琴图上叠加箱线图。
Salaries %>%
ggplot(aes(x = rank, y = salary)) +
geom_violin(aes(fill = rank)) +
geom_boxplot(aes(fill = rank),alpha = 0.5,width = 0.2,outlier.color = "orange",outlier.size = 2) +
labs(title = "Salary distribution by rank") +
theme(legend.position = "NULL")
山脊线图(也称为游戏图)显示了一个定量变量在几个群体中的分布情况。它们类似于具有垂直分面的核密度图,但占用的空间较小。山脊线图是用 ggridges 包创建的。
# create ridgeline graph
library(ggplot2)
library(ggridges)
ggplot(mpg,
aes(x = cty,
y = class,
fill = class)) +
geom_density_ridges(alpha = 0.5) +
labs("Highway mileage by auto class") +
theme(legend.position = "none")
## Picking joint bandwidth of 0.929
我在这里省略了图例,因为它是多余的(分布已经标记在y轴上了)。不出所料,pickup trucks的里程数最低,而subcompacts和compact cars的里程数往往最高。然而,这些小排量汽车的油耗分数范围很广。
注意,分布的可能重叠是为了获得更紧凑的图。如果重叠严重,可以使用geom_density_ridges(alpha = n)添加透明度,其中n的范围是从0(透明)到1(不透明)。
一种比较数值变量上的组的流行方法是带误差条的平均图。 误差条可以表示标准偏差、平均值的标准误差或置信区间。 在本节中,我们将绘制平均值和标准误差图。
# calculate means,standard deviations,standard errors, and 95% confidence intervals by rank
library(dplyr)
plotdata <- Salaries %>%
group_by(rank) %>%
summarize(n = n(),
mean = mean(salary),
sd = sd(salary),
se = sd / sqrt(n),
ci = qt(0.975, df = n - 1) * sd / sqrt(n))
plotdata
## # A tibble: 3 x 6
## rank n mean sd se ci
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AsstProf 67 80776. 8174. 999. 1994.
## 2 AssocProf 64 93876. 13832. 1729. 3455.
## 3 Prof 266 126772. 27719. 1700. 3346.
# plot the means and standard errors
ggplot(plotdata,
aes(x = rank,
y = mean,
group = 1)) +
geom_point(size = 3) +
geom_line()
ggplot(plotdata,
aes(x = rank,
y = mean,
group = 1)) +
geom_point(size = 3) +
geom_line() +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se),
width = .1)
虽然我们绘制的误差线代表标准误差,但我们可以绘制标准偏差或95% 的置信区间。 只需在 aes 选项中用 sd 或 error 替换 se 即可。 我们可以使用同样的技术来比较不同rank和性别的薪水。 (从技术上来说,这不是双变量,因为我们在绘制等级、性别和薪水,但它似乎适合这里)
# calculate means and standard errors by rank and sex
plotdata <- Salaries %>%
group_by(rank, sex) %>%
summarize(n = n(),
mean = mean(salary),
sd = sd(salary),
se = sd/sqrt(n))
plotdata
## # A tibble: 6 x 6
## # Groups: rank [3]
## rank sex n mean sd se
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 AsstProf Female 11 78050. 9372. 2826.
## 2 AsstProf Male 56 81311. 7901. 1056.
## 3 AssocProf Female 10 88513. 17965. 5681.
## 4 AssocProf Male 54 94870. 12891. 1754.
## 5 Prof Female 18 121968. 19620. 4624.
## 6 Prof Male 248 127121. 28214. 1792.
# plot the means and standard errors by sex
ggplot(plotdata, aes(x = rank,
y = mean,
group=sex,
color=sex)) +
geom_point(size = 3) +
geom_line(size = 1) +
geom_errorbar(aes(ymin =mean - se,
ymax = mean+se),
width = .1)
# plot the means and standard errors by sex (dodged)
pd <- position_dodge(0.2)
ggplot(plotdata,
aes(x = rank,
y = mean,
group=sex,
color=sex)) +
geom_point(position = pd,
size = 3) +
geom_line(position = pd,
size = 1) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se),
width = .1,
position= pd)
最后,让我们添加一些选项,使图形更有吸引力。
# improved means/standard error plot
pd <- position_dodge(0.2)
ggplot(plotdata,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = mean,
group=sex,
color=sex)) +
geom_point(position=pd,
size = 3) +
geom_line(position = pd,
size = 1) +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se),
width = .1,
position = pd,
size = 1) +
scale_y_continuous(label = scales::dollar) +
scale_color_brewer(palette="Set1") +
labs(title = "Mean salary by rank and sex",
subtitle = "(mean +/- standard error)",
x = "",
y = "",
color = "Gender")
分组变量和数值变量之间的关系可以用散点图显示
# plot the distribution of salaries by rank using strip plots
ggplot(Salaries,
aes(y = rank,
x = salary)) +
geom_point() +
labs(title = "Salary distribution by rank")
这些一维的散点被称为条形图。不幸的是,要点的重叠使解释变得困难。这种关系更容易看出这些点是否受到了影响。基本上每个y坐标都加上一个小随机数。
# plot the distribution of salaries by rank using jittering
ggplot(Salaries,
aes(y = rank,
x = salary)) +
geom_jitter() +
labs(title = "Salary distribution by rank")
如果我们使用颜色,比较组会更容易。
# plot the distribution of salaries
# by rank using jittering
library(scales)
ggplot(Salaries,
aes(y = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
x = salary,
color = rank)) +
geom_jitter(alpha = 0.7,
size = 1.5) +
scale_x_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme(legend.position = "none")
position = "none"用于隐藏图例(这里不需要这个图例)。当点数不太大时,抖动的情节很好。
# plot the distribution of salaries by rank using jittering
library(scales)
ggplot(Salaries,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")),
y = salary,
color = rank)) +
geom_boxplot(size=1,
outlier.shape = 1,
outlier.color = "black",
outlier.size = 3) +
geom_jitter(alpha = 0.5,
width=.2) +
scale_y_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme_minimal() +
theme(legend.position = "none") +
coord_flip()
添加了几个选项来创建这个图
对于箱线图:
对于散点图
Beeswarm图(也称为小提琴的散点图)类似于抖动的散点图,因为它们通过以减少重叠的方式绘制点来显示定量变量的分布。此外,它们还有助于显示每个点上的数据密度(以类似于小提琴图的方式)。继续前面的例子
# plot the distribution of salaries by rank using beewarm-syle plots
pacman::p_load(ggbeeswarm)
library(scales)
ggplot(Salaries,
aes(x = factor(rank,
labels = c("Assistant\nProfessor",
"Associate\nProfessor",
"Full\nProfessor")), # 分类变量为因子,color和x变量一样
y = salary,
color = rank)) +
geom_quasirandom(alpha = 0.7,
size = 1.5) +
scale_y_continuous(label = dollar) +
labs(title = "Academic Salary by Rank",
subtitle = "9-month salary for 2008-2009",
x = "",
y = "") +
theme(legend.position = "none")
这些绘图是使用geom_quasirandom函数创建的。这些图形比简单的条状图更容易读懂。
当您希望比较大量组的数字统计信息时,Cleveland图非常有用。例如,假设您想要使用gapminder数据集比较2007年亚洲国家的预期寿命。
data(gapminder, package="gapminder")
# subset Asian countries in 2007
library(dplyr)
plotdata <- gapminder %>%
filter(continent == "Asia" &
year == 2007)
# basic Cleveland plot of life expectancy by country
ggplot(plotdata,
aes(x= lifeExp, y = country)) +
geom_point()
# Sorted Cleveland plot
ggplot(plotdata,
aes(x=lifeExp,
y=reorder(country, lifeExp))) + # 排序函数很有用
geom_point()
# Fancy Cleveland plot
ggplot(plotdata,
aes(x=lifeExp,
y=reorder(country, lifeExp))) +
geom_point(color="blue",
size = 2) +
geom_segment(aes(x = 40, xend = lifeExp, y = reorder(country, lifeExp), yend = reorder(country, lifeExp)),color = "lightgrey") +
labs (x = "Life Expectancy (years)",
y = "",
title = "Life Expectancy by Country",
subtitle = "GapMinder data for Asia - 2007") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
日本人的预期寿命显然是最高的,而阿富汗人的预期寿命则是最低的。最后一个图也被称为棒棒糖图(你可以看到为什么)。