利用R进行数据可视化——第五章多变量图

多变量图显示三个或更多变量之间的关系。容纳多个变量的常见方法有两种：分组和分面。

5.1 分组

在分组中，前两个变量的值映射到x和y轴。然后，将其他变量映射到其他视觉特征，例如颜色，形状，大小，线型和透明度。分组允许您在单个图形中绘制多个组的数据。

使用Salaries数据集，让我们显示yrs.since.phd与薪水之间的关系。

library(tidyverse)

## -- Attaching packages -------------------------- tidyverse 1.3.0 --

## √ ggplot2 3.3.0     √ purrr   0.3.3
## √ tibble  2.1.3     √ dplyr   0.8.5
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0

## -- Conflicts ----------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

data(Salaries, package="carData")

# plot experience vs. salary
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary)) +
  geom_point() + 
  labs(title = "Academic salary by years since degree") +
  theme(plot.title = element_text(hjust = 0.5))

接下来，让我们使用颜色来包括教授的职位。

Salaries %>% 
  ggplot(aes(yrs.since.phd,salary)) +
  geom_point(aes(col = rank)) +
  labs(title = "Academic salary by rank and years since degree") +
  theme(plot.title = element_text(hjust = 0.5))

最后，让我们添加教授的性别，使用点的形状表示性别。我们将增加点的大小并增加透明度，以使各个点更清晰。

Salaries %>% 
  ggplot(aes(yrs.since.phd,salary,col = rank,shape = sex)) +
  geom_point(size = 3,alpha = 0.5) +
  labs(title = "Academic salary by rank, sex, and years since degree") +
  theme(plot.title = element_text(hjust = 0.5))

我不能说这是一个很棒的图形。很难区分男教授和女教授。分面（在下一节中描述）可能是一种更好的方法。

Salaries %>% 
  ggplot(aes(yrs.since.phd,salary)) +
  geom_point(aes(col = rank,shape = sex),size = 3,alpha = 0.7) +
  labs(title = "Academic salary by rank, sex, and years since degree") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(se = F) +
  facet_wrap(~sex)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

这是一个更干净的示例。我们将绘制自博士学位以来的年份和薪水之间的关系图。使用点数的大小来表示服务年限。这称为气泡图。

# plot experience vs. salary (color represents rank and size represents service)
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary, 
           color = rank, 
           size = yrs.service)) +
  geom_point(alpha = .6) +
  labs(title = "Academic salary by rank, years of service, and years since degree")

博士毕业后的年份与工作年份之间显然存在着很强的正相关关系。助理教授从博士毕业后的工作年限为0-11年，服务年限为0-10年。显然，经验丰富的专业人士不会停留在助理教授级别(他们可能会被提升或离开大学)。我们在副教授和正教授之间找不到同样的时间界限。

# plot experience vs. salary with fit lines (color represents sex)
ggplot(Salaries, 
       aes(x = yrs.since.phd, 
           y = salary, 
           color = sex)) +
  geom_point(alpha = .4, 
             size = 3) +
  geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,2), 
              size = 1.5) +
  labs(x = "Years Since Ph.D.",
       title = "Academic Salary by Sex and Years Experience",
       subtitle = "9-month salary for 2008-2009",
       y = "",
       color = "Sex") +
  scale_y_continuous(label = scales::dollar) +
  scale_color_brewer(palette = "Set2")

5.2 分面

通过分组，可以使用颜色、形状和大小等视觉特征在单个图中绘制多个变量。在facting中，一个图由几个独立的图或小倍数组成，每一个图代表第三个变量的一个水平，或变量的组合。

通过一个例子来理解这一点是最容易的。

# plot salary histograms by rank
ggplot(Salaries, aes(x = salary)) +
  geom_histogram(fill = "cornflowerblue",
                 color = "white") +
  facet_wrap(~rank, ncol = 1) +
  labs(title = "Salary histograms by rank")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Facet_wrap函数为每个等级创建一个单独的图。Ncol选项控制列的数量。在下一个示例中，将使用两个变量来定义facet。

# plot salary histograms by rank and sex
ggplot(Salaries, aes(x = salary / 1000)) +
  geom_histogram(color = "white",
                 fill = "cornflowerblue") +
  facet_grid(sex ~ rank) +
  labs(title = "Salary histograms by sex and rank",
       x = "Salary ($1000)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

facet_grid function:

facet_grid(row variable(s) ~ column variable(s))

在此，该函数将性别分配给行，将等级分配给列，从而在一张图中创建一个由6个图组成的矩阵。

我们还可以将分组和分面结合起来。让我们使用均值/ SE图和分面来比较男女教授在职级和纪律方面的薪水。我们将使用颜色区分性别和分面，以按学科组合创建等级图。

plotdata <- Salaries %>%
  group_by(sex, rank, discipline) %>%
  summarize(n = n(),
            mean = mean(salary),
            sd = sd(salary),
            se = sd / sqrt(n))
plotdata

## # A tibble: 12 x 7
## # Groups:   sex, rank [6]
##    sex    rank      discipline     n    mean     sd    se
##    <fct>  <fct>     <fct>      <int>   <dbl>  <dbl> <dbl>
##  1 Female AsstProf  A              6  72933.  5463. 2230.
##  2 Female AsstProf  B              5  84190.  9792. 4379.
##  3 Female AssocProf A              4  72128.  6403. 3201.
##  4 Female AssocProf B              6  99436. 14086. 5751.
##  5 Female Prof      A              8 109632. 15095. 5337.
##  6 Female Prof      B             10 131836. 17504. 5535.
##  7 Male   AsstProf  A             18  74270.  4580. 1080.
##  8 Male   AsstProf  B             38  84647.  6900. 1119.
##  9 Male   AssocProf A             22  85049. 10612. 2262.
## 10 Male   AssocProf B             32 101622.  9608. 1698.
## 11 Male   Prof      A            123 120619. 28505. 2570.
## 12 Male   Prof      B            125 133518. 26514. 2372.

# create better labels for discipline
plotdata$discipline <- factor(plotdata$discipline,
                              labels = c("Theoretical",
                                         "Applied"))

ggplot(plotdata, 
       aes(x = sex,                         # 离散
           y = mean,                        # 连续
           color = sex)) +
  geom_point(size = 3)

# create plot
ggplot(plotdata, 
       aes(x = sex,                         # 离散
           y = mean,                        # 连续
           color = sex)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean - se, 
                    ymax = mean + se),
                width = .1) +
  scale_y_continuous(breaks = seq(70000, 140000, 10000),
                     limits = c(70000,140000),
                     label = scales::dollar) +
  facet_grid(. ~ rank + discipline) +
  theme_bw() +
  theme(legend.position = "none",
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank()) +
  labs(x="", 
       y="", 
       title="Nine month academic salaries by gender, discipline, and rank",
       subtitle = "(Means and standard errors)") +
  scale_color_brewer(palette="Set1")

语句facet_grid (. ~ rank + discipline)指定没有行变量(.) 以及由等级和纪律组合定义的列。主题函数创建一个黑白主题，并消除垂直网格线和次要的水平网格线。标度颜色改变点和误差条的配色方案。

乍看起来，**理论领域副教授和正式教授的薪水可能存在性别差异。我之所以说“可能”，是因为我们尚未进行任何正式的假设检验（在这种情况下为ANCOVA）

作为最后一个例子，我们将转到一个新的数据集，并绘制“美洲”国家/地区的预期寿命随时间的变化。数据来自gapminder程序包中的gapminder数据集。每个国家都有自己的特点。主题功能用于简化背景颜色，旋转x轴文本以及减小字体大小。

# plot life expectancy by year separately for each country in the Americas
data(gapminder, package = "gapminder")

# Select the Americas data
plotdata <- dplyr::filter(gapminder, 
                          continent == "Americas")
plotdata %>% DT::datatable()

# plot life expectancy by year, for each country
ggplot(plotdata, aes(x=year, y = lifeExp)) +
  geom_line(color="grey") +
  geom_point(aes(color=country)) +
  facet_wrap(~country) +                  # 分组
  theme_minimal(base_size = 9) +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1)) +
  labs(title = "Changes in Life Expectancy",
       x = "Year",
       y = "Life Expectancy")  +
  theme(plot.title = element_text(hjust = 0.5),legend.position = "NULL")

利用R进行数据可视化——第五章多变量图

LJJ

2020/3/15

5.1 分组

5.2 分面