本文记录了使用 Tidyverse 如何高效的处理 factor

library(tidyverse)
## ─ Attaching packages ─────────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ─ Conflicts ──────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(forcats)

creating factors

一个字符串

x1 <- c("Dec", "Apr", "Jan", "Mar")

指定因子水平

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

现在,你就可以创建 factor

y1 <- factor(x1,levels = month_levels)

进行排序时,会按照指定的因子水平排序

sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

如果利用字符串创建因子时,没有指定因子水平,那么默认按照字母排序

factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar

有时候你想按照字符串出现的顺序定义因子水平

# 下面两行代码效果一样
f1 <- factor(x1,levels = unique(x1))
f2 <- x1 %>% factor() %>% fct_inorder()

查看因子水平

levels(f2)
## [1] "Dec" "Apr" "Jan" "Mar"

关于 factor 一些基础的部分已经介绍完,后面将结合实例数据介绍 tidverse 包中关于 factor 的处理方法

练习数据

gss_cat
## # A tibble: 21,483 x 9
##     year marital     age race  rincome   partyid    relig   denom   tvhours
##    <int> <fct>     <int> <fct> <fct>     <fct>      <fct>   <fct>     <int>
##  1  2000 Never ma…    26 White $8000 to… Ind,near … Protes… Southe…      12
##  2  2000 Divorced     48 White $8000 to… Not str r… Protes… Baptis…      NA
##  3  2000 Widowed      67 White Not appl… Independe… Protes… No den…       2
##  4  2000 Never ma…    39 White Not appl… Ind,near … Orthod… Not ap…       4
##  5  2000 Divorced     25 White Not appl… Not str d… None    Not ap…       1
##  6  2000 Married      25 White $20000 -… Strong de… Protes… Southe…      NA
##  7  2000 Never ma…    36 White $25000 o… Not str r… Christ… Not ap…       3
##  8  2000 Divorced     44 White $7000 to… Ind,near … Protes… Luther…      NA
##  9  2000 Married      44 White $25000 o… Not str d… Protes… Other         0
## 10  2000 Married      47 White $25000 o… Strong re… Protes… Southe…       3
## # … with 21,473 more rows

race 变量是一个分类变量,查看其因子水平

# count 不适用和 groub_by 连用
gss_cat %>% count(race)
## # A tibble: 3 x 2
##   race      n
##   <fct> <int>
## 1 Other  1959
## 2 Black  3129
## 3 White 16395
# 也可以通过柱状图展示
ggplot(gss_cat,aes(race)) +
  geom_bar()

# 默认,ggplot2 将不会展示数量为0的某一个因子水平,你可以强制展示
ggplot(gss_cat,aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = F)

改变因子顺序

通常在可视化过程中,改变因子水平顺序会非常有用

在这个数据集中,我们希望不同宗教信仰的人一天平均看电视的时间

relig_summary <- gss_cat %>% 
  group_by(relig) %>% 
  summarise(
    age = mean(age,na.rm = T),
    tvhours = mean(tvhours,na.rm = T),
    n = n()
  ) 
ggplot(relig_summary,aes(tvhours,relig)) +
  geom_point()

从上面的图中,我们很难看到有用的信息,因为图片中的点杂乱无章地排列,因此可以使用 fct_reorder

ggplot(relig_summary,aes(tvhours,fct_reorder(relig,tvhours))) +
  geom_point()

排序后展示的散点图一目了然,在实际中,我推荐在画图前就事先改变因子水平,可以结合 mutate

relig_summary %>% 
  mutate(relig = fct_reorder(relig,tvhours)) %>% 
  ggplot(aes(tvhours,relig)) +
  geom_point()

利用上面的方法对不同年纪的人群的收入进行绘图展示

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
ggplot(rincome_summary,aes(age,fct_reorder(rincome,age))) +
  geom_point()

levels(gss_cat$rincome)
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

虽然在散点的分布上看着挺舒服,不过看纵坐标标签,我们可以看出这个使用 fct_reorder()不是很好,因
为这个混乱 rincome 我们所认知的因子水平,使用 fct_relevel

ggplot(rincome_summary,aes(age,fct_relevel(rincome,"Not applicable"))) +
  geom_point()

当你在一个曲线图上用不同的颜色标注曲线,使用 fct_reorder2 将对因子进行排序,基于一个最大的 x 值
所对应的 y 值进行排序,从图例中可以很清楚的进行辨别

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))
ggplot(by_age,aes(age,prop,colour = marital)) +
  geom_line(na.rm = T)

从上面的图,我们很难将颜色与标签一一对应。我们是否可以将 x 轴最大指所对应的曲线与标签一一对应
这样就方便识图

ggplot(by_age,aes(age,prop,colour = fct_reorder2(marital,age,prop))) +
  geom_line() +
  labs(colour = "marital")

这样我们就能很方便的看出各个曲线的趋势

最后,对于柱状图,可以使用 fct_infreq() 根据降序排列

gss_cat %>% mutate(marital = marital %>% fct_infreq()) %>% 
  ggplot(aes(marital)) +
  geom_bar()

使用 fct_rev 进行顺序颠倒

gss_cat %>% mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>% 
  ggplot(aes(marital)) +
  geom_bar()

改变因子水平

最有用的是更改因子水平值,而不是因子水平顺序

gss_cat %>% count(partyid)
## # A tibble: 10 x 2
##    partyid                n
##    <fct>              <int>
##  1 No answer            154
##  2 Don't know             1
##  3 Other party          393
##  4 Strong republican   2314
##  5 Not str republican  3032
##  6 Ind,near rep        1791
##  7 Independent         4119
##  8 Ind,near dem        2499
##  9 Not str democrat    3690
## 10 Strong democrat     3490

对于上面的分类变量过于简洁,我们可以使用 fct_recode 对因子水平进行调整

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
## # A tibble: 10 x 2
##    partyid                   n
##    <fct>                 <int>
##  1 No answer               154
##  2 Don't know                1
##  3 Other party             393
##  4 Republican, strong     2314
##  5 Republican, weak       3032
##  6 Independent, near rep  1791
##  7 Independent            4119
##  8 Independent, near dem  2499
##  9 Democrat, weak         3690
## 10 Democrat, strong       3490

你也可以将几个因子水平合并成一个新的因子水平

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
## # A tibble: 8 x 2
##   partyid                   n
##   <fct>                 <int>
## 1 Other                   548
## 2 Republican, strong     2314
## 3 Republican, weak       3032
## 4 Independent, near rep  1791
## 5 Independent            4119
## 6 Independent, near dem  2499
## 7 Democrat, weak         3690
## 8 Democrat, strong       3490

如果你想合并许多因子水平,fct_collapse 是一个有用的 fct_recode 的变异体

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
## # A tibble: 4 x 2
##   partyid     n
##   <fct>   <int>
## 1 other     548
## 2 rep      5346
## 3 ind      8409
## 4 dem      7180