本文记录了使用 Tidyverse 如何高效的处理 factor
library(tidyverse)
## ─ Attaching packages ─────────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ─ Conflicts ──────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(forcats)
一个字符串
x1 <- c("Dec", "Apr", "Jan", "Mar")
指定因子水平
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
现在,你就可以创建 factor
y1 <- factor(x1,levels = month_levels)
进行排序时,会按照指定的因子水平排序
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
如果利用字符串创建因子时,没有指定因子水平,那么默认按照字母排序
factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
有时候你想按照字符串出现的顺序定义因子水平
# 下面两行代码效果一样
f1 <- factor(x1,levels = unique(x1))
f2 <- x1 %>% factor() %>% fct_inorder()
查看因子水平
levels(f2)
## [1] "Dec" "Apr" "Jan" "Mar"
关于 factor 一些基础的部分已经介绍完,后面将结合实例数据介绍 tidverse 包中关于 factor 的处理方法
gss_cat
## # A tibble: 21,483 x 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never ma… 26 White $8000 to… Ind,near … Protes… Southe… 12
## 2 2000 Divorced 48 White $8000 to… Not str r… Protes… Baptis… NA
## 3 2000 Widowed 67 White Not appl… Independe… Protes… No den… 2
## 4 2000 Never ma… 39 White Not appl… Ind,near … Orthod… Not ap… 4
## 5 2000 Divorced 25 White Not appl… Not str d… None Not ap… 1
## 6 2000 Married 25 White $20000 -… Strong de… Protes… Southe… NA
## 7 2000 Never ma… 36 White $25000 o… Not str r… Christ… Not ap… 3
## 8 2000 Divorced 44 White $7000 to… Ind,near … Protes… Luther… NA
## 9 2000 Married 44 White $25000 o… Not str d… Protes… Other 0
## 10 2000 Married 47 White $25000 o… Strong re… Protes… Southe… 3
## # … with 21,473 more rows
race 变量是一个分类变量,查看其因子水平
# count 不适用和 groub_by 连用
gss_cat %>% count(race)
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
# 也可以通过柱状图展示
ggplot(gss_cat,aes(race)) +
geom_bar()
# 默认,ggplot2 将不会展示数量为0的某一个因子水平,你可以强制展示
ggplot(gss_cat,aes(race)) +
geom_bar() +
scale_x_discrete(drop = F)
通常在可视化过程中,改变因子水平顺序会非常有用
在这个数据集中,我们希望不同宗教信仰的人一天平均看电视的时间
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age,na.rm = T),
tvhours = mean(tvhours,na.rm = T),
n = n()
)
ggplot(relig_summary,aes(tvhours,relig)) +
geom_point()
从上面的图中,我们很难看到有用的信息,因为图片中的点杂乱无章地排列,因此可以使用 fct_reorder
ggplot(relig_summary,aes(tvhours,fct_reorder(relig,tvhours))) +
geom_point()
排序后展示的散点图一目了然,在实际中,我推荐在画图前就事先改变因子水平,可以结合 mutate
relig_summary %>%
mutate(relig = fct_reorder(relig,tvhours)) %>%
ggplot(aes(tvhours,relig)) +
geom_point()
利用上面的方法对不同年纪的人群的收入进行绘图展示
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary,aes(age,fct_reorder(rincome,age))) +
geom_point()
levels(gss_cat$rincome)
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
虽然在散点的分布上看着挺舒服,不过看纵坐标标签,我们可以看出这个使用 fct_reorder()不是很好,因
为这个混乱 rincome 我们所认知的因子水平,使用 fct_relevel
ggplot(rincome_summary,aes(age,fct_relevel(rincome,"Not applicable"))) +
geom_point()
当你在一个曲线图上用不同的颜色标注曲线,使用 fct_reorder2 将对因子进行排序,基于一个最大的 x 值
所对应的 y 值进行排序,从图例中可以很清楚的进行辨别
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age,aes(age,prop,colour = marital)) +
geom_line(na.rm = T)
从上面的图,我们很难将颜色与标签一一对应。我们是否可以将 x 轴最大指所对应的曲线与标签一一对应
这样就方便识图
ggplot(by_age,aes(age,prop,colour = fct_reorder2(marital,age,prop))) +
geom_line() +
labs(colour = "marital")
这样我们就能很方便的看出各个曲线的趋势
最后,对于柱状图,可以使用 fct_infreq() 根据降序排列
gss_cat %>% mutate(marital = marital %>% fct_infreq()) %>%
ggplot(aes(marital)) +
geom_bar()
使用 fct_rev 进行顺序颠倒
gss_cat %>% mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
geom_bar()
最有用的是更改因子水平值,而不是因子水平顺序
gss_cat %>% count(partyid)
## # A tibble: 10 x 2
## partyid n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Strong republican 2314
## 5 Not str republican 3032
## 6 Ind,near rep 1791
## 7 Independent 4119
## 8 Ind,near dem 2499
## 9 Not str democrat 3690
## 10 Strong democrat 3490
对于上面的分类变量过于简洁,我们可以使用 fct_recode 对因子水平进行调整
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
## # A tibble: 10 x 2
## partyid n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Republican, strong 2314
## 5 Republican, weak 3032
## 6 Independent, near rep 1791
## 7 Independent 4119
## 8 Independent, near dem 2499
## 9 Democrat, weak 3690
## 10 Democrat, strong 3490
你也可以将几个因子水平合并成一个新的因子水平
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
## # A tibble: 8 x 2
## partyid n
## <fct> <int>
## 1 Other 548
## 2 Republican, strong 2314
## 3 Republican, weak 3032
## 4 Independent, near rep 1791
## 5 Independent 4119
## 6 Independent, near dem 2499
## 7 Democrat, weak 3690
## 8 Democrat, strong 3490
如果你想合并许多因子水平,fct_collapse 是一个有用的 fct_recode 的变异体
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
## # A tibble: 4 x 2
## partyid n
## <fct> <int>
## 1 other 548
## 2 rep 5346
## 3 ind 8409
## 4 dem 7180