Lesson 6 整理数据

2017年12月4日

整理数据

在本章中，我们将讨论tidyr，它提供了整理凌乱数据集的工具。

library(tidyr)

什么是整洁数据集（tidy data）

同样的数据可以用不同方式进行组织。下面的示例为世界卫生组织记录的1999年和2000年阿富汗、巴西、中国的肺结核病例数据。每个数据集都包含相同的变量（country、year、population、cases），但采用了四种不同方式组织。

table1

# A tibble: 6 x 4
      country  year  cases population
        <chr> <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583

table2

# A tibble: 12 x 4
       country  year       type      count
         <chr> <int>      <chr>      <int>
 1 Afghanistan  1999      cases        745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000      cases       2666
 4 Afghanistan  2000 population   20595360
 5      Brazil  1999      cases      37737
 6      Brazil  1999 population  172006362
 7      Brazil  2000      cases      80488
 8      Brazil  2000 population  174504898
 9       China  1999      cases     212258
10       China  1999 population 1272915272
11       China  2000      cases     213766
12       China  2000 population 1280428583

table3

# A tibble: 6 x 3
      country  year              rate
*       <chr> <int>             <chr>
1 Afghanistan  1999      745/19987071
2 Afghanistan  2000     2666/20595360
3      Brazil  1999   37737/172006362
4      Brazil  2000   80488/174504898
5       China  1999 212258/1272915272
6       China  2000 213766/1280428583

也可以分成两个数据框。

table4a # 病例数

# A tibble: 3 x 3
      country `1999` `2000`
*       <chr>  <int>  <int>
1 Afghanistan    745   2666
2      Brazil  37737  80488
3       China 212258 213766

table4b # 人口数

# A tibble: 3 x 3
      country     `1999`     `2000`
*       <chr>      <int>      <int>
1 Afghanistan   19987071   20595360
2      Brazil  172006362  174504898
3       China 1272915272 1280428583

尽管上述数据集都存储了相同的数据，只有table1是容易使用的整洁数据集，table2，table3，table4并不易使用。

整洁数据集满足下述3条相互关联的规则：

每个变量必须有自己的列。
每个记录必须有自己的行。
每个值必须有自己的单元格。

为什么使用整洁数据集？

所有的数据集保持一致的数据结构，数据分析更容易。
在列中放置变量能够充分发挥R语言的向量运算特性。 mutate、summary和大多数R语言内置函数都采用向量运算。dplyr、 ggplot2和大多数R语言包也都是用来处理整洁数据集。

尽管整洁数据集的原则非常简单，但很多时候我们会遇到不整洁数据集，主要原因有：

大多数人不熟悉整洁数据集的原则。
数据通常是为了便于输入、存储、输出而不是分析。

不整洁数据集的问题一般有两类：

一个变量分布在多个列上。
一条记录分散在多个行中。

解决上述问题需要学习tidyr中两个最重要的函数：gather()和spread()。

gather

第一类常见问题：一个数据集的部分列名是变量值而不是变量名。比如table4a中的列名1999和2000应该是year变量的值，另外table4a中每一行实际存储了两条记录。

table4a

# A tibble: 3 x 3
      country `1999` `2000`
*       <chr>  <int>  <int>
1 Afghanistan    745   2666
2      Brazil  37737  80488
3       China 212258 213766

整理类似table4a的数据集，需要使用gather()将1999和2000这些列聚集到一个新的变量中。gather()需要三个参数：

用来存储原数据框中列名的新变量名，参数为key。table4中命名为year。
用来存储原数据框中变量值的新变量名，参数为value。table4中命名为cases。
一组应为变量值而不是列名的列。如table4中的1999和2000列。

table4a %>% gather(key = "year", value = "cases", `1999`, `2000`)

# A tibble: 6 x 3
      country  year  cases
        <chr> <chr>  <int>
1 Afghanistan  1999    745
2      Brazil  1999  37737
3       China  1999 212258
4 Afghanistan  2000   2666
5      Brazil  2000  80488
6       China  2000 213766

最终，gather的列被删除，得到新的key和value列。同样使用gather()整理table4b。

table4b %>% gather("year", "population", `1999`, `2000`)

# A tibble: 6 x 3
      country  year population
        <chr> <chr>      <int>
1 Afghanistan  1999   19987071
2      Brazil  1999  172006362
3       China  1999 1272915272
4 Afghanistan  2000   20595360
5      Brazil  2000  174504898
6       China  2000 1280428583

注意，列名1999和2000不符合R语法对变量名称的要求（不是以字母开头），需要用``。

使用left_join()将table4a和table4b整合成一个新的数据框。

tidy4a <- table4a %>% gather("year", "cases", `1999`, `2000`)
tidy4b <- table4b %>% gather("year", "population", `1999`, `2000`)
dplyr::left_join(tidy4a, tidy4b)

Joining, by = c("country", "year")

# A tibble: 6 x 4
      country  year  cases population
        <chr> <chr>  <int>      <int>
1 Afghanistan  1999    745   19987071
2      Brazil  1999  37737  172006362
3       China  1999 212258 1272915272
4 Afghanistan  2000   2666   20595360
5      Brazil  2000  80488  174504898
6       China  2000 213766 1280428583

spread

spread与gather相反，用于一条记录分散在多个行时。比如一个国家一年的记录的记录应为一行，但table2中每一条记录有两行。

table2

# A tibble: 12 x 4
       country  year       type      count
         <chr> <int>      <chr>      <int>
 1 Afghanistan  1999      cases        745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000      cases       2666
 4 Afghanistan  2000 population   20595360
 5      Brazil  1999      cases      37737
 6      Brazil  1999 population  172006362
 7      Brazil  2000      cases      80488
 8      Brazil  2000 population  174504898
 9       China  1999      cases     212258
10       China  1999 population 1272915272
11       China  2000      cases     213766
12       China  2000 population 1280428583

spread()只需要两个参数。

原数据框中存储变量名的列，参数为key。table2中为type。
原数据框中存储变量值的列，参数为value。table2中为count。

spread(table2, key = type, value = count)

# A tibble: 6 x 4
      country  year  cases population
*       <chr> <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583

spread()和gather()都有key和value参数，两个函数功能互补。gather()使宽数据框变窄变长；spread()使长数据框变短变宽。

练习

为什么gather()和spread()不是完全对称的，思考下面的例子。

stocks <- tibble(
    year = c(2015, 2015, 2016, 2016), 
    half = c(1, 2, 1, 2), 
    returns = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% spread(year, returns) %>% 
    gather("year", "returns", `2015`:`2016`)

为什么下面的代码会报错，正确的代码应该是什么。

table4a %>% gather(1999, 2000, key = "year", value = "cases")

Error in combine_vars(vars, ind_list): Position must be between 0 and n

separat和unite

上面已经学习了整理table2和table4，下面学习整理table3。table3中的rate列实际包含了病例数和人口数两个变量。
separate()可以在分隔符的位置进行分割将一列分解为多列。

table3

# A tibble: 6 x 3
      country  year              rate
*       <chr> <int>             <chr>
1 Afghanistan  1999      745/19987071
2 Afghanistan  2000     2666/20595360
3      Brazil  1999   37737/172006362
4      Brazil  2000   80488/174504898
5       China  1999 212258/1272915272
6       China  2000 213766/1280428583

rate列包含cases和population变量，需要分割成两列，separate()的第一个参数是需要分割的列，第二个参数是分割后的列名。

table3 %>% separate(rate, into = c("cases", "population"))

# A tibble: 6 x 4
      country  year  cases population
*       <chr> <int>  <chr>      <chr>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583

默认情况下，separate()会在遇到非字母数字的字符时分割列。比如在table3中separate()在/处分割rate。如果需要在特定字符处分割列，可以使用sep参数。

table3 %>% separate(rate, into = c("cases", "population"), sep = "/")

separate()分割后的列默认保持与分割前的列一致的变量类型。由于table3中rate为字符型变量，因此生成的cases和population也为字符型变量。如果要在分割后使case和population直接存储为数值型变量，可以使用convert = TRUE参数。

table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE)

# A tibble: 6 x 4
      country  year  cases population
*       <chr> <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3      Brazil  1999  37737  172006362
4      Brazil  2000  80488  174504898
5       China  1999 212258 1272915272
6       China  2000 213766 1280428583

sep参数也可以是一个整数向量，separate()会在这些整数的位置做分割。从1开始的正数表示从左向右数的位置，从-1开始的负数表示从右向左数的位置。当使用整数向量进行分割时，sep中整数向量的长度要比into中变量名向量的长度小1。
比如可以将世纪和年份分开。

table3 %>% separate(year, into = c("century", "year"), sep = 2)

# A tibble: 6 x 4
      country century  year              rate
*       <chr>   <chr> <chr>             <chr>
1 Afghanistan      19    99      745/19987071
2 Afghanistan      20    00     2666/20595360
3      Brazil      19    99   37737/172006362
4      Brazil      20    00   80488/174504898
5       China      19    99 212258/1272915272
6       China      20    00 213766/1280428583

unite

unite()与separate()相反，其将多列合并成一列，其使用频率比separate()少。

使用unite()重新合并century和year列。
以table5为例。

table5

# A tibble: 6 x 4
      country century  year              rate
*       <chr>   <chr> <chr>             <chr>
1 Afghanistan      19    99      745/19987071
2 Afghanistan      20    00     2666/20595360
3      Brazil      19    99   37737/172006362
4      Brazil      20    00   80488/174504898
5       China      19    99 212258/1272915272
6       China      20    00 213766/1280428583

unite()针对一个数据框，需要一个新列名参数和一系列需要合并的列。默认将下划线作为不同列的值的分隔符 (_)。

table5 %>% unite(new, century, year)

# A tibble: 6 x 3
      country   new              rate
*       <chr> <chr>             <chr>
1 Afghanistan 19_99      745/19987071
2 Afghanistan 20_00     2666/20595360
3      Brazil 19_99   37737/172006362
4      Brazil 20_00   80488/174504898
5       China 19_99 212258/1272915272
6       China 20_00 213766/1280428583

如果不需要任何分隔符，可以将sep参数设置为""。

table5 %>% unite(new, century, year, sep = "")

# A tibble: 6 x 3
      country   new              rate
*       <chr> <chr>             <chr>
1 Afghanistan  1999      745/19987071
2 Afghanistan  2000     2666/20595360
3      Brazil  1999   37737/172006362
4      Brazil  2000   80488/174504898
5       China  1999 212258/1272915272
6       China  2000 213766/1280428583

练习

separate()中的extra和fill参数有什么用？尝试使用这两个参数来处理下面的数据框。

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
    separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
    separate(x, c("one", "two", "three"))

unite()和separate()都有remove参数，尝试有什么用？

缺失值

改变数据框的形式会带来值的缺失。值的缺失有两种形式：

显性的，标记为NA；
隐性的，在数据框中不呈现。

用一个非常简单的数据框来说明：

stocks <- tibble(
  year    = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr     = c(   1,    2,    3,   4,    2,    3,   4),
  returns = c(1.88,  0.59, 0.35,  NA,  0.92, 0.17, 2.66)
)

2015年第四季度的回报显性缺失，其值为NA。
2016年第一季度的回报隐性缺失，其整一行记录都没有出现。

显性缺失用NA直接提示数据缺失，隐性缺失不直接提示数据缺失。

使用complete()可以显性表示隐性缺失。

stocks %>% complete(year, qtr)

# A tibble: 8 x 3
   year   qtr return
  <dbl> <dbl>  <dbl>
1  2015     1   1.88
2  2015     2   0.59
3  2015     3   0.35
4  2015     4     NA
5  2016     1     NA
6  2016     2   0.92
7  2016     3   0.17
8  2016     4   2.66

complete()寻找一组列中所有值的组合，一旦存在组合缺失，则新增表示该组合的一条记录，并在其余列填充NA。

出于数据输入便利的考虑，有时缺失值应为前一个非缺失值。

treatment <- tibble(
  person    = c("Derrick Whitmore", NA, NA, "Katherine Burke"),
  treatment = c(1, 2, 3, 1),
  response  = c(7, 10, 9, 4)
)

fill()可以将缺失值填充为前一个非缺失值。

treatment %>% fill(person)

# A tibble: 4 x 3
            person treatment response
             <chr>     <dbl>    <dbl>
1 Derrick Whitmore         1        7
2 Derrick Whitmore         2       10
3 Derrick Whitmore         3        9
4  Katherine Burke         1        4

练习

研究fill()中direction参数有什么作用?