5  工作流程:代码风格

良好的编码风格就像正确的标点符号:没有它你也可以应付,但它确实会让事情更容易阅读。即使作为一个非常新的程序员,调整自己的代码风格也是一个好主意。使用一致的风格可以让其他人(包括未来的你!)更轻松地阅读您的作品,如果您需要从其他人那里获得帮助,这一点尤其重要。我们将使用 tidyverse 和 nycflights13 包作为本章的代码示例。

library(pacman)
p_load(tidyverse,styler,lintr,nycflights13)#加载两个代码风格管理r包

命令选项板允许您使用任何内置 RStudio 命令和软件包提供的许多插件。按 Cmd/Ctrl + Shift + P 打开调色板,然后输入”styler”以查看 styler 提供的所有快捷方式。

flights |>
  group_by(dest) |>
  summarize(
    distance = mean(distance),
    speed = mean(distance / air_time, na.rm = TRUE)
  ) |>
  ggplot(aes(x = distance, y = speed)) +
  geom_smooth(
    method = "loess",
    span = 0.5,
    se = FALSE,
    color = "white",
    linewidth = 4
  ) +
  geom_point()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

5.5分段注释

随着脚本变得越来越长,您可以使用分段注释将文件分成可管理的部分:

# Load data --------------------------------------

# Plot data --------------------------------------

RStudio 提供了一个键盘快捷键来创建这些标题(Cmd/Ctrl + Shift + R),并将它们显示在编辑器左下角的代码导航下拉列表中

6  Data tidying

6.1 Introduction

"Happy families are all alike; every unhappy family is unhappy in its own way."
— Leo Tolstoy

"Tidy datasets are all alike, but every messy dataset is messy in its own way."
— Hadley Wickham

在本章中,您将学习使用在 R 中被称为tidy data 的系统进行组织数据的一致方法。将数据转换为这种格式需要预先进行一些工作,但从长远来看,这些工作会得到回报。一旦您拥有了整洁的数据以及 tidyverse 中的软件包提供的整洁工具,您将花费更少的时间将数据从一种表示形式转换为另一种表示形式,从而使您可以将更多时间花在您关心的数据问题上。

在本章中,您将首先学习整洁数据的定义,并将其应用于简单的玩具数据集。然后我们将深入探讨用于整理数据的主要工具:数据透视(pivoting)。透视允许您更改数据的形式而不更改任何值。

6.1.1先决条件

在本章中,我们将重点关注 tidyr,这个包提供了一系列工具来帮助整理杂乱的数据集。tidyr 是 tidyverse 核心的成员。

library(pacman)
p_load(tidyverse)

6.2整理数据

您可以用多种方式表示相同的基础数据。下面的示例显示了以三种不同方式组织的相同数据。每个数据集显示四个变量的相同值:国家年份人口population和记录的结核病病例数case,但每个数据集以不同的方式组织这些值。

table1
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

table2
#> # A tibble: 12 × 4
#>   country      year type           count
#>   <chr>       <dbl> <chr>          <dbl>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # ℹ 6 more rows

table3
#> # A tibble: 6 × 3
#>   country      year rate             
#>   <chr>       <dbl> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

使数据集整洁的三个相互关联的规则:

  1. 每个变量都是一列;每列都是一个变量。

  2. 每个观察值都是一行;每一行都是一个观察结果。

  3. 每个值都是一个单元格;每个单元格都是一个值。

dplyr、ggplot2 和 tidyverse 中的所有其他软件包都设计用于处理类似table1中的 tidy 数据。以下是一些小示例,展示了如何使用table1.

# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000

6.2.1练习

2.把table2整理为table3形式

# Pivot table2 to table1 format
table1 <- table2 %>%
  pivot_wider(
    names_from = type,
    values_from = count
  )

# Rename the columns to match the desired format
colnames(table1) <- c("country", "year", "cases", "population")

# Print the resulting table1
print(table1)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
#convert to table3
table3 <- table1 |> 
  mutate(rate = (cases / population) * 10000) |> 
  select(country, year,rate)
table3
## # A tibble: 6 × 3
##   country      year  rate
##   <chr>       <dbl> <dbl>
## 1 Afghanistan  1999 0.373
## 2 Afghanistan  2000 1.29 
## 3 Brazil       1999 2.19 
## 4 Brazil       2000 4.61 
## 5 China        1999 1.67 
## 6 China        2000 1.67

6.3 Lengthening data(长数据)

tidyr provides two functions for pivoting data: pivot_longer() and pivot_wider(). We'll first start with pivot_longer() because it's the most common case. Let's dive into some examples. (类似于Stata中的reshape命令)

6.3.1列名中的数据

billboard数据集记录了2000年歌曲的排行榜排名:

billboard
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank).

为了整理这些数据,我们将使用pivot_longer()

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  )
## # A tibble: 24,092 × 5
##    artist track                   date.entered week   rank
##    <chr>  <chr>                   <date>       <chr> <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
## # ℹ 24,082 more rows

After the data, there are three key arguments:

  • cols specifies which columns need to be pivoted, i.e. which columns aren't variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or starts_with("wk").

  • names_to names the variable stored in the column names, we named that variable week.

  • values_to names the variable stored in the cell values, we named that variable rank.

请注意,在代码中"week""rank"被引用,因为这些是我们正在创建的新变量,当我们运行调用pivot_longer()时它们还不存在于数据中。

其中的NA并不真正代表未知的观测结果;数据集的结构迫使它们存在,因此我们可以通过设置values_drop_na=TRUE来请求pivot_longer()将其清除

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # ℹ 5,297 more rows

我们可以使用mutate()readr::parse_number()将周的值从字符串转换为数字,从而使未来的计算更加容易。parse_number()是一个方便的函数,它将从字符串中提取第一个数字,忽略所有其他文本。

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE,
  ) |> 
  mutate(week = parse_number(week))
## # A tibble: 5,307 × 5
##    artist  track                   date.entered  week  rank
##    <chr>   <chr>                   <date>       <dbl> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92
## # ℹ 5,297 more rows

6.3.2 How does pivoting work?

现在您已经了解了如何使用数据透视来重塑数据,接下来让我们花一点时间来直观地了解数据透视对数据的作用。让我们从一个非常简单的数据集开始,以便更容易地了解正在发生的情况。假设我们有 3 名患有ids A、B 和 C 的患者,并且我们对每位患者进行两次血压测量。我们将使用 来创建数据tribble(),这是一个手动构建小 tibbles 的便捷函数:

df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)
df
## # A tibble: 3 × 3
##   id      bp1   bp2
##   <chr> <dbl> <dbl>
## 1 A       100   120
## 2 B       140   115
## 3 C       120   125

We want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we need to pivot df longer:

df |> 
  pivot_longer(
  cols = bp1:bp2,
  names_to = "measurement",
  values_to = "value"
)
## # A tibble: 6 × 3
##   id    measurement value
##   <chr> <chr>       <dbl>
## 1 A     bp1           100
## 2 A     bp2           120
## 3 B     bp1           140
## 4 B     bp2           115
## 5 C     bp1           120
## 6 C     bp2           125

The column names become values in a new variable, whose name is defined by names_to. They need to be repeated once for each row in the original dataset.

The cell values also become values in a new variable, with a name defined by values_to. They are unwound row by row.

6.3.3 Many variables in column names

当您将多条信息塞进列名中,并且希望将这些信息存储在单独的新变量中时,就会出现更具挑战性的情况。例如,以who2数据集为例,它是表1和您在上面看到的朋友的来源:

who2
## # A tibble: 7,240 × 58
##    country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
##    <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
##  2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
##  3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
##  4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
##  5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
##  6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
##  7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
##  8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
##  9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
## 10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
## # ℹ 7,230 more rows
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## #   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## #   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## #   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## #   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## #   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
# write.csv(who2, file="myCSVFile.csv")
# write.xlsx(who2, file = "My_openxlsx_File.xlsx")

该数据集由世界卫生组织收集,记录了有关结核病诊断的信息。有两列已经是变量,很容易解释:国家和年份。它们后面跟有56列,如sp_m_014、ep_m_4554和rel_m_3544。如果你盯着这些列看足够长的时间,你会注意到有一种模式。每个列名由三个部分组成,用_分隔。第一部分sp/rel/ep描述了用于诊断的方法,第二部分m/f是性别(在该数据集中编码为二进制变量),第三部分014/1524/2534/3544/4554/5564/65是年龄范围(例如,014表示0-14)。

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  )
## # A tibble: 405,440 × 6
##    country      year diagnosis gender age   count
##    <chr>       <dbl> <chr>     <chr>  <chr> <dbl>
##  1 Afghanistan  1980 sp        m      014      NA
##  2 Afghanistan  1980 sp        m      1524     NA
##  3 Afghanistan  1980 sp        m      2534     NA
##  4 Afghanistan  1980 sp        m      3544     NA
##  5 Afghanistan  1980 sp        m      4554     NA
##  6 Afghanistan  1980 sp        m      5564     NA
##  7 Afghanistan  1980 sp        m      65       NA
##  8 Afghanistan  1980 sp        f      014      NA
##  9 Afghanistan  1980 sp        f      1524     NA
## 10 Afghanistan  1980 sp        f      2534     NA
## # ℹ 405,430 more rows

6.3.4列标题中的数据和变量名称

复杂性的下一步是列名包含变量值和变量名的混合。以数据集为例household

household
## # A tibble: 5 × 5
##   family dob_child1 dob_child2 name_child1 name_child2
##    <int> <date>     <date>     <chr>       <chr>      
## 1      1 1998-11-26 2000-01-29 Susan       Jose       
## 2      2 1996-06-22 NA         Mark        <NA>       
## 3      3 2002-07-11 2004-04-05 Sam         Seth       
## 4      4 2004-10-10 2009-08-27 Craig       Khai       
## 5      5 2000-12-05 2005-02-28 Parker      Gracie

该数据集包含五个家庭的数据,最多有两个孩子的名字和出生日期。该数据集中的新挑战是列名包含两个变量的名称(dob,name)和另一个变量的值(child,值为1或2)。为了解决这个问题,我们再次需要向names_to提供一个向量,但这次我们使用了特殊的”.value“标记;这不是一个变量的名称,而是一个唯一的值,告诉pivot_longer()做一些不同的事情。这将重写常用的values_to参数,以将数据透视列名的第一个组件用作输出中的变量名。

household |> 
  pivot_longer(
    cols = !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = TRUE
  )
## # A tibble: 9 × 4
##   family child  dob        name  
##    <int> <chr>  <date>     <chr> 
## 1      1 child1 1998-11-26 Susan 
## 2      1 child2 2000-01-29 Jose  
## 3      2 child1 1996-06-22 Mark  
## 4      3 child1 2002-07-11 Sam   
## 5      3 child2 2004-04-05 Seth  
## 6      4 child1 2004-10-10 Craig 
## 7      4 child2 2009-08-27 Khai  
## 8      5 child1 2000-12-05 Parker
## 9      5 child2 2005-02-28 Gracie

6.4 Widening data(宽数据)

我们首先查看cms_patient_experience来自医疗保险和医疗补助服务中心的数据集,该数据集收集有关患者体验的数据:

view(cms_patient_experience)

pivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from) and the column name (names_from)

cms_patient_experience |> 
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
  )
## # A tibble: 95 × 8
##    org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
##    <chr>      <chr>        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
##  1 0446157747 USC C…          63          87          86          57          85
##  2 0446162697 ASSOC…          59          85          83          63          88
##  3 0547164295 BEAVE…          49          NA          75          44          73
##  4 0749333730 CAPE …          67          84          85          65          82
##  5 0840104360 ALLIA…          66          87          87          64          87
##  6 0840109864 REX H…          73          87          84          67          91
##  7 0840513552 SCL H…          58          83          76          58          78
##  8 0941545784 GRITM…          46          86          81          54          NA
##  9 1052612785 COMMU…          65          84          80          58          87
## 10 1254237779 OUR L…          61          NA          NA          65          NA
## # ℹ 85 more rows
## # ℹ 1 more variable: CAHPS_GRP_12 <dbl>

上面这段代码将原来的一列变量measure_cd 按照不同的检测内容拆分为了单独的六列变量:

CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8 CAHPS_GRP_12

6.4.1 How does pivot_wider() work?

让我们再次从一个非常简单的数据集开始。这次我们有两名id为 A 和 B 的患者,我们对患者 A 进行了三次血压测量,对患者 B 进行了两次血压测量:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)
df
## # A tibble: 5 × 3
##   id    measurement value
##   <chr> <chr>       <dbl>
## 1 A     bp1           100
## 2 B     bp1           140
## 3 B     bp2           115
## 4 A     bp2           120
## 5 A     bp3           105

We'll take the values from the value column and the names from the measurement column:

df |> 
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
## # A tibble: 2 × 4
##   id      bp1   bp2   bp3
##   <chr> <dbl> <dbl> <dbl>
## 1 A       100   120   105
## 2 B       140   115    NA