良好的编码风格就像正确的标点符号:没有它你也可以应付,但它确实会让事情更容易阅读。即使作为一个非常新的程序员,调整自己的代码风格也是一个好主意。使用一致的风格可以让其他人(包括未来的你!)更轻松地阅读您的作品,如果您需要从其他人那里获得帮助,这一点尤其重要。我们将使用 tidyverse 和 nycflights13 包作为本章的代码示例。
命令选项板允许您使用任何内置 RStudio 命令和软件包提供的许多插件。按 Cmd/Ctrl + Shift + P 打开调色板,然后输入”styler”以查看 styler 提供的所有快捷方式。
flights |>
group_by(dest) |>
summarize(
distance = mean(distance),
speed = mean(distance / air_time, na.rm = TRUE)
) |>
ggplot(aes(x = distance, y = speed)) +
geom_smooth(
method = "loess",
span = 0.5,
se = FALSE,
color = "white",
linewidth = 4
) +
geom_point()## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
随着脚本变得越来越长,您可以使用分段注释将文件分成可管理的部分:
# Load data --------------------------------------
# Plot data --------------------------------------RStudio 提供了一个键盘快捷键来创建这些标题(Cmd/Ctrl + Shift + R),并将它们显示在编辑器左下角的代码导航下拉列表中
"Happy families are all alike; every unhappy family is unhappy in its own way."
— Leo Tolstoy
"Tidy datasets are all alike, but every messy dataset is messy in its own way."
— Hadley Wickham
在本章中,您将学习使用在 R 中被称为tidy data 的系统进行组织数据的一致方法。将数据转换为这种格式需要预先进行一些工作,但从长远来看,这些工作会得到回报。一旦您拥有了整洁的数据以及 tidyverse 中的软件包提供的整洁工具,您将花费更少的时间将数据从一种表示形式转换为另一种表示形式,从而使您可以将更多时间花在您关心的数据问题上。
在本章中,您将首先学习整洁数据的定义,并将其应用于简单的玩具数据集。然后我们将深入探讨用于整理数据的主要工具:数据透视(pivoting)。透视允许您更改数据的形式而不更改任何值。
您可以用多种方式表示相同的基础数据。下面的示例显示了以三种不同方式组织的相同数据。每个数据集显示四个变量的相同值:国家、年份、人口population和记录的结核病病例数case,但每个数据集以不同的方式组织这些值。
table1
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table2
#> # A tibble: 12 × 4
#> country year type count
#> <chr> <dbl> <chr> <dbl>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # ℹ 6 more rows
table3
#> # A tibble: 6 × 3
#> country year rate
#> <chr> <dbl> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583使数据集整洁的三个相互关联的规则:
每个变量都是一列;每列都是一个变量。
每个观察值都是一行;每一行都是一个观察结果。
每个值都是一个单元格;每个单元格都是一个值。
dplyr、ggplot2 和 tidyverse
中的所有其他软件包都设计用于处理类似table1中的 tidy
数据。以下是一些小示例,展示了如何使用table1.
# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 20002.把table2整理为table3形式
# Pivot table2 to table1 format
table1 <- table2 %>%
pivot_wider(
names_from = type,
values_from = count
)
# Rename the columns to match the desired format
colnames(table1) <- c("country", "year", "cases", "population")
# Print the resulting table1
print(table1)## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
#convert to table3
table3 <- table1 |>
mutate(rate = (cases / population) * 10000) |>
select(country, year,rate)
table3## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <dbl>
## 1 Afghanistan 1999 0.373
## 2 Afghanistan 2000 1.29
## 3 Brazil 1999 2.19
## 4 Brazil 2000 4.61
## 5 China 1999 1.67
## 6 China 2000 1.67
tidyr provides two functions for pivoting data: pivot_longer()
and pivot_wider().
We'll first start with pivot_longer()
because it's the most common case. Let's dive into some examples.
(类似于Stata中的reshape命令)
该billboard数据集记录了2000年歌曲的排行榜排名:
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank).
为了整理这些数据,我们将使用pivot_longer():
## # A tibble: 24,092 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
## # ℹ 24,082 more rows
After the data, there are three key arguments:
cols specifies which columns need to be pivoted,
i.e. which columns aren't variables. This argument uses the same syntax
as select()
so here we could use !c(artist, track, date.entered) or
starts_with("wk").
names_to names the variable stored in the column
names, we named that variable week.
values_to names the variable stored in the cell
values, we named that variable rank.
请注意,在代码中"week"和"rank"被引用,因为这些是我们正在创建的新变量,当我们运行调用pivot_longer()时它们还不存在于数据中。
其中的NA并不真正代表未知的观测结果;数据集的结构迫使它们存在,因此我们可以通过设置values_drop_na=TRUE来请求pivot_longer()将其清除
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
我们可以使用mutate()和readr::parse_number()将周的值从字符串转换为数字,从而使未来的计算更加容易。parse_number()是一个方便的函数,它将从字符串中提取第一个数字,忽略所有其他文本。
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE,
) |>
mutate(week = parse_number(week))## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <dbl> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # ℹ 5,297 more rows
现在您已经了解了如何使用数据透视来重塑数据,接下来让我们花一点时间来直观地了解数据透视对数据的作用。让我们从一个非常简单的数据集开始,以便更容易地了解正在发生的情况。假设我们有
3 名患有ids A、B 和 C
的患者,并且我们对每位患者进行两次血压测量。我们将使用 来创建数据tribble(),这是一个手动构建小
tibbles 的便捷函数:
## # A tibble: 3 × 3
## id bp1 bp2
## <chr> <dbl> <dbl>
## 1 A 100 120
## 2 B 140 115
## 3 C 120 125
We want our new dataset to have three variables: id
(already exists), measurement (the column names), and
value (the cell values). To achieve this, we need to pivot
df longer:
## # A tibble: 6 × 3
## id measurement value
## <chr> <chr> <dbl>
## 1 A bp1 100
## 2 A bp2 120
## 3 B bp1 140
## 4 B bp2 115
## 5 C bp1 120
## 6 C bp2 125
The column names become values in a new variable, whose name is
defined by names_to. They need to be repeated once for each
row in the original dataset.
The cell values also become values in a new variable, with a name
defined by values_to. They are unwound row by row.
当您将多条信息塞进列名中,并且希望将这些信息存储在单独的新变量中时,就会出现更具挑战性的情况。例如,以who2数据集为例,它是表1和您在上面看到的朋友的来源:
## # A tibble: 7,240 × 58
## country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1980 NA NA NA NA NA NA
## 2 Afghanistan 1981 NA NA NA NA NA NA
## 3 Afghanistan 1982 NA NA NA NA NA NA
## 4 Afghanistan 1983 NA NA NA NA NA NA
## 5 Afghanistan 1984 NA NA NA NA NA NA
## 6 Afghanistan 1985 NA NA NA NA NA NA
## 7 Afghanistan 1986 NA NA NA NA NA NA
## 8 Afghanistan 1987 NA NA NA NA NA NA
## 9 Afghanistan 1988 NA NA NA NA NA NA
## 10 Afghanistan 1989 NA NA NA NA NA NA
## # ℹ 7,230 more rows
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## # sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## # sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## # sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## # sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## # sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
该数据集由世界卫生组织收集,记录了有关结核病诊断的信息。有两列已经是变量,很容易解释:国家和年份。它们后面跟有56列,如sp_m_014、ep_m_4554和rel_m_3544。如果你盯着这些列看足够长的时间,你会注意到有一种模式。每个列名由三个部分组成,用_分隔。第一部分sp/rel/ep描述了用于诊断的方法,第二部分m/f是性别(在该数据集中编码为二进制变量),第三部分014/1524/2534/3544/4554/5564/65是年龄范围(例如,014表示0-14)。
who2 |>
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)## # A tibble: 405,440 × 6
## country year diagnosis gender age count
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1980 sp m 014 NA
## 2 Afghanistan 1980 sp m 1524 NA
## 3 Afghanistan 1980 sp m 2534 NA
## 4 Afghanistan 1980 sp m 3544 NA
## 5 Afghanistan 1980 sp m 4554 NA
## 6 Afghanistan 1980 sp m 5564 NA
## 7 Afghanistan 1980 sp m 65 NA
## 8 Afghanistan 1980 sp f 014 NA
## 9 Afghanistan 1980 sp f 1524 NA
## 10 Afghanistan 1980 sp f 2534 NA
## # ℹ 405,430 more rows
复杂性的下一步是列名包含变量值和变量名的混合。以数据集为例household:
## # A tibble: 5 × 5
## family dob_child1 dob_child2 name_child1 name_child2
## <int> <date> <date> <chr> <chr>
## 1 1 1998-11-26 2000-01-29 Susan Jose
## 2 2 1996-06-22 NA Mark <NA>
## 3 3 2002-07-11 2004-04-05 Sam Seth
## 4 4 2004-10-10 2009-08-27 Craig Khai
## 5 5 2000-12-05 2005-02-28 Parker Gracie
该数据集包含五个家庭的数据,最多有两个孩子的名字和出生日期。该数据集中的新挑战是列名包含两个变量的名称(dob,name)和另一个变量的值(child,值为1或2)。为了解决这个问题,我们再次需要向names_to提供一个向量,但这次我们使用了特殊的”.value“标记;这不是一个变量的名称,而是一个唯一的值,告诉pivot_longer()做一些不同的事情。这将重写常用的values_to参数,以将数据透视列名的第一个组件用作输出中的变量名。
household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)## # A tibble: 9 × 4
## family child dob name
## <int> <chr> <date> <chr>
## 1 1 child1 1998-11-26 Susan
## 2 1 child2 2000-01-29 Jose
## 3 2 child1 1996-06-22 Mark
## 4 3 child1 2002-07-11 Sam
## 5 3 child2 2004-04-05 Seth
## 6 4 child1 2004-10-10 Craig
## 7 4 child2 2009-08-27 Khai
## 8 5 child1 2000-12-05 Parker
## 9 5 child2 2005-02-28 Gracie
我们首先查看cms_patient_experience来自医疗保险和医疗补助服务中心的数据集,该数据集收集有关患者体验的数据:
pivot_wider()
has the opposite interface to pivot_longer():
instead of choosing new column names, we need to provide the existing
columns that define the values (values_from) and the column
name (names_from)
cms_patient_experience |>
pivot_wider(
id_cols = starts_with("org"),
names_from = measure_cd,
values_from = prf_rate
)## # A tibble: 95 × 8
## org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0446157747 USC C… 63 87 86 57 85
## 2 0446162697 ASSOC… 59 85 83 63 88
## 3 0547164295 BEAVE… 49 NA 75 44 73
## 4 0749333730 CAPE … 67 84 85 65 82
## 5 0840104360 ALLIA… 66 87 87 64 87
## 6 0840109864 REX H… 73 87 84 67 91
## 7 0840513552 SCL H… 58 83 76 58 78
## 8 0941545784 GRITM… 46 86 81 54 NA
## 9 1052612785 COMMU… 65 84 80 58 87
## 10 1254237779 OUR L… 61 NA NA 65 NA
## # ℹ 85 more rows
## # ℹ 1 more variable: CAHPS_GRP_12 <dbl>
上面这段代码将原来的一列变量measure_cd
按照不同的检测内容拆分为了单独的六列变量:
| CAHPS_GRP_1 | CAHPS_GRP_2 | CAHPS_GRP_3 | CAHPS_GRP_5 | CAHPS_GRP_8 | CAHPS_GRP_12 |
|---|---|---|---|---|---|
pivot_wider()
work?让我们再次从一个非常简单的数据集开始。这次我们有两名id为 A 和 B 的患者,我们对患者 A 进行了三次血压测量,对患者 B 进行了两次血压测量:
df <- tribble(
~id, ~measurement, ~value,
"A", "bp1", 100,
"B", "bp1", 140,
"B", "bp2", 115,
"A", "bp2", 120,
"A", "bp3", 105
)
df## # A tibble: 5 × 3
## id measurement value
## <chr> <chr> <dbl>
## 1 A bp1 100
## 2 B bp1 140
## 3 B bp2 115
## 4 A bp2 120
## 5 A bp3 105
We'll take the values from the value column and the
names from the measurement column:
## # A tibble: 2 × 4
## id bp1 bp2 bp3
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 120 105
## 2 B 140 115 NA