引言
数据集有长数据
、宽数据
之分。原生数据一般为宽数据,以其适合录入、查看、比较,这类数据的基本特点是:每一个观测(observation)占一行,每一个测量值(measurement)占一列。这时并不是每一列都是变量(variable),也就是说一个变量可能分解占位在不同的列。而长数据则是各类分析软件需要的格式,这类数据的特点是:每一个独特的变量集中在某列,观测位于一行。以下为各种类型数据示例:
tidyr::table4a
Afghanistan |
745 |
2666 |
Brazil |
37737 |
80488 |
China |
212258 |
213766 |
比较隐蔽的例子
data.table::data.table(
Date = c("2009-01-01", "2009-01-02"),
Boeing.stock.price = c("$173.55", "$172.61"),
Amazon.stock.price = c("$174.90", "$171.42"),
Google.stock.price = c("$174.34", "$170.04")
)
2009-01-01 |
$173.55 |
$174.90 |
$174.34 |
2009-01-02 |
$172.61 |
$171.42 |
$170.04 |
tidyr::table4a %>% gather(2:3, key = "year", value = cases)
Afghanistan |
1999 |
745 |
Brazil |
1999 |
37737 |
China |
1999 |
212258 |
Afghanistan |
2000 |
2666 |
Brazil |
2000 |
80488 |
China |
2000 |
213766 |
tidyr::table5 %>% mutate(year = str_c(century, year)) %>% select(-2)
Afghanistan |
1999 |
745/19987071 |
Afghanistan |
2000 |
2666/20595360 |
Brazil |
1999 |
37737/172006362 |
Brazil |
2000 |
80488/174504898 |
China |
1999 |
212258/1272915272 |
China |
2000 |
213766/1280428583 |
需要说明的是,判断数据是长还是宽数据,这取决于您所分析数据集中的变量。在矩形数据集(Rectangle Data)不是所有的列名称都是变量。是否是变量取决于您分析和计算的目的。也就是说在你本次分析和计算过程中,同一变量在同一列,观测(observation)在各自的行,才是真正的tidy data。
The ambiguity comes from the definition of tidy data. Tidiness depends on the variables in your data set. But what is a variable depends on what you are trying to do. To identify the variables that you need to work with, describe what you want to do with an equation. Each variable in the equation should correspond to a variable in your data. Primer
基本概念
变量
变量是可以测量的一种数量、属性或特征。 A variable is a quantity, quality, or property that you can measure.
A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. -Tidy data
变量值
变量值是测量变量时的状态。 A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
观测
观测是为了同一个目的同时测量的一组测量结果。 An observation or case is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.
An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.-Tidy data
宽数据(Wide Data)
病例一览表(Cases list)是公共卫生领域最常见的宽数据。
suppressPackageStartupMessages(library(tidyverse))
data(Oswego, package = "epiDisplay")
wide_data <- Oswego %>%
as_tibble() %>%
slice_sample(n = 5)
wide_data
24 |
F |
NA |
FALSE |
NA |
NA |
TRUE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
36 |
M |
NA |
TRUE |
04/18 |
2215 |
TRUE |
TRUE |
FALSE |
TRUE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
48 |
F |
NA |
TRUE |
04/18 |
2400 |
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
TRUE |
TRUE |
TRUE |
FALSE |
70 |
M |
1930 |
TRUE |
04/18 |
2230 |
TRUE |
TRUE |
TRUE |
FALSE |
TRUE |
TRUE |
TRUE |
FALSE |
TRUE |
TRUE |
FALSE |
TRUE |
FALSE |
FALSE |
68 |
M |
NA |
TRUE |
04/18 |
2130 |
TRUE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
# %>%
# flextable::flextable()
杂乱数据(Untidy Data)
“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham
The five most common problems with messy datasets, along with their remedies:
* Column headers are values, not variable names.
* Multiple variables are stored in one column.
* Variables are stored in both rows and columns.
* Multiple types of observational units are stored in the same table.
* A single observational unit is stored in multiple tables.
- 窄数据是指用每列存储多个变量或变量值的数据
- Narrow data(Molten dataTidy data) uses a literal key column and a literal value column to store multiple variables.
who %>%
slice_sample(n = 10)
Palau |
PW |
PLW |
1989 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Seychelles |
SC |
SYC |
1984 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Oman |
OM |
OMN |
1992 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Syrian Arab Republic |
SY |
SYR |
1998 |
5 |
335 |
293 |
111 |
93 |
48 |
50 |
20 |
197 |
99 |
43 |
49 |
18 |
21 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Namibia |
NA |
NAM |
2011 |
48 |
337 |
844 |
660 |
361 |
152 |
138 |
78 |
427 |
653 |
410 |
185 |
100 |
110 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Nepal |
NP |
NPL |
2008 |
81 |
150 |
1409 |
1558 |
1706 |
1515 |
792 |
107 |
832 |
820 |
704 |
630 |
523 |
226 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Montserrat |
MS |
MSR |
2005 |
NA |
NA |
NA |
NA |
1 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Cameroon |
CM |
CMR |
1993 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Zimbabwe |
ZW |
ZWE |
1988 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Namibia |
NA |
NAM |
2006 |
86 |
347 |
1052 |
799 |
386 |
174 |
146 |
74 |
485 |
875 |
521 |
239 |
92 |
80 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
整洁数据(Tidy Data)
整洁数据,需要满足以下条件:
- 每个变量分布在唯一列;
- 每个观测分布在行;
- 每一个值分布在单独的单元格。
- Each variable is in its own column
- Each observation is in its own row
- Each value is in its own cell (this follows from #1 and #2)
Data manipulation
includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, fltering and reordering.
Filter
: subsetting or removing observations based on some condition.
Transform
: adding or modifying variables. These modications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
Aggregate
: collapsing multiple values into a single value (e.g., by summing or taking means).
Sort
: changing the order of observations.
宽数据变长数据
列名含单变量(One variable in column names)
原始数据
table4a
Afghanistan |
745 |
2666 |
Brazil |
37737 |
80488 |
China |
212258 |
213766 |
stats::reshape
table4a %>%
as.data.frame() %>%
stats::reshape(
idvar = "country",
timevar = "year",
varying = list(2:3),
v.names = "cases",
direction = "long")
Afghanistan.1 |
Afghanistan |
1 |
745 |
Brazil.1 |
Brazil |
1 |
37737 |
China.1 |
China |
1 |
212258 |
Afghanistan.2 |
Afghanistan |
2 |
2666 |
Brazil.2 |
Brazil |
2 |
80488 |
China.2 |
China |
2 |
213766 |
reshape2::melt
table4a %>%
reshape2::melt(
# id.vars = "country",
id = 1,
variable.name = "year",
value.name = "cases"
)
Afghanistan |
1999 |
745 |
Brazil |
1999 |
37737 |
China |
1999 |
212258 |
Afghanistan |
2000 |
2666 |
Brazil |
2000 |
80488 |
China |
2000 |
213766 |
data.table::melt
table4a %>%
data.table::melt(
id = 1,
variable.name = "year",
value.name = "cases"
)
Afghanistan |
1999 |
745 |
Brazil |
1999 |
37737 |
China |
1999 |
212258 |
Afghanistan |
2000 |
2666 |
Brazil |
2000 |
80488 |
China |
2000 |
213766 |
tidyr::gather
table4a %>%
gather(-country, key = "year", value = "cases")
Afghanistan |
1999 |
745 |
Brazil |
1999 |
37737 |
China |
1999 |
212258 |
Afghanistan |
2000 |
2666 |
Brazil |
2000 |
80488 |
China |
2000 |
213766 |
tidyr::pivot_longer
table4a %>%
pivot_longer(-country,
names_to = "year",
values_to = "cases")
Afghanistan |
1999 |
745 |
Afghanistan |
2000 |
2666 |
Brazil |
1999 |
37737 |
Brazil |
2000 |
80488 |
China |
1999 |
212258 |
China |
2000 |
213766 |
列名含多变量(Many variables in column namese)
原始数据
who_subset <- who %>%
slice_sample(n = 10)
who_subset
Lithuania |
LT |
LTU |
1981 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Iraq |
IQ |
IRQ |
1985 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Haiti |
HT |
HTI |
1998 |
188 |
804 |
971 |
656 |
331 |
177 |
142 |
208 |
827 |
958 |
620 |
300 |
141 |
119 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Greenland |
GL |
GRL |
2000 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Nicaragua |
NI |
NIC |
2006 |
15 |
162 |
151 |
129 |
98 |
90 |
72 |
25 |
168 |
144 |
90 |
65 |
38 |
38 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Norway |
NO |
NOR |
1988 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Czech Republic |
CZ |
CZE |
2005 |
0 |
8 |
24 |
57 |
55 |
45 |
46 |
0 |
3 |
14 |
16 |
7 |
5 |
28 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
British Virgin Islands |
VG |
VGB |
1997 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Belarus |
BY |
BLR |
1980 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
China |
CN |
CHN |
2000 |
1131 |
19111 |
29399 |
25206 |
25593 |
21429 |
21771 |
1420 |
14536 |
18496 |
12377 |
9899 |
7102 |
6296 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
tidyr::gather
who_subset %>%
gather(
new_sp_m014:newrel_f65,
key = "variable",
value = "count"
)%>%
dplyr::slice_head(n = 10)
Lithuania |
LT |
LTU |
1981 |
new_sp_m014 |
NA |
Iraq |
IQ |
IRQ |
1985 |
new_sp_m014 |
NA |
Haiti |
HT |
HTI |
1998 |
new_sp_m014 |
188 |
Greenland |
GL |
GRL |
2000 |
new_sp_m014 |
NA |
Nicaragua |
NI |
NIC |
2006 |
new_sp_m014 |
15 |
Norway |
NO |
NOR |
1988 |
new_sp_m014 |
NA |
Czech Republic |
CZ |
CZE |
2005 |
new_sp_m014 |
0 |
British Virgin Islands |
VG |
VGB |
1997 |
new_sp_m014 |
NA |
Belarus |
BY |
BLR |
1980 |
new_sp_m014 |
NA |
China |
CN |
CHN |
2000 |
new_sp_m014 |
1131 |
tidyr::pivot_longer
who_subset %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
) %>%
dplyr::slice_head(n = 10)
Lithuania |
LT |
LTU |
1981 |
sp |
m |
014 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
1524 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
2534 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
3544 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
4554 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
5564 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
65 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
f |
014 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
f |
1524 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
f |
2534 |
NA |
who_subset %>% pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_transform = list(
gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
age = ~ readr::parse_factor(
.x,
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)%>%
dplyr::slice_head(n = 10)
Lithuania |
LT |
LTU |
1981 |
sp |
m |
014 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
1524 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
2534 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
3544 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
4554 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
5564 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
m |
65 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
f |
014 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
f |
1524 |
NA |
Lithuania |
LT |
LTU |
1981 |
sp |
f |
2534 |
NA |
每行含多个观测(Multiple observations per row)
原始数据
s2 <- "family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA"
DT <- data.table::fread(s2)
DT %>% knitr::kable()
1 |
30 |
1998-11-26 |
2000-01-29 |
NA |
1 |
2 |
NA |
2 |
27 |
1996-06-22 |
NA |
NA |
2 |
NA |
NA |
3 |
26 |
2002-07-11 |
2004-04-05 |
2007-09-02 |
2 |
2 |
1 |
4 |
32 |
2004-10-10 |
2009-08-27 |
2012-07-21 |
1 |
1 |
1 |
5 |
29 |
2000-12-05 |
2005-02-28 |
NA |
2 |
1 |
NA |
data.table::melt
library(data.table)
DT %>%
data.table::melt(
measure = patterns("^dob", "^gender"),
variable.name = "child",
value.name = c("dob", "gender")
)
1 |
30 |
1 |
1998-11-26 |
1 |
2 |
27 |
1 |
1996-06-22 |
2 |
3 |
26 |
1 |
2002-07-11 |
2 |
4 |
32 |
1 |
2004-10-10 |
1 |
5 |
29 |
1 |
2000-12-05 |
2 |
1 |
30 |
2 |
2000-01-29 |
2 |
2 |
27 |
2 |
NA |
NA |
3 |
26 |
2 |
2004-04-05 |
2 |
4 |
32 |
2 |
2009-08-27 |
1 |
5 |
29 |
2 |
2005-02-28 |
1 |
1 |
30 |
3 |
NA |
NA |
2 |
27 |
3 |
NA |
NA |
3 |
26 |
3 |
2007-09-02 |
1 |
4 |
32 |
3 |
2012-07-21 |
1 |
5 |
29 |
3 |
NA |
NA |
tidyr::pivot_longer
# method1
DT %>%
pivot_longer(
cols = -1:-2,
names_to = c(".value", "child"),
names_sep = "_"
)
1 |
30 |
child1 |
1998-11-26 |
1 |
1 |
30 |
child2 |
2000-01-29 |
2 |
1 |
30 |
child3 |
NA |
NA |
2 |
27 |
child1 |
1996-06-22 |
2 |
2 |
27 |
child2 |
NA |
NA |
2 |
27 |
child3 |
NA |
NA |
3 |
26 |
child1 |
2002-07-11 |
2 |
3 |
26 |
child2 |
2004-04-05 |
2 |
3 |
26 |
child3 |
2007-09-02 |
1 |
4 |
32 |
child1 |
2004-10-10 |
1 |
4 |
32 |
child2 |
2009-08-27 |
1 |
4 |
32 |
child3 |
2012-07-21 |
1 |
5 |
29 |
child1 |
2000-12-05 |
2 |
5 |
29 |
child2 |
2005-02-28 |
1 |
5 |
29 |
child3 |
NA |
NA |
# method2
DT %>%
pivot_longer(
cols = -1:-2,
names_to = c(".value", "child"),
names_pattern = '(.*)_.*(.)'
)
1 |
30 |
1 |
1998-11-26 |
1 |
1 |
30 |
2 |
2000-01-29 |
2 |
1 |
30 |
3 |
NA |
NA |
2 |
27 |
1 |
1996-06-22 |
2 |
2 |
27 |
2 |
NA |
NA |
2 |
27 |
3 |
NA |
NA |
3 |
26 |
1 |
2002-07-11 |
2 |
3 |
26 |
2 |
2004-04-05 |
2 |
3 |
26 |
3 |
2007-09-02 |
1 |
4 |
32 |
1 |
2004-10-10 |
1 |
4 |
32 |
2 |
2009-08-27 |
1 |
4 |
32 |
3 |
2012-07-21 |
1 |
5 |
29 |
1 |
2000-12-05 |
2 |
5 |
29 |
2 |
2005-02-28 |
1 |
5 |
29 |
3 |
NA |
NA |
一个简单数据
df <- tibble(id = 1:3, x1 = 4:6, x2 = 5:7, y1 = 7:9, y2 = 10:12)
df
1 |
4 |
5 |
7 |
10 |
2 |
5 |
6 |
8 |
11 |
3 |
6 |
7 |
9 |
12 |
df %>% pivot_longer(!id, names_to = c(".value", "time"), names_pattern = "(.)(.)")
1 |
1 |
4 |
7 |
1 |
2 |
5 |
10 |
2 |
1 |
5 |
8 |
2 |
2 |
6 |
11 |
3 |
1 |
6 |
9 |
3 |
2 |
7 |
12 |
# 注意与以下的区别
df %>% pivot_longer(!id, names_to = c("value", "time"), names_pattern = "(.)(.)", values_to = "results")
1 |
x |
1 |
4 |
1 |
x |
2 |
5 |
1 |
y |
1 |
7 |
1 |
y |
2 |
10 |
2 |
x |
1 |
5 |
2 |
x |
2 |
6 |
2 |
y |
1 |
8 |
2 |
y |
2 |
11 |
3 |
x |
1 |
6 |
3 |
x |
2 |
7 |
3 |
y |
1 |
9 |
3 |
y |
2 |
12 |
长数据变宽数据
原始数据
fish_encounters
4842 |
Release |
1 |
4842 |
I80_1 |
1 |
4842 |
Lisbon |
1 |
4842 |
Rstr |
1 |
4842 |
Base_TD |
1 |
4842 |
BCE |
1 |
4842 |
BCW |
1 |
4842 |
BCE2 |
1 |
4842 |
BCW2 |
1 |
4842 |
MAE |
1 |
4842 |
MAW |
1 |
4843 |
Release |
1 |
4843 |
I80_1 |
1 |
4843 |
Lisbon |
1 |
4843 |
Rstr |
1 |
4843 |
Base_TD |
1 |
4843 |
BCE |
1 |
4843 |
BCW |
1 |
4843 |
BCE2 |
1 |
4843 |
BCW2 |
1 |
4843 |
MAE |
1 |
4843 |
MAW |
1 |
4844 |
Release |
1 |
4844 |
I80_1 |
1 |
4844 |
Lisbon |
1 |
4844 |
Rstr |
1 |
4844 |
Base_TD |
1 |
4844 |
BCE |
1 |
4844 |
BCW |
1 |
4844 |
BCE2 |
1 |
4844 |
BCW2 |
1 |
4844 |
MAE |
1 |
4844 |
MAW |
1 |
4845 |
Release |
1 |
4845 |
I80_1 |
1 |
4845 |
Lisbon |
1 |
4845 |
Rstr |
1 |
4845 |
Base_TD |
1 |
4847 |
Release |
1 |
4847 |
I80_1 |
1 |
4847 |
Lisbon |
1 |
4848 |
Release |
1 |
4848 |
I80_1 |
1 |
4848 |
Lisbon |
1 |
4848 |
Rstr |
1 |
4849 |
Release |
1 |
4849 |
I80_1 |
1 |
4850 |
Release |
1 |
4850 |
I80_1 |
1 |
4850 |
Rstr |
1 |
4850 |
Base_TD |
1 |
4850 |
BCE |
1 |
4850 |
BCW |
1 |
4851 |
Release |
1 |
4851 |
I80_1 |
1 |
4854 |
Release |
1 |
4854 |
I80_1 |
1 |
4855 |
Release |
1 |
4855 |
I80_1 |
1 |
4855 |
Lisbon |
1 |
4855 |
Rstr |
1 |
4855 |
Base_TD |
1 |
4857 |
Release |
1 |
4857 |
I80_1 |
1 |
4857 |
Lisbon |
1 |
4857 |
Rstr |
1 |
4857 |
Base_TD |
1 |
4857 |
BCE |
1 |
4857 |
BCW |
1 |
4857 |
BCE2 |
1 |
4857 |
BCW2 |
1 |
4858 |
Release |
1 |
4858 |
I80_1 |
1 |
4858 |
Lisbon |
1 |
4858 |
Rstr |
1 |
4858 |
Base_TD |
1 |
4858 |
BCE |
1 |
4858 |
BCW |
1 |
4858 |
BCE2 |
1 |
4858 |
BCW2 |
1 |
4858 |
MAE |
1 |
4858 |
MAW |
1 |
4859 |
Release |
1 |
4859 |
I80_1 |
1 |
4859 |
Lisbon |
1 |
4859 |
Rstr |
1 |
4859 |
Base_TD |
1 |
4861 |
Release |
1 |
4861 |
I80_1 |
1 |
4861 |
Lisbon |
1 |
4861 |
Rstr |
1 |
4861 |
Base_TD |
1 |
4861 |
BCE |
1 |
4861 |
BCW |
1 |
4861 |
BCE2 |
1 |
4861 |
BCW2 |
1 |
4861 |
MAE |
1 |
4861 |
MAW |
1 |
4862 |
Release |
1 |
4862 |
I80_1 |
1 |
4862 |
Lisbon |
1 |
4862 |
Rstr |
1 |
4862 |
Base_TD |
1 |
4862 |
BCE |
1 |
4862 |
BCW |
1 |
4862 |
BCE2 |
1 |
4862 |
BCW2 |
1 |
4863 |
Release |
1 |
4863 |
I80_1 |
1 |
4864 |
Release |
1 |
4864 |
I80_1 |
1 |
4865 |
Release |
1 |
4865 |
I80_1 |
1 |
4865 |
Lisbon |
1 |
tidyr::spread
fish_encounters %>%
spread(station, seen, fill = 0)
4842 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4843 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4844 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4845 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
4847 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4848 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4849 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4850 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
4851 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4854 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4855 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
4857 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
4858 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4859 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
4861 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4862 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
4863 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4864 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4865 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
tidyr::pivot_wider
fish_encounters %>%
pivot_wider(names_from = station,
values_from = seen,
values_fill = 0)
4842 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4843 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4844 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4845 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
4847 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4848 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4849 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4850 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
4851 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4854 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4855 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
4857 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
4858 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4859 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
4861 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
4862 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
4863 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4864 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4865 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |