By Hadley Wickham (see https://github.com/hadley/tidyr/)
tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.
Just as reshape2 did less than reshape, tidy2 does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.
There are two fundamental verbs of data tidying:
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
spread(). takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider.
These verbs have a number of synonyms:
# I had to give the table a name...
t <- c('tidyr', 'reshape2', 'spreadsheets', 'databases')
a <- c('gather', 'melt', 'pivot', 'fold')
o <- c('spread', 'cast', 'unpivot', 'unfold')
tao <- as.data.frame(list(tool=t, action=a, opposite=o))
print(tao) # http://huffman.sourceforge.net/tao/tao-of-programming.html
## tool action opposite
## 1 tidyr gather spread
## 2 reshape2 melt cast
## 3 spreadsheets pivot unpivot
## 4 databases fold unfold
tidyr also provides separate() function which makes it easier to turn a single character column into multiple columns either according to regular expression, or by fixed positions. The component to separate() is unite().
# install.packages("devtools")
# devtools::install_github("hadley/tidyr")
require(tidyr)
## Loading required package: tidyr
require(dplyr) # tidyr designed for use with dplyr
dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmom
## famid named incd namem incm
## 1 1 Bill 30000 Bess 15000
## 2 2 Art 22000 Amy 18000
## 3 3 Paul 25000 Pat 50000
dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmom %>%
gather(key, value, named:incm) %>% # define a key & value, specify cols to gather (aka 'melt')
separate(key, c("variable", "type"), -2) %>% # seperate key to var & type, -2 is the last char
spread(variable, value, convert = TRUE) # name & inc reversed, partners swapping??
## famid type name inc
## 1 1 d 30000 Bill
## 2 2 d 15000 Bess
## 3 3 d 22000 Art
## 7 1 m 18000 Amy
## 8 2 m 25000 Paul
## 9 3 m 50000 Pat
It works if values are of one type:
dadmom %>%
gather(key, value, starts_with('inc')) %>% # gather numeric 'inc%'
separate(key, c("variable", "type"), -2) %>%
spread(type, value) # spread by type (d/m) & value
## famid named namem variable d m
## 1 1 Bill Bess inc 30000 15000
## 2 2 Art Amy inc 22000 18000
## 3 3 Paul Pat inc 25000 50000
In steps, gather:
dadmom %>%
gather(key, value, starts_with('inc')) # key & value variables defined
## famid named namem key value
## 1 1 Bill Bess incd 30000
## 2 2 Art Amy incd 22000
## 3 3 Paul Pat incd 25000
## 4 1 Bill Bess incm 15000
## 5 2 Art Amy incm 18000
## 6 3 Paul Pat incm 50000
separate:
dadmom %>%
gather(key, value, starts_with('inc')) %>%
separate(key, c("variable", "type"), -2) # separate key to variable & type
## famid named namem variable type value
## 1 1 Bill Bess inc d 30000
## 2 2 Art Amy inc d 22000
## 3 3 Paul Pat inc d 25000
## 4 1 Bill Bess inc m 15000
## 5 2 Art Amy inc m 18000
## 6 3 Paul Pat inc m 50000
spead:
dadmom %>%
gather(key, value, starts_with('inc')) %>%
separate(key, c("variable", "type"), -2) %>%
spread(type, value) # spread by type & value
## famid named namem variable d m
## 1 1 Bill Bess inc 30000 15000
## 2 2 Art Amy inc 22000 18000
## 3 3 Paul Pat inc 25000 50000
gather a character variable:
dadmom %>%
gather(key, value, starts_with('name')) %>% # gather a char variable
separate(key, c("variable", "type"), -2) %>%
spread(type, value)
## famid incd incm variable d m
## 1 1 30000 15000 name Bill Bess
## 2 2 22000 18000 name Art Amy
## 3 3 25000 50000 name Paul Pat
spread a numeric variable by different variable:
dadmom %>%
gather(key, value, starts_with('inc')) %>%
separate(key, c("variable", "type"), -2) %>%
spread(famid, value) # fill = NA is default
## named namem variable type 1 2 3
## 1 Bill Bess inc d NA 22000 NA
## 2 Art Amy inc d NA 18000 NA
## 3 Paul Pat inc d 30000 NA NA
## 4 Bill Bess inc m 15000 NA NA
## 5 Art Amy inc m NA NA 25000
## 6 Paul Pat inc m NA NA 50000
Spelling it out, the key, value pair is required to gather:
key is a new variable with values from the variable names gathered, plus…value, another new variable holding the values of those variables gathereddadmom %>%
gather(key = my_key, value = my_value
, starts_with('inc'))
## famid named namem my_key my_value
## 1 1 Bill Bess incd 30000
## 2 2 Art Amy incd 22000
## 3 3 Paul Pat incd 25000
## 4 1 Bill Bess incm 15000
## 5 2 Art Amy incm 18000
## 6 3 Paul Pat incm 50000
The pairing allows the gathered data to be spread. The value becomes the body of a table, and the key matches that value against the remaining variables:
dadmom %>%
gather(key = my_key, value = my_value
, starts_with('inc')) %>%
separate(my_key
, c("my_var", "my_type"), -2) %>% # spread into columns of variable values
spread(my_type, my_value) # table body is values keyed against other variables
## famid named namem my_var d m
## 1 1 Bill Bess inc 30000 15000
## 2 2 Art Amy inc 22000 18000
## 3 3 Paul Pat inc 25000 50000
# http://stackoverflow.com/questions/15668870/
grades <- tbl_df(read.table(header = TRUE, text = "
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
"))
grades %>%
gather(Semester, Score, Fall:Winter) %>%
mutate(Test = paste0("Test", Test)) %>%
spread(Test, Score) %>%
arrange(ID, Year, Semester)
## ID Year Semester Test1 Test2
## 1 1 2008 Fall 15 22
## 2 1 2008 Spring 12 13
## 3 1 2008 Winter 11 17
## 4 1 2009 Fall 16 22
## 5 1 2009 Spring 13 11
## 6 1 2009 Winter 12 12
## 7 2 2008 Fall 19 24
## 8 2 2008 Spring 25 29
## 9 2 2008 Winter 22 23
## 10 2 2009 Fall 12 10
## 11 2 2009 Spring 16 23
## 12 2 2009 Winter 13 14
## 13 3 2008 Fall 13 14
## 14 3 2008 Spring 14 20
## 15 3 2008 Winter 11 9
## 16 3 2009 Fall 27 20
## 17 3 2009 Spring 21 26
## 18 3 2009 Winter 27 31
# http://stackoverflow.com/questions/16032858
results <- data.frame(
Ind = paste0("Ind", 1:10),
Treatment = rep(c("Treat", "Cont"), each = 10),
value = 1:20
)
head(results)
## Ind Treatment value
## 1 Ind1 Treat 1
## 2 Ind2 Treat 2
## 3 Ind3 Treat 3
## 4 Ind4 Treat 4
## 5 Ind5 Treat 5
## 6 Ind6 Treat 6
results %>% spread(Treatment, value)
## Ind Treat Cont
## 1 Ind1 11 1
## 2 Ind2 20 10
## 3 Ind3 12 2
## 4 Ind4 13 3
## 5 Ind5 14 4
## 6 Ind6 15 5
## 7 Ind7 16 6
## 8 Ind8 17 7
## 9 Ind9 18 8
## 10 Ind10 19 9
# http://stackoverflow.com/questions/17481212
race <- read.table(header = TRUE, check.names = FALSE, text = "
Name 50 100 150 200 250 300 350
Carla 1.2 1.8 2.2 2.3 3.0 2.5 1.8
Mace 1.5 1.1 1.9 2.0 3.6 3.0 2.5
Lea 1.7 1.6 2.3 2.7 2.6 2.2 2.6
Karen 1.3 1.7 1.9 2.2 3.2 1.5 1.9
")
race %>%
gather(Time, Score, -Name, convert = TRUE) %>%
arrange(Name, Time)
## Name Time Score
## 1 Carla 50 1.2
## 2 Carla 100 1.8
## 3 Carla 150 2.2
## 4 Carla 200 2.3
## 5 Carla 250 3.0
## 6 Carla 300 2.5
## 7 Carla 350 1.8
## 8 Karen 50 1.3
## 9 Karen 100 1.7
## 10 Karen 150 1.9
## 11 Karen 200 2.2
## 12 Karen 250 3.2
## 13 Karen 300 1.5
## 14 Karen 350 1.9
## 15 Lea 50 1.7
## 16 Lea 100 1.6
## 17 Lea 150 2.3
## 18 Lea 200 2.7
## 19 Lea 250 2.6
## 20 Lea 300 2.2
## 21 Lea 350 2.6
## 22 Mace 50 1.5
## 23 Mace 100 1.1
## 24 Mace 150 1.9
## 25 Mace 200 2.0
## 26 Mace 250 3.6
## 27 Mace 300 3.0
## 28 Mace 350 2.5
# http://stackoverflow.com/questions/9684671
set.seed(10)
activities <- data.frame(
id = sprintf("x1.%02d", 1:10),
trt = sample(c('cnt', 'tr'), 10, T),
work.T1 = runif(10),
play.T1 = runif(10),
talk.T1 = runif(10),
work.T2 = runif(10),
play.T2 = runif(10),
talk.T2 = runif(10)
)
activities %>%
gather(key, value, -id, -trt) %>%
separate(key, into = c("location", "time")) %>%
arrange(id, trt, time) %>%
spread(location, value)
## id trt time work play talk
## 1 x1.01 tr T1 0.8647 0.53560 0.65166
## 4 x1.01 tr T2 0.3543 0.03189 0.27548
## 7 x1.02 cnt T1 0.6154 0.09309 0.56774
## 10 x1.02 cnt T2 0.9364 0.11447 0.22890
## 13 x1.03 cnt T1 0.7751 0.16980 0.11351
## 16 x1.03 cnt T2 0.2459 0.46894 0.01443
## 19 x1.04 tr T1 0.3556 0.89983 0.59593
## 22 x1.04 tr T2 0.4731 0.39699 0.72896
## 25 x1.05 cnt T1 0.4058 0.42264 0.35805
## 28 x1.05 cnt T2 0.1916 0.83362 0.24988
## 31 x1.06 cnt T1 0.7066 0.74775 0.42881
## 34 x1.06 cnt T2 0.5832 0.76112 0.16118
## 37 x1.07 cnt T1 0.8383 0.82265 0.05190
## 40 x1.07 cnt T2 0.4595 0.57336 0.01704
## 43 x1.08 cnt T1 0.2396 0.95465 0.26418
## 46 x1.08 cnt T2 0.4674 0.44751 0.48610
## 49 x1.09 tr T1 0.7708 0.68544 0.39879
## 52 x1.09 tr T2 0.3998 0.08380 0.10290
## 55 x1.10 cnt T1 0.3559 0.50050 0.83613
## 58 x1.10 cnt T2 0.5053 0.21914 0.80155
If you'd like to learn more about these data reshaping operators, I'd recommend the following papers: