tidyr

By Hadley Wickham (see https://github.com/hadley/tidyr/)

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

Just as reshape2 did less than reshape, tidy2 does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.

There are two fundamental verbs of data tidying:

These verbs have a number of synonyms:

# I had to give the table a name...
t <- c('tidyr', 'reshape2', 'spreadsheets', 'databases')
a <- c('gather', 'melt', 'pivot', 'fold')
o <- c('spread', 'cast', 'unpivot', 'unfold')
tao <- as.data.frame(list(tool=t, action=a, opposite=o))
print(tao) # http://huffman.sourceforge.net/tao/tao-of-programming.html
##           tool action opposite
## 1        tidyr gather   spread
## 2     reshape2   melt     cast
## 3 spreadsheets  pivot  unpivot
## 4    databases   fold   unfold

tidyr also provides separate() function which makes it easier to turn a single character column into multiple columns either according to regular expression, or by fixed positions. The component to separate() is unite().

# install.packages("devtools")
# devtools::install_github("hadley/tidyr")
require(tidyr)
## Loading required package: tidyr
require(dplyr) # tidyr designed for use with dplyr
dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmom
##   famid named  incd namem  incm
## 1     1  Bill 30000  Bess 15000
## 2     2   Art 22000   Amy 18000
## 3     3  Paul 25000   Pat 50000
dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmom %>%
  gather(key, value, named:incm) %>%            # define a key & value, specify cols to gather (aka 'melt')
  separate(key, c("variable", "type"), -2) %>%  # seperate key to var & type, -2 is the last char
  spread(variable, value, convert = TRUE)       # name & inc reversed, partners swapping??
##   famid type  name  inc
## 1     1    d 30000 Bill
## 2     2    d 15000 Bess
## 3     3    d 22000  Art
## 7     1    m 18000  Amy
## 8     2    m 25000 Paul
## 9     3    m 50000  Pat

It works if values are of one type:

dadmom %>%
  gather(key, value, starts_with('inc')) %>%    # gather numeric 'inc%'
  separate(key, c("variable", "type"), -2) %>%  
  spread(type, value)                           # spread by type (d/m) & value
##   famid named namem variable     d     m
## 1     1  Bill  Bess      inc 30000 15000
## 2     2   Art   Amy      inc 22000 18000
## 3     3  Paul   Pat      inc 25000 50000

In steps, gather:

dadmom %>%
  gather(key, value, starts_with('inc'))        # key & value variables defined
##   famid named namem  key value
## 1     1  Bill  Bess incd 30000
## 2     2   Art   Amy incd 22000
## 3     3  Paul   Pat incd 25000
## 4     1  Bill  Bess incm 15000
## 5     2   Art   Amy incm 18000
## 6     3  Paul   Pat incm 50000

separate:

dadmom %>%
  gather(key, value, starts_with('inc')) %>%    
  separate(key, c("variable", "type"), -2)      # separate key to variable & type
##   famid named namem variable type value
## 1     1  Bill  Bess      inc    d 30000
## 2     2   Art   Amy      inc    d 22000
## 3     3  Paul   Pat      inc    d 25000
## 4     1  Bill  Bess      inc    m 15000
## 5     2   Art   Amy      inc    m 18000
## 6     3  Paul   Pat      inc    m 50000

spead:

dadmom %>%
  gather(key, value, starts_with('inc')) %>%    
  separate(key, c("variable", "type"), -2) %>%    
  spread(type, value)                           # spread by type & value
##   famid named namem variable     d     m
## 1     1  Bill  Bess      inc 30000 15000
## 2     2   Art   Amy      inc 22000 18000
## 3     3  Paul   Pat      inc 25000 50000

gather a character variable:

dadmom %>%
  gather(key, value, starts_with('name')) %>%    # gather a char variable
  separate(key, c("variable", "type"), -2) %>%  
  spread(type, value)       
##   famid  incd  incm variable    d    m
## 1     1 30000 15000     name Bill Bess
## 2     2 22000 18000     name  Art  Amy
## 3     3 25000 50000     name Paul  Pat

spread a numeric variable by different variable:

dadmom %>%
  gather(key, value, starts_with('inc')) %>%
  separate(key, c("variable", "type"), -2) %>%  
  spread(famid, value)                            # fill = NA is default
##   named namem variable type     1     2     3
## 1  Bill  Bess      inc    d    NA 22000    NA
## 2   Art   Amy      inc    d    NA 18000    NA
## 3  Paul   Pat      inc    d 30000    NA    NA
## 4  Bill  Bess      inc    m 15000    NA    NA
## 5   Art   Amy      inc    m    NA    NA 25000
## 6  Paul   Pat      inc    m    NA    NA 50000

Spelling it out, the key, value pair is required to gather:

dadmom %>%
  gather(key = my_key, value = my_value
         , starts_with('inc'))
##   famid named namem my_key my_value
## 1     1  Bill  Bess   incd    30000
## 2     2   Art   Amy   incd    22000
## 3     3  Paul   Pat   incd    25000
## 4     1  Bill  Bess   incm    15000
## 5     2   Art   Amy   incm    18000
## 6     3  Paul   Pat   incm    50000

The pairing allows the gathered data to be spread. The value becomes the body of a table, and the key matches that value against the remaining variables:

dadmom %>%
  gather(key = my_key, value = my_value
         , starts_with('inc')) %>%    
  separate(my_key
           , c("my_var", "my_type"), -2) %>%      # spread into columns of variable values 
  spread(my_type, my_value)                       # table body is values keyed against other variables
##   famid named namem my_var     d     m
## 1     1  Bill  Bess    inc 30000 15000
## 2     2   Art   Amy    inc 22000 18000
## 3     3  Paul   Pat    inc 25000 50000

Reshape wide format, to multi-column long format

# http://stackoverflow.com/questions/15668870/

grades <- tbl_df(read.table(header = TRUE, text = "
   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31
"))

grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("Test", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)
##    ID Year Semester Test1 Test2
## 1   1 2008     Fall    15    22
## 2   1 2008   Spring    12    13
## 3   1 2008   Winter    11    17
## 4   1 2009     Fall    16    22
## 5   1 2009   Spring    13    11
## 6   1 2009   Winter    12    12
## 7   2 2008     Fall    19    24
## 8   2 2008   Spring    25    29
## 9   2 2008   Winter    22    23
## 10  2 2009     Fall    12    10
## 11  2 2009   Spring    16    23
## 12  2 2009   Winter    13    14
## 13  3 2008     Fall    13    14
## 14  3 2008   Spring    14    20
## 15  3 2008   Winter    11     9
## 16  3 2009     Fall    27    20
## 17  3 2009   Spring    21    26
## 18  3 2009   Winter    27    31

Reshape data from long to a short format by a variable, and rename columns

# http://stackoverflow.com/questions/16032858

results <- data.frame(
  Ind = paste0("Ind", 1:10),
  Treatment = rep(c("Treat", "Cont"), each = 10),
  value = 1:20
)
head(results)
##    Ind Treatment value
## 1 Ind1     Treat     1
## 2 Ind2     Treat     2
## 3 Ind3     Treat     3
## 4 Ind4     Treat     4
## 5 Ind5     Treat     5
## 6 Ind6     Treat     6
results %>% spread(Treatment, value)
##      Ind Treat Cont
## 1   Ind1    11    1
## 2   Ind2    20   10
## 3   Ind3    12    2
## 4   Ind4    13    3
## 5   Ind5    14    4
## 6   Ind6    15    5
## 7   Ind7    16    6
## 8   Ind8    17    7
## 9   Ind9    18    8
## 10 Ind10    19    9

Rearranging Data frame in R

# http://stackoverflow.com/questions/17481212

race <- read.table(header = TRUE, check.names = FALSE, text = "
  Name    50  100  150  200  250  300  350
  Carla  1.2  1.8  2.2  2.3  3.0  2.5  1.8
  Mace   1.5  1.1  1.9  2.0  3.6  3.0  2.5
  Lea    1.7  1.6  2.3  2.7  2.6  2.2  2.6
  Karen  1.3  1.7  1.9  2.2  3.2  1.5  1.9
")

race %>%
  gather(Time, Score, -Name, convert = TRUE) %>%
  arrange(Name, Time)
##     Name Time Score
## 1  Carla   50   1.2
## 2  Carla  100   1.8
## 3  Carla  150   2.2
## 4  Carla  200   2.3
## 5  Carla  250   3.0
## 6  Carla  300   2.5
## 7  Carla  350   1.8
## 8  Karen   50   1.3
## 9  Karen  100   1.7
## 10 Karen  150   1.9
## 11 Karen  200   2.2
## 12 Karen  250   3.2
## 13 Karen  300   1.5
## 14 Karen  350   1.9
## 15   Lea   50   1.7
## 16   Lea  100   1.6
## 17   Lea  150   2.3
## 18   Lea  200   2.7
## 19   Lea  250   2.6
## 20   Lea  300   2.2
## 21   Lea  350   2.6
## 22  Mace   50   1.5
## 23  Mace  100   1.1
## 24  Mace  150   1.9
## 25  Mace  200   2.0
## 26  Mace  250   3.6
## 27  Mace  300   3.0
## 28  Mace  350   2.5

wide to long multiple measures each time

# http://stackoverflow.com/questions/9684671

set.seed(10)
activities <- data.frame(
  id = sprintf("x1.%02d", 1:10),
  trt = sample(c('cnt', 'tr'), 10, T),
  work.T1 = runif(10),
  play.T1 = runif(10),
  talk.T1 = runif(10),
  work.T2 = runif(10),
  play.T2 = runif(10),
  talk.T2 = runif(10)
)

activities %>%
  gather(key, value, -id, -trt) %>%
  separate(key, into = c("location", "time")) %>%
  arrange(id, trt, time) %>%
  spread(location, value)
##       id trt time   work    play    talk
## 1  x1.01  tr   T1 0.8647 0.53560 0.65166
## 4  x1.01  tr   T2 0.3543 0.03189 0.27548
## 7  x1.02 cnt   T1 0.6154 0.09309 0.56774
## 10 x1.02 cnt   T2 0.9364 0.11447 0.22890
## 13 x1.03 cnt   T1 0.7751 0.16980 0.11351
## 16 x1.03 cnt   T2 0.2459 0.46894 0.01443
## 19 x1.04  tr   T1 0.3556 0.89983 0.59593
## 22 x1.04  tr   T2 0.4731 0.39699 0.72896
## 25 x1.05 cnt   T1 0.4058 0.42264 0.35805
## 28 x1.05 cnt   T2 0.1916 0.83362 0.24988
## 31 x1.06 cnt   T1 0.7066 0.74775 0.42881
## 34 x1.06 cnt   T2 0.5832 0.76112 0.16118
## 37 x1.07 cnt   T1 0.8383 0.82265 0.05190
## 40 x1.07 cnt   T2 0.4595 0.57336 0.01704
## 43 x1.08 cnt   T1 0.2396 0.95465 0.26418
## 46 x1.08 cnt   T2 0.4674 0.44751 0.48610
## 49 x1.09  tr   T1 0.7708 0.68544 0.39879
## 52 x1.09  tr   T2 0.3998 0.08380 0.10290
## 55 x1.10 cnt   T1 0.3559 0.50050 0.83613
## 58 x1.10 cnt   T2 0.5053 0.21914 0.80155

References

If you'd like to learn more about these data reshaping operators, I'd recommend the following papers: