tidyr

By Hadley Wickham (see https://github.com/hadley/tidyr/)

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

Just as reshape2 did less than reshape, tidy2 does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.

There are two fundamental verbs of data tidying:

These verbs have a number of synonyms:

# I had to give the table a name...
t <- c('tidyr', 'reshape2', 'spreadsheets', 'databases')
a <- c('gather', 'melt', 'pivot', 'fold')
o <- c('spread', 'cast', 'unpivot', 'unfold')
tao <- as.data.frame(list(tool=t, action=a, opposite=o))
print(tao) # http://huffman.sourceforge.net/tao/tao-of-programming.html
##           tool action opposite
## 1        tidyr gather   spread
## 2     reshape2   melt     cast
## 3 spreadsheets  pivot  unpivot
## 4    databases   fold   unfold

tidyr also provides separate() function which makes it easier to turn a single character column into multiple columns either according to regular expression, or by fixed positions. The component to separate() is unite().

# install.packages("devtools")
# devtools::install_github("hadley/tidyr")
require(tidyr)
## Loading required package: tidyr
require(dplyr) # tidyr designed for use with dplyr
dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmom
##   famid named  incd namem  incm
## 1     1  Bill 30000  Bess 15000
## 2     2   Art 22000   Amy 18000
## 3     3  Paul 25000   Pat 50000
dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmom %>%
  gather(key, value, named:incm) %>%            # set key, value, cols to gather
  separate(key, c("variable", "type"), -2) %>%  # split key into var & type
  spread(variable, value, convert = TRUE)       # name & inc reversed??
##   famid type  name  inc
## 1     1    d 30000 Bill
## 2     2    d 15000 Bess
## 3     3    d 22000  Art
## 7     1    m 18000  Amy
## 8     2    m 25000 Paul
## 9     3    m 50000  Pat

It works if variable is filtered:

dadmom %>%
  gather(key, value, starts_with('inc')) %>%    # perhaps in cols gathered?
  separate(key, c("variable", "type"), -2) %>%  # works if they're one type
  spread(type, value)       
##   famid named namem variable     d     m
## 1     1  Bill  Bess      inc 30000 15000
## 2     2   Art   Amy      inc 22000 18000
## 3     3  Paul   Pat      inc 25000 50000
dadmom %>%
  gather(key, value, starts_with('name')) %>%    # char col
  separate(key, c("variable", "type"), -2) %>%  
  spread(type, value)       
##   famid  incd  incm variable    d    m
## 1     1 30000 15000     name Bill Bess
## 2     2 22000 18000     name  Art  Amy
## 3     3 25000 50000     name Paul  Pat
dadmom %>%
  gather(key, value, starts_with('inc')) %>%    # char col
  separate(key, c("variable", "type"), -2) %>%  
  spread(famid, value)       
##   named namem variable type     1     2     3
## 1  Bill  Bess      inc    d    NA 22000    NA
## 2   Art   Amy      inc    d    NA 18000    NA
## 3  Paul   Pat      inc    d 30000    NA    NA
## 4  Bill  Bess      inc    m 15000    NA    NA
## 5   Art   Amy      inc    m    NA    NA 25000
## 6  Paul   Pat      inc    m    NA    NA 50000
dadmom %>%
  gather(key, value, starts_with('inc')) %>%    # char col
  separate(key, c("variable", "type"), -2) %>%  
  spread(variable, value)       
##   famid named namem type   inc
## 1     1  Bill  Bess    d 30000
## 2     2   Art   Amy    d 15000
## 3     3  Paul   Pat    d 22000
## 4     1  Bill  Bess    m 18000
## 5     2   Art   Amy    m 25000
## 6     3  Paul   Pat    m 50000

Reshape wide format, to multi-column long format

# http://stackoverflow.com/questions/15668870/

grades <- tbl_df(read.table(header = TRUE, text = "
   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31
"))

grades %>%
  gather(Semester, Score, Fall:Winter) %>%
  mutate(Test = paste0("Test", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Semester)
##    ID Year Semester Test1 Test2
## 1   1 2008     Fall    15    22
## 2   1 2008   Spring    12    13
## 3   1 2008   Winter    11    17
## 4   1 2009     Fall    16    22
## 5   1 2009   Spring    13    11
## 6   1 2009   Winter    12    12
## 7   2 2008     Fall    19    24
## 8   2 2008   Spring    25    29
## 9   2 2008   Winter    22    23
## 10  2 2009     Fall    12    10
## 11  2 2009   Spring    16    23
## 12  2 2009   Winter    13    14
## 13  3 2008     Fall    13    14
## 14  3 2008   Spring    14    20
## 15  3 2008   Winter    11     9
## 16  3 2009     Fall    27    20
## 17  3 2009   Spring    21    26
## 18  3 2009   Winter    27    31

Reshape data from long to a short format by a variable, and rename columns

# http://stackoverflow.com/questions/16032858

results <- data.frame(
  Ind = paste0("Ind", 1:10),
  Treatment = rep(c("Treat", "Cont"), each = 10),
  value = 1:20
)
head(results)
##    Ind Treatment value
## 1 Ind1     Treat     1
## 2 Ind2     Treat     2
## 3 Ind3     Treat     3
## 4 Ind4     Treat     4
## 5 Ind5     Treat     5
## 6 Ind6     Treat     6
results %>% spread(Treatment, value)
##      Ind Treat Cont
## 1   Ind1    11    1
## 2   Ind2    20   10
## 3   Ind3    12    2
## 4   Ind4    13    3
## 5   Ind5    14    4
## 6   Ind6    15    5
## 7   Ind7    16    6
## 8   Ind8    17    7
## 9   Ind9    18    8
## 10 Ind10    19    9

Rearranging Data frame in R

# http://stackoverflow.com/questions/17481212

race <- read.table(header = TRUE, check.names = FALSE, text = "
  Name    50  100  150  200  250  300  350
  Carla  1.2  1.8  2.2  2.3  3.0  2.5  1.8
  Mace   1.5  1.1  1.9  2.0  3.6  3.0  2.5
  Lea    1.7  1.6  2.3  2.7  2.6  2.2  2.6
  Karen  1.3  1.7  1.9  2.2  3.2  1.5  1.9
")

race %>%
  gather(Time, Score, -Name, convert = TRUE) %>%
  arrange(Name, Time)
##     Name Time Score
## 1  Carla   50   1.2
## 2  Carla  100   1.8
## 3  Carla  150   2.2
## 4  Carla  200   2.3
## 5  Carla  250   3.0
## 6  Carla  300   2.5
## 7  Carla  350   1.8
## 8  Karen   50   1.3
## 9  Karen  100   1.7
## 10 Karen  150   1.9
## 11 Karen  200   2.2
## 12 Karen  250   3.2
## 13 Karen  300   1.5
## 14 Karen  350   1.9
## 15   Lea   50   1.7
## 16   Lea  100   1.6
## 17   Lea  150   2.3
## 18   Lea  200   2.7
## 19   Lea  250   2.6
## 20   Lea  300   2.2
## 21   Lea  350   2.6
## 22  Mace   50   1.5
## 23  Mace  100   1.1
## 24  Mace  150   1.9
## 25  Mace  200   2.0
## 26  Mace  250   3.6
## 27  Mace  300   3.0
## 28  Mace  350   2.5

wide to long multiple measures each time

# http://stackoverflow.com/questions/9684671

set.seed(10)
activities <- data.frame(
  id = sprintf("x1.%02d", 1:10),
  trt = sample(c('cnt', 'tr'), 10, T),
  work.T1 = runif(10),
  play.T1 = runif(10),
  talk.T1 = runif(10),
  work.T2 = runif(10),
  play.T2 = runif(10),
  talk.T2 = runif(10)
)

activities %>%
  gather(key, value, -id, -trt) %>%
  separate(key, into = c("location", "time")) %>%
  arrange(id, trt, time) %>%
  spread(location, value)
##       id trt time   work    play    talk
## 1  x1.01  tr   T1 0.8647 0.53560 0.65166
## 4  x1.01  tr   T2 0.3543 0.03189 0.27548
## 7  x1.02 cnt   T1 0.6154 0.09309 0.56774
## 10 x1.02 cnt   T2 0.9364 0.11447 0.22890
## 13 x1.03 cnt   T1 0.7751 0.16980 0.11351
## 16 x1.03 cnt   T2 0.2459 0.46894 0.01443
## 19 x1.04  tr   T1 0.3556 0.89983 0.59593
## 22 x1.04  tr   T2 0.4731 0.39699 0.72896
## 25 x1.05 cnt   T1 0.4058 0.42264 0.35805
## 28 x1.05 cnt   T2 0.1916 0.83362 0.24988
## 31 x1.06 cnt   T1 0.7066 0.74775 0.42881
## 34 x1.06 cnt   T2 0.5832 0.76112 0.16118
## 37 x1.07 cnt   T1 0.8383 0.82265 0.05190
## 40 x1.07 cnt   T2 0.4595 0.57336 0.01704
## 43 x1.08 cnt   T1 0.2396 0.95465 0.26418
## 46 x1.08 cnt   T2 0.4674 0.44751 0.48610
## 49 x1.09  tr   T1 0.7708 0.68544 0.39879
## 52 x1.09  tr   T2 0.3998 0.08380 0.10290
## 55 x1.10 cnt   T1 0.3559 0.50050 0.83613
## 58 x1.10 cnt   T2 0.5053 0.21914 0.80155

References

If you'd like to learn more about these data reshaping operators, I'd recommend the following papers: