Load packages

tidyr now is in the tidyverse

#knitr::opts_chunk$set(echo = TRUE)
library(readr) # function: read_table
library(kableExtra)
#library(tidyr)
library(tidyverse) # tidyr in tidyverse
library(reshape2) # use melt()
library(data.table) # setDT

History

reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.
tidyr replaces reshape2 (2010-2014) and reshape (2005-2010).
tidyr is designed specifically for tidying data, not general reshaping (reshape2), or the general aggregation (reshape).

Functional

Method Wide \(\rightarrow\) Long Long \(\rightarrow\) Wide
reshape2 melt cast
tidyr gather spread
spreadsheet unpivot pivot
database fold unfold

Usage of Long or Wide form of Data

Long form Wide form
– has a column for possible variable types and a column for the values of those variables.
– not necessarily only two columns.
– widely used in many tools, such as ggplot2, plyr, lm(), glm(), etc.
– is easier and intuitive for recording data.

Structure of Long or Wide form Data

Long form

I used an example from Riesby and al. (Reisby et al. 1977) examined the relationship between Imipramine (IMI) and Desipramine (DMI) plasma levels and clinical response in 66 depressed inpatients (37 endogenous and 29 non-endogenous), patients received 225 mg/day doses of imipramine for 4 weeks (Donald Hedeker 2006). Dataset is including:

  • ID: study identification
  • HamD: Hamilton Depression Rating Scale (HDRS scores). These represent the dependent variable that is measured across time. Higher scores on the HDRS represent higher levels of depression and lower scores indicate less depression
  • Endog: 0=‘NonEndog’ 1=‘Endog’
  • Week: 0=‘week 0’ 1=‘week 1’ 2=‘week 2’ 3=‘week 3’ 4=‘week 4’ 5=‘week 5’
one<-read_table(url("https://hedeker.people.uic.edu/RIESBY.DAT.txt"), col_names=FALSE, col_types=cols(.default = col_double()))
colnames(one)<-c("NULL","id", "hamd", "intcpt", "week", "endog", "endweek")
one<-one[-dim(one)[1],-c(1,4,7)]

kable(one[1:10,], caption="Long form of Riesby data") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Long form of Riesby data
id hamd week endog
101 26 0 0
101 22 1 0
101 18 2 0
101 7 3 0
101 4 4 0
101 3 5 0
103 33 0 0
103 24 1 0
103 15 2 0
103 24 3 0

Wide form

Another example is the Treatment of Lead Exposed Children Trial (N=100, TLC trial) showed the blood lead level from week 0 (baseline) to week 1, week 4, and week 6 for the succimer and placebo groups from Applied Longitudinal Analysis, 2nd Edition by Garrett Fitzmaurice, Nan Laird & James Ware (Garrett M. Fitzmaurice 2011)

tlc<-read_table(url("https://content.sph.harvard.edu/fitzmaur/ala2e/tlc-data.txt"), col_names=FALSE)
colnames(tlc)<-c('id', 'group', 'lead0', 'lead1', 'lead4', 'lead6')

kable(tlc[1:10,], caption="Wide form of TLC data") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Wide form of TLC data
id group lead0 lead1 lead4 lead6
1 P 30.8 26.9 25.8 23.8
2 A 26.5 14.8 19.5 21.0
3 A 25.8 23.0 19.1 23.2
4 P 24.7 24.5 22.0 22.5
5 A 20.4 2.8 3.2 9.4
6 A 20.4 5.4 4.5 11.9
7 P 28.6 20.8 19.2 18.4
8 P 33.7 31.6 28.5 25.1
9 P 19.7 14.9 15.3 14.7
10 P 31.1 31.2 29.2 30.1

Show case

Long to Wide

#one$Week<-as.character(one$Week)
one.w<-stats::reshape(one,
                      idvar = c("id","endog"),
                      timevar = "week",
                      v.names = "hamd", 
                      direction = "wide")

kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Dataset looks like when Transposing Long to Wide
id endog hamd.c(0, 1, 2, 3, 4, 5)
101 0 NA
103 0 NA
104 1 NA
105 0 NA
106 1 NA
107 1 NA
108 1 NA
113 0 NA
114 0 NA
115 1 NA

Obviously, reshape function did not recognize the dimensions of data via the timevar when turning long to wide. Thus, we got the wrong wide form!?!

Rethinking of data: A dataset is a collection of values. Every value belongs to a variable and an observation (Wickham 2014).

one.w <- tidyr::spread(one, key = week, value = hamd, sep = "")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide using `spread` function") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Dataset looks like when Transposing Long to Wide using spread function
id endog week0 week1 week2 week3 week4 week5
101 0 26 22 18 7 4 3
103 0 33 24 15 24 15 13
104 1 29 22 18 13 19 0
105 0 22 12 16 16 13 9
106 1 21 25 23 18 20 NA
107 1 21 21 16 19 NA 6
108 1 21 22 11 9 9 7
113 0 21 23 19 23 23 NA
114 0 NA 17 11 13 7 7
115 1 NA 16 16 16 16 11

Currently, spread has been retired so we conveniently use the pivot_wider

one.w <- tidyr::pivot_wider(one, names_from = week, values_from = hamd, names_prefix = "w")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide using `pivot_wider` function") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Dataset looks like when Transposing Long to Wide using pivot_wider function
id endog w0 w1 w2 w3 w4 w5
101 0 26 22 18 7 4 3
103 0 33 24 15 24 15 13
104 1 29 22 18 13 19 0
105 0 22 12 16 16 13 9
106 1 21 25 23 18 20 NA
107 1 21 21 16 19 NA 6
108 1 21 22 11 9 9 7
113 0 21 23 19 23 23 NA
114 0 NA 17 11 13 7 7
115 1 NA 16 16 16 16 11

Wide to Long

As the conventional way

tlc.l <- stats::reshape(tlc,
                        varying = c("lead0", "lead1", "lead4", "lead6"),
                        v.names = "y",
                        timevar = "time",
                        times = c("0", "1", "4", "6"),
                        new.row.names = 1:1000,
                        direction = "long")

#tlc.l.sort <- tlc.l[order(tlc.l$id),]
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Dataset looks like when Transposing Wide to Long
id group time y
1 P 0 30.8
2 A 0 26.5
3 A 0 25.8
4 P 0 24.7
5 A 0 20.4
6 A 0 20.4
7 P 0 28.6
8 P 0 33.7
9 P 0 19.7
10 P 0 31.1

Applying gather()

tlc.l<-gather(tlc, key = time, value = y, lead0:lead6)
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Dataset looks like when Transposing Wide to Long
id group time y
1 P lead0 30.8
2 A lead0 26.5
3 A lead0 25.8
4 P lead0 24.7
5 A lead0 20.4
6 A lead0 20.4
7 P lead0 28.6
8 P lead0 33.7
9 P lead0 19.7
10 P lead0 31.1

Using pivot_longer() due to the retirement of gather()

tlc.l<-pivot_longer(tlc, names_to = "time", values_to = "y", lead0:lead6)

tlc.l<-separate(tlc.l, time, into = c("lead", "time"), sep = 4)%>%
  select(-lead)

kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Dataset looks like when Transposing Wide to Long
id group time y
1 P 0 30.8
1 P 1 26.9
1 P 4 25.8
1 P 6 23.8
2 A 0 26.5
2 A 1 14.8
2 A 4 19.5
2 A 6 21.0
3 A 0 25.8
3 A 1 23.0

I have gone further a little by using separate() to eliminate “lead” word in value of variable time, so it completely returned to it-should-be.

Bonus

  • Especially, if we use melt() in reshape (or reshape2), the data should be treated as data.frame()
tlc.dta.frm<-as.data.frame(tlc)
tlc.l<-melt(tlc.dta.frm, id=c("id","group"))
head(tlc.l)
##   id group variable value
## 1  1     P    lead0  30.8
## 2  2     A    lead0  26.5
## 3  3     A    lead0  25.8
## 4  4     P    lead0  24.7
## 5  5     A    lead0  20.4
## 6  6     A    lead0  20.4
(dw <- read.table(header=T, text='
 sbj f1.avg f1.sd f2.avg f2.sd  blabla
  A   10    6     50     10      bA
  B   12    5     70     11      bB
  C   20    7     20     8       bC
  D   22    8     22     9       bD'))
##   sbj f1.avg f1.sd f2.avg f2.sd blabla
## 1   A     10     6     50    10     bA
## 2   B     12     5     70    11     bB
## 3   C     20     7     20     8     bC
## 4   D     22     8     22     9     bD
melt(setDT(dw), measure.vars=list(c(2,4), c(3,5)), 
     variable.name='var', 
     value.name=c('avg', 'sd'))[,var:= paste0('f',var)][order(sbj)]
##    sbj blabla var avg sd
## 1:   A     bA  f1  10  6
## 2:   A     bA  f2  50 10
## 3:   B     bB  f1  12  5
## 4:   B     bB  f2  70 11
## 5:   C     bC  f1  20  7
## 6:   C     bC  f2  20  8
## 7:   D     bD  f1  22  8
## 8:   D     bD  f2  22  9
dw %>% 
  gather(v, value, f1.avg:f2.sd) %>% 
  separate(v, c("var", "col")) %>% 
  arrange(sbj) %>% 
  spread(col, value)
##   sbj blabla var avg sd
## 1   A     bA  f1  10  6
## 2   A     bA  f2  50 10
## 3   B     bB  f1  12  5
## 4   B     bB  f2  70 11
## 5   C     bC  f1  20  7
## 6   C     bC  f2  20  8
## 7   D     bD  f1  22  8
## 8   D     bD  f2  22  9

A note of comparison of tidyr and reshape2

  • They are very similar!
  • cast() in reshape2 can work on matrix/array, while gather() in tidyr can only work on data.frame.
  • reshape2 can do aggregation, while tidyr is not designed for this purpose.
  • colsplit() in reshape2 operates only on a single column while separate() in tidyr performs all the operation at once.

Reference

Donald Hedeker, Robert D. Gibbons. 2006. Longitudinal Data Analysis. Book. WILEY, A JOHN WILEY & SONS, INC.,

Garrett M. Fitzmaurice, James H. Ware, Nan M. Laird. 2011. Applied Longitudinal Analysis. Book. Canada: WILEY, A JOHN WILEY & SONS, INC.,

Reisby, Niels, Lars F. Gram, Per Bech, Adam Nagy, Gorm Odden Petersen, Jørgen Ortmann, Ilse Ibsen, et al. 1977. “Imipramine: Clinical Effects and Pharmacokinetic Variability.” Journal Article. Psychopharmacology 54 (3): 263–72. doi:10.1007/BF00426574.

Wickham, Hadley. 2014. “Tidy Data.” Journal Article. The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.