The Evolution of Reshaping Data: stats

Load packages

tidyr now is in the tidyverse

#knitr::opts_chunk$set(echo = TRUE)
library(readr) # function: read_table
library(kableExtra)
#library(tidyr)
library(tidyverse) # tidyr in tidyverse
library(reshape2) # use melt()
library(data.table) # setDT

History

– reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.
– tidyr replaces reshape2 (2010-2014) and reshape (2005-2010).
– tidyr is designed specifically for tidying data, not general reshaping (reshape2), or the general aggregation (reshape).

Functional

Method	Wide \(\rightarrow\) Long	Long \(\rightarrow\) Wide
`reshape2`	melt	cast
`tidyr`	gather	spread
spreadsheet	unpivot	pivot
database	fold	unfold

Usage of Long or Wide form of Data

Long form	Wide form
– has a column for possible variable types and a column for the values of those variables. – not necessarily only two columns. – widely used in many tools, such as `ggplot2`, `plyr`, lm(), glm(), etc.	– is easier and intuitive for recording data.

Structure of Long or Wide form Data

Long form

I used an example from Riesby and al. (Reisby et al. 1977) examined the relationship between Imipramine (IMI) and Desipramine (DMI) plasma levels and clinical response in 66 depressed inpatients (37 endogenous and 29 non-endogenous), patients received 225 mg/day doses of imipramine for 4 weeks (Donald Hedeker 2006). Dataset is including:

ID: study identification
HamD: Hamilton Depression Rating Scale (HDRS scores). These represent the dependent variable that is measured across time. Higher scores on the HDRS represent higher levels of depression and lower scores indicate less depression
Endog: 0=‘NonEndog’ 1=‘Endog’
Week: 0=‘week 0’ 1=‘week 1’ 2=‘week 2’ 3=‘week 3’ 4=‘week 4’ 5=‘week 5’

one<-read_table(url("https://hedeker.people.uic.edu/RIESBY.DAT.txt"), col_names=FALSE, col_types=cols(.default = col_double()))
colnames(one)<-c("NULL","id", "hamd", "intcpt", "week", "endog", "endweek")
one<-one[-dim(one)[1],-c(1,4,7)]

kable(one[1:10,], caption="Long form of Riesby data") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Long form of Riesby data
id	hamd	week
101	26	0
101	22	1
101	18	2
101	7	3
101	4	4
101	3	5
103	33	0
103	24	1
103	15	2
103	24	3

Wide form

Another example is the Treatment of Lead Exposed Children Trial (N=100, TLC trial) showed the blood lead level from week 0 (baseline) to week 1, week 4, and week 6 for the succimer and placebo groups from Applied Longitudinal Analysis, 2nd Edition by Garrett Fitzmaurice, Nan Laird & James Ware (Garrett M. Fitzmaurice 2011)

tlc<-read_table(url("https://content.sph.harvard.edu/fitzmaur/ala2e/tlc-data.txt"), col_names=FALSE)
colnames(tlc)<-c('id', 'group', 'lead0', 'lead1', 'lead4', 'lead6')

kable(tlc[1:10,], caption="Wide form of TLC data") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Wide form of TLC data
id	group	lead0	lead1	lead4	lead6
1	P	30.8	26.9	25.8	23.8
2	A	26.5	14.8	19.5	21.0
3	A	25.8	23.0	19.1	23.2
4	P	24.7	24.5	22.0	22.5
5	A	20.4	2.8	3.2	9.4
6	A	20.4	5.4	4.5	11.9
7	P	28.6	20.8	19.2	18.4
8	P	33.7	31.6	28.5	25.1
9	P	19.7	14.9	15.3	14.7
10	P	31.1	31.2	29.2	30.1

Show case

Long to Wide

#one$Week<-as.character(one$Week)
one.w<-stats::reshape(one,
                      idvar = c("id","endog"),
                      timevar = "week",
                      v.names = "hamd", 
                      direction = "wide")

kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Dataset looks like when Transposing Long to Wide
id	endog	hamd.c(0, 1, 2, 3, 4, 5)
101	0	NA
103	0	NA
104	1	NA
105	0	NA
106	1	NA
107	1	NA
108	1	NA
113	0	NA
114	0	NA
115	1	NA

Obviously, reshape function did not recognize the dimensions of data via the timevar when turning long to wide. Thus, we got the wrong wide form!?!

Rethinking of data: A dataset is a collection of values. Every value belongs to a variable and an observation (Wickham 2014).

one.w <- tidyr::spread(one, key = week, value = hamd, sep = "")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide using `spread` function") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Dataset looks like when Transposing Long to Wide using `spread` function
id	endog	week0	week1	week2	week3	week4	week5
101	0	26	22	18	7	4	3
103	0	33	24	15	24	15	13
104	1	29	22	18	13	19	0
105	0	22	12	16	16	13	9
106	1	21	25	23	18	20	NA
107	1	21	21	16	19	NA	6
108	1	21	22	11	9	9	7
113	0	21	23	19	23	23	NA
114	0	NA	17	11	13	7	7
115	1	NA	16	16	16	16	11

Currently, spread has been retired so we conveniently use the pivot_wider

one.w <- tidyr::pivot_wider(one, names_from = week, values_from = hamd, names_prefix = "w")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide using `pivot_wider` function") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Dataset looks like when Transposing Long to Wide using `pivot_wider` function
id	endog	w0	w1	w2	w3	w4	w5
101	0	26	22	18	7	4	3
103	0	33	24	15	24	15	13
104	1	29	22	18	13	19	0
105	0	22	12	16	16	13	9
106	1	21	25	23	18	20	NA
107	1	21	21	16	19	NA	6
108	1	21	22	11	9	9	7
113	0	21	23	19	23	23	NA
114	0	NA	17	11	13	7	7
115	1	NA	16	16	16	16	11

Wide to Long

As the conventional way

tlc.l <- stats::reshape(tlc,
                        varying = c("lead0", "lead1", "lead4", "lead6"),
                        v.names = "y",
                        timevar = "time",
                        times = c("0", "1", "4", "6"),
                        new.row.names = 1:1000,
                        direction = "long")

#tlc.l.sort <- tlc.l[order(tlc.l$id),]
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Dataset looks like when Transposing Wide to Long
id	group	y
1	P	30.8
2	A	26.5
3	A	25.8
4	P	24.7
5	A	20.4
6	A	20.4
7	P	28.6
8	P	33.7
9	P	19.7
10	P	31.1

Applying gather()

tlc.l<-gather(tlc, key = time, value = y, lead0:lead6)
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Dataset looks like when Transposing Wide to Long
id	group	time	y
1	P	lead0	30.8
2	A	lead0	26.5
3	A	lead0	25.8
4	P	lead0	24.7
5	A	lead0	20.4
6	A	lead0	20.4
7	P	lead0	28.6
8	P	lead0	33.7
9	P	lead0	19.7
10	P	lead0	31.1

Using pivot_longer() due to the retirement of gather()

tlc.l<-pivot_longer(tlc, names_to = "time", values_to = "y", lead0:lead6)

tlc.l<-separate(tlc.l, time, into = c("lead", "time"), sep = 4)%>%
  select(-lead)

kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Dataset looks like when Transposing Wide to Long
id	group	time	y
1	P	0	30.8
1	P	1	26.9
1	P	4	25.8
1	P	6	23.8
2	A	0	26.5
2	A	1	14.8
2	A	4	19.5
2	A	6	21.0
3	A	0	25.8
3	A	1	23.0

I have gone further a little by using separate() to eliminate “lead” word in value of variable time, so it completely returned to it-should-be.

Bonus

Especially, if we use melt() in reshape (or reshape2), the data should be treated as data.frame()

tlc.dta.frm<-as.data.frame(tlc)
tlc.l<-melt(tlc.dta.frm, id=c("id","group"))
head(tlc.l)

##   id group variable value
## 1  1     P    lead0  30.8
## 2  2     A    lead0  26.5
## 3  3     A    lead0  25.8
## 4  4     P    lead0  24.7
## 5  5     A    lead0  20.4
## 6  6     A    lead0  20.4

Reshaping wide to long with multiple values columns, i.e. need to keep multiple fields for each record

(dw <- read.table(header=T, text='
 sbj f1.avg f1.sd f2.avg f2.sd  blabla
  A   10    6     50     10      bA
  B   12    5     70     11      bB
  C   20    7     20     8       bC
  D   22    8     22     9       bD'))

##   sbj f1.avg f1.sd f2.avg f2.sd blabla
## 1   A     10     6     50    10     bA
## 2   B     12     5     70    11     bB
## 3   C     20     7     20     8     bC
## 4   D     22     8     22     9     bD

melt(setDT(dw), measure.vars=list(c(2,4), c(3,5)), 
     variable.name='var', 
     value.name=c('avg', 'sd'))[,var:= paste0('f',var)][order(sbj)]

##    sbj blabla var avg sd
## 1:   A     bA  f1  10  6
## 2:   A     bA  f2  50 10
## 3:   B     bB  f1  12  5
## 4:   B     bB  f2  70 11
## 5:   C     bC  f1  20  7
## 6:   C     bC  f2  20  8
## 7:   D     bD  f1  22  8
## 8:   D     bD  f2  22  9

dw %>% 
  gather(v, value, f1.avg:f2.sd) %>% 
  separate(v, c("var", "col")) %>% 
  arrange(sbj) %>% 
  spread(col, value)

##   sbj blabla var avg sd
## 1   A     bA  f1  10  6
## 2   A     bA  f2  50 10
## 3   B     bB  f1  12  5
## 4   B     bB  f2  70 11
## 5   C     bC  f1  20  7
## 6   C     bC  f2  20  8
## 7   D     bD  f1  22  8
## 8   D     bD  f2  22  9

A note of comparison of tidyr and reshape2

They are very similar!
cast() in reshape2 can work on matrix/array, while gather() in tidyr can only work on data.frame.
reshape2 can do aggregation, while tidyr is not designed for this purpose.
colsplit() in reshape2 operates only on a single column while separate() in tidyr performs all the operation at once.

Reference

Donald Hedeker, Robert D. Gibbons. 2006. Longitudinal Data Analysis. Book. WILEY, A JOHN WILEY & SONS, INC.,

Garrett M. Fitzmaurice, James H. Ware, Nan M. Laird. 2011. Applied Longitudinal Analysis. Book. Canada: WILEY, A JOHN WILEY & SONS, INC.,

Reisby, Niels, Lars F. Gram, Per Bech, Adam Nagy, Gorm Odden Petersen, Jørgen Ortmann, Ilse Ibsen, et al. 1977. “Imipramine: Clinical Effects and Pharmacokinetic Variability.” Journal Article. Psychopharmacology 54 (3): 263–72. doi:10.1007/BF00426574.

Wickham, Hadley. 2014. “Tidy Data.” Journal Article. The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.

The Evolution of Reshaping Data: stats — tidyr

Hai Nguyen

June 22, 2020