tidyr now is in the tidyverse
#knitr::opts_chunk$set(echo = TRUE)
library(readr) # function: read_table
library(kableExtra)
#library(tidyr)
library(tidyverse) # tidyr in tidyverse
library(reshape2) # use melt()
library(data.table) # setDT
– reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.
– tidyr replaces reshape2 (2010-2014) and reshape (2005-2010).
– tidyr is designed specifically for tidying data, not general reshaping (reshape2), or the general aggregation (reshape).
| Method | Wide \(\rightarrow\) Long | Long \(\rightarrow\) Wide |
|---|---|---|
reshape2 |
melt | cast |
tidyr |
gather | spread |
| spreadsheet | unpivot | pivot |
| database | fold | unfold |
| Long form | Wide form |
|---|---|
| – has a column for possible variable types and a column for the values of those variables. – not necessarily only two columns. – widely used in many tools, such as ggplot2, plyr, lm(), glm(), etc. |
– is easier and intuitive for recording data. |
I used an example from Riesby and al. (Reisby et al. 1977) examined the relationship between Imipramine (IMI) and Desipramine (DMI) plasma levels and clinical response in 66 depressed inpatients (37 endogenous and 29 non-endogenous), patients received 225 mg/day doses of imipramine for 4 weeks (Donald Hedeker 2006). Dataset is including:
one<-read_table(url("https://hedeker.people.uic.edu/RIESBY.DAT.txt"), col_names=FALSE, col_types=cols(.default = col_double()))
colnames(one)<-c("NULL","id", "hamd", "intcpt", "week", "endog", "endweek")
one<-one[-dim(one)[1],-c(1,4,7)]
kable(one[1:10,], caption="Long form of Riesby data") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | hamd | week | endog |
|---|---|---|---|
| 101 | 26 | 0 | 0 |
| 101 | 22 | 1 | 0 |
| 101 | 18 | 2 | 0 |
| 101 | 7 | 3 | 0 |
| 101 | 4 | 4 | 0 |
| 101 | 3 | 5 | 0 |
| 103 | 33 | 0 | 0 |
| 103 | 24 | 1 | 0 |
| 103 | 15 | 2 | 0 |
| 103 | 24 | 3 | 0 |
Another example is the Treatment of Lead Exposed Children Trial (N=100, TLC trial) showed the blood lead level from week 0 (baseline) to week 1, week 4, and week 6 for the succimer and placebo groups from Applied Longitudinal Analysis, 2nd Edition by Garrett Fitzmaurice, Nan Laird & James Ware (Garrett M. Fitzmaurice 2011)
tlc<-read_table(url("https://content.sph.harvard.edu/fitzmaur/ala2e/tlc-data.txt"), col_names=FALSE)
colnames(tlc)<-c('id', 'group', 'lead0', 'lead1', 'lead4', 'lead6')
kable(tlc[1:10,], caption="Wide form of TLC data") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | group | lead0 | lead1 | lead4 | lead6 |
|---|---|---|---|---|---|
| 1 | P | 30.8 | 26.9 | 25.8 | 23.8 |
| 2 | A | 26.5 | 14.8 | 19.5 | 21.0 |
| 3 | A | 25.8 | 23.0 | 19.1 | 23.2 |
| 4 | P | 24.7 | 24.5 | 22.0 | 22.5 |
| 5 | A | 20.4 | 2.8 | 3.2 | 9.4 |
| 6 | A | 20.4 | 5.4 | 4.5 | 11.9 |
| 7 | P | 28.6 | 20.8 | 19.2 | 18.4 |
| 8 | P | 33.7 | 31.6 | 28.5 | 25.1 |
| 9 | P | 19.7 | 14.9 | 15.3 | 14.7 |
| 10 | P | 31.1 | 31.2 | 29.2 | 30.1 |
#one$Week<-as.character(one$Week)
one.w<-stats::reshape(one,
idvar = c("id","endog"),
timevar = "week",
v.names = "hamd",
direction = "wide")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | endog | hamd.c(0, 1, 2, 3, 4, 5) |
|---|---|---|
| 101 | 0 | NA |
| 103 | 0 | NA |
| 104 | 1 | NA |
| 105 | 0 | NA |
| 106 | 1 | NA |
| 107 | 1 | NA |
| 108 | 1 | NA |
| 113 | 0 | NA |
| 114 | 0 | NA |
| 115 | 1 | NA |
Obviously, reshape function did not recognize the dimensions of data via the timevar when turning long to wide. Thus, we got the wrong wide form!?!
Rethinking of data: A dataset is a collection of values. Every value belongs to a variable and an observation (Wickham 2014).
one.w <- tidyr::spread(one, key = week, value = hamd, sep = "")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide using `spread` function") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | endog | week0 | week1 | week2 | week3 | week4 | week5 |
|---|---|---|---|---|---|---|---|
| 101 | 0 | 26 | 22 | 18 | 7 | 4 | 3 |
| 103 | 0 | 33 | 24 | 15 | 24 | 15 | 13 |
| 104 | 1 | 29 | 22 | 18 | 13 | 19 | 0 |
| 105 | 0 | 22 | 12 | 16 | 16 | 13 | 9 |
| 106 | 1 | 21 | 25 | 23 | 18 | 20 | NA |
| 107 | 1 | 21 | 21 | 16 | 19 | NA | 6 |
| 108 | 1 | 21 | 22 | 11 | 9 | 9 | 7 |
| 113 | 0 | 21 | 23 | 19 | 23 | 23 | NA |
| 114 | 0 | NA | 17 | 11 | 13 | 7 | 7 |
| 115 | 1 | NA | 16 | 16 | 16 | 16 | 11 |
Currently, spread has been retired so we conveniently use the pivot_wider
one.w <- tidyr::pivot_wider(one, names_from = week, values_from = hamd, names_prefix = "w")
kable(one.w[1:10,], caption="Dataset looks like when Transposing Long to Wide using `pivot_wider` function") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | endog | w0 | w1 | w2 | w3 | w4 | w5 |
|---|---|---|---|---|---|---|---|
| 101 | 0 | 26 | 22 | 18 | 7 | 4 | 3 |
| 103 | 0 | 33 | 24 | 15 | 24 | 15 | 13 |
| 104 | 1 | 29 | 22 | 18 | 13 | 19 | 0 |
| 105 | 0 | 22 | 12 | 16 | 16 | 13 | 9 |
| 106 | 1 | 21 | 25 | 23 | 18 | 20 | NA |
| 107 | 1 | 21 | 21 | 16 | 19 | NA | 6 |
| 108 | 1 | 21 | 22 | 11 | 9 | 9 | 7 |
| 113 | 0 | 21 | 23 | 19 | 23 | 23 | NA |
| 114 | 0 | NA | 17 | 11 | 13 | 7 | 7 |
| 115 | 1 | NA | 16 | 16 | 16 | 16 | 11 |
As the conventional way
tlc.l <- stats::reshape(tlc,
varying = c("lead0", "lead1", "lead4", "lead6"),
v.names = "y",
timevar = "time",
times = c("0", "1", "4", "6"),
new.row.names = 1:1000,
direction = "long")
#tlc.l.sort <- tlc.l[order(tlc.l$id),]
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | group | time | y |
|---|---|---|---|
| 1 | P | 0 | 30.8 |
| 2 | A | 0 | 26.5 |
| 3 | A | 0 | 25.8 |
| 4 | P | 0 | 24.7 |
| 5 | A | 0 | 20.4 |
| 6 | A | 0 | 20.4 |
| 7 | P | 0 | 28.6 |
| 8 | P | 0 | 33.7 |
| 9 | P | 0 | 19.7 |
| 10 | P | 0 | 31.1 |
Applying gather()
tlc.l<-gather(tlc, key = time, value = y, lead0:lead6)
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | group | time | y |
|---|---|---|---|
| 1 | P | lead0 | 30.8 |
| 2 | A | lead0 | 26.5 |
| 3 | A | lead0 | 25.8 |
| 4 | P | lead0 | 24.7 |
| 5 | A | lead0 | 20.4 |
| 6 | A | lead0 | 20.4 |
| 7 | P | lead0 | 28.6 |
| 8 | P | lead0 | 33.7 |
| 9 | P | lead0 | 19.7 |
| 10 | P | lead0 | 31.1 |
Using pivot_longer() due to the retirement of gather()
tlc.l<-pivot_longer(tlc, names_to = "time", values_to = "y", lead0:lead6)
tlc.l<-separate(tlc.l, time, into = c("lead", "time"), sep = 4)%>%
select(-lead)
kable(tlc.l[1:10,], caption="Dataset looks like when Transposing Wide to Long") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| id | group | time | y |
|---|---|---|---|
| 1 | P | 0 | 30.8 |
| 1 | P | 1 | 26.9 |
| 1 | P | 4 | 25.8 |
| 1 | P | 6 | 23.8 |
| 2 | A | 0 | 26.5 |
| 2 | A | 1 | 14.8 |
| 2 | A | 4 | 19.5 |
| 2 | A | 6 | 21.0 |
| 3 | A | 0 | 25.8 |
| 3 | A | 1 | 23.0 |
I have gone further a little by using separate() to eliminate “lead” word in value of variable time, so it completely returned to it-should-be.
melt() in reshape (or reshape2), the data should be treated as data.frame()tlc.dta.frm<-as.data.frame(tlc)
tlc.l<-melt(tlc.dta.frm, id=c("id","group"))
head(tlc.l)
## id group variable value
## 1 1 P lead0 30.8
## 2 2 A lead0 26.5
## 3 3 A lead0 25.8
## 4 4 P lead0 24.7
## 5 5 A lead0 20.4
## 6 6 A lead0 20.4
(dw <- read.table(header=T, text='
sbj f1.avg f1.sd f2.avg f2.sd blabla
A 10 6 50 10 bA
B 12 5 70 11 bB
C 20 7 20 8 bC
D 22 8 22 9 bD'))
## sbj f1.avg f1.sd f2.avg f2.sd blabla
## 1 A 10 6 50 10 bA
## 2 B 12 5 70 11 bB
## 3 C 20 7 20 8 bC
## 4 D 22 8 22 9 bD
melt(setDT(dw), measure.vars=list(c(2,4), c(3,5)),
variable.name='var',
value.name=c('avg', 'sd'))[,var:= paste0('f',var)][order(sbj)]
## sbj blabla var avg sd
## 1: A bA f1 10 6
## 2: A bA f2 50 10
## 3: B bB f1 12 5
## 4: B bB f2 70 11
## 5: C bC f1 20 7
## 6: C bC f2 20 8
## 7: D bD f1 22 8
## 8: D bD f2 22 9
dw %>%
gather(v, value, f1.avg:f2.sd) %>%
separate(v, c("var", "col")) %>%
arrange(sbj) %>%
spread(col, value)
## sbj blabla var avg sd
## 1 A bA f1 10 6
## 2 A bA f2 50 10
## 3 B bB f1 12 5
## 4 B bB f2 70 11
## 5 C bC f1 20 7
## 6 C bC f2 20 8
## 7 D bD f1 22 8
## 8 D bD f2 22 9
cast() in reshape2 can work on matrix/array, while gather() in tidyr can only work on data.frame.colsplit() in reshape2 operates only on a single column while separate() in tidyr performs all the operation at once.Donald Hedeker, Robert D. Gibbons. 2006. Longitudinal Data Analysis. Book. WILEY, A JOHN WILEY & SONS, INC.,
Garrett M. Fitzmaurice, James H. Ware, Nan M. Laird. 2011. Applied Longitudinal Analysis. Book. Canada: WILEY, A JOHN WILEY & SONS, INC.,
Reisby, Niels, Lars F. Gram, Per Bech, Adam Nagy, Gorm Odden Petersen, Jørgen Ortmann, Ilse Ibsen, et al. 1977. “Imipramine: Clinical Effects and Pharmacokinetic Variability.” Journal Article. Psychopharmacology 54 (3): 263–72. doi:10.1007/BF00426574.
Wickham, Hadley. 2014. “Tidy Data.” Journal Article. The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.