Generally, reshape() is a pretty handy function (note that this function is in the stats package; there is also another package called reshape2, but it doesn’t have a function called reshape). I recently noticed that it doesn’t work quite the way I was expecting, though…
Let’s say I have some data from a super boring longitudinal study with 5 subjects assessed on two measures (a and b) over three timepoints. And let’s say I only actually assessed measure b at times 1 and 3, so there’s no data for b at time 2.
n <- 5 # sample size
# generate random data for measure A at all three timepoints, and measure B at times 1 and 3
data <- data.frame(subj=1:n, a_1=rnorm(n), a_2=rnorm(n), a_3=rnorm(n), b_1=rnorm(n), b_3=rnorm(n))
I want to reshape the data to long format, with one column for each measure (a and b) and three rows for each participant, for the three timepoints. All participants should have NA for measure b on time 2, since it was not collected.
# reshape(data, varying=2:6, sep="_", direction="long")
I commented it out, but if you run the above command, it returns the error “Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,: ‘varying’ arguments must be the same length”. It fails because there are only 2 times for b and 3 for a. I can fix that by adding an extra column of NAs for b_2:
data$b_2 <- NA
head(data)
## subj a_1 a_2 a_3 b_1 b_3 b_2
## 1 1 0.1130759 1.3168352 0.6017997 -2.3610015 -0.49012850 NA
## 2 2 2.0009421 0.7875619 -0.1601370 1.1072672 -0.50052236 NA
## 3 3 -0.6132944 1.1021857 0.2626477 0.6605574 0.02425462 NA
## 4 4 0.5200335 0.4082823 1.5374979 -0.1106323 0.03988946 NA
## 5 5 0.7429704 2.7038044 0.3422189 -0.9910181 0.23605807 NA
reshape(data, varying=2:7, sep="_", direction="long")
## subj time a b id
## 1.1 1 1 0.1130759 -2.36100146 1
## 2.1 2 1 2.0009421 1.10726721 2
## 3.1 3 1 -0.6132944 0.66055737 3
## 4.1 4 1 0.5200335 -0.11063231 4
## 5.1 5 1 0.7429704 -0.99101809 5
## 1.2 1 2 1.3168352 -0.49012850 1
## 2.2 2 2 0.7875619 -0.50052236 2
## 3.2 3 2 1.1021857 0.02425462 3
## 4.2 4 2 0.4082823 0.03988946 4
## 5.2 5 2 2.7038044 0.23605807 5
## 1.3 1 3 0.6017997 NA 1
## 2.3 2 3 -0.1601370 NA 2
## 3.3 3 3 0.2626477 NA 3
## 4.3 4 3 1.5374979 NA 4
## 5.3 5 3 0.3422189 NA 5
This returns a result that at first glance looks correct (it got it into long format for me), but you’ll notice that sep=“_" didn’t correctly retain the times. Time 3, not time 2, are listed as all NAs for measure b. It appears to use the order of the columns to determine the numbers for time rather than the value after the separator. In the help documentation, it says that sep is “used for guessing v.names and times arguments based on the names in varying,” which suggests to me that it should take the times from the varying variable names. But not so. User beware.