rethink reshape()

Generally, reshape() is a pretty handy function (note that this function is in the stats package; there is also another package called reshape2, but it doesn’t have a function called reshape). I recently noticed that it doesn’t work quite the way I was expecting, though…

Let’s say I have some data from a super boring longitudinal study with 5 subjects assessed on two measures (a and b) over three timepoints. And let’s say I only actually assessed measure b at times 1 and 3, so there’s no data for b at time 2.

n <- 5 # sample size
# generate random data for measure A at all three timepoints, and measure B at times 1 and 3
data <- data.frame(subj=1:n, a_1=rnorm(n), a_2=rnorm(n), a_3=rnorm(n), b_1=rnorm(n), b_3=rnorm(n))

I want to reshape the data to long format, with one column for each measure (a and b) and three rows for each participant, for the three timepoints. All participants should have NA for measure b on time 2, since it was not collected.

# reshape(data, varying=2:6, sep="_", direction="long")

I commented it out, but if you run the above command, it returns the error “Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,: ‘varying’ arguments must be the same length”. It fails because there are only 2 times for b and 3 for a. I can fix that by adding an extra column of NAs for b_2:

data$b_2 <- NA
head(data)

##   subj        a_1       a_2        a_3        b_1         b_3 b_2
## 1    1  0.1130759 1.3168352  0.6017997 -2.3610015 -0.49012850  NA
## 2    2  2.0009421 0.7875619 -0.1601370  1.1072672 -0.50052236  NA
## 3    3 -0.6132944 1.1021857  0.2626477  0.6605574  0.02425462  NA
## 4    4  0.5200335 0.4082823  1.5374979 -0.1106323  0.03988946  NA
## 5    5  0.7429704 2.7038044  0.3422189 -0.9910181  0.23605807  NA

reshape(data, varying=2:7, sep="_", direction="long")

##     subj time          a           b id
## 1.1    1    1  0.1130759 -2.36100146  1
## 2.1    2    1  2.0009421  1.10726721  2
## 3.1    3    1 -0.6132944  0.66055737  3
## 4.1    4    1  0.5200335 -0.11063231  4
## 5.1    5    1  0.7429704 -0.99101809  5
## 1.2    1    2  1.3168352 -0.49012850  1
## 2.2    2    2  0.7875619 -0.50052236  2
## 3.2    3    2  1.1021857  0.02425462  3
## 4.2    4    2  0.4082823  0.03988946  4
## 5.2    5    2  2.7038044  0.23605807  5
## 1.3    1    3  0.6017997          NA  1
## 2.3    2    3 -0.1601370          NA  2
## 3.3    3    3  0.2626477          NA  3
## 4.3    4    3  1.5374979          NA  4
## 5.3    5    3  0.3422189          NA  5

This returns a result that at first glance looks correct (it got it into long format for me), but you’ll notice that sep=“_" didn’t correctly retain the times. Time 3, not time 2, are listed as all NAs for measure b. It appears to use the order of the columns to determine the numbers for time rather than the value after the separator. In the help documentation, it says that sep is “used for guessing v.names and times arguments based on the names in varying,” which suggests to me that it should take the times from the varying variable names. But not so. User beware.