Separating data

A common issue in processing data is getting it into a clean format. Survey data in particular can be challenging. A very frequent data processing task we are faced with is when multiple observations are contained in one row with data separated by e.g. a comma or a semi-colon.

I think there are three relatively straightforward ways of turning this type of data into new rows. One new one just came out today with the recent version (v0.5) of tidyr.

Get Data

These are some example data from our lab. We have three variables. The first is a timestamp - when the event occurred. The second is “Actor” this is the mouse (or mice) that won a contest. Each individual mouse has a number (1-12). The third is “Recipient” - this is the mouse or mice that lost a contest. We sometimes have two animals being a recipient at once - in the data they are separated like this “4, 10”. Very occassionally we have rows where both Actor and Recipient columns both have two mice in them.

df = read.csv("https://gist.githubusercontent.com/jalapic/9decd56936d112172c176e6e09e906ff/raw/caecc0c5b26c5cf3e497d7bf1b6c8a142796faf5/mouseexample.csv",stringsAsFactors = F)


head(df)

##            Timestamp Actor Recipient
## 1 10/8/2015 12:02:06     5         9
## 2 10/8/2015 12:02:36     5         9
## 3 10/8/2015 12:03:53     5         2
## 4 10/8/2015 12:04:03     5         9
## 5 10/8/2015 12:06:45     5         8
## 6 10/8/2015 12:08:24     5         4

df[297:302,]

##               Timestamp Actor Recipient
## 297 10/17/2015 13:15:24    12     2, 10
## 298 10/17/2015 13:16:48    12         4
## 299 10/17/2015 13:17:38    12         9
## 300 10/17/2015 13:18:40    12     4, 10
## 301 10/17/2015 13:19:05     3         4
## 302 10/17/2015 13:19:37    12         1

table(df$Recipient)

## 
##     1    10    11    12     2 2, 10     3     4 4, 10     5  5, 6  5, 9 
##    47   118    39    80    64     1    11    76     1   137     1     1 
##     6     7 7, 10     8     9 9, 10 
##    52   107     2    45   100     1

As you can see from the results of table as well as the subsetted data there are several instances of multiple individuals in the Recipient column. We want each of these to have their own row with the same information in the other variables replicated.

Here are the ways to do this:

First load the packages -

#we'll use these packages.
library(tidyr)
library(dplyr)
library(data.table)
library(splitstackshape)

## option 1. tidyr::separate_rows()

df %>% separate_rows(Recipient, sep=", ") -> df1

nrow(df)

## [1] 883

nrow(df1)

## [1] 890

table(df$Recipient)

## 
##     1    10    11    12     2 2, 10     3     4 4, 10     5  5, 6  5, 9 
##    47   118    39    80    64     1    11    76     1   137     1     1 
##     6     7 7, 10     8     9 9, 10 
##    52   107     2    45   100     1

table(df1$Recipient)

## 
##   1  10  11  12   2   3   4   5   6   7   8   9 
##  47 123  39  80  65  11  77 139  53 109  45 102

df1[297:310,]

## Source: local data frame [14 x 3]
## 
##              Timestamp Actor Recipient
##                  (chr) (int)     (chr)
## 1  10/17/2015 13:14:16     3         5
## 2  10/17/2015 13:15:24    12         2
## 3  10/17/2015 13:15:24    12        10
## 4  10/17/2015 13:16:48    12         4
## 5  10/17/2015 13:17:38    12         9
## 6  10/17/2015 13:18:40    12         4
## 7  10/17/2015 13:18:40    12        10
## 8  10/17/2015 13:19:05     3         4
## 9  10/17/2015 13:19:37    12         1
## 10 10/17/2015 13:24:49    12         5
## 11 10/17/2015 13:26:41    10         7
## 12 10/17/2015 13:27:40     3         2
## 13 10/17/2015 13:30:44     5        10
## 14 10/17/2015 13:37:07     5        10

You can see here that the events at 1.15pm and 1.18pm that were in the same row are now on separate rows.

## option 2. splitstackshape::cSplit()   (uses data.table)

df2 <- splitstackshape::cSplit(df, splitCols="Recipient", sep=", ", direction="long")

df2

##                Timestamp Actor Recipient
##   1:  10/8/2015 12:02:06     5         9
##   2:  10/8/2015 12:02:36     5         9
##   3:  10/8/2015 12:03:53     5         2
##   4:  10/8/2015 12:04:03     5         9
##   5:  10/8/2015 12:06:45     5         8
##  ---                                    
## 886: 10/29/2015 13:37:54     6         7
## 887: 10/29/2015 13:42:32     3         9
## 888: 10/29/2015 13:44:17    12        11
## 889: 10/29/2015 13:48:21     3        12
## 890: 10/29/2015 13:57:31     3        12

table(df1$Recipient)

## 
##   1  10  11  12   2   3   4   5   6   7   8   9 
##  47 123  39  80  65  11  77 139  53 109  45 102

table(df2$Recipient)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  47  65  11  77 139  53 109  45 102 123  39  80

## option 3. base  

df3 <- do.call(rbind, with(df, Map(expand.grid, 
                         Timestamp = Timestamp,
                         Actor = Actor,
                         Recipient = strsplit(Recipient, ", ")
            ))) 


table(df1$Recipient)

## 
##   1  10  11  12   2   3   4   5   6   7   8   9 
##  47 123  39  80  65  11  77 139  53 109  45 102

table(df3$Recipient)

## 
##   9   2   8   4   6   5   7  11   3   1  12  10 
## 102  65  45  77  53 139 109  39  11  47  80 123

There we have it ! Choose the one that fits the workflow the best.

Just as a final thing - the reason we want every observation to have its own row here is so we can look at things like the sociomatrix - how often each individual won against every other individual:

table(df2$Actor, df2$Recipient)

##     
##       1  2  3  4  5  6  7  8  9 10 11 12
##   1   0  1  7  0  0  1  1  1  1  0  0  0
##   2   3  0  0  2  3  4  4  2  0 13  2  0
##   3  15 21  0 44 44 17 35 17 22 36 10 77
##   4   4  1  0  0  7  1  1  0 22  7  2  0
##   5   4  9  1  2  0  5 11  4  9 10  5  3
##   6   3  0  0  0  0  0  4  0  0  1  1  0
##   8   0  0  0  0  0  0  1  0  0  0  0  0
##   9   0  0  0  1  0  0  2  1  0  3  2  0
##   10  7  4  0  4 26  9 21  4  8  0  3  0
##   11  1  0  0  0  2  2  1  0  0  0  0  0
##   12 10 29  3 24 57 14 28 16 40 53 14  0

Separating data

James P. Curley

June 13, 2016