Editing Text Columns by Selecting One Element of Text

When cleaning a dataset, I often find myself needing to edit categorical variables. Usually, I need to split the character variable on a space, comma, or other character, and select a specific element of this split as variable for classification.

For example, if my dataset includes a datetime column, I may wish to create a new variable that includes just the date (and not the time) in order to segment my data by day. This requires me to parse the datetime column by splitting it on a space and selecting the first element (the date).

I used to proceed as follows:

data$NewColumn <- sapply(as.character(strsplit(data$ColumnToSplit, " ")), function(x){x[1]})

Writing this out everytime I need to create a new variable based on splitting text, can be annoying, and error prone. I have done it enough times that a writing a function seemed best. Below is the function that performs the same operation as above. You the function asks you to select a character to split on, and the element you would like to select from that split.

#Function that splits a column on a specific character and return the nth element of that split

parse <- function(data, ..., element, fixed = F){
  output <- sapply(strsplit(data, ..., fixed = fixed), function(x){x[element]})
  return(output)
}

The strsplit() function has a fixed arguement. The parse() function above sets the fixed arguement to false as the strplit() function does by default. Setting it to TRUE will ignore regex functionaliy which is useful when trying to split on special charaters such sa “.” or “/”.

Examples of Parse Function in Practice

First is an example of the function working on simple text

#returns second element.  Must specify element = 2 explicitly
parse(c("the quick brown fox", "Bears Beets Battlestargalactica"), " ", element = 2)
## [1] "quick" "Beets"

The funcionality of the parse function allows you to apply it to a large data set. See an example of its use on a data frame. The new column represents the parsed text.

#Create a data frame
data <- data.frame(a = "test_senario", b = 11000:11050, stringsAsFactors = F)

data$NewColumn <- parse(data$a, "_", element = 1) #result should be test for all results in new column

head(data)
##              a     b NewColumn
## 1 test_senario 11000      test
## 2 test_senario 11001      test
## 3 test_senario 11002      test
## 4 test_senario 11003      test
## 5 test_senario 11004      test
## 6 test_senario 11005      test
#set fixed to true to split on "."
data <- data.frame(a = "test.senario", b = 11000:11050, stringsAsFactors = F)

data$NewColumn <- parse(data$a, ".", element = 2, fixed = T) #result should be senario for all results in new column

head(data)
##              a     b NewColumn
## 1 test.senario 11000   senario
## 2 test.senario 11001   senario
## 3 test.senario 11002   senario
## 4 test.senario 11003   senario
## 5 test.senario 11004   senario
## 6 test.senario 11005   senario