Parsing and Concatenating

Parsing is very much like the text-to-column function in Microsoft Excel.
Concatenating is about stitching together individual elements.
Let’s downnload the dataset that we will be using in this session.

# Download folder
if(!file.exists("gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0/CHOL.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt")) {
download.file(url="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/CHOL/20160128/gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz", destfile="gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz", method="curl")
}

# Untar file
untar(tarfile="gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz")

# Read data frame
df <- read.table("gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0/CHOL.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)

# Extract the ID's
df <- data.frame("ID"=names(df))

# Remove the first row which is not an ID
df <- df[-1, , drop=FALSE]
head(df)

##                             ID
## 2 TCGA.3X.AAV9.01A.72R.A41I.07
## 3 TCGA.3X.AAVA.01A.11R.A41I.07
## 4 TCGA.3X.AAVB.01A.31R.A41I.07
## 5 TCGA.3X.AAVC.01A.21R.A41I.07
## 6 TCGA.3X.AAVE.01A.11R.A41I.07
## 7 TCGA.4G.AAZO.01A.12R.A41I.07

The drop=FALSE argument specifies a data frame to be returned instead of a vector when only one column is being subsetted.
Refer to Subsetting Data Frames for more info on subsetting data frames.

Parsing

The strsplit() function parses the elements according to a user-defined delimiter.
In this example, we will parse the elements using the period (“.”) as the delimiter.
The function takes in character elements. Here, the vector consists of factor elements, so we would have to convert them to characters first using the as.character() function.

ID_split <- strsplit(as.character(df$ID), split="\\.")
head(ID_split)

## [[1]]
## [1] "TCGA" "3X"   "AAV9" "01A"  "72R"  "A41I" "07"  
## 
## [[2]]
## [1] "TCGA" "3X"   "AAVA" "01A"  "11R"  "A41I" "07"  
## 
## [[3]]
## [1] "TCGA" "3X"   "AAVB" "01A"  "31R"  "A41I" "07"  
## 
## [[4]]
## [1] "TCGA" "3X"   "AAVC" "01A"  "21R"  "A41I" "07"  
## 
## [[5]]
## [1] "TCGA" "3X"   "AAVE" "01A"  "11R"  "A41I" "07"  
## 
## [[6]]
## [1] "TCGA" "4G"   "AAZO" "01A"  "12R"  "A41I" "07"

Retrieving parsed elements

strsplit() returns a list of vector with each element of the list corresponding to the now-separated elements of the orginal elements.
The next task is to retrieve the specific elements of interest.
Here, we would like to retrieve the first 3 elements.

# Create a function to retrieve the 1st element
retrieve_first_element <- function(x) {
  x[1]
}

# Create a function to retrieve the 2nd element
retrieve_second_element <- function(x) {
  x[2]
}

# Create a function to retrieve the 3rd element
retrieve_third_element <- function(x) {
  x[3]
}

# Use the functions created to retrieve the 1st, 2nd, and 3rd elements
first_element <- sapply(ID_split, retrieve_first_element)

second_element <- sapply(ID_split, retrieve_second_element)

third_element <- sapply(ID_split, retrieve_third_element)

head(first_element); head(second_element); head(third_element)

## [1] "TCGA" "TCGA" "TCGA" "TCGA" "TCGA" "TCGA"

## [1] "3X" "3X" "3X" "3X" "3X" "4G"

## [1] "AAV9" "AAVA" "AAVB" "AAVC" "AAVE" "AAZO"

Concatenating

Let’s use the parsed elements from the previous section for concatenation.
The paste() function may be used for concatenation.
We will use dash (“-”) as the delimiter in this example, but any other preference may be used as the delimiter. Use the sep argument to specify the delimiter.

combined <- paste(first_element, second_element, third_element, sep="-")

# Create a new column based on new ID's
df$ID2 <- combined
head(df)

##                             ID          ID2
## 2 TCGA.3X.AAV9.01A.72R.A41I.07 TCGA-3X-AAV9
## 3 TCGA.3X.AAVA.01A.11R.A41I.07 TCGA-3X-AAVA
## 4 TCGA.3X.AAVB.01A.31R.A41I.07 TCGA-3X-AAVB
## 5 TCGA.3X.AAVC.01A.21R.A41I.07 TCGA-3X-AAVC
## 6 TCGA.3X.AAVE.01A.11R.A41I.07 TCGA-3X-AAVE
## 7 TCGA.4G.AAZO.01A.12R.A41I.07 TCGA-4G-AAZO

The default delimiter, i.e. when none is specified, is a single space (" “).

combined2 <- paste(first_element, second_element, third_element)
df$ID3 <- combined2
head(df)

##                             ID          ID2          ID3
## 2 TCGA.3X.AAV9.01A.72R.A41I.07 TCGA-3X-AAV9 TCGA 3X AAV9
## 3 TCGA.3X.AAVA.01A.11R.A41I.07 TCGA-3X-AAVA TCGA 3X AAVA
## 4 TCGA.3X.AAVB.01A.31R.A41I.07 TCGA-3X-AAVB TCGA 3X AAVB
## 5 TCGA.3X.AAVC.01A.21R.A41I.07 TCGA-3X-AAVC TCGA 3X AAVC
## 6 TCGA.3X.AAVE.01A.11R.A41I.07 TCGA-3X-AAVE TCGA 3X AAVE
## 7 TCGA.4G.AAZO.01A.12R.A41I.07 TCGA-4G-AAZO TCGA 4G AAZO