Parsing and Concatenating

# Download folder
if(!file.exists("gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0/CHOL.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt")) {
download.file(url="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/CHOL/20160128/gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz", destfile="gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz", method="curl")
}

# Untar file
untar(tarfile="gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz")

# Read data frame
df <- read.table("gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0/CHOL.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)

# Extract the ID's
df <- data.frame("ID"=names(df))

# Remove the first row which is not an ID
df <- df[-1, , drop=FALSE]
head(df)
##                             ID
## 2 TCGA.3X.AAV9.01A.72R.A41I.07
## 3 TCGA.3X.AAVA.01A.11R.A41I.07
## 4 TCGA.3X.AAVB.01A.31R.A41I.07
## 5 TCGA.3X.AAVC.01A.21R.A41I.07
## 6 TCGA.3X.AAVE.01A.11R.A41I.07
## 7 TCGA.4G.AAZO.01A.12R.A41I.07

Parsing

  • The strsplit() function parses the elements according to a user-defined delimiter.
  • In this example, we will parse the elements using the period (“.”) as the delimiter.
  • The function takes in character elements. Here, the vector consists of factor elements, so we would have to convert them to characters first using the as.character() function.
ID_split <- strsplit(as.character(df$ID), split="\\.")
head(ID_split)
## [[1]]
## [1] "TCGA" "3X"   "AAV9" "01A"  "72R"  "A41I" "07"  
## 
## [[2]]
## [1] "TCGA" "3X"   "AAVA" "01A"  "11R"  "A41I" "07"  
## 
## [[3]]
## [1] "TCGA" "3X"   "AAVB" "01A"  "31R"  "A41I" "07"  
## 
## [[4]]
## [1] "TCGA" "3X"   "AAVC" "01A"  "21R"  "A41I" "07"  
## 
## [[5]]
## [1] "TCGA" "3X"   "AAVE" "01A"  "11R"  "A41I" "07"  
## 
## [[6]]
## [1] "TCGA" "4G"   "AAZO" "01A"  "12R"  "A41I" "07"

Retrieving parsed elements

  • strsplit() returns a list of vector with each element of the list corresponding to the now-separated elements of the orginal elements.
  • The next task is to retrieve the specific elements of interest.
  • Here, we would like to retrieve the first 3 elements.
# Create a function to retrieve the 1st element
retrieve_first_element <- function(x) {
  x[1]
}

# Create a function to retrieve the 2nd element
retrieve_second_element <- function(x) {
  x[2]
}

# Create a function to retrieve the 3rd element
retrieve_third_element <- function(x) {
  x[3]
}

# Use the functions created to retrieve the 1st, 2nd, and 3rd elements
first_element <- sapply(ID_split, retrieve_first_element)

second_element <- sapply(ID_split, retrieve_second_element)

third_element <- sapply(ID_split, retrieve_third_element)

head(first_element); head(second_element); head(third_element)
## [1] "TCGA" "TCGA" "TCGA" "TCGA" "TCGA" "TCGA"
## [1] "3X" "3X" "3X" "3X" "3X" "4G"
## [1] "AAV9" "AAVA" "AAVB" "AAVC" "AAVE" "AAZO"

Concatenating

  • Let’s use the parsed elements from the previous section for concatenation.
  • The paste() function may be used for concatenation.
  • We will use dash (“-”) as the delimiter in this example, but any other preference may be used as the delimiter. Use the sep argument to specify the delimiter.
combined <- paste(first_element, second_element, third_element, sep="-")

# Create a new column based on new ID's
df$ID2 <- combined
head(df)
##                             ID          ID2
## 2 TCGA.3X.AAV9.01A.72R.A41I.07 TCGA-3X-AAV9
## 3 TCGA.3X.AAVA.01A.11R.A41I.07 TCGA-3X-AAVA
## 4 TCGA.3X.AAVB.01A.31R.A41I.07 TCGA-3X-AAVB
## 5 TCGA.3X.AAVC.01A.21R.A41I.07 TCGA-3X-AAVC
## 6 TCGA.3X.AAVE.01A.11R.A41I.07 TCGA-3X-AAVE
## 7 TCGA.4G.AAZO.01A.12R.A41I.07 TCGA-4G-AAZO
  • The default delimiter, i.e. when none is specified, is a single space (" “).
combined2 <- paste(first_element, second_element, third_element)
df$ID3 <- combined2
head(df)
##                             ID          ID2          ID3
## 2 TCGA.3X.AAV9.01A.72R.A41I.07 TCGA-3X-AAV9 TCGA 3X AAV9
## 3 TCGA.3X.AAVA.01A.11R.A41I.07 TCGA-3X-AAVA TCGA 3X AAVA
## 4 TCGA.3X.AAVB.01A.31R.A41I.07 TCGA-3X-AAVB TCGA 3X AAVB
## 5 TCGA.3X.AAVC.01A.21R.A41I.07 TCGA-3X-AAVC TCGA 3X AAVC
## 6 TCGA.3X.AAVE.01A.11R.A41I.07 TCGA-3X-AAVE TCGA 3X AAVE
## 7 TCGA.4G.AAZO.01A.12R.A41I.07 TCGA-4G-AAZO TCGA 4G AAZO