Parsing and Concatenating
- Parsing is very much like the text-to-column function in Microsoft Excel.
- Concatenating is about stitching together individual elements.
- Let’s downnload the dataset that we will be using in this session.
# Download folder
if(!file.exists("gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0/CHOL.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt")) {
download.file(url="http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/CHOL/20160128/gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz", destfile="gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz", method="curl")
}
# Untar file
untar(tarfile="gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz")
# Read data frame
df <- read.table("gdac.broadinstitute.org_CHOL.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0/CHOL.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)
# Extract the ID's
df <- data.frame("ID"=names(df))
# Remove the first row which is not an ID
df <- df[-1, , drop=FALSE]
head(df)
## ID
## 2 TCGA.3X.AAV9.01A.72R.A41I.07
## 3 TCGA.3X.AAVA.01A.11R.A41I.07
## 4 TCGA.3X.AAVB.01A.31R.A41I.07
## 5 TCGA.3X.AAVC.01A.21R.A41I.07
## 6 TCGA.3X.AAVE.01A.11R.A41I.07
## 7 TCGA.4G.AAZO.01A.12R.A41I.07
- The drop=FALSE argument specifies a data frame to be returned instead of a vector when only one column is being subsetted.
- Refer to Subsetting Data Frames for more info on subsetting data frames.
Parsing
- The strsplit() function parses the elements according to a user-defined delimiter.
- In this example, we will parse the elements using the period (“.”) as the delimiter.
- The function takes in character elements. Here, the vector consists of factor elements, so we would have to convert them to characters first using the as.character() function.
ID_split <- strsplit(as.character(df$ID), split="\\.")
head(ID_split)
## [[1]]
## [1] "TCGA" "3X" "AAV9" "01A" "72R" "A41I" "07"
##
## [[2]]
## [1] "TCGA" "3X" "AAVA" "01A" "11R" "A41I" "07"
##
## [[3]]
## [1] "TCGA" "3X" "AAVB" "01A" "31R" "A41I" "07"
##
## [[4]]
## [1] "TCGA" "3X" "AAVC" "01A" "21R" "A41I" "07"
##
## [[5]]
## [1] "TCGA" "3X" "AAVE" "01A" "11R" "A41I" "07"
##
## [[6]]
## [1] "TCGA" "4G" "AAZO" "01A" "12R" "A41I" "07"
Retrieving parsed elements
- strsplit() returns a list of vector with each element of the list corresponding to the now-separated elements of the orginal elements.
- The next task is to retrieve the specific elements of interest.
- Here, we would like to retrieve the first 3 elements.
# Create a function to retrieve the 1st element
retrieve_first_element <- function(x) {
x[1]
}
# Create a function to retrieve the 2nd element
retrieve_second_element <- function(x) {
x[2]
}
# Create a function to retrieve the 3rd element
retrieve_third_element <- function(x) {
x[3]
}
# Use the functions created to retrieve the 1st, 2nd, and 3rd elements
first_element <- sapply(ID_split, retrieve_first_element)
second_element <- sapply(ID_split, retrieve_second_element)
third_element <- sapply(ID_split, retrieve_third_element)
head(first_element); head(second_element); head(third_element)
## [1] "TCGA" "TCGA" "TCGA" "TCGA" "TCGA" "TCGA"
## [1] "3X" "3X" "3X" "3X" "3X" "4G"
## [1] "AAV9" "AAVA" "AAVB" "AAVC" "AAVE" "AAZO"
Concatenating
- Let’s use the parsed elements from the previous section for concatenation.
- The paste() function may be used for concatenation.
- We will use dash (“-”) as the delimiter in this example, but any other preference may be used as the delimiter. Use the sep argument to specify the delimiter.
combined <- paste(first_element, second_element, third_element, sep="-")
# Create a new column based on new ID's
df$ID2 <- combined
head(df)
## ID ID2
## 2 TCGA.3X.AAV9.01A.72R.A41I.07 TCGA-3X-AAV9
## 3 TCGA.3X.AAVA.01A.11R.A41I.07 TCGA-3X-AAVA
## 4 TCGA.3X.AAVB.01A.31R.A41I.07 TCGA-3X-AAVB
## 5 TCGA.3X.AAVC.01A.21R.A41I.07 TCGA-3X-AAVC
## 6 TCGA.3X.AAVE.01A.11R.A41I.07 TCGA-3X-AAVE
## 7 TCGA.4G.AAZO.01A.12R.A41I.07 TCGA-4G-AAZO
- The default delimiter, i.e. when none is specified, is a single space (" “).
combined2 <- paste(first_element, second_element, third_element)
df$ID3 <- combined2
head(df)
## ID ID2 ID3
## 2 TCGA.3X.AAV9.01A.72R.A41I.07 TCGA-3X-AAV9 TCGA 3X AAV9
## 3 TCGA.3X.AAVA.01A.11R.A41I.07 TCGA-3X-AAVA TCGA 3X AAVA
## 4 TCGA.3X.AAVB.01A.31R.A41I.07 TCGA-3X-AAVB TCGA 3X AAVB
## 5 TCGA.3X.AAVC.01A.21R.A41I.07 TCGA-3X-AAVC TCGA 3X AAVC
## 6 TCGA.3X.AAVE.01A.11R.A41I.07 TCGA-3X-AAVE TCGA 3X AAVE
## 7 TCGA.4G.AAZO.01A.12R.A41I.07 TCGA-4G-AAZO TCGA 4G AAZO