This page documents the data preparation, including splitting the data into training and test sets and ranking the miRNAs for feature selection.
First, load the miRNA .RData
file and the clinical .RData
file.
load("../Data/kirc_mirna.RData")
load("../Data/kirc_clin.RData")
Add a column called “stage” to the clinical data frame that is just a copy of the tumor stage column. Recode that column so that stages I and II are labeled “early” and stages III and IV are labeled “late.”"
kirc_clin$stage <- kirc_clin$ajcc_pathologic_tumor_stage
kirc_clin[kirc_clin$stage == "Stage I" | kirc_clin$stage == "Stage II", "stage"] <- "early"
kirc_clin[kirc_clin$stage == "Stage III" | kirc_clin$stage == "Stage IV", "stage"] <- "late"
Now we’ll merge the data so that we have one data frame with all of the clinical and expression data together.
all_data <- merge(kirc_clin[, c("bcr_patient_barcode", "stage")], kirc_mirna)
# rename id column
names(all_data)[1] <- "id"
First, create a subset of the data that contains just the rows that are in the early stage.
early_data <- all_data[all_data$stage == "early", ]
Create vector of integers from 1 to the number of rows in the data frame and randomly shuffle those integers.
set.seed(10); r_early_rows <- sample(1:nrow(early_data), nrow(early_data))
Shuffle the early data frame.
r_early_data <- early_data[r_early_rows, ]
Split the early data into a training set (80%) and test set (20%).
early_train <- r_early_data[1:round(.8*nrow(r_early_data)), ]
early_test <- r_early_data[(round(.8*nrow(r_early_data)) + 1): nrow(r_early_data), ]
Now do the same thing for the remaining subset of the data, corresponding to the late stage.
late_data <- all_data[all_data$stage == "late", ]
set.seed(23); r_late_rows <- sample(1:nrow(late_data), nrow(late_data))
r_late_data <- late_data[r_late_rows, ]
late_train <- r_late_data[1:round(.8*nrow(r_late_data)), ]
late_test <- r_late_data[(round(.8*nrow(r_late_data)) + 1):nrow(r_late_data), ]
Finally, combine the training sets with each other and the test sets with each other.
train_data <- rbind(early_train, late_train)
test_data <- rbind(early_test, late_test)
We can safely remove any miRNAs in the training set that have zero variance.
no_variance <- sapply(train_data[, -c(1:2)], function(column){
v <- var(column, na.rm = TRUE)
if(v == 0){TRUE}else{FALSE}
})
train_data <- train_data[, c(TRUE, TRUE, !no_variance)]
One way to select miRNAs is to use a supervised method. So the outcome data (tumor stage) would be used to select the predictors. For each miRNA, we’ll look at the expression value for each patient. For example, let’s look at the first miRNA.
head(train_data[, 1:3])
## id stage hsa.let.7a.1
## 262 TCGA-BP-4169 early 4131.461
## 169 TCGA-B0-5120 early 3645.426
## 221 TCGA-B4-5844 early 9432.902
## 353 TCGA-BP-5004 early 3910.602
## 32 TCGA-A3-3362 early 7474.009
## 84 TCGA-AK-3458 early 5323.429
The miRNA labeled hsa.let.7a.1
can be split into two categories, depending on the tumor stage of each patient in the clinical data frame. Here’s a box plot showing the different distributions of miRNA expression level for hsa.let.7a.1
in early and late stage tumors in the training set.
boxplot(hsa.let.7a.1 ~ stage, data = train_data)
And we can get the p-value for the hypothesis that there is no difference in the mean of these two distributions.
early_t <- train_data[, 3][train_data$stage == "early"]
late_t <- train_data[, 3][train_data$stage == "late"]
t.test(early_t, late_t)$p.value
## [1] 0.08847883
It appears we shouldn’t reject the null hypothesis, which makes sense based on the boxplot.
Now we run the t-test for all of the miRNAs.
# I use an apply() function to loop through the columns of the mir_tumor
# data frame
pvalue <- apply(train_data[, -c(1:2)], 2, function(column, stage){
early <- column[stage == "early"]
late <- column[stage == "late"]
t.test(early, late)$p.value
}, stage = train_data$stage)
So now we can rank the mir_tumor
table for selecting the number of miRNAs that is ideal for each algorithm.
miRNA_pvalue <- data.frame(miRNA_id = names(train_data)[-c(1:2)], pvalue, stringsAsFactors = FALSE)
miRNA_pvalue <- miRNA_pvalue[order(miRNA_pvalue$pvalue),]
head(miRNA_pvalue)
## miRNA_id pvalue
## hsa.mir.139 hsa.mir.139 1.385515e-09
## hsa.mir.28 hsa.mir.28 1.770146e-08
## hsa.mir.21 hsa.mir.21 4.796884e-08
## hsa.mir.142 hsa.mir.142 1.922263e-07
## hsa.let.7i hsa.let.7i 2.474180e-07
## hsa.mir.10a hsa.mir.10a 4.630312e-07
We can also look at the distribution of variance and see if there are features that have very low variance, which we can consider removing (if we determine a cutoff). Here I calculate the coefficient of variation, which is the standard deviation divided by the mean.
train_coef_var <- sapply(train_data[, -c(1:2)], function(column){
sd(column, na.rm = TRUE)/mean(column, na.rm = TRUE)
})
train_coef_var <- train_coef_var[order(train_coef_var)]
summary(train_coef_var)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3066 0.9647 2.4110 4.9410 6.3800 20.5900
boxplot(train_coef_var)
The training set, test set, and ranked miRNAs will be kept a file called ‘split_data.RData’. The removeCorrFeatures()
function will be kept in a file called ‘functions.RData’.
save(train_data, test_data, miRNA_pvalue,
file = "../Data/split_data.RData")
save(removeCorrFeatures, file = "../Data/functions.RData")