Machine Learning Project: Data Preparation

This page documents the data preparation, including splitting the data into training and test sets and ranking the miRNAs for feature selection.

Load, Code Tumor Stage, and Merge

First, load the miRNA .RData file and the clinical .RData file.

load("../Data/kirc_mirna.RData")
load("../Data/kirc_clin.RData")

Add a column called “stage” to the clinical data frame that is just a copy of the tumor stage column. Recode that column so that stages I and II are labeled “early” and stages III and IV are labeled “late.”"

kirc_clin$stage <- kirc_clin$ajcc_pathologic_tumor_stage
kirc_clin[kirc_clin$stage == "Stage I" | kirc_clin$stage == "Stage II", "stage"] <- "early"
kirc_clin[kirc_clin$stage == "Stage III" | kirc_clin$stage == "Stage IV", "stage"] <- "late"

Now we’ll merge the data so that we have one data frame with all of the clinical and expression data together.

all_data <- merge(kirc_clin[, c("bcr_patient_barcode", "stage")], kirc_mirna)

# rename id column
names(all_data)[1] <- "id"

Split into Training and Test Sets

First, create a subset of the data that contains just the rows that are in the early stage.

early_data <- all_data[all_data$stage == "early", ]

Create vector of integers from 1 to the number of rows in the data frame and randomly shuffle those integers.

set.seed(10); r_early_rows <- sample(1:nrow(early_data), nrow(early_data))

Shuffle the early data frame.

r_early_data <- early_data[r_early_rows, ]

Split the early data into a training set (80%) and test set (20%).

early_train <- r_early_data[1:round(.8*nrow(r_early_data)), ]
early_test <- r_early_data[(round(.8*nrow(r_early_data)) + 1): nrow(r_early_data), ]

Now do the same thing for the remaining subset of the data, corresponding to the late stage.

late_data <- all_data[all_data$stage == "late", ]
set.seed(23); r_late_rows <- sample(1:nrow(late_data), nrow(late_data))
r_late_data <- late_data[r_late_rows, ]
late_train <- r_late_data[1:round(.8*nrow(r_late_data)), ]
late_test <- r_late_data[(round(.8*nrow(r_late_data)) + 1):nrow(r_late_data), ]

Finally, combine the training sets with each other and the test sets with each other.

train_data <- rbind(early_train, late_train)
test_data <- rbind(early_test, late_test)

Remove Zero Variance Features

We can safely remove any miRNAs in the training set that have zero variance.

no_variance <- sapply(train_data[, -c(1:2)], function(column){
  v <- var(column, na.rm = TRUE)
  if(v == 0){TRUE}else{FALSE}
  })
train_data <- train_data[, c(TRUE, TRUE, !no_variance)]

Rank miRNAs

One way to select miRNAs is to use a supervised method. So the outcome data (tumor stage) would be used to select the predictors. For each miRNA, we’ll look at the expression value for each patient. For example, let’s look at the first miRNA.

head(train_data[, 1:3])

##               id stage hsa.let.7a.1
## 262 TCGA-BP-4169 early     4131.461
## 169 TCGA-B0-5120 early     3645.426
## 221 TCGA-B4-5844 early     9432.902
## 353 TCGA-BP-5004 early     3910.602
## 32  TCGA-A3-3362 early     7474.009
## 84  TCGA-AK-3458 early     5323.429

The miRNA labeled hsa.let.7a.1 can be split into two categories, depending on the tumor stage of each patient in the clinical data frame. Here’s a box plot showing the different distributions of miRNA expression level for hsa.let.7a.1 in early and late stage tumors in the training set.

boxplot(hsa.let.7a.1 ~ stage, data = train_data)

And we can get the p-value for the hypothesis that there is no difference in the mean of these two distributions.

early_t <- train_data[, 3][train_data$stage == "early"]
late_t <- train_data[, 3][train_data$stage == "late"]
t.test(early_t, late_t)$p.value

## [1] 0.08847883

It appears we shouldn’t reject the null hypothesis, which makes sense based on the boxplot.

Now we run the t-test for all of the miRNAs.

# I use an apply() function to loop through the columns of the mir_tumor
# data frame
pvalue <- apply(train_data[, -c(1:2)], 2, function(column, stage){
  early <- column[stage == "early"]
  late <- column[stage == "late"]
  t.test(early, late)$p.value
}, stage = train_data$stage)

So now we can rank the mir_tumor table for selecting the number of miRNAs that is ideal for each algorithm.

miRNA_pvalue <- data.frame(miRNA_id = names(train_data)[-c(1:2)], pvalue, stringsAsFactors = FALSE)
miRNA_pvalue <- miRNA_pvalue[order(miRNA_pvalue$pvalue),]
head(miRNA_pvalue)

##                miRNA_id       pvalue
## hsa.mir.139 hsa.mir.139 1.385515e-09
## hsa.mir.28   hsa.mir.28 1.770146e-08
## hsa.mir.21   hsa.mir.21 4.796884e-08
## hsa.mir.142 hsa.mir.142 1.922263e-07
## hsa.let.7i   hsa.let.7i 2.474180e-07
## hsa.mir.10a hsa.mir.10a 4.630312e-07

Low Variance

We can also look at the distribution of variance and see if there are features that have very low variance, which we can consider removing (if we determine a cutoff). Here I calculate the coefficient of variation, which is the standard deviation divided by the mean.

train_coef_var <- sapply(train_data[, -c(1:2)], function(column){
  sd(column, na.rm = TRUE)/mean(column, na.rm = TRUE)
  })
train_coef_var <- train_coef_var[order(train_coef_var)]
summary(train_coef_var)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3066  0.9647  2.4110  4.9410  6.3800 20.5900

boxplot(train_coef_var)

Remove Correlated Features

We can also remove highly correlated miRNAs. This would be an unsupervised method.

Here I create a function that takes a data frame of normalized features and calculates the correlation matrix for those features. It then finds the two highest correlated features. If that correlation is above a threshold then one of the features is dropped (the one with the highest mean correlation with all of the other features in the matrix). After the feature is removed, a correlation matrix is calculated for the remaining features and the process is repeated until the highest correlation is below the threshold.

removeCorrFeatures <- function(data, cutoff){ 
  # get correlation matrix
  cor_m <- cor(data, use = "pairwise.complete.obs")
  # replace diagonal of 1s with diagonal of 0s so that the max correlation isn't
  # a feature's correlation with itself
  cor_m[cor_m == 1] = 0
  # find the column with the maximum correlation
  col.max <- which.max(apply(cor_m, 2, max))
  # find the row of the max correlation in that column
  row.max <- which.max(cor_m[, col.max])
  # get the max value
  max <- cor_m[names(row.max), names(col.max)]
  # loop while the max correlation is above the threshold
  while(max > cutoff){
    # get the column for the first in the pair of highly correlated miRNAs
    mirna1 <- cor_m[, names(col.max)]
    # remove the pair from the values in that column
    mirna1 <- mirna1[!(names(mirna1) %in% c(names(col.max), names(row.max)))]
    # get the mean of correlations with all other features
    mean1 <- mean(mirna1)
    # repeat for the second in the pair
    mirna2 <- cor_m[, names(row.max)]
    mirna2 <- mirna2[!(names(mirna2) %in% c(names(col.max), names(row.max)))]
    mean2 <- mean(mirna2)
    # if the mean of the first in the pair is greater, remove that one--otherwise
    # remove the second
    if(mean1 > mean2)
    {
      data <- data[names(data) != names(col.max)]
    }else{
      data <- data[names(data) != names(row.max)]
    }
    # restart the process until you get the max value
    cor_m <- cor(data, use = "pairwise.complete.obs")
    cor_m[cor_m == 1] = 0
    col.max <- which.max(apply(cor_m, 2, max))
    row.max <- which.max(cor_m[, col.max])
    max <- cor_m[names(row.max), names(col.max)]
  }
  # return the trimmed data frame
  data
}

This function can be used to trim the data down to 713 features if the cutoff is set to .75 (it takes about 3 minutes to run)

train_trimmed <- removeCorrFeatures(train_data[, -c(1:2)], cutoff = .75)

Save Data and Function

The training set, test set, and ranked miRNAs will be kept a file called ‘split_data.RData’. The removeCorrFeatures() function will be kept in a file called ‘functions.RData’.

save(train_data, test_data, miRNA_pvalue, 
     file = "../Data/split_data.RData")
save(removeCorrFeatures, file = "../Data/functions.RData")