Training the Models

This page documents the model training process. For documentation on how the data was prepared, see this page.

Preprocessing

First, we library all of the necessary packages and load the training and test sets.

library(kernlab)
library(caret)
library(pROC)
# library the genefilter package (install from bioconductor first if it's not 
# installed already)
if(!require(genefilter)){
  source("http://bioconductor.org/biocLite.R")
  biocLite("genefilter")
  library(genefilter)
}
load("../Data/split_data.RData")
ls()

## [1] "miRNA_pvalue" "test_data"    "train_data"

Then we make sure that the stage column in those data frames are factors.

train_data$stage <- as.factor(train_data$stage)
test_data$stage <- as.factor(test_data$stage)

We’ll do two kinds of preprocessing:

Filter based on the intensity of the miRNA expression and the coefficient of variation, and
Filter based on the p-value from the t-test in the data preparation stage and then transformation using principal components.

Preprocessing: Method 1

Here we create a function that filters out miRNAs that do not have at least 20% of the sample expression value at 100 or greater. Then it filters out miRNAs that do not have coefficient of variation (sd/mean) between 0.7 and 10. (https://www.biostars.org/p/86981/)

exprFilterFun <- filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))

Now we filter the training and test data sets.

# make a matrix from the train_data data frame
train_matrix <- train_data[, -c(1:2)]
# transpose the matrix so that rows are miRNAs and columns are samples
train_matrix_t <- t(train_matrix)
# create a logical vector that indicates TRUE for features that meet the 
# filter requirments
expr_filter <- genefilter(train_matrix_t, exprFilterFun)
# see how many features are left
sum(expr_filter)

## [1] 45

# subset the training data down to the features that meet the requirements
train_data_filtered <- train_data[, c(TRUE, TRUE, expr_filter)]
# subset test data
test_data_filtered <- test_data[, names(train_data_filtered)]

Preprocessing: Method 2

For this method, we narrow the features down by choosing a p-value threshold, and then using principal component analysis to transform the data into linear combinations of the miRNA expression data.

Here we choose a threshold of 0.1 for the p-values.

sig_rna <- miRNA_pvalue[miRNA_pvalue$pvalue <= .1, "miRNA_id"]
train_data_sig <- data.frame(train_data[, 1:2], 
                             train_data[, sig_rna])
test_data_sig <- data.frame(test_data[, 1:2],
                            test_data[, sig_rna])

Then we apply PCA to the remaining data set. We’ll choose a threshold of 95% (choose components that account for 95% of the variation in the data).

trans <- preProcess(train_data_sig[, -c(1:2)],
                   method = c("center", "scale", "pca"),
                   thresh = 0.95)
trans

## 
## Call:
## preProcess.default(x = train_data_sig[, -c(1:2)], method =
##  c("center", "scale", "pca"), thresh = 0.95)
## 
## Created from 424 samples and 221 variables
## Pre-processing: centered, scaled, principal component signal extraction 
## 
## PCA needed 117 components to capture 95 percent of the variance

117 components will be used

Splitting

We’ll be using the training samples, which totals 424 observations. To train the parameters for each model, we’ll use two data splitting methods. 5- fold cross validation and 10 fold cross validation.

5-fold Cross Validation

Here we set the training control for 5-fold cross validation.

ctrl5f <- trainControl(method = "cv", number = 5,
                      summaryFunction = twoClassSummary,
                      classProbs = TRUE)

10-fold Cross Validation

And here we set the training control for 10-fold cross validation.

ctrl10f <- trainControl(method = "cv", number = 10,
                      summaryFunction = twoClassSummary,
                      classProbs = TRUE)

Support Vector Machine (SVM)

The first algorithm is the support vector machine.

SVM/PreProc1/5-foldCV

The first model we’ll train is the support vector machine, using the first preproccessing method and 5-fold cross validation.

First we make a tuning grid that covers a range of sigma and cost values (Kuhn and Johnson 2013)

svmrGrid <- expand.grid(sigma = c(.00007, .00009, .0001, .0002),
                        C = 2^(-3:8))

Now we fit the model and report the run time. ROC is used to select the optimal model.

then <- Sys.time()
set.seed(477)
svm_pre1_5cv <- train(stage ~ .,
             data = train_data_filtered[, -1],
             method = "svmRadial",
             tuneGrid = svmrGrid,
             metric = "ROC",
             trControl = ctrl5f)
Sys.time() - then

## Time difference of 27.81267 secs

We predict using the test set and evaluate the performance.

svm_pre1_5cv_pred <- predict(svm_pre1_5cv, 
                            newdata = test_data_filtered[, - c(1:2)])
svm_pre1_5cv_prob <- predict(svm_pre1_5cv, 
                            newdata = test_data_filtered[, - c(1:2)],
                            type = "prob")

svm_pre1_5cv.df <- data.frame(stage = test_data_filtered$stage, 
                             svm_pre1_5cv_pred, 
                             early_prob = svm_pre1_5cv_prob$early,
                             late_prob = svm_pre1_5cv_prob$late)

head(svm_pre1_5cv.df)

##   stage svm_pre1_5cv_pred early_prob late_prob
## 1 early             early  0.7546434 0.2453566
## 2 early              late  0.3258466 0.6741534
## 3 early             early  0.8322489 0.1677511
## 4 early             early  0.7042966 0.2957034
## 5 early             early  0.5331362 0.4668638
## 6 early             early  0.7421521 0.2578479

confusionMatrix(data = svm_pre1_5cv.df$svm_pre1_5cv_pred,
                reference = test_data_filtered$stage,
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    57   21
##      late      8   20
##                                           
##                Accuracy : 0.7264          
##                  95% CI : (0.6313, 0.8085)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.009744        
##                                           
##                   Kappa : 0.3874          
##  Mcnemar's Test P-Value : 0.025858        
##                                           
##             Sensitivity : 0.4878          
##             Specificity : 0.8769          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 0.7308          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1887          
##    Detection Prevalence : 0.2642          
##       Balanced Accuracy : 0.6824          
##                                           
##        'Positive' Class : late            
##

svm_pre1_5cv_rocCurve <- roc(response = test_data_filtered$stage,
                predictor = svm_pre1_5cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_filtered$stage)))

auc(svm_pre1_5cv_rocCurve)

## Area under the curve: 0.7996

ci.roc(svm_pre1_5cv_rocCurve)

## 95% CI: 0.7136-0.8857 (DeLong)

plot(svm_pre1_5cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_filtered$stage, predictor = svm_pre1_5cv.df$early_prob,     levels = rev(levels(test_data_filtered$stage)))
## 
## Data: svm_pre1_5cv.df$early_prob in 41 controls (test_data_filtered$stage late) < 65 cases (test_data_filtered$stage early).
## Area under the curve: 0.7996

save(svm_pre1_5cv, svm_pre1_5cv_pred, svm_pre1_5cv_prob,
     file = "../Results/svm_pre1_5cv.RData")

SVM/PreProc2/5-foldCV

SVM, preprocessing method 2, and 5 fold cross validation. (The default threshold for “pca” in the preProc argument is 95%)

then <- Sys.time()
set.seed(477)
svm_pre2_5cv <- train(stage ~ .,
             data = train_data_sig[, -1],
             method = "svmRadial",
             tuneGrid = svmrGrid,
             preProc = c("center", "scale", "pca"),
             metric = "ROC",
             trControl = ctrl5f)

## maximum number of iterations reached 3.491983e-05 3.484089e-05maximum number of iterations reached 2.17924e-05 2.174739e-05

Sys.time() - then

## Time difference of 1.239041 mins

svm_pre2_5cv_pred <- predict(svm_pre2_5cv, 
                            newdata = test_data_sig[, - c(1:2)])
svm_pre2_5cv_prob <- predict(svm_pre2_5cv, 
                            newdata = test_data_sig[, - c(1:2)],
                            type = "prob")

svm_pre2_5cv.df <- data.frame(stage = test_data_sig$stage, 
                             svm_pre2_5cv_pred, 
                             early_prob = svm_pre2_5cv_prob$early,
                             late_prob = svm_pre2_5cv_prob$late)

head(svm_pre2_5cv.df)

##   stage svm_pre2_5cv_pred early_prob late_prob
## 1 early             early  0.7733647 0.2266353
## 2 early             early  0.5800195 0.4199805
## 3 early             early  0.7451924 0.2548076
## 4 early              late  0.4315871 0.5684129
## 5 early             early  0.6153190 0.3846810
## 6 early             early  0.7932442 0.2067558

confusionMatrix(data = svm_pre2_5cv.df$svm_pre2_5cv_pred,
                reference = test_data_sig$stage,
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    58   25
##      late      7   16
##                                           
##                Accuracy : 0.6981          
##                  95% CI : (0.6013, 0.7835)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.043386        
##                                           
##                   Kappa : 0.3075          
##  Mcnemar's Test P-Value : 0.002654        
##                                           
##             Sensitivity : 0.3902          
##             Specificity : 0.8923          
##          Pos Pred Value : 0.6957          
##          Neg Pred Value : 0.6988          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1509          
##    Detection Prevalence : 0.2170          
##       Balanced Accuracy : 0.6413          
##                                           
##        'Positive' Class : late            
##

svm_pre2_5cv_rocCurve <- roc(response = test_data_sig$stage,
                predictor = svm_pre2_5cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_sig$stage)))

auc(svm_pre2_5cv_rocCurve)

## Area under the curve: 0.6604

ci.roc(svm_pre2_5cv_rocCurve)

## 95% CI: 0.5478-0.773 (DeLong)

plot(svm_pre2_5cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_sig$stage, predictor = svm_pre2_5cv.df$early_prob,     levels = rev(levels(test_data_sig$stage)))
## 
## Data: svm_pre2_5cv.df$early_prob in 41 controls (test_data_sig$stage late) < 65 cases (test_data_sig$stage early).
## Area under the curve: 0.6604

save(svm_pre2_5cv, svm_pre2_5cv_pred, svm_pre2_5cv_prob,
     file = "../Results/svm_pre2_5cv.RData")

SVM/PreProc1/10-foldCV

SVM, preproccessing method 1, and 10 fold cross validation.

then <- Sys.time()
set.seed(477)
svm_pre1_10cv <- train(stage ~ .,
             data = train_data_filtered[, -1],
             method = "svmRadial",
             tuneGrid = svmrGrid,
             metric = "ROC",
             trControl = ctrl10f)
Sys.time() - then

## Time difference of 54.44594 secs

We predict using the test set and evaluate the performance.

svm_pre1_10cv_pred <- predict(svm_pre1_10cv, 
                            newdata = test_data_filtered[, - c(1:2)])
svm_pre1_10cv_prob <- predict(svm_pre1_10cv, 
                            newdata = test_data_filtered[, - c(1:2)],
                            type = "prob")

svm_pre1_10cv.df <- data.frame(stage = test_data_filtered[, "stage"], 
                             svm_pre1_10cv_pred, 
                             early_prob = svm_pre1_10cv_prob$early,
                             late_prob = svm_pre1_10cv_prob$late)

head(svm_pre1_10cv.df)

##   stage svm_pre1_10cv_pred early_prob late_prob
## 1 early              early  0.8349623 0.1650377
## 2 early               late  0.4175511 0.5824489
## 3 early              early  0.8705517 0.1294483
## 4 early              early  0.7254903 0.2745097
## 5 early              early  0.6109064 0.3890936
## 6 early              early  0.7569519 0.2430481

confusionMatrix(data = svm_pre1_10cv.df$svm_pre1_10cv_pred,
                reference = test_data_filtered[, "stage"],
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    57   25
##      late      8   16
##                                           
##                Accuracy : 0.6887          
##                  95% CI : (0.5914, 0.7751)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.065898        
##                                           
##                   Kappa : 0.2893          
##  Mcnemar's Test P-Value : 0.005349        
##                                           
##             Sensitivity : 0.3902          
##             Specificity : 0.8769          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.6951          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1509          
##    Detection Prevalence : 0.2264          
##       Balanced Accuracy : 0.6336          
##                                           
##        'Positive' Class : late            
##

svm_pre1_10cv_rocCurve <- roc(response = test_data_filtered[, "stage"],
                predictor = svm_pre1_10cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_filtered[, "stage"])))

auc(svm_pre1_10cv_rocCurve)

## Area under the curve: 0.7644

ci.roc(svm_pre1_10cv_rocCurve)

## 95% CI: 0.6682-0.8605 (DeLong)

plot(svm_pre1_10cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_filtered[, "stage"], predictor = svm_pre1_10cv.df$early_prob,     levels = rev(levels(test_data_filtered[, "stage"])))
## 
## Data: svm_pre1_10cv.df$early_prob in 41 controls (test_data_filtered[, "stage"] late) < 65 cases (test_data_filtered[, "stage"] early).
## Area under the curve: 0.7644

save(svm_pre1_10cv, svm_pre1_10cv_pred, svm_pre1_10cv_prob,
     file = "../Results/svm_pre1_10cv.RData")

SVM/PreProc2/10-foldCV

SVM, preprocessing method 2, and 10-fold cross validation.

then <- Sys.time()
set.seed(477)
svm_pre2_10cv <- train(stage ~ .,
             data = train_data_sig[, -1],
             method = "svmRadial",
             tuneGrid = svmrGrid,
             preProc = c("center", "scale", "pca"),
             metric = "ROC",
             trControl = ctrl10f)
Sys.time() - then

## Time difference of 2.62259 mins

svm_pre2_10cv_pred <- predict(svm_pre2_10cv, 
                            newdata = test_data_sig[, - c(1:2)])
svm_pre2_10cv_prob <- predict(svm_pre2_10cv, 
                            newdata = test_data_sig[, - c(1:2)],
                            type = "prob")

svm_pre2_10cv.df <- data.frame(stage = test_data_sig[, "stage"], 
                             svm_pre2_10cv_pred, 
                             early_prob = svm_pre2_10cv_prob$early,
                             late_prob = svm_pre2_10cv_prob$late)

head(svm_pre2_10cv.df)

##   stage svm_pre2_10cv_pred early_prob late_prob
## 1 early              early  0.8124083 0.1875917
## 2 early               late  0.4760781 0.5239219
## 3 early              early  0.7788731 0.2211269
## 4 early               late  0.3690614 0.6309386
## 5 early              early  0.5924416 0.4075584
## 6 early              early  0.7155900 0.2844100

confusionMatrix(data = svm_pre2_10cv.df$svm_pre2_10cv_pred,
                reference = test_data_sig[, "stage"],
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    55   24
##      late     10   17
##                                           
##                Accuracy : 0.6792          
##                  95% CI : (0.5816, 0.7666)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.09639         
##                                           
##                   Kappa : 0.2783          
##  Mcnemar's Test P-Value : 0.02578         
##                                           
##             Sensitivity : 0.4146          
##             Specificity : 0.8462          
##          Pos Pred Value : 0.6296          
##          Neg Pred Value : 0.6962          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1604          
##    Detection Prevalence : 0.2547          
##       Balanced Accuracy : 0.6304          
##                                           
##        'Positive' Class : late            
##

svm_pre2_10cv_rocCurve <- roc(response = test_data_sig[, "stage"],
                predictor = svm_pre2_10cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_sig[, "stage"])))

auc(svm_pre2_10cv_rocCurve)

## Area under the curve: 0.7081

ci.roc(svm_pre2_10cv_rocCurve)

## 95% CI: 0.6066-0.8096 (DeLong)

plot(svm_pre2_10cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_sig[, "stage"], predictor = svm_pre2_10cv.df$early_prob,     levels = rev(levels(test_data_sig[, "stage"])))
## 
## Data: svm_pre2_10cv.df$early_prob in 41 controls (test_data_sig[, "stage"] late) < 65 cases (test_data_sig[, "stage"] early).
## Area under the curve: 0.7081

save(svm_pre2_10cv, svm_pre2_10cv_pred, svm_pre2_10cv_prob,
     file = "../Results/svm_pre2_10cv.RData")

Random Forest (RF)

The second algorithm is random forest.

RF/PreProc1/5-foldCV

Random forest, preprocessing method 1, 5 fold cross validation.

For random forest, we tune the mtry parameter using 5 values evenly spaced from 2 to p, where p = # of predictors. We also use an ensemble of 1000 trees (pg. 387, Kuhn and Johnson 2013)

p <- ncol(train_data_filtered) - 2
spaced.values <- round(seq(2, p, length.out = 7))
mtryValues <- spaced.values[2:6]

then <- Sys.time()
set.seed(477)
rf_pre1_5cv <- train(x = train_data_filtered[, -c(1:2)], 
               y = train_data_filtered$stage,
               method = "rf",
               ntree = 1000,
               tuneGrid = data.frame(mtry = mtryValues),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl5f)
Sys.time() - then

## Time difference of 38.28141 secs

rf_pre1_5cv_pred <- predict(rf_pre1_5cv, 
                            newdata = test_data_filtered[, - c(1:2)])
rf_pre1_5cv_prob <- predict(rf_pre1_5cv, 
                            newdata = test_data_filtered[, - c(1:2)],
                            type = "prob")

rf_pre1_5cv.df <- data.frame(stage = test_data_filtered$stage, 
                             rf_pre1_5cv_pred, 
                             early_prob = rf_pre1_5cv_prob$early,
                             late_prob = rf_pre1_5cv_prob$late)

head(rf_pre1_5cv.df)

##   stage rf_pre1_5cv_pred early_prob late_prob
## 1 early            early      0.671     0.329
## 2 early             late      0.252     0.748
## 3 early            early      0.926     0.074
## 4 early            early      0.673     0.327
## 5 early            early      0.635     0.365
## 6 early            early      0.862     0.138

confusionMatrix(data = rf_pre1_5cv.df$rf_pre1_5cv_pred,
                reference = test_data_filtered$stage,
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    58   21
##      late      7   20
##                                           
##                Accuracy : 0.7358          
##                  95% CI : (0.6413, 0.8168)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.005444        
##                                           
##                   Kappa : 0.4057          
##  Mcnemar's Test P-Value : 0.014019        
##                                           
##             Sensitivity : 0.4878          
##             Specificity : 0.8923          
##          Pos Pred Value : 0.7407          
##          Neg Pred Value : 0.7342          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1887          
##    Detection Prevalence : 0.2547          
##       Balanced Accuracy : 0.6901          
##                                           
##        'Positive' Class : late            
##

rf_pre1_5cv_rocCurve <- roc(response = test_data_filtered$stage,
                predictor = rf_pre1_5cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_filtered$stage)))

auc(rf_pre1_5cv_rocCurve)

## Area under the curve: 0.831

ci.roc(rf_pre1_5cv_rocCurve)

## 95% CI: 0.7499-0.912 (DeLong)

plot(rf_pre1_5cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_filtered$stage, predictor = rf_pre1_5cv.df$early_prob,     levels = rev(levels(test_data_filtered$stage)))
## 
## Data: rf_pre1_5cv.df$early_prob in 41 controls (test_data_filtered$stage late) < 65 cases (test_data_filtered$stage early).
## Area under the curve: 0.831

save(rf_pre1_5cv, rf_pre1_5cv_pred, rf_pre1_5cv_prob,
     file = "../Results/rf_pre1_5cv.RData")

RF/PreProc2/5-foldCV

Random forest, preprocessing method 2, 5 fold cross validation.

then <- Sys.time()
set.seed(477)
rf_pre2_5cv <- train(x = train_data_sig[, -c(1:2)], 
               y = train_data_sig$stage,
               method = "rf",
               ntree = 1000,
               tuneGrid = data.frame(mtry = mtryValues),
               preProc = c("center", "scale", "pca"),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl5f)
Sys.time() - then

## Time difference of 1.135894 mins

rf_pre2_5cv_pred <- predict(rf_pre2_5cv, 
                            newdata = test_data_sig[, - c(1:2)])
rf_pre2_5cv_prob <- predict(rf_pre2_5cv, 
                            newdata = test_data_sig[, - c(1:2)],
                            type = "prob")

rf_pre2_5cv.df <- data.frame(stage = test_data_sig$stage, 
                             rf_pre2_5cv_pred, 
                             early_prob = rf_pre2_5cv_prob$early,
                             late_prob = rf_pre2_5cv_prob$late)

head(rf_pre2_5cv.df)

##   stage rf_pre2_5cv_pred early_prob late_prob
## 1 early            early      0.757     0.243
## 2 early             late      0.493     0.507
## 3 early            early      0.868     0.132
## 4 early            early      0.542     0.458
## 5 early            early      0.563     0.437
## 6 early            early      0.685     0.315

confusionMatrix(data = rf_pre2_5cv.df$rf_pre2_5cv_pred,
                reference = test_data_sig$stage,
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    57   19
##      late      8   22
##                                           
##                Accuracy : 0.7453          
##                  95% CI : (0.6514, 0.8249)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.002909        
##                                           
##                   Kappa : 0.4351          
##  Mcnemar's Test P-Value : 0.054292        
##                                           
##             Sensitivity : 0.5366          
##             Specificity : 0.8769          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.3868          
##          Detection Rate : 0.2075          
##    Detection Prevalence : 0.2830          
##       Balanced Accuracy : 0.7068          
##                                           
##        'Positive' Class : late            
##

rf_pre2_5cv_rocCurve <- roc(response = test_data_sig$stage,
                predictor = rf_pre2_5cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_sig$stage)))

auc(rf_pre2_5cv_rocCurve)

## Area under the curve: 0.7884

ci.roc(rf_pre2_5cv_rocCurve)

## 95% CI: 0.6979-0.8788 (DeLong)

plot(rf_pre2_5cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_sig$stage, predictor = rf_pre2_5cv.df$early_prob,     levels = rev(levels(test_data_sig$stage)))
## 
## Data: rf_pre2_5cv.df$early_prob in 41 controls (test_data_sig$stage late) < 65 cases (test_data_sig$stage early).
## Area under the curve: 0.7884

save(rf_pre2_5cv, rf_pre2_5cv_pred, rf_pre2_5cv_prob,
     file = "../Results/rf_pre2_5cv.RData")

RF/PreProc1/10-foldCV

Random forest, preprocessing method 1, 10 fold cross validataion.

then <- Sys.time()
set.seed(477)
rf_pre1_10cv <- train(x = train_data_filtered[, -c(1:2)], 
               y = train_data_filtered[, "stage"],
               method = "rf",
               ntree = 1000,
               tuneGrid = data.frame(mtry = mtryValues),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl10f)
Sys.time() - then

## Time difference of 1.410013 mins

rf_pre1_10cv_pred <- predict(rf_pre1_10cv, 
                            newdata = test_data_filtered[, - c(1:2)])
rf_pre1_10cv_prob <- predict(rf_pre1_10cv, 
                            newdata = test_data_filtered[, - c(1:2)],
                            type = "prob")

rf_pre1_10cv.df <- data.frame(stage = test_data_filtered[, "stage"], 
                             rf_pre1_10cv_pred, 
                             early_prob = rf_pre1_10cv_prob$early,
                             late_prob = rf_pre1_10cv_prob$late)

head(rf_pre1_10cv.df)

##   stage rf_pre1_10cv_pred early_prob late_prob
## 1 early             early      0.666     0.334
## 2 early              late      0.259     0.741
## 3 early             early      0.941     0.059
## 4 early             early      0.680     0.320
## 5 early             early      0.582     0.418
## 6 early             early      0.855     0.145

confusionMatrix(data = rf_pre1_10cv.df$rf_pre1_10cv_pred,
                reference = test_data_filtered[, "stage"],
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    58   20
##      late      7   21
##                                           
##                Accuracy : 0.7453          
##                  95% CI : (0.6514, 0.8249)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.002909        
##                                           
##                   Kappa : 0.4297          
##  Mcnemar's Test P-Value : 0.020921        
##                                           
##             Sensitivity : 0.5122          
##             Specificity : 0.8923          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.7436          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1981          
##    Detection Prevalence : 0.2642          
##       Balanced Accuracy : 0.7023          
##                                           
##        'Positive' Class : late            
##

rf_pre1_10cv_rocCurve <- roc(response = test_data_filtered[, "stage"],
                predictor = rf_pre1_10cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_filtered[, "stage"])))

auc(rf_pre1_10cv_rocCurve)

## Area under the curve: 0.821

ci.roc(rf_pre1_10cv_rocCurve)

## 95% CI: 0.737-0.905 (DeLong)

plot(rf_pre1_10cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_filtered[, "stage"], predictor = rf_pre1_10cv.df$early_prob,     levels = rev(levels(test_data_filtered[, "stage"])))
## 
## Data: rf_pre1_10cv.df$early_prob in 41 controls (test_data_filtered[, "stage"] late) < 65 cases (test_data_filtered[, "stage"] early).
## Area under the curve: 0.821

save(rf_pre1_10cv, rf_pre1_10cv_pred, rf_pre1_10cv_prob,
     file = "../Results/rf_pre1_10cv.RData")

RF/PreProc2/10-foldCV

Random forest, preprocessing method, 10 fold cross validation

then <- Sys.time()
set.seed(477)
rf_pre2_10cv <- train(x = train_data_sig[, -c(1:2)], 
               y = train_data_sig[, "stage"],
               method = "rf",
               ntree = 1000,
               tuneGrid = data.frame(mtry = mtryValues),
               preProc = c("center", "scale", "pca"),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl10f)
Sys.time() - then

## Time difference of 2.616941 mins

rf_pre2_10cv_pred <- predict(rf_pre2_10cv, 
                            newdata = test_data_sig[, - c(1:2)])
rf_pre2_10cv_prob <- predict(rf_pre2_10cv, 
                            newdata = test_data_sig[, - c(1:2)],
                            type = "prob")

rf_pre2_10cv.df <- data.frame(stage = test_data_sig[, "stage"], 
                             rf_pre2_10cv_pred, 
                             early_prob = rf_pre2_10cv_prob$early,
                             late_prob = rf_pre2_10cv_prob$late)

head(rf_pre2_10cv.df)

##   stage rf_pre2_10cv_pred early_prob late_prob
## 1 early             early      0.750     0.250
## 2 early              late      0.488     0.512
## 3 early             early      0.816     0.184
## 4 early             early      0.543     0.457
## 5 early             early      0.549     0.451
## 6 early             early      0.644     0.356

confusionMatrix(data = rf_pre2_10cv.df$rf_pre2_10cv_pred,
                reference = test_data_sig[, "stage"],
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    59   21
##      late      6   20
##                                           
##                Accuracy : 0.7453          
##                  95% CI : (0.6514, 0.8249)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.002909        
##                                           
##                   Kappa : 0.4241          
##  Mcnemar's Test P-Value : 0.007054        
##                                           
##             Sensitivity : 0.4878          
##             Specificity : 0.9077          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.7375          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1887          
##    Detection Prevalence : 0.2453          
##       Balanced Accuracy : 0.6977          
##                                           
##        'Positive' Class : late            
##

rf_pre2_10cv_rocCurve <- roc(response = test_data_sig[, "stage"],
                predictor = rf_pre2_10cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_sig[, "stage"])))

auc(rf_pre2_10cv_rocCurve)

## Area under the curve: 0.7767

ci.roc(rf_pre2_10cv_rocCurve)

## 95% CI: 0.6849-0.8686 (DeLong)

plot(rf_pre2_10cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_sig[, "stage"], predictor = rf_pre2_10cv.df$early_prob,     levels = rev(levels(test_data_sig[, "stage"])))
## 
## Data: rf_pre2_10cv.df$early_prob in 41 controls (test_data_sig[, "stage"] late) < 65 cases (test_data_sig[, "stage"] early).
## Area under the curve: 0.7767

save(rf_pre2_10cv, rf_pre2_10cv_pred, rf_pre2_10cv_prob,
     file = "../Results/rf_pre2_10cv.RData")

Naive Bayes (NB)

The last algorithm is naive Bayes.

NB/PreProc1/5-foldCV

Naive Bayes, preprocessing method 1, 5 fold cross validation.

then <- Sys.time()
set.seed(477)
nb_pre1_5cv <- train(x = train_data_filtered[, -c(1:2)], 
               y = train_data_filtered$stage,
               method = "nb",
               ntree = 1000,
               tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl5f)
Sys.time() - then

## Time difference of 6.616614 secs

nb_pre1_5cv_pred <- predict(nb_pre1_5cv, 
                            newdata = test_data_filtered[, - c(1:2)])
nb_pre1_5cv_prob <- predict(nb_pre1_5cv, 
                            newdata = test_data_filtered[, - c(1:2)],
                            type = "prob")

nb_pre1_5cv.df <- data.frame(stage = test_data_filtered$stage, 
                             nb_pre1_5cv_pred, 
                             early_prob = nb_pre1_5cv_prob$early,
                             late_prob = nb_pre1_5cv_prob$late)

head(nb_pre1_5cv.df)

##   stage nb_pre1_5cv_pred   early_prob    late_prob
## 1 early            early 9.749844e-01 2.501558e-02
## 2 early             late 1.134862e-13 1.000000e+00
## 3 early            early 1.000000e+00 6.178386e-09
## 4 early             late 7.773875e-27 1.000000e+00
## 5 early            early 1.000000e+00 6.718722e-13
## 6 early            early 1.000000e+00 1.464735e-19

confusionMatrix(data = nb_pre1_5cv.df$nb_pre1_5cv_pred,
                reference = test_data_filtered$stage,
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    48   21
##      late     17   20
##                                           
##                Accuracy : 0.6415          
##                  95% CI : (0.5426, 0.7323)
##     No Information Rate : 0.6132          
##     P-Value [Acc > NIR] : 0.3112          
##                                           
##                   Kappa : 0.2304          
##  Mcnemar's Test P-Value : 0.6265          
##                                           
##             Sensitivity : 0.4878          
##             Specificity : 0.7385          
##          Pos Pred Value : 0.5405          
##          Neg Pred Value : 0.6957          
##              Prevalence : 0.3868          
##          Detection Rate : 0.1887          
##    Detection Prevalence : 0.3491          
##       Balanced Accuracy : 0.6131          
##                                           
##        'Positive' Class : late            
##

nb_pre1_5cv_rocCurve <- roc(response = test_data_filtered$stage,
                predictor = nb_pre1_5cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_filtered$stage)))

auc(nb_pre1_5cv_rocCurve)

## Area under the curve: 0.672

ci.roc(nb_pre1_5cv_rocCurve)

## 95% CI: 0.5642-0.7798 (DeLong)

plot(nb_pre1_5cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_filtered$stage, predictor = nb_pre1_5cv.df$early_prob,     levels = rev(levels(test_data_filtered$stage)))
## 
## Data: nb_pre1_5cv.df$early_prob in 41 controls (test_data_filtered$stage late) < 65 cases (test_data_filtered$stage early).
## Area under the curve: 0.672

save(nb_pre1_5cv, nb_pre1_5cv_pred, nb_pre1_5cv_prob,
     file = "../Results/nb_pre1_5cv.RData")

NB/PreProc2/5-foldCV

Naive Bayes, preprocessing method 2, 5 fold cross validation.

then <- Sys.time()
set.seed(477)
nb_pre2_5cv <- train(x = train_data_sig[, -c(1:2)], 
               y = train_data_sig$stage,
               method = "nb",
               ntree = 1000,
               tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
               preProc = c("center", "scale", "pca"),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl5f)
Sys.time() - then

## Time difference of 16.84182 secs

nb_pre2_5cv_pred <- predict(nb_pre2_5cv, 
                            newdata = test_data_sig[, - c(1:2)])
nb_pre2_5cv_prob <- predict(nb_pre2_5cv, 
                            newdata = test_data_sig[, - c(1:2)],
                            type = "prob")

nb_pre2_5cv.df <- data.frame(stage = test_data_sig$stage, 
                             nb_pre2_5cv_pred, 
                             early_prob = nb_pre2_5cv_prob$early,
                             late_prob = nb_pre2_5cv_prob$late)

head(nb_pre2_5cv.df)

##   stage nb_pre2_5cv_pred   early_prob   late_prob
## 1 early            early 9.989010e-01 0.001098964
## 2 early             late 4.247094e-01 0.575290621
## 3 early            early 9.975956e-01 0.002404399
## 4 early             late 5.295538e-05 0.999947045
## 5 early             late 4.391418e-01 0.560858178
## 6 early            early 9.620655e-01 0.037934535

confusionMatrix(data = nb_pre2_5cv.df$nb_pre2_5cv_pred,
                reference = test_data_sig$stage,
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    48   23
##      late     17   18
##                                          
##                Accuracy : 0.6226         
##                  95% CI : (0.5233, 0.715)
##     No Information Rate : 0.6132         
##     P-Value [Acc > NIR] : 0.4633         
##                                          
##                   Kappa : 0.1824         
##  Mcnemar's Test P-Value : 0.4292         
##                                          
##             Sensitivity : 0.4390         
##             Specificity : 0.7385         
##          Pos Pred Value : 0.5143         
##          Neg Pred Value : 0.6761         
##              Prevalence : 0.3868         
##          Detection Rate : 0.1698         
##    Detection Prevalence : 0.3302         
##       Balanced Accuracy : 0.5887         
##                                          
##        'Positive' Class : late           
##

nb_pre2_5cv_rocCurve <- roc(response = test_data_sig$stage,
                predictor = nb_pre2_5cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_sig$stage)))

auc(nb_pre2_5cv_rocCurve)

## Area under the curve: 0.6484

ci.roc(nb_pre2_5cv_rocCurve)

## 95% CI: 0.542-0.7549 (DeLong)

plot(nb_pre2_5cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_sig$stage, predictor = nb_pre2_5cv.df$early_prob,     levels = rev(levels(test_data_sig$stage)))
## 
## Data: nb_pre2_5cv.df$early_prob in 41 controls (test_data_sig$stage late) < 65 cases (test_data_sig$stage early).
## Area under the curve: 0.6484

save(nb_pre2_5cv, nb_pre2_5cv_pred, nb_pre2_5cv_prob,
     file = "../Results/nb_pre2_5cv.RData")

NB/PreProc1/10-foldCV

Naive Bayes, preprocessing method 1, 10 fold cross validation.

then <- Sys.time()
set.seed(477)
nb_pre1_10cv <- train(x = train_data_filtered[, -c(1:2)], 
               y = train_data_filtered[, "stage"],
               method = "nb",
               ntree = 1000,
               tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl10f)
Sys.time() - then

## Time difference of 7.104017 secs

nb_pre1_10cv_pred <- predict(nb_pre1_10cv, 
                            newdata = test_data_filtered[, - c(1:2)])
nb_pre1_10cv_prob <- predict(nb_pre1_10cv, 
                            newdata = test_data_filtered[, - c(1:2)],
                            type = "prob")

nb_pre1_10cv.df <- data.frame(stage = test_data_filtered[, "stage"], 
                             nb_pre1_10cv_pred, 
                             early_prob = nb_pre1_10cv_prob$early,
                             late_prob = nb_pre1_10cv_prob$late)

head(nb_pre1_10cv.df)

##   stage nb_pre1_10cv_pred   early_prob    late_prob
## 1 early             early 0.9999999930 7.034877e-09
## 2 early              late 0.0001013995 9.998986e-01
## 3 early             early 0.9999998822 1.178449e-07
## 4 early             early 0.9999027000 9.729995e-05
## 5 early              late 0.0455040776 9.544959e-01
## 6 early             early 1.0000000000 2.024158e-12

confusionMatrix(data = nb_pre1_10cv.df$nb_pre1_10cv_pred,
                reference = test_data_filtered[, "stage"],
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    59   30
##      late      6   11
##                                          
##                Accuracy : 0.6604         
##                  95% CI : (0.562, 0.7496)
##     No Information Rate : 0.6132         
##     P-Value [Acc > NIR] : 0.1851589      
##                                          
##                   Kappa : 0.1973         
##  Mcnemar's Test P-Value : 0.0001264      
##                                          
##             Sensitivity : 0.2683         
##             Specificity : 0.9077         
##          Pos Pred Value : 0.6471         
##          Neg Pred Value : 0.6629         
##              Prevalence : 0.3868         
##          Detection Rate : 0.1038         
##    Detection Prevalence : 0.1604         
##       Balanced Accuracy : 0.5880         
##                                          
##        'Positive' Class : late           
##

nb_pre1_10cv_rocCurve <- roc(response = test_data_filtered[, "stage"],
                predictor = nb_pre1_10cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_filtered[, "stage"])))

auc(nb_pre1_10cv_rocCurve)

## Area under the curve: 0.6935

ci.roc(nb_pre1_10cv_rocCurve)

## 95% CI: 0.5846-0.8023 (DeLong)

plot(nb_pre1_10cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_filtered[, "stage"], predictor = nb_pre1_10cv.df$early_prob,     levels = rev(levels(test_data_filtered[, "stage"])))
## 
## Data: nb_pre1_10cv.df$early_prob in 40 controls (test_data_filtered[, "stage"] late) < 65 cases (test_data_filtered[, "stage"] early).
## Area under the curve: 0.6935

save(nb_pre1_10cv, nb_pre1_10cv_pred, nb_pre1_10cv_prob,
     file = "../Results/nb_pre1_10cv.RData")

nb/PreProc2/10-foldCV

Random forest, preprocessing method, 10 fold cross validation.

then <- Sys.time()
set.seed(477)
nb_pre2_10cv <- train(x = train_data_sig[, -c(1:2)], 
               y = train_data_sig[, "stage"],
               method = "nb",
               ntree = 1000,
               tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
               preProc = c("center", "scale", "pca"),
               importance = TRUE,
               metric = "ROC",
               trControl = ctrl10f)
Sys.time() - then

## Time difference of 20.5454 secs

nb_pre2_10cv_pred <- predict(nb_pre2_10cv, 
                            newdata = test_data_sig[, - c(1:2)])
nb_pre2_10cv_prob <- predict(nb_pre2_10cv, 
                            newdata = test_data_sig[, - c(1:2)],
                            type = "prob")

nb_pre2_10cv.df <- data.frame(stage = test_data_sig[, "stage"], 
                             nb_pre2_10cv_pred, 
                             early_prob = nb_pre2_10cv_prob$early,
                             late_prob = nb_pre2_10cv_prob$late)

head(nb_pre2_10cv.df)

##   stage nb_pre2_10cv_pred early_prob    late_prob
## 1 early             early  0.9989512 0.0010488276
## 2 early             early  0.8834962 0.1165037761
## 3 early             early  0.9997455 0.0002545083
## 4 early             early  0.7958403 0.2041597150
## 5 early             early  0.8731985 0.1268015222
## 6 early             early  0.9781878 0.0218122298

confusionMatrix(data = nb_pre2_10cv.df$nb_pre2_10cv_pred,
                reference = test_data_sig[, "stage"],
                positive = "late")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction early late
##      early    51   26
##      late     14   15
##                                          
##                Accuracy : 0.6226         
##                  95% CI : (0.5233, 0.715)
##     No Information Rate : 0.6132         
##     P-Value [Acc > NIR] : 0.46329        
##                                          
##                   Kappa : 0.1591         
##  Mcnemar's Test P-Value : 0.08199        
##                                          
##             Sensitivity : 0.3659         
##             Specificity : 0.7846         
##          Pos Pred Value : 0.5172         
##          Neg Pred Value : 0.6623         
##              Prevalence : 0.3868         
##          Detection Rate : 0.1415         
##    Detection Prevalence : 0.2736         
##       Balanced Accuracy : 0.5752         
##                                          
##        'Positive' Class : late           
##

nb_pre2_10cv_rocCurve <- roc(response = test_data_sig[, "stage"],
                predictor = nb_pre2_10cv.df$early_prob,
                ## This function assumes that the second
                ## class is the event of interest, so we
                ## reverse the labels
                levels = rev(levels(test_data_sig[, "stage"])))

auc(nb_pre2_10cv_rocCurve)

## Area under the curve: 0.6081

ci.roc(nb_pre2_10cv_rocCurve)

## 95% CI: 0.4985-0.7177 (DeLong)

plot(nb_pre2_10cv_rocCurve, legacy.axes = TRUE)

## 
## Call:
## roc.default(response = test_data_sig[, "stage"], predictor = nb_pre2_10cv.df$early_prob,     levels = rev(levels(test_data_sig[, "stage"])))
## 
## Data: nb_pre2_10cv.df$early_prob in 40 controls (test_data_sig[, "stage"] late) < 65 cases (test_data_sig[, "stage"] early).
## Area under the curve: 0.6081

save(nb_pre2_10cv, nb_pre2_10cv_pred, nb_pre2_10cv_prob,
     file = "../Results/nb_pre2_10cv.RData")

Session Information

print(sessionInfo(), locale = FALSE)

## R version 3.1.3 (2015-03-09)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] klaR_0.6-12         MASS_7.3-40         randomForest_4.6-10
## [4] genefilter_1.48.1   pROC_1.7.3          caret_6.0-41       
## [7] ggplot2_1.0.1       lattice_0.20-30     kernlab_0.9-20     
## 
## loaded via a namespace (and not attached):
##  [1] annotate_1.44.0      AnnotationDbi_1.28.2 Biobase_2.26.0      
##  [4] BiocGenerics_0.12.1  BradleyTerry2_1.0-6  brglm_0.5-9         
##  [7] car_2.0-25           class_7.3-12         codetools_0.2-11    
## [10] colorspace_1.2-6     combinat_0.0-8       compiler_3.1.3      
## [13] DBI_0.3.1            digest_0.6.8         e1071_1.6-4         
## [16] evaluate_0.5.5       foreach_1.4.2        formatR_1.1         
## [19] GenomeInfoDb_1.2.5   grid_3.1.3           gtable_0.1.2        
## [22] gtools_3.4.1         htmltools_0.2.6      IRanges_2.0.1       
## [25] iterators_1.0.7      knitr_1.9            lme4_1.1-7          
## [28] Matrix_1.2-0         mgcv_1.8-6           minqa_1.2.4         
## [31] munsell_0.4.2        nlme_3.1-120         nloptr_1.0.4        
## [34] nnet_7.3-9           parallel_3.1.3       pbkrtest_0.4-2      
## [37] plyr_1.8.1           proto_0.3-10         quantreg_5.11       
## [40] Rcpp_0.11.5          reshape2_1.4.1       rmarkdown_0.3.11    
## [43] RSQLite_1.0.0        S4Vectors_0.4.0      scales_0.2.4        
## [46] SparseM_1.6          splines_3.1.3        stats4_3.1.3        
## [49] stringr_0.6.2        survival_2.38-1      tools_3.1.3         
## [52] XML_3.98-1.1         xtable_1.7-4         yaml_2.1.13