This page documents the model training process. For documentation on how the data was prepared, see this page.
First, we library all of the necessary packages and load the training and test sets.
library(kernlab)
library(caret)
library(pROC)
# library the genefilter package (install from bioconductor first if it's not
# installed already)
if(!require(genefilter)){
source("http://bioconductor.org/biocLite.R")
biocLite("genefilter")
library(genefilter)
}
load("../Data/split_data.RData")
ls()
## [1] "miRNA_pvalue" "test_data" "train_data"
Then we make sure that the stage column in those data frames are factors.
train_data$stage <- as.factor(train_data$stage)
test_data$stage <- as.factor(test_data$stage)
We’ll do two kinds of preprocessing:
Here we create a function that filters out miRNAs that do not have at least 20% of the sample expression value at 100 or greater. Then it filters out miRNAs that do not have coefficient of variation (sd/mean) between 0.7 and 10. (https://www.biostars.org/p/86981/)
exprFilterFun <- filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
Now we filter the training and test data sets.
# make a matrix from the train_data data frame
train_matrix <- train_data[, -c(1:2)]
# transpose the matrix so that rows are miRNAs and columns are samples
train_matrix_t <- t(train_matrix)
# create a logical vector that indicates TRUE for features that meet the
# filter requirments
expr_filter <- genefilter(train_matrix_t, exprFilterFun)
# see how many features are left
sum(expr_filter)
## [1] 45
# subset the training data down to the features that meet the requirements
train_data_filtered <- train_data[, c(TRUE, TRUE, expr_filter)]
# subset test data
test_data_filtered <- test_data[, names(train_data_filtered)]
For this method, we narrow the features down by choosing a p-value threshold, and then using principal component analysis to transform the data into linear combinations of the miRNA expression data.
Here we choose a threshold of 0.1 for the p-values.
sig_rna <- miRNA_pvalue[miRNA_pvalue$pvalue <= .1, "miRNA_id"]
train_data_sig <- data.frame(train_data[, 1:2],
train_data[, sig_rna])
test_data_sig <- data.frame(test_data[, 1:2],
test_data[, sig_rna])
Then we apply PCA to the remaining data set. We’ll choose a threshold of 95% (choose components that account for 95% of the variation in the data).
trans <- preProcess(train_data_sig[, -c(1:2)],
method = c("center", "scale", "pca"),
thresh = 0.95)
trans
##
## Call:
## preProcess.default(x = train_data_sig[, -c(1:2)], method =
## c("center", "scale", "pca"), thresh = 0.95)
##
## Created from 424 samples and 221 variables
## Pre-processing: centered, scaled, principal component signal extraction
##
## PCA needed 117 components to capture 95 percent of the variance
117 components will be used
We’ll be using the training samples, which totals 424 observations. To train the parameters for each model, we’ll use two data splitting methods. 5- fold cross validation and 10 fold cross validation.
Here we set the training control for 5-fold cross validation.
ctrl5f <- trainControl(method = "cv", number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE)
And here we set the training control for 10-fold cross validation.
ctrl10f <- trainControl(method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE)
The first algorithm is the support vector machine.
The first model we’ll train is the support vector machine, using the first preproccessing method and 5-fold cross validation.
First we make a tuning grid that covers a range of sigma and cost values (Kuhn and Johnson 2013)
svmrGrid <- expand.grid(sigma = c(.00007, .00009, .0001, .0002),
C = 2^(-3:8))
Now we fit the model and report the run time. ROC is used to select the optimal model.
then <- Sys.time()
set.seed(477)
svm_pre1_5cv <- train(stage ~ .,
data = train_data_filtered[, -1],
method = "svmRadial",
tuneGrid = svmrGrid,
metric = "ROC",
trControl = ctrl5f)
Sys.time() - then
## Time difference of 27.81267 secs
We predict using the test set and evaluate the performance.
svm_pre1_5cv_pred <- predict(svm_pre1_5cv,
newdata = test_data_filtered[, - c(1:2)])
svm_pre1_5cv_prob <- predict(svm_pre1_5cv,
newdata = test_data_filtered[, - c(1:2)],
type = "prob")
svm_pre1_5cv.df <- data.frame(stage = test_data_filtered$stage,
svm_pre1_5cv_pred,
early_prob = svm_pre1_5cv_prob$early,
late_prob = svm_pre1_5cv_prob$late)
head(svm_pre1_5cv.df)
## stage svm_pre1_5cv_pred early_prob late_prob
## 1 early early 0.7546434 0.2453566
## 2 early late 0.3258466 0.6741534
## 3 early early 0.8322489 0.1677511
## 4 early early 0.7042966 0.2957034
## 5 early early 0.5331362 0.4668638
## 6 early early 0.7421521 0.2578479
confusionMatrix(data = svm_pre1_5cv.df$svm_pre1_5cv_pred,
reference = test_data_filtered$stage,
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 57 21
## late 8 20
##
## Accuracy : 0.7264
## 95% CI : (0.6313, 0.8085)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.009744
##
## Kappa : 0.3874
## Mcnemar's Test P-Value : 0.025858
##
## Sensitivity : 0.4878
## Specificity : 0.8769
## Pos Pred Value : 0.7143
## Neg Pred Value : 0.7308
## Prevalence : 0.3868
## Detection Rate : 0.1887
## Detection Prevalence : 0.2642
## Balanced Accuracy : 0.6824
##
## 'Positive' Class : late
##
svm_pre1_5cv_rocCurve <- roc(response = test_data_filtered$stage,
predictor = svm_pre1_5cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_filtered$stage)))
auc(svm_pre1_5cv_rocCurve)
## Area under the curve: 0.7996
ci.roc(svm_pre1_5cv_rocCurve)
## 95% CI: 0.7136-0.8857 (DeLong)
plot(svm_pre1_5cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_filtered$stage, predictor = svm_pre1_5cv.df$early_prob, levels = rev(levels(test_data_filtered$stage)))
##
## Data: svm_pre1_5cv.df$early_prob in 41 controls (test_data_filtered$stage late) < 65 cases (test_data_filtered$stage early).
## Area under the curve: 0.7996
save(svm_pre1_5cv, svm_pre1_5cv_pred, svm_pre1_5cv_prob,
file = "../Results/svm_pre1_5cv.RData")
SVM, preprocessing method 2, and 5 fold cross validation. (The default threshold for “pca” in the preProc
argument is 95%)
then <- Sys.time()
set.seed(477)
svm_pre2_5cv <- train(stage ~ .,
data = train_data_sig[, -1],
method = "svmRadial",
tuneGrid = svmrGrid,
preProc = c("center", "scale", "pca"),
metric = "ROC",
trControl = ctrl5f)
## maximum number of iterations reached 3.491983e-05 3.484089e-05maximum number of iterations reached 2.17924e-05 2.174739e-05
Sys.time() - then
## Time difference of 1.239041 mins
svm_pre2_5cv_pred <- predict(svm_pre2_5cv,
newdata = test_data_sig[, - c(1:2)])
svm_pre2_5cv_prob <- predict(svm_pre2_5cv,
newdata = test_data_sig[, - c(1:2)],
type = "prob")
svm_pre2_5cv.df <- data.frame(stage = test_data_sig$stage,
svm_pre2_5cv_pred,
early_prob = svm_pre2_5cv_prob$early,
late_prob = svm_pre2_5cv_prob$late)
head(svm_pre2_5cv.df)
## stage svm_pre2_5cv_pred early_prob late_prob
## 1 early early 0.7733647 0.2266353
## 2 early early 0.5800195 0.4199805
## 3 early early 0.7451924 0.2548076
## 4 early late 0.4315871 0.5684129
## 5 early early 0.6153190 0.3846810
## 6 early early 0.7932442 0.2067558
confusionMatrix(data = svm_pre2_5cv.df$svm_pre2_5cv_pred,
reference = test_data_sig$stage,
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 58 25
## late 7 16
##
## Accuracy : 0.6981
## 95% CI : (0.6013, 0.7835)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.043386
##
## Kappa : 0.3075
## Mcnemar's Test P-Value : 0.002654
##
## Sensitivity : 0.3902
## Specificity : 0.8923
## Pos Pred Value : 0.6957
## Neg Pred Value : 0.6988
## Prevalence : 0.3868
## Detection Rate : 0.1509
## Detection Prevalence : 0.2170
## Balanced Accuracy : 0.6413
##
## 'Positive' Class : late
##
svm_pre2_5cv_rocCurve <- roc(response = test_data_sig$stage,
predictor = svm_pre2_5cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_sig$stage)))
auc(svm_pre2_5cv_rocCurve)
## Area under the curve: 0.6604
ci.roc(svm_pre2_5cv_rocCurve)
## 95% CI: 0.5478-0.773 (DeLong)
plot(svm_pre2_5cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_sig$stage, predictor = svm_pre2_5cv.df$early_prob, levels = rev(levels(test_data_sig$stage)))
##
## Data: svm_pre2_5cv.df$early_prob in 41 controls (test_data_sig$stage late) < 65 cases (test_data_sig$stage early).
## Area under the curve: 0.6604
save(svm_pre2_5cv, svm_pre2_5cv_pred, svm_pre2_5cv_prob,
file = "../Results/svm_pre2_5cv.RData")
SVM, preproccessing method 1, and 10 fold cross validation.
then <- Sys.time()
set.seed(477)
svm_pre1_10cv <- train(stage ~ .,
data = train_data_filtered[, -1],
method = "svmRadial",
tuneGrid = svmrGrid,
metric = "ROC",
trControl = ctrl10f)
Sys.time() - then
## Time difference of 54.44594 secs
We predict using the test set and evaluate the performance.
svm_pre1_10cv_pred <- predict(svm_pre1_10cv,
newdata = test_data_filtered[, - c(1:2)])
svm_pre1_10cv_prob <- predict(svm_pre1_10cv,
newdata = test_data_filtered[, - c(1:2)],
type = "prob")
svm_pre1_10cv.df <- data.frame(stage = test_data_filtered[, "stage"],
svm_pre1_10cv_pred,
early_prob = svm_pre1_10cv_prob$early,
late_prob = svm_pre1_10cv_prob$late)
head(svm_pre1_10cv.df)
## stage svm_pre1_10cv_pred early_prob late_prob
## 1 early early 0.8349623 0.1650377
## 2 early late 0.4175511 0.5824489
## 3 early early 0.8705517 0.1294483
## 4 early early 0.7254903 0.2745097
## 5 early early 0.6109064 0.3890936
## 6 early early 0.7569519 0.2430481
confusionMatrix(data = svm_pre1_10cv.df$svm_pre1_10cv_pred,
reference = test_data_filtered[, "stage"],
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 57 25
## late 8 16
##
## Accuracy : 0.6887
## 95% CI : (0.5914, 0.7751)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.065898
##
## Kappa : 0.2893
## Mcnemar's Test P-Value : 0.005349
##
## Sensitivity : 0.3902
## Specificity : 0.8769
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.6951
## Prevalence : 0.3868
## Detection Rate : 0.1509
## Detection Prevalence : 0.2264
## Balanced Accuracy : 0.6336
##
## 'Positive' Class : late
##
svm_pre1_10cv_rocCurve <- roc(response = test_data_filtered[, "stage"],
predictor = svm_pre1_10cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_filtered[, "stage"])))
auc(svm_pre1_10cv_rocCurve)
## Area under the curve: 0.7644
ci.roc(svm_pre1_10cv_rocCurve)
## 95% CI: 0.6682-0.8605 (DeLong)
plot(svm_pre1_10cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_filtered[, "stage"], predictor = svm_pre1_10cv.df$early_prob, levels = rev(levels(test_data_filtered[, "stage"])))
##
## Data: svm_pre1_10cv.df$early_prob in 41 controls (test_data_filtered[, "stage"] late) < 65 cases (test_data_filtered[, "stage"] early).
## Area under the curve: 0.7644
save(svm_pre1_10cv, svm_pre1_10cv_pred, svm_pre1_10cv_prob,
file = "../Results/svm_pre1_10cv.RData")
SVM, preprocessing method 2, and 10-fold cross validation.
then <- Sys.time()
set.seed(477)
svm_pre2_10cv <- train(stage ~ .,
data = train_data_sig[, -1],
method = "svmRadial",
tuneGrid = svmrGrid,
preProc = c("center", "scale", "pca"),
metric = "ROC",
trControl = ctrl10f)
Sys.time() - then
## Time difference of 2.62259 mins
svm_pre2_10cv_pred <- predict(svm_pre2_10cv,
newdata = test_data_sig[, - c(1:2)])
svm_pre2_10cv_prob <- predict(svm_pre2_10cv,
newdata = test_data_sig[, - c(1:2)],
type = "prob")
svm_pre2_10cv.df <- data.frame(stage = test_data_sig[, "stage"],
svm_pre2_10cv_pred,
early_prob = svm_pre2_10cv_prob$early,
late_prob = svm_pre2_10cv_prob$late)
head(svm_pre2_10cv.df)
## stage svm_pre2_10cv_pred early_prob late_prob
## 1 early early 0.8124083 0.1875917
## 2 early late 0.4760781 0.5239219
## 3 early early 0.7788731 0.2211269
## 4 early late 0.3690614 0.6309386
## 5 early early 0.5924416 0.4075584
## 6 early early 0.7155900 0.2844100
confusionMatrix(data = svm_pre2_10cv.df$svm_pre2_10cv_pred,
reference = test_data_sig[, "stage"],
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 55 24
## late 10 17
##
## Accuracy : 0.6792
## 95% CI : (0.5816, 0.7666)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.09639
##
## Kappa : 0.2783
## Mcnemar's Test P-Value : 0.02578
##
## Sensitivity : 0.4146
## Specificity : 0.8462
## Pos Pred Value : 0.6296
## Neg Pred Value : 0.6962
## Prevalence : 0.3868
## Detection Rate : 0.1604
## Detection Prevalence : 0.2547
## Balanced Accuracy : 0.6304
##
## 'Positive' Class : late
##
svm_pre2_10cv_rocCurve <- roc(response = test_data_sig[, "stage"],
predictor = svm_pre2_10cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_sig[, "stage"])))
auc(svm_pre2_10cv_rocCurve)
## Area under the curve: 0.7081
ci.roc(svm_pre2_10cv_rocCurve)
## 95% CI: 0.6066-0.8096 (DeLong)
plot(svm_pre2_10cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_sig[, "stage"], predictor = svm_pre2_10cv.df$early_prob, levels = rev(levels(test_data_sig[, "stage"])))
##
## Data: svm_pre2_10cv.df$early_prob in 41 controls (test_data_sig[, "stage"] late) < 65 cases (test_data_sig[, "stage"] early).
## Area under the curve: 0.7081
save(svm_pre2_10cv, svm_pre2_10cv_pred, svm_pre2_10cv_prob,
file = "../Results/svm_pre2_10cv.RData")
The second algorithm is random forest.
Random forest, preprocessing method 1, 5 fold cross validation.
For random forest, we tune the mtry parameter using 5 values evenly spaced from 2 to p, where p = # of predictors. We also use an ensemble of 1000 trees (pg. 387, Kuhn and Johnson 2013)
p <- ncol(train_data_filtered) - 2
spaced.values <- round(seq(2, p, length.out = 7))
mtryValues <- spaced.values[2:6]
then <- Sys.time()
set.seed(477)
rf_pre1_5cv <- train(x = train_data_filtered[, -c(1:2)],
y = train_data_filtered$stage,
method = "rf",
ntree = 1000,
tuneGrid = data.frame(mtry = mtryValues),
importance = TRUE,
metric = "ROC",
trControl = ctrl5f)
Sys.time() - then
## Time difference of 38.28141 secs
rf_pre1_5cv_pred <- predict(rf_pre1_5cv,
newdata = test_data_filtered[, - c(1:2)])
rf_pre1_5cv_prob <- predict(rf_pre1_5cv,
newdata = test_data_filtered[, - c(1:2)],
type = "prob")
rf_pre1_5cv.df <- data.frame(stage = test_data_filtered$stage,
rf_pre1_5cv_pred,
early_prob = rf_pre1_5cv_prob$early,
late_prob = rf_pre1_5cv_prob$late)
head(rf_pre1_5cv.df)
## stage rf_pre1_5cv_pred early_prob late_prob
## 1 early early 0.671 0.329
## 2 early late 0.252 0.748
## 3 early early 0.926 0.074
## 4 early early 0.673 0.327
## 5 early early 0.635 0.365
## 6 early early 0.862 0.138
confusionMatrix(data = rf_pre1_5cv.df$rf_pre1_5cv_pred,
reference = test_data_filtered$stage,
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 58 21
## late 7 20
##
## Accuracy : 0.7358
## 95% CI : (0.6413, 0.8168)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.005444
##
## Kappa : 0.4057
## Mcnemar's Test P-Value : 0.014019
##
## Sensitivity : 0.4878
## Specificity : 0.8923
## Pos Pred Value : 0.7407
## Neg Pred Value : 0.7342
## Prevalence : 0.3868
## Detection Rate : 0.1887
## Detection Prevalence : 0.2547
## Balanced Accuracy : 0.6901
##
## 'Positive' Class : late
##
rf_pre1_5cv_rocCurve <- roc(response = test_data_filtered$stage,
predictor = rf_pre1_5cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_filtered$stage)))
auc(rf_pre1_5cv_rocCurve)
## Area under the curve: 0.831
ci.roc(rf_pre1_5cv_rocCurve)
## 95% CI: 0.7499-0.912 (DeLong)
plot(rf_pre1_5cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_filtered$stage, predictor = rf_pre1_5cv.df$early_prob, levels = rev(levels(test_data_filtered$stage)))
##
## Data: rf_pre1_5cv.df$early_prob in 41 controls (test_data_filtered$stage late) < 65 cases (test_data_filtered$stage early).
## Area under the curve: 0.831
save(rf_pre1_5cv, rf_pre1_5cv_pred, rf_pre1_5cv_prob,
file = "../Results/rf_pre1_5cv.RData")
Random forest, preprocessing method 2, 5 fold cross validation.
then <- Sys.time()
set.seed(477)
rf_pre2_5cv <- train(x = train_data_sig[, -c(1:2)],
y = train_data_sig$stage,
method = "rf",
ntree = 1000,
tuneGrid = data.frame(mtry = mtryValues),
preProc = c("center", "scale", "pca"),
importance = TRUE,
metric = "ROC",
trControl = ctrl5f)
Sys.time() - then
## Time difference of 1.135894 mins
rf_pre2_5cv_pred <- predict(rf_pre2_5cv,
newdata = test_data_sig[, - c(1:2)])
rf_pre2_5cv_prob <- predict(rf_pre2_5cv,
newdata = test_data_sig[, - c(1:2)],
type = "prob")
rf_pre2_5cv.df <- data.frame(stage = test_data_sig$stage,
rf_pre2_5cv_pred,
early_prob = rf_pre2_5cv_prob$early,
late_prob = rf_pre2_5cv_prob$late)
head(rf_pre2_5cv.df)
## stage rf_pre2_5cv_pred early_prob late_prob
## 1 early early 0.757 0.243
## 2 early late 0.493 0.507
## 3 early early 0.868 0.132
## 4 early early 0.542 0.458
## 5 early early 0.563 0.437
## 6 early early 0.685 0.315
confusionMatrix(data = rf_pre2_5cv.df$rf_pre2_5cv_pred,
reference = test_data_sig$stage,
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 57 19
## late 8 22
##
## Accuracy : 0.7453
## 95% CI : (0.6514, 0.8249)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.002909
##
## Kappa : 0.4351
## Mcnemar's Test P-Value : 0.054292
##
## Sensitivity : 0.5366
## Specificity : 0.8769
## Pos Pred Value : 0.7333
## Neg Pred Value : 0.7500
## Prevalence : 0.3868
## Detection Rate : 0.2075
## Detection Prevalence : 0.2830
## Balanced Accuracy : 0.7068
##
## 'Positive' Class : late
##
rf_pre2_5cv_rocCurve <- roc(response = test_data_sig$stage,
predictor = rf_pre2_5cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_sig$stage)))
auc(rf_pre2_5cv_rocCurve)
## Area under the curve: 0.7884
ci.roc(rf_pre2_5cv_rocCurve)
## 95% CI: 0.6979-0.8788 (DeLong)
plot(rf_pre2_5cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_sig$stage, predictor = rf_pre2_5cv.df$early_prob, levels = rev(levels(test_data_sig$stage)))
##
## Data: rf_pre2_5cv.df$early_prob in 41 controls (test_data_sig$stage late) < 65 cases (test_data_sig$stage early).
## Area under the curve: 0.7884
save(rf_pre2_5cv, rf_pre2_5cv_pred, rf_pre2_5cv_prob,
file = "../Results/rf_pre2_5cv.RData")
Random forest, preprocessing method 1, 10 fold cross validataion.
then <- Sys.time()
set.seed(477)
rf_pre1_10cv <- train(x = train_data_filtered[, -c(1:2)],
y = train_data_filtered[, "stage"],
method = "rf",
ntree = 1000,
tuneGrid = data.frame(mtry = mtryValues),
importance = TRUE,
metric = "ROC",
trControl = ctrl10f)
Sys.time() - then
## Time difference of 1.410013 mins
rf_pre1_10cv_pred <- predict(rf_pre1_10cv,
newdata = test_data_filtered[, - c(1:2)])
rf_pre1_10cv_prob <- predict(rf_pre1_10cv,
newdata = test_data_filtered[, - c(1:2)],
type = "prob")
rf_pre1_10cv.df <- data.frame(stage = test_data_filtered[, "stage"],
rf_pre1_10cv_pred,
early_prob = rf_pre1_10cv_prob$early,
late_prob = rf_pre1_10cv_prob$late)
head(rf_pre1_10cv.df)
## stage rf_pre1_10cv_pred early_prob late_prob
## 1 early early 0.666 0.334
## 2 early late 0.259 0.741
## 3 early early 0.941 0.059
## 4 early early 0.680 0.320
## 5 early early 0.582 0.418
## 6 early early 0.855 0.145
confusionMatrix(data = rf_pre1_10cv.df$rf_pre1_10cv_pred,
reference = test_data_filtered[, "stage"],
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 58 20
## late 7 21
##
## Accuracy : 0.7453
## 95% CI : (0.6514, 0.8249)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.002909
##
## Kappa : 0.4297
## Mcnemar's Test P-Value : 0.020921
##
## Sensitivity : 0.5122
## Specificity : 0.8923
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7436
## Prevalence : 0.3868
## Detection Rate : 0.1981
## Detection Prevalence : 0.2642
## Balanced Accuracy : 0.7023
##
## 'Positive' Class : late
##
rf_pre1_10cv_rocCurve <- roc(response = test_data_filtered[, "stage"],
predictor = rf_pre1_10cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_filtered[, "stage"])))
auc(rf_pre1_10cv_rocCurve)
## Area under the curve: 0.821
ci.roc(rf_pre1_10cv_rocCurve)
## 95% CI: 0.737-0.905 (DeLong)
plot(rf_pre1_10cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_filtered[, "stage"], predictor = rf_pre1_10cv.df$early_prob, levels = rev(levels(test_data_filtered[, "stage"])))
##
## Data: rf_pre1_10cv.df$early_prob in 41 controls (test_data_filtered[, "stage"] late) < 65 cases (test_data_filtered[, "stage"] early).
## Area under the curve: 0.821
save(rf_pre1_10cv, rf_pre1_10cv_pred, rf_pre1_10cv_prob,
file = "../Results/rf_pre1_10cv.RData")
Random forest, preprocessing method, 10 fold cross validation
then <- Sys.time()
set.seed(477)
rf_pre2_10cv <- train(x = train_data_sig[, -c(1:2)],
y = train_data_sig[, "stage"],
method = "rf",
ntree = 1000,
tuneGrid = data.frame(mtry = mtryValues),
preProc = c("center", "scale", "pca"),
importance = TRUE,
metric = "ROC",
trControl = ctrl10f)
Sys.time() - then
## Time difference of 2.616941 mins
rf_pre2_10cv_pred <- predict(rf_pre2_10cv,
newdata = test_data_sig[, - c(1:2)])
rf_pre2_10cv_prob <- predict(rf_pre2_10cv,
newdata = test_data_sig[, - c(1:2)],
type = "prob")
rf_pre2_10cv.df <- data.frame(stage = test_data_sig[, "stage"],
rf_pre2_10cv_pred,
early_prob = rf_pre2_10cv_prob$early,
late_prob = rf_pre2_10cv_prob$late)
head(rf_pre2_10cv.df)
## stage rf_pre2_10cv_pred early_prob late_prob
## 1 early early 0.750 0.250
## 2 early late 0.488 0.512
## 3 early early 0.816 0.184
## 4 early early 0.543 0.457
## 5 early early 0.549 0.451
## 6 early early 0.644 0.356
confusionMatrix(data = rf_pre2_10cv.df$rf_pre2_10cv_pred,
reference = test_data_sig[, "stage"],
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 59 21
## late 6 20
##
## Accuracy : 0.7453
## 95% CI : (0.6514, 0.8249)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.002909
##
## Kappa : 0.4241
## Mcnemar's Test P-Value : 0.007054
##
## Sensitivity : 0.4878
## Specificity : 0.9077
## Pos Pred Value : 0.7692
## Neg Pred Value : 0.7375
## Prevalence : 0.3868
## Detection Rate : 0.1887
## Detection Prevalence : 0.2453
## Balanced Accuracy : 0.6977
##
## 'Positive' Class : late
##
rf_pre2_10cv_rocCurve <- roc(response = test_data_sig[, "stage"],
predictor = rf_pre2_10cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_sig[, "stage"])))
auc(rf_pre2_10cv_rocCurve)
## Area under the curve: 0.7767
ci.roc(rf_pre2_10cv_rocCurve)
## 95% CI: 0.6849-0.8686 (DeLong)
plot(rf_pre2_10cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_sig[, "stage"], predictor = rf_pre2_10cv.df$early_prob, levels = rev(levels(test_data_sig[, "stage"])))
##
## Data: rf_pre2_10cv.df$early_prob in 41 controls (test_data_sig[, "stage"] late) < 65 cases (test_data_sig[, "stage"] early).
## Area under the curve: 0.7767
save(rf_pre2_10cv, rf_pre2_10cv_pred, rf_pre2_10cv_prob,
file = "../Results/rf_pre2_10cv.RData")
The last algorithm is naive Bayes.
Naive Bayes, preprocessing method 1, 5 fold cross validation.
then <- Sys.time()
set.seed(477)
nb_pre1_5cv <- train(x = train_data_filtered[, -c(1:2)],
y = train_data_filtered$stage,
method = "nb",
ntree = 1000,
tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
importance = TRUE,
metric = "ROC",
trControl = ctrl5f)
Sys.time() - then
## Time difference of 6.616614 secs
nb_pre1_5cv_pred <- predict(nb_pre1_5cv,
newdata = test_data_filtered[, - c(1:2)])
nb_pre1_5cv_prob <- predict(nb_pre1_5cv,
newdata = test_data_filtered[, - c(1:2)],
type = "prob")
nb_pre1_5cv.df <- data.frame(stage = test_data_filtered$stage,
nb_pre1_5cv_pred,
early_prob = nb_pre1_5cv_prob$early,
late_prob = nb_pre1_5cv_prob$late)
head(nb_pre1_5cv.df)
## stage nb_pre1_5cv_pred early_prob late_prob
## 1 early early 9.749844e-01 2.501558e-02
## 2 early late 1.134862e-13 1.000000e+00
## 3 early early 1.000000e+00 6.178386e-09
## 4 early late 7.773875e-27 1.000000e+00
## 5 early early 1.000000e+00 6.718722e-13
## 6 early early 1.000000e+00 1.464735e-19
confusionMatrix(data = nb_pre1_5cv.df$nb_pre1_5cv_pred,
reference = test_data_filtered$stage,
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 48 21
## late 17 20
##
## Accuracy : 0.6415
## 95% CI : (0.5426, 0.7323)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.3112
##
## Kappa : 0.2304
## Mcnemar's Test P-Value : 0.6265
##
## Sensitivity : 0.4878
## Specificity : 0.7385
## Pos Pred Value : 0.5405
## Neg Pred Value : 0.6957
## Prevalence : 0.3868
## Detection Rate : 0.1887
## Detection Prevalence : 0.3491
## Balanced Accuracy : 0.6131
##
## 'Positive' Class : late
##
nb_pre1_5cv_rocCurve <- roc(response = test_data_filtered$stage,
predictor = nb_pre1_5cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_filtered$stage)))
auc(nb_pre1_5cv_rocCurve)
## Area under the curve: 0.672
ci.roc(nb_pre1_5cv_rocCurve)
## 95% CI: 0.5642-0.7798 (DeLong)
plot(nb_pre1_5cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_filtered$stage, predictor = nb_pre1_5cv.df$early_prob, levels = rev(levels(test_data_filtered$stage)))
##
## Data: nb_pre1_5cv.df$early_prob in 41 controls (test_data_filtered$stage late) < 65 cases (test_data_filtered$stage early).
## Area under the curve: 0.672
save(nb_pre1_5cv, nb_pre1_5cv_pred, nb_pre1_5cv_prob,
file = "../Results/nb_pre1_5cv.RData")
Naive Bayes, preprocessing method 2, 5 fold cross validation.
then <- Sys.time()
set.seed(477)
nb_pre2_5cv <- train(x = train_data_sig[, -c(1:2)],
y = train_data_sig$stage,
method = "nb",
ntree = 1000,
tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
preProc = c("center", "scale", "pca"),
importance = TRUE,
metric = "ROC",
trControl = ctrl5f)
Sys.time() - then
## Time difference of 16.84182 secs
nb_pre2_5cv_pred <- predict(nb_pre2_5cv,
newdata = test_data_sig[, - c(1:2)])
nb_pre2_5cv_prob <- predict(nb_pre2_5cv,
newdata = test_data_sig[, - c(1:2)],
type = "prob")
nb_pre2_5cv.df <- data.frame(stage = test_data_sig$stage,
nb_pre2_5cv_pred,
early_prob = nb_pre2_5cv_prob$early,
late_prob = nb_pre2_5cv_prob$late)
head(nb_pre2_5cv.df)
## stage nb_pre2_5cv_pred early_prob late_prob
## 1 early early 9.989010e-01 0.001098964
## 2 early late 4.247094e-01 0.575290621
## 3 early early 9.975956e-01 0.002404399
## 4 early late 5.295538e-05 0.999947045
## 5 early late 4.391418e-01 0.560858178
## 6 early early 9.620655e-01 0.037934535
confusionMatrix(data = nb_pre2_5cv.df$nb_pre2_5cv_pred,
reference = test_data_sig$stage,
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 48 23
## late 17 18
##
## Accuracy : 0.6226
## 95% CI : (0.5233, 0.715)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.4633
##
## Kappa : 0.1824
## Mcnemar's Test P-Value : 0.4292
##
## Sensitivity : 0.4390
## Specificity : 0.7385
## Pos Pred Value : 0.5143
## Neg Pred Value : 0.6761
## Prevalence : 0.3868
## Detection Rate : 0.1698
## Detection Prevalence : 0.3302
## Balanced Accuracy : 0.5887
##
## 'Positive' Class : late
##
nb_pre2_5cv_rocCurve <- roc(response = test_data_sig$stage,
predictor = nb_pre2_5cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_sig$stage)))
auc(nb_pre2_5cv_rocCurve)
## Area under the curve: 0.6484
ci.roc(nb_pre2_5cv_rocCurve)
## 95% CI: 0.542-0.7549 (DeLong)
plot(nb_pre2_5cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_sig$stage, predictor = nb_pre2_5cv.df$early_prob, levels = rev(levels(test_data_sig$stage)))
##
## Data: nb_pre2_5cv.df$early_prob in 41 controls (test_data_sig$stage late) < 65 cases (test_data_sig$stage early).
## Area under the curve: 0.6484
save(nb_pre2_5cv, nb_pre2_5cv_pred, nb_pre2_5cv_prob,
file = "../Results/nb_pre2_5cv.RData")
Naive Bayes, preprocessing method 1, 10 fold cross validation.
then <- Sys.time()
set.seed(477)
nb_pre1_10cv <- train(x = train_data_filtered[, -c(1:2)],
y = train_data_filtered[, "stage"],
method = "nb",
ntree = 1000,
tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
importance = TRUE,
metric = "ROC",
trControl = ctrl10f)
Sys.time() - then
## Time difference of 7.104017 secs
nb_pre1_10cv_pred <- predict(nb_pre1_10cv,
newdata = test_data_filtered[, - c(1:2)])
nb_pre1_10cv_prob <- predict(nb_pre1_10cv,
newdata = test_data_filtered[, - c(1:2)],
type = "prob")
nb_pre1_10cv.df <- data.frame(stage = test_data_filtered[, "stage"],
nb_pre1_10cv_pred,
early_prob = nb_pre1_10cv_prob$early,
late_prob = nb_pre1_10cv_prob$late)
head(nb_pre1_10cv.df)
## stage nb_pre1_10cv_pred early_prob late_prob
## 1 early early 0.9999999930 7.034877e-09
## 2 early late 0.0001013995 9.998986e-01
## 3 early early 0.9999998822 1.178449e-07
## 4 early early 0.9999027000 9.729995e-05
## 5 early late 0.0455040776 9.544959e-01
## 6 early early 1.0000000000 2.024158e-12
confusionMatrix(data = nb_pre1_10cv.df$nb_pre1_10cv_pred,
reference = test_data_filtered[, "stage"],
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 59 30
## late 6 11
##
## Accuracy : 0.6604
## 95% CI : (0.562, 0.7496)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.1851589
##
## Kappa : 0.1973
## Mcnemar's Test P-Value : 0.0001264
##
## Sensitivity : 0.2683
## Specificity : 0.9077
## Pos Pred Value : 0.6471
## Neg Pred Value : 0.6629
## Prevalence : 0.3868
## Detection Rate : 0.1038
## Detection Prevalence : 0.1604
## Balanced Accuracy : 0.5880
##
## 'Positive' Class : late
##
nb_pre1_10cv_rocCurve <- roc(response = test_data_filtered[, "stage"],
predictor = nb_pre1_10cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_filtered[, "stage"])))
auc(nb_pre1_10cv_rocCurve)
## Area under the curve: 0.6935
ci.roc(nb_pre1_10cv_rocCurve)
## 95% CI: 0.5846-0.8023 (DeLong)
plot(nb_pre1_10cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_filtered[, "stage"], predictor = nb_pre1_10cv.df$early_prob, levels = rev(levels(test_data_filtered[, "stage"])))
##
## Data: nb_pre1_10cv.df$early_prob in 40 controls (test_data_filtered[, "stage"] late) < 65 cases (test_data_filtered[, "stage"] early).
## Area under the curve: 0.6935
save(nb_pre1_10cv, nb_pre1_10cv_pred, nb_pre1_10cv_prob,
file = "../Results/nb_pre1_10cv.RData")
Random forest, preprocessing method, 10 fold cross validation.
then <- Sys.time()
set.seed(477)
nb_pre2_10cv <- train(x = train_data_sig[, -c(1:2)],
y = train_data_sig[, "stage"],
method = "nb",
ntree = 1000,
tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
preProc = c("center", "scale", "pca"),
importance = TRUE,
metric = "ROC",
trControl = ctrl10f)
Sys.time() - then
## Time difference of 20.5454 secs
nb_pre2_10cv_pred <- predict(nb_pre2_10cv,
newdata = test_data_sig[, - c(1:2)])
nb_pre2_10cv_prob <- predict(nb_pre2_10cv,
newdata = test_data_sig[, - c(1:2)],
type = "prob")
nb_pre2_10cv.df <- data.frame(stage = test_data_sig[, "stage"],
nb_pre2_10cv_pred,
early_prob = nb_pre2_10cv_prob$early,
late_prob = nb_pre2_10cv_prob$late)
head(nb_pre2_10cv.df)
## stage nb_pre2_10cv_pred early_prob late_prob
## 1 early early 0.9989512 0.0010488276
## 2 early early 0.8834962 0.1165037761
## 3 early early 0.9997455 0.0002545083
## 4 early early 0.7958403 0.2041597150
## 5 early early 0.8731985 0.1268015222
## 6 early early 0.9781878 0.0218122298
confusionMatrix(data = nb_pre2_10cv.df$nb_pre2_10cv_pred,
reference = test_data_sig[, "stage"],
positive = "late")
## Confusion Matrix and Statistics
##
## Reference
## Prediction early late
## early 51 26
## late 14 15
##
## Accuracy : 0.6226
## 95% CI : (0.5233, 0.715)
## No Information Rate : 0.6132
## P-Value [Acc > NIR] : 0.46329
##
## Kappa : 0.1591
## Mcnemar's Test P-Value : 0.08199
##
## Sensitivity : 0.3659
## Specificity : 0.7846
## Pos Pred Value : 0.5172
## Neg Pred Value : 0.6623
## Prevalence : 0.3868
## Detection Rate : 0.1415
## Detection Prevalence : 0.2736
## Balanced Accuracy : 0.5752
##
## 'Positive' Class : late
##
nb_pre2_10cv_rocCurve <- roc(response = test_data_sig[, "stage"],
predictor = nb_pre2_10cv.df$early_prob,
## This function assumes that the second
## class is the event of interest, so we
## reverse the labels
levels = rev(levels(test_data_sig[, "stage"])))
auc(nb_pre2_10cv_rocCurve)
## Area under the curve: 0.6081
ci.roc(nb_pre2_10cv_rocCurve)
## 95% CI: 0.4985-0.7177 (DeLong)
plot(nb_pre2_10cv_rocCurve, legacy.axes = TRUE)
##
## Call:
## roc.default(response = test_data_sig[, "stage"], predictor = nb_pre2_10cv.df$early_prob, levels = rev(levels(test_data_sig[, "stage"])))
##
## Data: nb_pre2_10cv.df$early_prob in 40 controls (test_data_sig[, "stage"] late) < 65 cases (test_data_sig[, "stage"] early).
## Area under the curve: 0.6081
save(nb_pre2_10cv, nb_pre2_10cv_pred, nb_pre2_10cv_prob,
file = "../Results/nb_pre2_10cv.RData")
print(sessionInfo(), locale = FALSE)
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] klaR_0.6-12 MASS_7.3-40 randomForest_4.6-10
## [4] genefilter_1.48.1 pROC_1.7.3 caret_6.0-41
## [7] ggplot2_1.0.1 lattice_0.20-30 kernlab_0.9-20
##
## loaded via a namespace (and not attached):
## [1] annotate_1.44.0 AnnotationDbi_1.28.2 Biobase_2.26.0
## [4] BiocGenerics_0.12.1 BradleyTerry2_1.0-6 brglm_0.5-9
## [7] car_2.0-25 class_7.3-12 codetools_0.2-11
## [10] colorspace_1.2-6 combinat_0.0-8 compiler_3.1.3
## [13] DBI_0.3.1 digest_0.6.8 e1071_1.6-4
## [16] evaluate_0.5.5 foreach_1.4.2 formatR_1.1
## [19] GenomeInfoDb_1.2.5 grid_3.1.3 gtable_0.1.2
## [22] gtools_3.4.1 htmltools_0.2.6 IRanges_2.0.1
## [25] iterators_1.0.7 knitr_1.9 lme4_1.1-7
## [28] Matrix_1.2-0 mgcv_1.8-6 minqa_1.2.4
## [31] munsell_0.4.2 nlme_3.1-120 nloptr_1.0.4
## [34] nnet_7.3-9 parallel_3.1.3 pbkrtest_0.4-2
## [37] plyr_1.8.1 proto_0.3-10 quantreg_5.11
## [40] Rcpp_0.11.5 reshape2_1.4.1 rmarkdown_0.3.11
## [43] RSQLite_1.0.0 S4Vectors_0.4.0 scales_0.2.4
## [46] SparseM_1.6 splines_3.1.3 stats4_3.1.3
## [49] stringr_0.6.2 survival_2.38-1 tools_3.1.3
## [52] XML_3.98-1.1 xtable_1.7-4 yaml_2.1.13