0.1 summary

0.1.1 Introduction

The purpose of this study is to estimate and predict the risk (1-year survival rate) of thoracic surgery. It is important to take post-surgery risk into account in patients selection for surgery. Particularly, this is a classification problem with a binary outcome. Traditional statistical methods include Kaplan–Meier survival curves, hierarchical statistical models, multivariable logistic regression or Cox proportional hazards regression are easy to interpret but have certain limitations. In this project, KNN (K nearest neighbors), SVM (support vector machine) and GBM (Gradient Boosting Machine) with 10-fold repeated cross validation (CV) are applied to predict the 1-year survival status of patients post thoracic surgery. The final dataset without missing values contains 470 observations and the imbalanced rate equalled 5.71 (F vs T). To address this imbalanced issue, SMOTE (synthetic Minority Over-sampling Technique) sampling is also applied to test whether the sampling technique can imporve the model performance (AUC, area under the ROC curve)

0.1.2 Take a look at the dataset

The dataset contains 16 features and one binary outcome “Risk1Yr”. There are 3 continous variables, “PRE4”, “PRE5” and “AGE”. Among these variables, “PRE4” and “PRE5” are highly skewed. There are no strong relations among these continuous variables. Also, the outcome can not be well classified only based on these 3 continuous variables. “DGN”, “PRE6”, “PRE7”, “PRE8”, “PRE9”, “PRE10”, “PRE11”, “PRE14”, “PRE17”, “PRE19”, “PRE25”, “PRE30” and “PRE32” are categorical variables.
Continuous variables are normalized, centered and scaled to feed KNN (“tsdaKNN” dataset). From the “tsdaKNN”, all categorical variables are transfered to dummy variables to feed SVM, while keep the outcome as categorical variable (“num.svm.out” dataset). The original dataset (“tsda” dataset) is used as the input of GBM model. Feature selection is done by lasso with “num.svm.out” data. However, due to the lack of holdout test dataset, it is difficult to assess the impact of feature selection on the GBM model, since the GBM model reach the upper limit of AUC (AUC = 1).

0.1.3 Results

The final results are shown as the table. T-SNE is applied to improve the performance of KNN. The seeds list is used to keep the reproducibility while carring out the parallel processing.
The table 1 is at the end of the document, which is the comparison of models’ performance. (The comparison table)

0.1.4 Conclusion

In conclusion, the GBM model (n.trees = 3500, interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10) without SMOTE sampling performs the best. Overall, the SMOTE sampling is not necessary in this study. Part of the reason could be the outcome is not rare but 70 out of 470. The t-SNE method can improve the performance of KNN in this case. Model without SMOTE sampling always has a youden optimized threshold at the population prevalence level. This analysis also suggests a houldout dataset is necessary for model performance validation.

0.2 The exploratory data analysis

require(foreign, quietly = TRUE)
require(caret, quietly = TRUE)
require(e1071, quietly = TRUE)
## Readin data with read.arff
tsda <- read.arff('http://archive.ics.uci.edu/ml/machine-learning-databases/00277/ThoraricSurgery.arff')

## DATA preprocess for the model
summary(tsda)
##    DGN           PRE4            PRE5          PRE6     PRE7    PRE8   
##  DGN1:  1   Min.   :1.440   Min.   : 0.960   PRZ0:130   F:439   F:402  
##  DGN2: 52   1st Qu.:2.600   1st Qu.: 1.960   PRZ1:313   T: 31   T: 68  
##  DGN3:349   Median :3.160   Median : 2.400   PRZ2: 27                  
##  DGN4: 47   Mean   :3.282   Mean   : 4.569                             
##  DGN5: 15   3rd Qu.:3.808   3rd Qu.: 3.080                             
##  DGN6:  4   Max.   :6.300   Max.   :86.300                             
##  DGN8:  2                                                              
##  PRE9    PRE10   PRE11    PRE14     PRE17   PRE19   PRE25   PRE30  
##  F:439   F:147   F:392   OC11:177   F:435   F:468   F:462   F: 84  
##  T: 31   T:323   T: 78   OC12:257   T: 35   T:  2   T:  8   T:386  
##                          OC13: 19                                  
##                          OC14: 17                                  
##                                                                    
##                                                                    
##                                                                    
##  PRE32        AGE        Risk1Yr
##  F:468   Min.   :21.00   F:400  
##  T:  2   1st Qu.:57.00   T: 70  
##          Median :62.00          
##          Mean   :62.53          
##          3rd Qu.:69.00          
##          Max.   :87.00          
## 
class(tsda)
## [1] "data.frame"
typeof(tsda)
## [1] "list"
str(tsda)
## 'data.frame':    470 obs. of  17 variables:
##  $ DGN    : Factor w/ 7 levels "DGN1","DGN2",..: 2 3 3 3 3 3 3 2 3 3 ...
##  $ PRE4   : num  2.88 3.4 2.76 3.68 2.44 2.48 4.36 3.19 3.16 2.32 ...
##  $ PRE5   : num  2.16 1.88 2.08 3.04 0.96 1.88 3.28 2.5 2.64 2.16 ...
##  $ PRE6   : Factor w/ 3 levels "PRZ0","PRZ1",..: 2 1 2 1 3 2 2 2 3 2 ...
##  $ PRE7   : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PRE8   : Factor w/ 2 levels "F","T": 1 1 1 1 2 1 1 1 1 1 ...
##  $ PRE9   : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PRE10  : Factor w/ 2 levels "F","T": 2 1 2 1 2 2 2 2 2 2 ...
##  $ PRE11  : Factor w/ 2 levels "F","T": 2 1 1 1 2 1 1 1 2 1 ...
##  $ PRE14  : Factor w/ 4 levels "OC11","OC12",..: 4 2 1 1 1 1 2 1 1 1 ...
##  $ PRE17  : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 2 1 1 1 ...
##  $ PRE19  : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PRE25  : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 2 1 1 ...
##  $ PRE30  : Factor w/ 2 levels "F","T": 2 2 2 1 2 1 2 2 2 2 ...
##  $ PRE32  : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AGE    : num  60 51 59 54 73 51 59 66 68 54 ...
##  $ Risk1Yr: Factor w/ 2 levels "F","T": 1 1 1 1 2 1 2 2 1 1 ...
fea.tsda <- names(tsda)
print(fea.tsda)
##  [1] "DGN"     "PRE4"    "PRE5"    "PRE6"    "PRE7"    "PRE8"    "PRE9"   
##  [8] "PRE10"   "PRE11"   "PRE14"   "PRE17"   "PRE19"   "PRE25"   "PRE30"  
## [15] "PRE32"   "AGE"     "Risk1Yr"
## we know the outcome Risk1Yr
conti.tsda <- c("PRE4", "PRE5", "AGE")  ## PRE4, PRE5, AGE
cate.tsda <- c("DGN", "PRE6", "PRE7", "PRE8", "PRE9",
               "PRE10", "PRE11", "PRE14", "PRE17", "PRE19",
               "PRE25", "PRE30", "PRE32")
## Categorical DGN, PRE6, PRE7, PRE8, PRE9, 
## PRE10, PRE11, PRE14, PRE17, PRE19, PRE25, PRE30, PRE32
skewtest <- apply(tsda[,c(2,3,16)], 2, skewness)
skewtest
##       PRE4       PRE5        AGE 
##  0.5417132  5.5975836 -0.1899413
## correlation analysis 
## the featurePlot only for continous features
featurePlot( x = tsda[ , c(2,3,16)],
             y = tsda$Risk1Yr,
             plot = "ellipse",
             ## Add a key at the top
            auto.key = list(columns = 2)
             )

## correltaion of continuous variables
require(corrplot)
## Loading required package: corrplot
## corrplot 0.84 loaded
tsda.corM <- cor(tsda[,c(2,3,16)])
corrplot(tsda.corM, method = "number", type = "upper")

0.3 3 methods and SMOTE sampling (and a sniff of t-sne)

require(caret, quietly = TRUE)
require(pROC)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## set seeds for parallel caret
set.seed(123)
## 10 fold 3 repeats
seeds.knn <- vector(mode = "list", length = 31)
## the number of k in knn 10, the same number of C in svmRadialCost
for(i in 1:30) seeds.knn[[i]]<- sample.int(n=1000, 10)
#for the last model
seeds.knn[[31]]<-sample.int(1000, 1)

## the traincontrols
## no seeds
## gbmGrid contains 60 parameters per model
fitControl <- trainControl(## 10 fold repeated CV
                    method = "repeatedcv",
                    number = 10,
                    repeats = 3,
                    allowParallel = TRUE,
                    classProbs = TRUE,
                    # sampling = "smote",
                    summaryFunction = twoClassSummary,
                    verboseIter = FALSE)

# no sampling for imbalance class
print("the trControl for caret 10 cross-fold 3 repeats no smote")
## [1] "the trControl for caret 10 cross-fold 3 repeats no smote"
fitControl.knn <- trainControl(## 10 fold repeated CV
                    method = "repeatedcv",
                    number = 10,
                    repeats = 3,
                    allowParallel = TRUE,
                    classProbs = TRUE,
                    seeds = seeds.knn,
                    summaryFunction = twoClassSummary,
                    verboseIter = FALSE)

# smote sampling for imbalance class
print("the trControl for caret 10 cross-fold 3 repeats with smote sampling")
## [1] "the trControl for caret 10 cross-fold 3 repeats with smote sampling"
fitControlSmote.knn <- trainControl(
                         method = "repeatedcv",
                         number = 10,
                         repeats = 3,
                         allowParallel = TRUE,
                         seeds = seeds.knn,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary,
                         verboseIter = FALSE,
                         sampling = "smote")

## Parallel precossing
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores = 6)

## the preprocessing for KNN
preKNN <- preProcess(tsda, method = c("BoxCox", "center", "scale"))
tsdaKNN <- predict(preKNN, tsda)

## the preprocess for lasso svm and tsne
dmy <- dummyVars(" Risk1Yr~ .", data = tsdaKNN)
numx.lasso <- data.frame(predict(dmy, tsdaKNN))

## keep outcome as categoriacl
num.svm.out <- data.frame(numx.lasso, tsda$Risk1Yr)

## tsne transformed data for better seperation of KNN and SVM
require(Rtsne)
## Loading required package: Rtsne
rtsne.tsda <- Rtsne(tsdaKNN[,-17], dims = 5, check_duplicates = F)
rtsne.knn <- data.frame(rtsne.tsda$Y, tsdaKNN$Risk1Yr)
## all numeric tsne
rtsne.tsda.num <- Rtsne(num.svm.out[,-38], dims = 5, check_duplicates = F)
rtsne.knn.num <- data.frame(rtsne.tsda.num$Y, tsda$Risk1Yr)

## the tuning Grid for KNN
knnGrid <- expand.grid(k = (1:10))

par(mfrow = c(1,3))
## caret train the KNN model no smote sampling 
print("the KNN withOUT smote and WITH normalization, Model1 KNN1")
## [1] "the KNN withOUT smote and WITH normalization, Model1 KNN1"
caretknn <- train(Risk1Yr ~., data = tsdaKNN, 
                   method = "knn",
                   #distribution = "bernoulli",
                   #verbose = F,
                   metric = "ROC",
                   trControl = fitControl.knn,
                   tuneGrid = knnGrid)
caretknn
## k-Nearest Neighbors 
## 
## 470 samples
##  16 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec      
##    1  0.5397619  0.9033333  0.17619048
##    2  0.5731548  0.8950000  0.15238095
##    3  0.6001190  0.9608333  0.07142857
##    4  0.5857143  0.9558333  0.07619048
##    5  0.5976190  0.9841667  0.01904762
##    6  0.5945833  0.9841667  0.01428571
##    7  0.5789881  0.9941667  0.01428571
##    8  0.5827976  0.9933333  0.03333333
##    9  0.5779762  0.9958333  0.01904762
##   10  0.5708333  0.9950000  0.00952381
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
plot(caretknn)

## roc and youden
pre.c.k <- predict(caretknn, tsdaKNN, type = "prob")
roc.m1k <- roc(tsdaKNN$Risk1Yr, pre.c.k$T)
print("The model1 KNN without smote sampling and with normalization" )
## [1] "The model1 KNN without smote sampling and with normalization"
print(roc.m1k)
## 
## Call:
## roc.default(response = tsdaKNN$Risk1Yr, predictor = pre.c.k$T)
## 
## Data: pre.c.k$T in 400 controls (tsdaKNN$Risk1Yr F) < 70 cases (tsdaKNN$Risk1Yr T).
## Area under the curve: 0.9269
you.m1k <- coords(roc.m1k, x = "best", 
                   best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy")) 
you.m1k
##   threshold specificity sensitivity    accuracy 
##   0.1666667   0.8025000   1.0000000   0.8319149
## KNN model with smote sampling 
print("the KNN with smote and NO normalization, Model2 KNN2")
## [1] "the KNN with smote and NO normalization, Model2 KNN2"
caretknn.s.nopre <- train(Risk1Yr ~ ., data = tsda,  ## this is nonnormalized data
                    method = "knn",
                    #distribution = "bernoulli",
                    #verbose = F,
                    metric = "ROC",
                    trControl = fitControlSmote.knn,
                    tuneGrid = knnGrid)
## Loading required package: grid
caretknn.s.nopre
## k-Nearest Neighbors 
## 
## 470 samples
##  16 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Addtional sampling using SMOTE
## 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.5354167  0.6475000  0.4238095
##    2  0.5139881  0.6741667  0.3238095
##    3  0.5101786  0.6708333  0.3333333
##    4  0.4900000  0.6441667  0.3285714
##    5  0.5142857  0.6983333  0.3428571
##    6  0.5548214  0.6808333  0.3809524
##    7  0.5081548  0.6816667  0.3095238
##    8  0.5300000  0.6808333  0.3238095
##    9  0.5444048  0.6958333  0.3571429
##   10  0.5466071  0.6991667  0.3904762
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 6.
#knn_fit <- train(V1 ~., data = training, method = "knn",
#trControl=trctrl,
#preProcess = c("center", "scale"),
#tuneLength = 10)

print("the KNN with smote and normalization, Model3 KNN3")
## [1] "the KNN with smote and normalization, Model3 KNN3"
caretknn.s <- train(Risk1Yr ~ ., data = tsdaKNN,  ## normalized data
                    method = "knn",
                    #distribution = "bernoulli",
                    #verbose = F,
                    metric = "ROC",
                    trControl = fitControlSmote.knn,
                    tuneGrid = knnGrid)
caretknn.s
## k-Nearest Neighbors 
## 
## 470 samples
##  16 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Addtional sampling using SMOTE
## 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.5547619  0.7333333  0.3761905
##    2  0.5663690  0.7500000  0.3523810
##    3  0.5791071  0.7508333  0.3619048
##    4  0.5711905  0.7600000  0.3571429
##    5  0.5902381  0.7966667  0.3714286
##    6  0.5651190  0.7666667  0.3142857
##    7  0.5654762  0.8033333  0.3095238
##    8  0.5636310  0.7741667  0.3523810
##    9  0.5495238  0.7766667  0.3285714
##   10  0.5601190  0.7616667  0.3333333
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(caretknn.s)

print("the data with smote sampling and normalization of continuous variables produces better model")
## [1] "the data with smote sampling and normalization of continuous variables produces better model"
## predict value and roc, auc, youden, coords
## using the best model 3
# c for caret, k for knn and s for smote

pre.c.k.s <- predict(caretknn.s, tsdaKNN, type = "prob")
roc.m3k <- roc(tsdaKNN$Risk1Yr, pre.c.k.s$T)
print("The model3 KNN with smote sampling and normalization" )
## [1] "The model3 KNN with smote sampling and normalization"
print(roc.m3k)
## 
## Call:
## roc.default(response = tsdaKNN$Risk1Yr, predictor = pre.c.k.s$T)
## 
## Data: pre.c.k.s$T in 400 controls (tsdaKNN$Risk1Yr F) < 70 cases (tsdaKNN$Risk1Yr T).
## Area under the curve: 0.8464
you.m3k <- coords(roc.m3k, x = "best" , best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy"))
you.m3k
##   threshold specificity sensitivity    accuracy 
##   0.2250000   0.6050000   0.9571429   0.6574468
## creating dummy variables with model.matrix, need to drop the intercept
## potential improvement with dummy variable and numerical dataset
## Scaling before applying SVM is very important.

print("the KNN with t-SNE, smote and normalization, Model4 KNN4")
## [1] "the KNN with t-SNE, smote and normalization, Model4 KNN4"
caretknn.s.rtsne <- train(tsdaKNN.Risk1Yr ~ ., 
                    data = rtsne.knn,  ## normalized data
                    method = "knn",
                    #distribution = "bernoulli",
                    #verbose = F,
                    metric = "ROC",
                    trControl = fitControlSmote.knn,
                    tuneGrid = knnGrid)
caretknn.s.rtsne
## k-Nearest Neighbors 
## 
## 470 samples
##   5 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Addtional sampling using SMOTE
## 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.5895833  0.7458333  0.4333333
##    2  0.5922024  0.7366667  0.3857143
##    3  0.5764286  0.7016667  0.4238095
##    4  0.5695238  0.6916667  0.4238095
##    5  0.5408333  0.6883333  0.3761905
##    6  0.5623214  0.6941667  0.3809524
##    7  0.5623214  0.6883333  0.4000000
##    8  0.5709524  0.6925000  0.3761905
##    9  0.5551786  0.6941667  0.3761905
##   10  0.5711905  0.6608333  0.4095238
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 2.
plot(caretknn.s.rtsne)

preKNN.rtsne <-  predict(caretknn.s.rtsne, rtsne.knn, type = "prob")
roc.m4k <-  roc(tsda$Risk1Yr, preKNN.rtsne$T)
print("The model4 KNN with tsne, smote sampling and normalization: ")
## [1] "The model4 KNN with tsne, smote sampling and normalization: "
print(roc.m4k)
## 
## Call:
## roc.default(response = tsda$Risk1Yr, predictor = preKNN.rtsne$T)
## 
## Data: preKNN.rtsne$T in 400 controls (tsda$Risk1Yr F) < 70 cases (tsda$Risk1Yr T).
## Area under the curve: 0.9304
you.m4k <- coords(roc.m4k, x = "best",best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy"))
you.m4k
##   threshold specificity sensitivity    accuracy 
##   0.1250000   0.7450000   1.0000000   0.7829787
##all numeric
print("the KNN with t-SNE, smote and normalization, all numeric, Model5 KNN5")
## [1] "the KNN with t-SNE, smote and normalization, all numeric, Model5 KNN5"
caretknn.s.rtsne.num <- train(tsda.Risk1Yr ~ ., 
                    data = rtsne.knn.num,  ## normalized data
                    method = "knn",
                    #distribution = "bernoulli",
                    #verbose = F,
                    metric = "ROC",
                    trControl = fitControlSmote.knn,
                    tuneGrid = knnGrid)
caretknn.s.rtsne.num
## k-Nearest Neighbors 
## 
## 470 samples
##   5 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Addtional sampling using SMOTE
## 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.5801190  0.7316667  0.4285714
##    2  0.5676190  0.7200000  0.3666667
##    3  0.5958333  0.7058333  0.4523810
##    4  0.5610714  0.7016667  0.4047619
##    5  0.5908333  0.6950000  0.3904762
##    6  0.5875000  0.6991667  0.4190476
##    7  0.5778571  0.6958333  0.3952381
##    8  0.5755357  0.6925000  0.4000000
##    9  0.5607143  0.6975000  0.3333333
##   10  0.5614286  0.6808333  0.3761905
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
preKNN.rtsne.num <-  predict(caretknn.s.rtsne.num, rtsne.knn.num, type = "prob")
knnrtsne.knn.num <-  roc(tsda$Risk1Yr, preKNN.rtsne.num$T)
print("The model5 KNN with tsne, smote sampling and normalization, numeric: ")
## [1] "The model5 KNN with tsne, smote sampling and normalization, numeric: "
print(knnrtsne.knn.num)
## 
## Call:
## roc.default(response = tsda$Risk1Yr, predictor = preKNN.rtsne.num$T)
## 
## Data: preKNN.rtsne.num$T in 400 controls (tsda$Risk1Yr F) < 70 cases (tsda$Risk1Yr T).
## Area under the curve: 0.8976
print("in this case, t-sne does not numeric features as input ")
## [1] "in this case, t-sne does not numeric features as input "
## the support vector machine model 
## SVM  requires  that  each  data  instance  is  represented  as  a  vector  of  real  numbers.
## Hence, if there are categorical attributes, we first have to convert them into numeric
## data.  

## the tuning Grid for svmRadialCost
svmGrid <- expand.grid(C = seq(0.07, 0.115, by= 0.005))

## Parallel precossing
library(doMC)
registerDoMC(cores = 6)
## train with numeric dataset
print("no smote sampling, Model1 SVM")
## [1] "no smote sampling, Model1 SVM"
svm.m1 <- train( tsda.Risk1Yr  ~ ., 
                    data = num.svm.out,  ## normalized data, scaling and num
                    method = "svmRadialCost",
                    #distribution = "bernoulli",
                    #verbose = F,
                    metric = "ROC",
                    trControl = fitControl.knn,
                    tuneGrid = svmGrid)
svm.m1
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 470 samples
##  37 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Resampling results across tuning parameters:
## 
##   C      ROC        Sens       Spec       
##   0.070  0.6370238  0.9983333  0.004761905
##   0.075  0.6338095  0.9983333  0.004761905
##   0.080  0.6384524  0.9966667  0.000000000
##   0.085  0.6359524  0.9966667  0.000000000
##   0.090  0.6126190  0.9958333  0.004761905
##   0.095  0.6363095  0.9958333  0.004761905
##   0.100  0.6353571  0.9966667  0.000000000
##   0.105  0.6351190  0.9991667  0.000000000
##   0.110  0.6322619  0.9950000  0.000000000
##   0.115  0.6326190  0.9966667  0.000000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.08.
print("small C means soft margin")
## [1] "small C means soft margin"
pre.m1s <- predict(svm.m1,num.svm.out ,type = "prob")
roc.m1s <- roc(tsda$Risk1Yr, pre.m1s$T)
print(roc.m1s)
## 
## Call:
## roc.default(response = tsda$Risk1Yr, predictor = pre.m1s$T)
## 
## Data: pre.m1s$T in 400 controls (tsda$Risk1Yr F) < 70 cases (tsda$Risk1Yr T).
## Area under the curve: 0.8805
you.m1s <- coords(roc.m1s, x = "best",best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy"))
you.m1s
##   threshold specificity sensitivity    accuracy 
##   0.1403008   0.9700000   0.7285714   0.9340426
## train with numeric dataset
## with smote sampling 
print("smote sampling, Model2 SVM")
## [1] "smote sampling, Model2 SVM"
svm.m2 <- train( tsda.Risk1Yr  ~ ., 
                    data = num.svm.out,  ## normalized data, scaling and num
                    method = "svmRadialCost",
                    #distribution = "bernoulli",
                    #verbose = F,
                    metric = "ROC",
                    trControl = fitControlSmote.knn,
                    tuneGrid = svmGrid)
svm.m2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 470 samples
##  37 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Addtional sampling using SMOTE
## 
## Resampling results across tuning parameters:
## 
##   C      ROC        Sens       Spec     
##   0.070  0.5750000  0.7183333  0.3761905
##   0.075  0.5572619  0.7166667  0.3952381
##   0.080  0.5651190  0.7091667  0.3571429
##   0.085  0.5908333  0.7000000  0.4047619
##   0.090  0.5650000  0.7008333  0.3809524
##   0.095  0.5650000  0.6958333  0.3714286
##   0.100  0.5751190  0.7433333  0.3428571
##   0.105  0.5807143  0.7158333  0.3857143
##   0.110  0.5817857  0.7200000  0.3619048
##   0.115  0.5729762  0.7225000  0.3619048
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.085.
print("small C means soft margin")
## [1] "small C means soft margin"
pre.m2s <- predict(svm.m2,num.svm.out ,type = "prob")
roc.m2s <- roc(tsda$Risk1Yr, pre.m2s$T)
print(roc.m2s)
## 
## Call:
## roc.default(response = tsda$Risk1Yr, predictor = pre.m2s$T)
## 
## Data: pre.m2s$T in 400 controls (tsda$Risk1Yr F) < 70 cases (tsda$Risk1Yr T).
## Area under the curve: 0.6865
you.m2s <- coords(roc.m2s, x = "best",best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy"))
you.m2s
##   threshold specificity sensitivity    accuracy 
##   0.2703502   0.4750000   0.8428571   0.5297872
## the gbm tune Grid

gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 5),
                        n.trees = (1:10)*500,
                        shrinkage = c(0.1, 0.05),
                        n.minobsinnode = 10)

## set seeds for parallel caret
set.seed(123)
## 10 fold 3 repeats
seeds.gbm <- vector(mode = "list", length = 31)
## the number of k in knn 10, the same number of C in svmRadialCost
for(i in 1:30) seeds.gbm[[i]]<- sample.int(n=1000, 60)
#for the last model
seeds.gbm[[31]]<-sample.int(1000, 1)

## trControl for gbm 
print("the trControl for caret 10 cross-fold 3 repeats no smote")
## [1] "the trControl for caret 10 cross-fold 3 repeats no smote"
fitControl.gbm <- trainControl(## 10 fold repeated CV
                    method = "repeatedcv",
                    number = 10,
                    repeats = 3,
                    allowParallel = TRUE,
                    classProbs = TRUE,
                    seeds = seeds.gbm,
                    summaryFunction = twoClassSummary,
                    verboseIter = FALSE)

# smote sampling for imbalance class
print("the trControl for caret 10 cross-fold 3 repeats with smote sampling")
## [1] "the trControl for caret 10 cross-fold 3 repeats with smote sampling"
fitControlSmote.gbm <- trainControl(
                         method = "repeatedcv",
                         number = 10,
                         repeats = 3,
                         allowParallel = TRUE,
                         seeds = seeds.gbm,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary,
                         verboseIter = FALSE,
                         sampling = "smote")

## Parallel precossing
library(doMC)
registerDoMC(cores = 6)

## train with original dataset, no smote 
print("no smote sampling, Model1 gbm")
## [1] "no smote sampling, Model1 gbm"
gbm.m1 <- train( Risk1Yr  ~ ., 
                    data = tsda,  
                    method = "gbm",
                    distribution = "bernoulli",
                    verbose = F,
                    metric = "ROC",
                    trControl = fitControl.gbm,
                    tuneGrid = gbmGrid)
gbm.m1
## Stochastic Gradient Boosting 
## 
## 470 samples
##  16 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  ROC        Sens       Spec       
##   0.05       1                   500     0.5944048  0.9783333  0.004761905
##   0.05       1                  1000     0.5811905  0.9683333  0.047619048
##   0.05       1                  1500     0.5736905  0.9625000  0.047619048
##   0.05       1                  2000     0.5761905  0.9625000  0.052380952
##   0.05       1                  2500     0.5785714  0.9566667  0.052380952
##   0.05       1                  3000     0.5826190  0.9558333  0.076190476
##   0.05       1                  3500     0.5780952  0.9550000  0.066666667
##   0.05       1                  4000     0.5750000  0.9541667  0.085714286
##   0.05       1                  4500     0.5757143  0.9533333  0.095238095
##   0.05       1                  5000     0.5730952  0.9500000  0.090476190
##   0.05       3                   500     0.5978571  0.9583333  0.057142857
##   0.05       3                  1000     0.5925000  0.9358333  0.090476190
##   0.05       3                  1500     0.5992857  0.9258333  0.119047619
##   0.05       3                  2000     0.5945238  0.9241667  0.119047619
##   0.05       3                  2500     0.5976190  0.9200000  0.114285714
##   0.05       3                  3000     0.5965476  0.9158333  0.119047619
##   0.05       3                  3500     0.5976190  0.9175000  0.123809524
##   0.05       3                  4000     0.6001190  0.9133333  0.123809524
##   0.05       3                  4500     0.5986905  0.9125000  0.128571429
##   0.05       3                  5000     0.6028571  0.9091667  0.128571429
##   0.05       5                   500     0.6005952  0.9525000  0.080952381
##   0.05       5                  1000     0.6042857  0.9375000  0.090476190
##   0.05       5                  1500     0.6066667  0.9325000  0.090476190
##   0.05       5                  2000     0.6055952  0.9333333  0.095238095
##   0.05       5                  2500     0.6039286  0.9283333  0.100000000
##   0.05       5                  3000     0.6070238  0.9275000  0.119047619
##   0.05       5                  3500     0.6064286  0.9266667  0.128571429
##   0.05       5                  4000     0.6105952  0.9233333  0.128571429
##   0.05       5                  4500     0.6109524  0.9225000  0.128571429
##   0.05       5                  5000     0.6127381  0.9208333  0.128571429
##   0.10       1                   500     0.5832143  0.9675000  0.038095238
##   0.10       1                  1000     0.5755952  0.9608333  0.042857143
##   0.10       1                  1500     0.5765476  0.9558333  0.071428571
##   0.10       1                  2000     0.5707143  0.9558333  0.085714286
##   0.10       1                  2500     0.5657143  0.9475000  0.100000000
##   0.10       1                  3000     0.5639286  0.9441667  0.119047619
##   0.10       1                  3500     0.5636905  0.9416667  0.128571429
##   0.10       1                  4000     0.5670238  0.9375000  0.142857143
##   0.10       1                  4500     0.5683333  0.9375000  0.133333333
##   0.10       1                  5000     0.5708333  0.9325000  0.133333333
##   0.10       3                   500     0.5944048  0.9375000  0.100000000
##   0.10       3                  1000     0.5957143  0.9258333  0.090476190
##   0.10       3                  1500     0.6061905  0.9175000  0.114285714
##   0.10       3                  2000     0.6028571  0.9150000  0.119047619
##   0.10       3                  2500     0.5986905  0.9141667  0.128571429
##   0.10       3                  3000     0.6038095  0.9133333  0.123809524
##   0.10       3                  3500     0.6027381  0.9133333  0.133333333
##   0.10       3                  4000     0.6042857  0.9141667  0.138095238
##   0.10       3                  4500     0.6066667  0.9125000  0.142857143
##   0.10       3                  5000     0.6048810  0.9100000  0.133333333
##   0.10       5                   500     0.5972619  0.9383333  0.090476190
##   0.10       5                  1000     0.6026190  0.9283333  0.119047619
##   0.10       5                  1500     0.6052381  0.9208333  0.119047619
##   0.10       5                  2000     0.6116667  0.9191667  0.128571429
##   0.10       5                  2500     0.6129762  0.9166667  0.128571429
##   0.10       5                  3000     0.6150000  0.9183333  0.123809524
##   0.10       5                  3500     0.6160119  0.9166667  0.128571429
##   0.10       5                  4000     0.6143452  0.9175000  0.133333333
##   0.10       5                  4500     0.6150595  0.9133333  0.133333333
##   0.10       5                  5000     0.6152381  0.9125000  0.133333333
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 3500,
##  interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.
##  roc, youden
pre.m1g <- predict(gbm.m1, tsda, type = "prob")
roc.m1g <- roc(tsda$Risk1Yr, pre.m1g$T)
print(roc.m1g)
## 
## Call:
## roc.default(response = tsda$Risk1Yr, predictor = pre.m1g$T)
## 
## Data: pre.m1g$T in 400 controls (tsda$Risk1Yr F) < 70 cases (tsda$Risk1Yr T).
## Area under the curve: 1
you.m1g <- coords(roc.m1g, x = "best",best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy"))
you.m1g
##   threshold specificity sensitivity    accuracy 
##   0.5000059   1.0000000   1.0000000   1.0000000
## train with original dataset, no smote 
print("smote sampling, Model2 gbm")
## [1] "smote sampling, Model2 gbm"
gbm.m2 <- train( Risk1Yr  ~ ., 
                    data = tsda,  
                    method = "gbm",
                    distribution = "bernoulli",
                    verbose = F,
                    metric = "ROC",
                    trControl = fitControlSmote.gbm,
                    tuneGrid = gbmGrid)
gbm.m2
## Stochastic Gradient Boosting 
## 
## 470 samples
##  16 predictors
##   2 classes: 'F', 'T' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 423, 423, 423, 423, 423, 423, ... 
## Addtional sampling using SMOTE
## 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  ROC        Sens       Spec     
##   0.05       1                   500     0.6166667  0.9033333  0.1666667
##   0.05       1                  1000     0.6070238  0.8875000  0.1761905
##   0.05       1                  1500     0.6071429  0.8775000  0.2095238
##   0.05       1                  2000     0.6090476  0.8700000  0.2190476
##   0.05       1                  2500     0.6059524  0.8600000  0.2285714
##   0.05       1                  3000     0.6063095  0.8491667  0.2285714
##   0.05       1                  3500     0.6025000  0.8491667  0.2523810
##   0.05       1                  4000     0.6029762  0.8416667  0.2571429
##   0.05       1                  4500     0.6051190  0.8408333  0.2714286
##   0.05       1                  5000     0.6080952  0.8300000  0.2619048
##   0.05       3                   500     0.6204762  0.8650000  0.2476190
##   0.05       3                  1000     0.6110714  0.8483333  0.2714286
##   0.05       3                  1500     0.6061905  0.8441667  0.2904762
##   0.05       3                  2000     0.6121429  0.8400000  0.3000000
##   0.05       3                  2500     0.6105952  0.8308333  0.3190476
##   0.05       3                  3000     0.6122619  0.8291667  0.3000000
##   0.05       3                  3500     0.6122619  0.8258333  0.3142857
##   0.05       3                  4000     0.6102381  0.8283333  0.3095238
##   0.05       3                  4500     0.6108333  0.8216667  0.3142857
##   0.05       3                  5000     0.6095238  0.8216667  0.3190476
##   0.05       5                   500     0.5932143  0.8408333  0.2476190
##   0.05       5                  1000     0.5941667  0.8225000  0.2619048
##   0.05       5                  1500     0.5958333  0.8233333  0.2857143
##   0.05       5                  2000     0.5976190  0.8183333  0.3000000
##   0.05       5                  2500     0.5978571  0.8091667  0.3095238
##   0.05       5                  3000     0.5988095  0.8091667  0.3095238
##   0.05       5                  3500     0.5984524  0.8075000  0.3190476
##   0.05       5                  4000     0.5972619  0.8075000  0.3190476
##   0.05       5                  4500     0.5973810  0.8033333  0.3142857
##   0.05       5                  5000     0.5992857  0.8050000  0.3142857
##   0.10       1                   500     0.5944048  0.8983333  0.1619048
##   0.10       1                  1000     0.5798810  0.8666667  0.2000000
##   0.10       1                  1500     0.5721429  0.8525000  0.2190476
##   0.10       1                  2000     0.5764286  0.8525000  0.2047619
##   0.10       1                  2500     0.5804762  0.8433333  0.2238095
##   0.10       1                  3000     0.5822619  0.8325000  0.2333333
##   0.10       1                  3500     0.5859524  0.8208333  0.2380952
##   0.10       1                  4000     0.5800000  0.8225000  0.2285714
##   0.10       1                  4500     0.5820238  0.8141667  0.2333333
##   0.10       1                  5000     0.5782143  0.8066667  0.2333333
##   0.10       3                   500     0.6022619  0.8383333  0.2666667
##   0.10       3                  1000     0.5995238  0.8241667  0.2809524
##   0.10       3                  1500     0.6005952  0.8241667  0.2809524
##   0.10       3                  2000     0.5978571  0.8125000  0.2809524
##   0.10       3                  2500     0.5996429  0.8116667  0.2857143
##   0.10       3                  3000     0.6030952  0.8100000  0.2809524
##   0.10       3                  3500     0.6019048  0.8050000  0.2809524
##   0.10       3                  4000     0.6025000  0.7991667  0.2857143
##   0.10       3                  4500     0.6000000  0.7966667  0.2714286
##   0.10       3                  5000     0.6008333  0.7966667  0.2761905
##   0.10       5                   500     0.6132143  0.8308333  0.2904762
##   0.10       5                  1000     0.6178571  0.8216667  0.3333333
##   0.10       5                  1500     0.6179762  0.8200000  0.3285714
##   0.10       5                  2000     0.6130952  0.8125000  0.3333333
##   0.10       5                  2500     0.6076190  0.8108333  0.3285714
##   0.10       5                  3000     0.6103571  0.8075000  0.3380952
##   0.10       5                  3500     0.6083929  0.8066667  0.3333333
##   0.10       5                  4000     0.6097619  0.8066667  0.3285714
##   0.10       5                  4500     0.6107143  0.8041667  0.3285714
##   0.10       5                  5000     0.6113690  0.8066667  0.3333333
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500,
##  interaction.depth = 3, shrinkage = 0.05 and n.minobsinnode = 10.
##  roc, youden
pre.m2g <- predict(gbm.m2, tsda, type = "prob")
roc.m2g <- roc(tsda$Risk1Yr, pre.m2g$T)
print(roc.m2g)
## 
## Call:
## roc.default(response = tsda$Risk1Yr, predictor = pre.m2g$T)
## 
## Data: pre.m2g$T in 400 controls (tsda$Risk1Yr F) < 70 cases (tsda$Risk1Yr T).
## Area under the curve: 0.9206
you.m2g <- coords(roc.m2g, x = "best",best.method = "youden", 
                   ret = c("threshold", "specificity", 
                           "sensitivity", "accuracy"))
you.m2g
##   threshold specificity sensitivity    accuracy 
##   0.3121200   0.8275000   0.9142857   0.8404255

0.4 feature selection

require(glmnet, quietly = TRUE)
## Loaded glmnet 2.0-13
## 
## Attaching package: 'glmnet'
## The following object is masked from 'package:pROC':
## 
##     auc
require(randomForest, quietly = T)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
require(pROC)
require(caret)
## feature selection with lasso 10 cv for min lambda
set.seed(666)
cv.lassotsda <- cv.glmnet(data.matrix(numx.lasso), tsda[, 17], 
    type.measure = "auc", family = "binomial")
min(cv.lassotsda$cvm)
## [1] 0.4557077
## the shrinkage of lasso
plot(cv.lassotsda)

paste("the lambda min of cv.lasso is: ", cv.lassotsda$lambda.min)
## [1] "the lambda min of cv.lasso is:  0.0118639728912261"
lasso_feamin <- coef(cv.lassotsda, s = "lambda.min")
lasso_fea1 <- coef(cv.lassotsda, s = "lambda.1se")

# the selected feature by lasso, alpha = 1 for lasso and
# minimal lambda, auc only for 2 classes lambda.min maybe a
# complex model
lasso_feamin
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.973673e-01
## DGN.DGN1     .           
## DGN.DGN2     1.280324e-02
## DGN.DGN3    -2.811909e-01
## DGN.DGN4     .           
## DGN.DGN5     1.336389e+00
## DGN.DGN6     .           
## DGN.DGN8     1.866230e+00
## PRE4         .           
## PRE5        -1.521228e-01
## PRE6.PRZ0    .           
## PRE6.PRZ1    .           
## PRE6.PRZ2    5.003554e-02
## PRE7.F      -1.624513e-01
## PRE7.T       .           
## PRE8.F       .           
## PRE8.T       .           
## PRE9.F      -8.761517e-01
## PRE9.T       5.608992e-04
## PRE10.F     -1.309453e-01
## PRE10.T      1.442531e-13
## PRE11.F     -2.353568e-01
## PRE11.T      .           
## PRE14.OC11  -2.543296e-01
## PRE14.OC12   .           
## PRE14.OC13   5.416508e-01
## PRE14.OC14   1.106736e+00
## PRE17.F     -6.136537e-01
## PRE17.T      3.376202e-13
## PRE19.F      .           
## PRE19.T      .           
## PRE25.F      .           
## PRE25.T      .           
## PRE30.F     -5.405700e-01
## PRE30.T      .           
## PRE32.F      .           
## PRE32.T      .           
## AGE          .
# The simplest model that has comparable error to the best
# model given the uncertainty: lambda.1se
lasso_fea1
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -7.834264e-01
## DGN.DGN1     .           
## DGN.DGN2     .           
## DGN.DGN3    -1.561285e-01
## DGN.DGN4     .           
## DGN.DGN5     1.073922e+00
## DGN.DGN6     .           
## DGN.DGN8     3.613873e-01
## PRE4         .           
## PRE5        -4.732449e-02
## PRE6.PRZ0    .           
## PRE6.PRZ1    .           
## PRE6.PRZ2    .           
## PRE7.F       .           
## PRE7.T       .           
## PRE8.F       .           
## PRE8.T       .           
## PRE9.F      -5.017091e-01
## PRE9.T       2.471236e-03
## PRE10.F     -3.161768e-02
## PRE10.T      6.368082e-14
## PRE11.F     -8.939617e-02
## PRE11.T      .           
## PRE14.OC11  -1.122957e-01
## PRE14.OC12   .           
## PRE14.OC13   3.155575e-01
## PRE14.OC14   8.905196e-01
## PRE17.F     -3.622279e-01
## PRE17.T      2.444644e-13
## PRE19.F      .           
## PRE19.T      .           
## PRE25.F      .           
## PRE25.T      .           
## PRE30.F     -2.134856e-01
## PRE30.T      .           
## PRE32.F      .           
## PRE32.T      .           
## AGE          .
## the auc of best predict value, the probabilities
pre.la.min <- predict(cv.lassotsda, data.matrix(numx.lasso), 
    s = "lambda.min", type = "response")
pre.la.1 <- predict(cv.lassotsda, data.matrix(numx.lasso), s = "lambda.1se", 
    type = "response")
## roc of the lasso from the predicted probabilities of the
## models
roc.lamin <- roc(tsdaKNN$Risk1Yr, pre.la.min)
## Warning in roc.default(tsdaKNN$Risk1Yr, pre.la.min): Deprecated use a
## matrix as predictor. Unexpected results may be produced, please pass a
## numeric vector.
roc.la1 <- roc(tsdaKNN$Risk1Yr, pre.la.1)
## Warning in roc.default(tsdaKNN$Risk1Yr, pre.la.1): Deprecated use a matrix
## as predictor. Unexpected results may be produced, please pass a numeric
## vector.
print(roc.lamin)
## 
## Call:
## roc.default(response = tsdaKNN$Risk1Yr, predictor = pre.la.min)
## 
## Data: pre.la.min in 400 controls (tsdaKNN$Risk1Yr F) < 70 cases (tsdaKNN$Risk1Yr T).
## Area under the curve: 0.7589
print(roc.la1)
## 
## Call:
## roc.default(response = tsdaKNN$Risk1Yr, predictor = pre.la.1)
## 
## Data: pre.la.1 in 400 controls (tsdaKNN$Risk1Yr F) < 70 cases (tsdaKNN$Risk1Yr T).
## Area under the curve: 0.7477
print("the more complex model obtained at lambda.min has a better AUC")
## [1] "the more complex model obtained at lambda.min has a better AUC"
# optimize the AUC with probability threshold such as
# Youden's

you_lamin <- coords(roc.lamin, x = "best", best.method = "youden", 
    ret = c("threshold", "specificity", "sensitivity", "accuracy"))
you_lamin
##   threshold specificity sensitivity    accuracy 
##   0.1569123   0.7800000   0.6142857   0.7553191
## get the non-zero feature in lasso.lambdamin
lambdamin.lassofeature <- rownames(which(lasso_feamin != 0, arr.ind = T))[-1]
print("The final features from lasso.lambda.min: ")
## [1] "The final features from lasso.lambda.min: "
print(lambdamin.lassofeature)
##  [1] "DGN.DGN2"   "DGN.DGN3"   "DGN.DGN5"   "DGN.DGN8"   "PRE5"      
##  [6] "PRE6.PRZ2"  "PRE7.F"     "PRE9.F"     "PRE9.T"     "PRE10.F"   
## [11] "PRE10.T"    "PRE11.F"    "PRE14.OC11" "PRE14.OC13" "PRE14.OC14"
## [16] "PRE17.F"    "PRE17.T"    "PRE30.F"

0.5 The comparison table

## the final comparison table
lasso <- c(roc.lamin$auc, you_lamin)
knn1 <- c(roc.m1k$auc, you.m1k)
knn2 <- c(roc.m3k$auc, you.m3k)
knn3 <- c(roc.m4k$auc, you.m4k)
svm1 <- c(roc.m1s$auc, you.m1s)
svm2 <- c(roc.m2s$auc, you.m2s)
gbm1 <- c(roc.m1g$auc, you.m1g)
gbm2 <- c(roc.m2g$auc, you.m2g)
modtab <- rbind(knn1, knn2, knn3, svm1, svm2, gbm1, gbm2, lasso)
colnames(modtab)[1] <- c("AUC")
rownames(modtab) <- c("KNN", "KNN.SMOTE", "KNN.SMOTE.t-SNE", 
    "SVM", "SVM.SMOTE", "GBM", "GBM.SMOTE", "Lasso")
modtab <- round(modtab, 4)
modtab
##                    AUC threshold specificity sensitivity accuracy
## KNN             0.9269    0.1667      0.8025      1.0000   0.8319
## KNN.SMOTE       0.8464    0.2250      0.6050      0.9571   0.6574
## KNN.SMOTE.t-SNE 0.9304    0.1250      0.7450      1.0000   0.7830
## SVM             0.8805    0.1403      0.9700      0.7286   0.9340
## SVM.SMOTE       0.6865    0.2704      0.4750      0.8429   0.5298
## GBM             1.0000    0.5000      1.0000      1.0000   1.0000
## GBM.SMOTE       0.9206    0.3121      0.8275      0.9143   0.8404
## Lasso           0.7589    0.1569      0.7800      0.6143   0.7553
kable(modtab, caption = "Table 1: Comparison of Different Models")
Table 1: Comparison of Different Models
AUC threshold specificity sensitivity accuracy
KNN 0.9269 0.1667 0.8025 1.0000 0.8319
KNN.SMOTE 0.8464 0.2250 0.6050 0.9571 0.6574
KNN.SMOTE.t-SNE 0.9304 0.1250 0.7450 1.0000 0.7830
SVM 0.8805 0.1403 0.9700 0.7286 0.9340
SVM.SMOTE 0.6865 0.2704 0.4750 0.8429 0.5298
GBM 1.0000 0.5000 1.0000 1.0000 1.0000
GBM.SMOTE 0.9206 0.3121 0.8275 0.9143 0.8404
Lasso 0.7589 0.1569 0.7800 0.6143 0.7553

0.6 The Code

0.6.1 Exploratory analysis

> require(foreign, quietly = TRUE)
> require(caret, quietly = TRUE)
> require(e1071, quietly = TRUE)
> ## Readin data with read.arff
> tsda <- read.arff('http://archive.ics.uci.edu/ml/machine-learning-databases/00277/ThoraricSurgery.arff')
> 
> ## DATA preprocess for the model
> summary(tsda)
> class(tsda)
> typeof(tsda)
> str(tsda)
> fea.tsda <- names(tsda)
> print(fea.tsda)
> 
> ## we know the outcome Risk1Yr
> conti.tsda <- c("PRE4", "PRE5", "AGE")  ## PRE4, PRE5, AGE
> cate.tsda <- c("DGN", "PRE6", "PRE7", "PRE8", "PRE9",
>                "PRE10", "PRE11", "PRE14", "PRE17", "PRE19",
>                "PRE25", "PRE30", "PRE32")
> ## Categorical DGN, PRE6, PRE7, PRE8, PRE9, 
> ## PRE10, PRE11, PRE14, PRE17, PRE19, PRE25, PRE30, PRE32
> skewtest <- apply(tsda[,c(2,3,16)], 2, skewness)
> skewtest
> 
> ## correlation analysis 
> ## the featurePlot only for continous features
> featurePlot( x = tsda[ , c(2,3,16)],
>              y = tsda$Risk1Yr,
>              plot = "ellipse",
>              ## Add a key at the top
>             auto.key = list(columns = 2)
>              )
> 
> ## correltaion of continuous variables
> require(corrplot)
> tsda.corM <- cor(tsda[,c(2,3,16)])
> corrplot(tsda.corM, method = "number", type = "upper")

0.6.2 Models

> require(caret, quietly = TRUE)
> require(pROC)
> 
> ## set seeds for parallel caret
> set.seed(123)
> ## 10 fold 3 repeats
> seeds.knn <- vector(mode = "list", length = 31)
> ## the number of k in knn 10, the same number of C in svmRadialCost
> for(i in 1:30) seeds.knn[[i]]<- sample.int(n=1000, 10)
> #for the last model
> seeds.knn[[31]]<-sample.int(1000, 1)
> 
> ## the traincontrols
> ## no seeds
> ## gbmGrid contains 60 parameters per model
> fitControl <- trainControl(## 10 fold repeated CV
>                     method = "repeatedcv",
>                     number = 10,
>                     repeats = 3,
>                     allowParallel = TRUE,
>                     classProbs = TRUE,
>                     # sampling = "smote",
>                     summaryFunction = twoClassSummary,
>                     verboseIter = FALSE)
> 
> # no sampling for imbalance class
> print("the trControl for caret 10 cross-fold 3 repeats no smote")
> fitControl.knn <- trainControl(## 10 fold repeated CV
>                     method = "repeatedcv",
>                     number = 10,
>                     repeats = 3,
>                     allowParallel = TRUE,
>                     classProbs = TRUE,
>                     seeds = seeds.knn,
>                     summaryFunction = twoClassSummary,
>                     verboseIter = FALSE)
> 
> # smote sampling for imbalance class
> print("the trControl for caret 10 cross-fold 3 repeats with smote sampling")
> fitControlSmote.knn <- trainControl(
>                          method = "repeatedcv",
>                          number = 10,
>                          repeats = 3,
>                          allowParallel = TRUE,
>                          seeds = seeds.knn,
>                          classProbs = TRUE,
>                          summaryFunction = twoClassSummary,
>                          verboseIter = FALSE,
>                          sampling = "smote")
> 
> ## Parallel precossing
> library(doMC)
> registerDoMC(cores = 6)
> 
> ## the preprocessing for KNN
> preKNN <- preProcess(tsda, method = c("BoxCox", "center", "scale"))
> tsdaKNN <- predict(preKNN, tsda)
> 
> ## the preprocess for lasso svm and tsne
> dmy <- dummyVars(" Risk1Yr~ .", data = tsdaKNN)
> numx.lasso <- data.frame(predict(dmy, tsdaKNN))
> Risk1Yr <- rep.int(0, 470)
> Risk1Yr[tsdaKNN$Risk1Yr == "T"] <- 1
> 
> ## all numeric including the outcome
> num.svm <- data.frame(numx.lasso, Risk1Yr)
> dim(num.svm)
> 
> ## keep outcome as categoriacl
> num.svm.out <- data.frame(numx.lasso, tsda$Risk1Yr)
> 
> ## tsne transformed data for better seperation of KNN and SVM
> require(Rtsne)
> rtsne.tsda <- Rtsne(tsdaKNN[,-17], dims = 5, check_duplicates = F)
> rtsne.knn <- data.frame(rtsne.tsda$Y, tsdaKNN$Risk1Yr)
> ## all numeric tsne
> rtsne.tsda.num <- Rtsne(num.svm[,-37], dims = 5, check_duplicates = F)
> rtsne.knn.num <- data.frame(rtsne.tsda.num$Y, tsda$Risk1Yr)
> 
> ## the tuning Grid for KNN
> knnGrid <- expand.grid(k = (1:10))
> 
> par(mfrow = c(1,3))
> ## caret train the KNN model no smote sampling 
> print("the KNN withOUT smote and WITH normalization, Model1 KNN1")
> caretknn <- train(Risk1Yr ~., data = tsdaKNN, 
>                    method = "knn",
>                    #distribution = "bernoulli",
>                    #verbose = F,
>                    metric = "ROC",
>                    trControl = fitControl.knn,
>                    tuneGrid = knnGrid)
> caretknn
> plot(caretknn)
> ## roc and youden
> pre.c.k <- predict(caretknn, tsdaKNN, type = "prob")
> roc.m1k <- roc(tsdaKNN$Risk1Yr, pre.c.k$T)
> print("The model1 KNN without smote sampling and with normalization" )
> print(roc.m1k)
> 
> you.m1k <- coords(roc.m1k, x = "best", 
>                    best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy")) 
> you.m1k
> 
> ## KNN model with smote sampling 
> print("the KNN with smote and NO normalization, Model2 KNN2")
> caretknn.s.nopre <- train(Risk1Yr ~ ., data = tsda,  ## this is nonnormalized data
>                     method = "knn",
>                     #distribution = "bernoulli",
>                     #verbose = F,
>                     metric = "ROC",
>                     trControl = fitControlSmote.knn,
>                     tuneGrid = knnGrid)
> caretknn.s.nopre
> 
> #knn_fit <- train(V1 ~., data = training, method = "knn",
> #trControl=trctrl,
> #preProcess = c("center", "scale"),
> #tuneLength = 10)
> 
> print("the KNN with smote and normalization, Model3 KNN3")
> caretknn.s <- train(Risk1Yr ~ ., data = tsdaKNN,  ## normalized data
>                     method = "knn",
>                     #distribution = "bernoulli",
>                     #verbose = F,
>                     metric = "ROC",
>                     trControl = fitControlSmote.knn,
>                     tuneGrid = knnGrid)
> caretknn.s
> plot(caretknn.s)
> print("the data with smote sampling and normalization of continuous variables produces better model")
> ## predict value and roc, auc, youden, coords
> ## using the best model 3
> # c for caret, k for knn and s for smote
> 
> pre.c.k.s <- predict(caretknn.s, tsdaKNN, type = "prob")
> roc.m3k <- roc(tsdaKNN$Risk1Yr, pre.c.k.s$T)
> print("The model3 KNN with smote sampling and normalization" )
> print(roc.m3k)
> 
> you.m3k <- coords(roc.m3k, x = "best" , best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy"))
> you.m3k
> 
> ## creating dummy variables with model.matrix, need to drop the intercept
> ## potential improvement with dummy variable and numerical dataset
> ## Scaling before applying SVM is very important.
> 
> print("the KNN with t-SNE, smote and normalization, Model4 KNN4")
> caretknn.s.rtsne <- train(tsdaKNN.Risk1Yr ~ ., 
>                     data = rtsne.knn,  ## normalized data
>                     method = "knn",
>                     #distribution = "bernoulli",
>                     #verbose = F,
>                     metric = "ROC",
>                     trControl = fitControlSmote.knn,
>                     tuneGrid = knnGrid)
> caretknn.s.rtsne
> plot(caretknn.s.rtsne)
> preKNN.rtsne <-  predict(caretknn.s.rtsne, rtsne.knn, type = "prob")
> roc.m4k <-  roc(tsda$Risk1Yr, preKNN.rtsne$T)
> print("The model4 KNN with tsne, smote sampling and normalization: ")
> print(roc.m4k)
> 
> you.m4k <- coords(roc.m4k, x = "best",best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy"))
> you.m4k
> 
> ##all numeric
> print("the KNN with t-SNE, smote and normalization, all numeric, Model5 KNN5")
> caretknn.s.rtsne.num <- train(tsda.Risk1Yr ~ ., 
>                     data = rtsne.knn.num,  ## normalized data
>                     method = "knn",
>                     #distribution = "bernoulli",
>                     #verbose = F,
>                     metric = "ROC",
>                     trControl = fitControlSmote.knn,
>                     tuneGrid = knnGrid)
> caretknn.s.rtsne.num
> preKNN.rtsne.num <-  predict(caretknn.s.rtsne.num, rtsne.knn.num, type = "prob")
> knnrtsne.knn.num <-  roc(tsda$Risk1Yr, preKNN.rtsne.num$T)
> print("The model5 KNN with tsne, smote sampling and normalization, numeric: ")
> print(knnrtsne.knn.num)
> 
> print("in this case, t-sne does not numeric features as input ")
> 
> ## the support vector machine model 
> ## SVM  requires  that  each  data  instance  is  represented  as  a  vector  of  real  numbers.
> ## Hence, if there are categorical attributes, we first have to convert them into numeric
> ## data.  
> 
> ## the tuning Grid for svmRadialCost
> svmGrid <- expand.grid(C = seq(0.07, 0.115, by= 0.005))
> 
> ## Parallel precossing
> library(doMC)
> registerDoMC(cores = 6)
> ## train with numeric dataset
> print("no smote sampling, Model1 SVM")
> svm.m1 <- train( tsda.Risk1Yr  ~ ., 
>                     data = num.svm.out,  ## normalized data, scaling and num
>                     method = "svmRadialCost",
>                     #distribution = "bernoulli",
>                     #verbose = F,
>                     metric = "ROC",
>                     trControl = fitControl.knn,
>                     tuneGrid = svmGrid)
> svm.m1
> print("small C means soft margin")
> pre.m1s <- predict(svm.m1,num.svm.out ,type = "prob")
> roc.m1s <- roc(tsda$Risk1Yr, pre.m1s$T)
> print(roc.m1s)
> you.m1s <- coords(roc.m1s, x = "best",best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy"))
> you.m1s
> 
> ## train with numeric dataset
> ## with smote sampling 
> print("smote sampling, Model2 SVM")
> svm.m2 <- train( tsda.Risk1Yr  ~ ., 
>                     data = num.svm.out,  ## normalized data, scaling and num
>                     method = "svmRadialCost",
>                     #distribution = "bernoulli",
>                     #verbose = F,
>                     metric = "ROC",
>                     trControl = fitControlSmote.knn,
>                     tuneGrid = svmGrid)
> svm.m2
> print("small C means soft margin")
> pre.m2s <- predict(svm.m2,num.svm.out ,type = "prob")
> roc.m2s <- roc(tsda$Risk1Yr, pre.m2s$T)
> print(roc.m2s)
> you.m2s <- coords(roc.m2s, x = "best",best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy"))
> you.m2s
> 
> 
> ## the gbm tune Grid
> 
> gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 5),
>                         n.trees = (1:10)*500,
>                         shrinkage = c(0.1, 0.05),
>                         n.minobsinnode = 10)
> 
> ## set seeds for parallel caret
> set.seed(123)
> ## 10 fold 3 repeats
> seeds.gbm <- vector(mode = "list", length = 31)
> ## the number of k in knn 10, the same number of C in svmRadialCost
> for(i in 1:30) seeds.gbm[[i]]<- sample.int(n=1000, 60)
> #for the last model
> seeds.gbm[[31]]<-sample.int(1000, 1)
> 
> ## trControl for gbm 
> print("the trControl for caret 10 cross-fold 3 repeats no smote")
> fitControl.gbm <- trainControl(## 10 fold repeated CV
>                     method = "repeatedcv",
>                     number = 10,
>                     repeats = 3,
>                     allowParallel = TRUE,
>                     classProbs = TRUE,
>                     seeds = seeds.gbm,
>                     summaryFunction = twoClassSummary,
>                     verboseIter = FALSE)
> 
> # smote sampling for imbalance class
> print("the trControl for caret 10 cross-fold 3 repeats with smote sampling")
> fitControlSmote.gbm <- trainControl(
>                          method = "repeatedcv",
>                          number = 10,
>                          repeats = 3,
>                          allowParallel = TRUE,
>                          seeds = seeds.gbm,
>                          classProbs = TRUE,
>                          summaryFunction = twoClassSummary,
>                          verboseIter = FALSE,
>                          sampling = "smote")
> 
> ## Parallel precossing
> library(doMC)
> registerDoMC(cores = 6)
> 
> ## train with original dataset, no smote 
> print("no smote sampling, Model1 gbm")
> gbm.m1 <- train( Risk1Yr  ~ ., 
>                     data = tsda,  
>                     method = "gbm",
>                     distribution = "bernoulli",
>                     verbose = F,
>                     metric = "ROC",
>                     trControl = fitControl.gbm,
>                     tuneGrid = gbmGrid)
> gbm.m1
> ##  roc, youden
> pre.m1g <- predict(gbm.m1, tsda, type = "prob")
> roc.m1g <- roc(tsda$Risk1Yr, pre.m1g$T)
> print(roc.m1g)
> you.m1g <- coords(roc.m1g, x = "best",best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy"))
> you.m1g
> ## train with original dataset, no smote 
> print("smote sampling, Model2 gbm")
> gbm.m2 <- train( Risk1Yr  ~ ., 
>                     data = tsda,  
>                     method = "gbm",
>                     distribution = "bernoulli",
>                     verbose = F,
>                     metric = "ROC",
>                     trControl = fitControlSmote.gbm,
>                     tuneGrid = gbmGrid)
> gbm.m2
> ##  roc, youden
> pre.m2g <- predict(gbm.m2, tsda, type = "prob")
> roc.m2g <- roc(tsda$Risk1Yr, pre.m2g$T)
> print(roc.m2g)
> you.m2g <- coords(roc.m2g, x = "best",best.method = "youden", 
>                    ret = c("threshold", "specificity", 
>                            "sensitivity", "accuracy"))
> you.m2g

0.6.3 Lasso

> require(glmnet, quietly = TRUE)
> require(randomForest, quietly = T)
> require(pROC)
> require(caret)
> ## feature selection with lasso 10 cv for min lambda
> set.seed(666)
> cv.lassotsda <- cv.glmnet(data.matrix(numx.lasso), tsda[, 17], 
+     type.measure = "auc", family = "binomial")
> min(cv.lassotsda$cvm)
> ## the shrinkage of lasso
> plot(cv.lassotsda)
> paste("the lambda min of cv.lasso is: ", cv.lassotsda$lambda.min)
> lasso_feamin <- coef(cv.lassotsda, s = "lambda.min")
> lasso_fea1 <- coef(cv.lassotsda, s = "lambda.1se")
> 
> # the selected feature by lasso, alpha = 1 for lasso and
> # minimal lambda, auc only for 2 classes lambda.min maybe a
> # complex model
> lasso_feamin
> # The simplest model that has comparable error to the best
> # model given the uncertainty: lambda.1se
> lasso_fea1
> 
> ## the auc of best predict value, the probabilities
> pre.la.min <- predict(cv.lassotsda, data.matrix(numx.lasso), 
+     s = "lambda.min", type = "response")
> pre.la.1 <- predict(cv.lassotsda, data.matrix(numx.lasso), s = "lambda.1se", 
+     type = "response")
> ## roc of the lasso from the predicted probabilities of the
> ## models
> roc.lamin <- roc(tsdaKNN$Risk1Yr, pre.la.min)
> roc.la1 <- roc(tsdaKNN$Risk1Yr, pre.la.1)
> 
> print(roc.lamin)
> print(roc.la1)
> print("the more complex model obtained at lambda.min has a better AUC")
> # optimize the AUC with probability threshold such as
> # Youden's
> 
> you_lamin <- coords(roc.lamin, x = "best", best.method = "youden", 
+     ret = c("threshold", "specificity", "sensitivity", "accuracy"))
> you_lamin
> 
> ## get the non-zero feature in lasso.lambdamin
> lambdamin.lassofeature <- rownames(which(lasso_feamin != 0, arr.ind = T))[-1]
> print("The final features from lasso.lambda.min: ")
> print(lambdamin.lassofeature)