Team Information

This report was prepared for ChemsRUs by Evan Hernandez, hernae5 for the team “We Don’t Zinc So” My team members are: Jessica Coco, Tamia Brefo, Morgan Ford, and Jeffrey Chai cocoj2, brefot, fordm2, chaij

Our team used http://codalab.idea.rpi.edu/competitions/37

Introduction

Chems-R-Us has created an entry to the challenge at https://competitions.codalab.org/competitions/22892 based on logistic regression (LR). Their entry is in the file ‘FinalProjChemsRUs.Rmd’ Based on the information in the leaderboard under bennek, their entry is not performing feature selection well. The approaches tried by Chems-R-Us were LR with feature selection based on the coefficients of logistic regression with p-values used to determine importance.

The purpose of this report is to investigate alternative approaches to the that may help achieve high AUC scores on the testing set while correctly identifying the relevant features as measured by balanced accuracy.

Methods Used

The first method that I used for classification was randomForest. RandomForest works by creating a tree of nodes for selection and trying different variables at each level to find the correlation to the data. The second method I used was SVM (Support Vector Machines), which generates vectors about the points in the dataset to separate them into two classes. The first feature selection method that I used was Recursive Feature Elimination and the second feature selection I used was variable importance using the caret method ‘rpart’.

Data Description:

  1. The dataset had 168 attributes 2a. The test data was divided independently 2b. The training and validation data were made from a 90:10 split of the feature, class, and result data provided
  2. The testing set had 400 data points per feature, the validation set had 105 points per feature, and the training set had 905 data points per feature
  3. The scaling method used was the preProcess function with the scale method

Results Using All Features

The accuracy for the training models were across the board better than the validation models. The reason for this is that the training data had nine times the information that the validation data, giving the AI more time to learn and better classify the results based on the information given. The difference was not stark, and was only about 4-8% at most. Logistically, this means that the return was most likely logarithmic for data and accuracy and therefore there is optimization potential for runtime, data used, accuracy, etc. for both models.

Results Using Feature Selection

The two feature selection methods that I used were Recursive Feature Elimination and rpart Variable Importance. Recursive feature selection used a multi step process with k-folds in the data to repeat and evaluate what the most determining variables were. A similar process is used for rpart Variable Importance, and that selected the top 5 features only to be used. Recursie feature elimination for both classification methods performed better than all features for the training and validation sets, but did not perform as well on the challenge site or test data. Rpart Variable Importance performed the worst in the training, validation, and test sets for both classifiers, leading me to believe that rpart Variable Importance is not very viable for data that needs extreme accuracy, however the rpart ran significantly faster than the recursive feature elimination.

Results Comparison

knitr::opts_chunk$set(cache = T)

# Set the correct default repository
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)


# These will install required packages if they are not already installed
if (!require("ggplot2")) {
   install.packages("ggplot2", dependencies = TRUE)
   library(ggplot2)
}
if (!require("knitr")) {
   install.packages("knitr", dependencies = TRUE)
   library(knitr)
}
if (!require("xtable")) {
   install.packages("xtable", dependencies = TRUE)
   library(xtable)
}
if (!require("pander")) {
   install.packages("pander", dependencies = TRUE)
   library(pander)
}

if (!require("devtools")) {
  install.packages("devtools",dependencies = TRUE ) 
  library(devtools)
}

if (!require("usethis")) {
  install.packages("usethis" ) 
  library(usethis)
}

if (!require("e1071")) {
 install.packages("e1071" ) 
  library(e1071)
}

if (!require("pROC")){
  install.packages("pROC")
   library(pROC)
} 

if (!require("dplyr")) {
   install.packages("dplyr", dependencies = TRUE)
   library(dplyr)
}

if (!require("tidyverse")) {
   install.packages("tidyverse", dependencies = TRUE)
   library(tidyverse)
}

if (!require("caret")) {
   install.packages("caret", dependencies = TRUE)
   library(caret)
}
if (!require('randomForest')){
   install.packages('randomForest')
   library(randomForest)
}
if (!require('InformationValue')){
   install.packages('InformationValue')
   library(InformationValue)
}
if (!require('DALEX')){
   install.packages('DALEX')
   library(DALEX)
}
if (!require('mlbench')){
   install.packages('mlbench')
   library(mlbench)
}
if (!require('kernlab')){
   install.packages('kernlab')
   library(kernlab)
}
if (!require('e1071')){
   install.packages('e1071')
   library(e1071)
}
knitr::opts_chunk$set(echo = TRUE)

The best method for prediction overall was SVM, and the best feature selection methods were all features for random forest and recursive feature elimination for SVM. The strength of random forest approaches is to be able to quickly parse through many variables to draw conclusions. Because of this, it makes sense why the feature elimination across the board decreased random forest effectiveness, as there were less nodes to work from and therefore less learning that the forest could do. Therefore, unless you are splitting data in a smaller size, i.e. 50% of data is significant, I’d advise against using feature selections for random forest. Support Vectors work differently, as they use a point based method and therefore outlying points will have an impact on the accuracy of the model. Recursive Feature Elimination, the proven better of the two methods, showed a 5% increase in AUC score and a 2% increase in balanced accuracy over the standard all variables approach. Contrarily, the numbers dropped drastically with rpart variable importance. Overall however, I would recommend using randomforest with all features as my go to method. SVM produced very accurate results, but was very inconsistent. In the training and testing data, SVM proved much less effective returning roughly 50% and 70% AUC scores for testing and training data respectively, while randomforest remained relatively the same throughout with all features (+- 2%).

Additional Analysis

val.results <- matrix(c('rF-All',168,.81,.93,'rF-rfe',20,.62,.88,'rf-rpart',5,.61,.61,'SVM-All',168,.87,.93,'SVM-rfe',20,.92,.95,'SVM-rpart',5,.74,.85),ncol=4,byrow=TRUE)
colnames(val.results) <- c("Classification & Feature","Dimension","AUC Score",'Balanced Accuracy')
rownames(val.results) <- c('1','2','3','4','5','6')
val.results <- as.table(val.results)
val.results
  Classification & Feature Dimension AUC Score Balanced Accuracy
1 rF-All                   168       0.81      0.93             
2 rF-rfe                   20        0.62      0.88             
3 rf-rpart                 5         0.61      0.61             
4 SVM-All                  168       0.87      0.93             
5 SVM-rfe                  20        0.92      0.95             
6 SVM-rpart                5         0.74      0.85             

For my additional analysis, I provided a table with false positive and false negative rates for each method. Overall, randomforest had better true positive accuracy, and svm had better true negative acccuracy. Also, rf-rpart had extreme difficulty correctly classifying negative data, with only a 34% true positive rate. Therefore, if one variable is more significant than the other, i.e. you would rather have false positives than false negatives, then it is viable to select one method over the other.

Challenge Prediction

My challenge ID is hernae5 with an AUC score of .82 for prediction and balanced accuracy .50 for feature selection. I used random forest as my final selection method, as it is the most consistent method and returned the higest results in the challenge of all that I produced.

For my challenge entries, SVM consistently only returned 50% AUC score, while randomforest was 70-80%. All feature selection methods were fairly pedestrian, only returning about 50-53% for each score. Therefore, all randomforest methods were more effective than SVM, and of the choices randomforest with all features was clearly the best method posting a score of 82 compared to 77 for both with feaure selections.

Conclusion

In conclusion, SVM can be a very accurate tool for classification modeling, but I prefer randomforest due to its overall consistency. SVM worked best with recursive feature selection, and randomforest with all features selected. Overall, using rpart variable importance only served as a hindrance to any of the classification methods, making it a non-viable feature selection method for the task at hand.

Project Background

The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing.

Our contract is to build a classification model to predict ready biodegradation of chemicals and to predict the classification of 400 newly-developed molecules. This will help Chems-R-Us to design new materials with desired biodegradation properties more quickly and for lower cost.

Chems-R-Us Challenge Objective

The Chems-R-Us Challenge consists of two problems:

  • Binary classification: Each data row is labeled -1 or 1. We must train a predictive model on the train dataset to be able to find (as best we can) the labels of the test dataset.
  • Feature selection: Scattered among the 168 features there are fake features. These are randomly generated variables which don’t help predicting the class. The goal of this problem is to classify features between fake (0) and real (1).

Chems-R-Us Training & Testing Data

  • Experimental values of 1055 chemicals were collected.
  • The training dataset consists of these 1055 chemicals; whether they were readily biodegradable (1= yes , -1 = no); and 168 molecular descriptors.
  • Molecules and Molecular descriptors are proprietary. No details are provided except cryptic names in column headers
  • A testing set of 400 molecules with unknown biodegradability is provided

Chems-R-Us Challenge Files (Detail)

  • TRAINING DATA is divided into four files:

    • chems_train.data.csv: Training data matrix with no response labels (1018 samples x 168 feature values)
    • chems_feat.name.csv: Name of the 168 attributes (168 x 1 features names).
    • chems_train.solution.csv: Training target values (1018 lines x 1 column)
  • EXTERNAL TEST DATA is one file

    • chems_test.data.csv: Test data matrix (437 samples x 168 features values)

Reading the Data

# Prepare biodegradability data 
#get feature names 
featurenames <- read.csv("~/MATP-4400/data/chems_feat.name.csv",
                         header=FALSE, 
                         colClasses = "character")
featurenames
# get training data and rename with feature names
cdata.df <-read.csv("~/MATP-4400/data/chems_train.data.csv",
                    header=FALSE)
colnames(cdata.df) <- featurenames$V1
cdata.df
# get external testing data and rename with feature names
tdata.df <-read.csv("~/MATP-4400/data/chems_test.data.csv",
                    header=FALSE) 

colnames(tdata.df) <- featurenames$V1

class <- read.csv("~/MATP-4400/data/chems_train.solution.csv",
                  header=FALSE, 
                  colClasses = "factor") 
class <- class$V1

Preparing the Data: Create Training and Validation datasets

We split the data into 90% train and 10% validation datasets.

#ss will be the number of data points in the training set
n <- nrow(cdata.df)

ss <- ceiling(n*0.90)
# Set random seed for reproducibility
set.seed(200)
train.perm <- sample(1:n,ss)

#Split training and validation data
train <- cdata.df %>% slice(train.perm) 
validation <- cdata.df %>% slice(-train.perm) 

Next, we create a scaler to normalize the data and prevent outliers having significant impact on the results.

# Initialize the scaler on the training data
scaler <- preProcess(train, method = "scale") 
scaler <- preProcess(validation, method = "scale")
These variables have zero variances: X108, X131
scaler <- preProcess(tdata.df, method = "scale")


test <- predict(scaler, tdata.df)
# Normalize training data
# Split the output classes

classtrain <- class[train.perm]
classval <-class[-train.perm]

Fitting Data

First, we create dataframes combining the variables and the class results.

# Fit model to classify all the variables
train.df <- cbind(train,classtrain)
val.df <- cbind(validation, classval)

Then we write our methods for feature selection, both for the training and validation sets.

#Create training for rpart variable importance
set.seed(100)
rPartMod <- train(classtrain ~ ., data=train.df, method="rpart")
rpartImp <- varImp(rPartMod)

#Select most important variables, and select those from dataframe to use
variable.select <- rpartImp$importance %>%
   dplyr::select(Overall) > 0
variable.fs <- train %>%
   select_if(variable.select)
train.df.rpart <- cbind(variable.fs, classtrain)
set.seed(100)
rPartMod.val <- train(classval ~ ., data=val.df, method="rpart")
rpartImp.val <- varImp(rPartMod.val)
variable.select.val <- rpartImp.val$importance %>%
   dplyr::select(Overall) > 0
variable.fs.val <- validation %>%
   select_if(variable.select.val)
val.df.rpart <- cbind(variable.fs.val, classval)
set.seed(7)
library(mlbench)
library(caret)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(train.df[1:168], train.df$classtrain, sizes=20, rfeControl=control)

sig.rfe <- predictors(results)
sig.rfe <- as.factor(sig.rfe)

train.df.rfe <- train.df %>% select(sig.rfe)
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(sig.rfe)` instead of `sig.rfe` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
train.df.rfe <- cbind(train.df.rfe, classtrain)
set.seed(7)
library(mlbench)
library(caret)
# define the control using a random forest selection function
control.val <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results.val <- rfe(val.df[1:168], val.df$classval, sizes=20, rfeControl=control.val)

sig.rfe.val <- predictors(results.val)
sig.rfe.val <- as.factor(sig.rfe.val)

val.df.rfe <- val.df %>% select(sig.rfe.val)
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(sig.rfe.val)` instead of `sig.rfe.val` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
val.df.rfe <- cbind(val.df.rfe, classval)

Then, we plug in our training and validation dataframes to the training algorithms for both classifications and all feature combinations. We return values for balanced accuracy, and a confusion matrix of data.


rf.rfe <- randomForest(
   classtrain~.,
   data = train.df.rfe,
)

confmat.rf.rfe <- rf.rfe$confusion
confmat.rf.rfe <- subset(confmat.rf.rfe, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rfe <- sensitivity_from_confmat(confmat.rf.rfe)# True Negative Rate or Specificity
Specificity.rf.rfe <- specificity_from_confmat(confmat.rf.rfe)
BalancedAccuracy.rf.rfe <- (Sensitivity.rf.rfe+Specificity.rf.rfe)/2

rf.rfe.val <- randomForest(
   classval~.,
   data = val.df.rfe,
)

confmat.rf.rfe.val <- rf.rfe.val$confusion
confmat.rf.rfe.val <- subset(confmat.rf.rfe.val, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rfe.val <- sensitivity_from_confmat(confmat.rf.rfe.val)# True Negative Rate or Specificity
Specificity.rf.rfe.val <- specificity_from_confmat(confmat.rf.rfe.val)
BalancedAccuracy.rf.rfe.val <- (Sensitivity.rf.rfe.val+Specificity.rf.rfe.val)/2
BalancedAccuracy.rf.rfe.val
[1] 0.886
#random forest using rpart variable imporance
rf.rpart <- randomForest(
   classtrain~.,
   data = train.df.rpart,
)

confmat.rf.rpart <- rf.rpart$confusion
confmat.rf.rpart <- subset(confmat.rf.rpart, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rpart <- sensitivity_from_confmat(confmat.rf.rpart)# True Negative Rate or Specificity
Specificity.rf.rpart <- specificity_from_confmat(confmat.rf.rpart)
BalancedAccuracy.rf.rpart <- (Sensitivity.rf.rpart+Specificity.rf.rpart)/2
#random forest using rpart variable imporance
rf.rpart.val <- randomForest(
   classval~.,
   data = val.df.rpart,
)

confmat.rf.rpart.val <- rf.rpart.val$confusion
confmat.rf.rpart.val <- subset(confmat.rf.rpart.val, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rpart.val <- sensitivity_from_confmat(confmat.rf.rpart.val)# True Negative Rate or Specificity
Specificity.rf.rpart.val <- specificity_from_confmat(confmat.rf.rpart.val)
BalancedAccuracy.rf.rpart.val <- (Sensitivity.rf.rpart.val+Specificity.rf.rpart.val)/2
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm <- train(classtrain~., data=train.df, method="svmRadial", metric=metric, trControl=control)
pred.svm <- predict(fit.svm, train.df)


confmat.svm <- table(pred.svm, train.df$classtrain, dnn=c("Prediction", "Actual"))   
# True Positive Rate or Sensitivity
Sensitivity.svm <- sensitivity_from_confmat(confmat.svm)# True Negative Rate or Specificity
Specificity.svm <- specificity_from_confmat(confmat.svm)
BalancedAccuracy.svm <- (Sensitivity.svm+Specificity.svm)/2
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.val <- train(classval~., data=val.df, method="svmRadial", metric=metric, trControl=control)
Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.Variable(s) `' constant. Cannot scale data.
pred.svm.val <- predict(fit.svm, val.df)


confmat.svm.val <- table(pred.svm.val, val.df$classval, dnn=c("Prediction", "Actual"))   
# True Positive Rate or Sensitivity
Sensitivity.svm.val <- sensitivity_from_confmat(confmat.svm.val)# True Negative Rate or Specificity
Specificity.svm.val <- specificity_from_confmat(confmat.svm.val)
BalancedAccuracy.svm.val <- (Sensitivity.svm.val+Specificity.svm.val)/2
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rfe <- train(classtrain~., data=train.df.rfe, method="svmRadial", metric=metric, trControl=control)
pred.svm.rfe <- predict(fit.svm.rfe, train.df.rfe)


confmat.svm.rfe <- table(pred.svm.rfe, train.df$classtrain, dnn=c("Prediction", "Actual"))  
Sensitivity.svm.rfe <- sensitivity_from_confmat(confmat.svm.rfe)# True Negative Rate or Specificity
Specificity.svm.rfe <- specificity_from_confmat(confmat.svm.rfe)
BalancedAccuracy.svm.rfe <- (Sensitivity.svm.rfe+Specificity.svm.rfe)/2
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rfe.val <- train(classval~., data=val.df.rfe, method="svmRadial", metric=metric, trControl=control)
pred.svm.rfe.val <- predict(fit.svm.rfe.val, val.df.rfe)


confmat.svm.rfe.val <- table(pred.svm.rfe.val, val.df$classval, dnn=c("Prediction", "Actual"))  
Sensitivity.svm.rfe.val <- sensitivity_from_confmat(confmat.svm.rfe.val)# True Negative Rate or Specificity
Specificity.svm.rfe.val <- specificity_from_confmat(confmat.svm.rfe.val)
BalancedAccuracy.svm.rfe.val <- (Sensitivity.svm.rfe.val+Specificity.svm.rfe.val)/2
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rpart <- train(classtrain~., data=train.df.rpart, method="svmRadial", metric=metric, trControl=control)
pred.svm.rpart <- predict(fit.svm.rpart, train.df.rpart)

confmat.svm.rpart <- table(pred.svm.rpart, train.df$classtrain, dnn=c("Prediction", "Actual"))
Sensitivity.svm.rpart <- sensitivity_from_confmat(confmat.svm.rpart)# True Negative Rate or Specificity
Specificity.svm.rpart <- specificity_from_confmat(confmat.svm.rpart)
BalancedAccuracy.svm.rpart <- (Sensitivity.svm.rpart+Specificity.svm.rpart)/2
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rpart.val <- train(classval~., data=val.df.rpart, method="svmRadial", metric=metric, trControl=control)
pred.svm.rpart.val <- predict(fit.svm.rpart.val, val.df.rpart)

confmat.svm.rpart.val <- table(pred.svm.rpart.val, val.df$classval, dnn=c("Prediction", "Actual"))
Sensitivity.svm.rpart.val <- sensitivity_from_confmat(confmat.svm.rpart.val)# True Negative Rate or Specificity
Specificity.svm.rpart.val <- specificity_from_confmat(confmat.svm.rpart.val)
BalancedAccuracy.svm.rpart.val <- (Sensitivity.svm.rpart.val+Specificity.svm.rpart.val)/2

ROC Curves

roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.svm.val,
                       "With Selection" = pred.svm.rfe.val)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot

roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.svm.val,
                       "With Selection" = pred.svm.rpart.val)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot

pred.rf <- predict(rf.val, validation, type = 'response')
pred.rf.rfe <- predict(rf.rfe.val, validation, type = 'response')
pred.rf.rpart <- predict(rf.rpart.val, validation, type = 'response')
roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.rf,
                       "With Selection" = pred.rf.rfe)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot

pred.rf <- predict(rf.val, validation, type = 'response')
pred.rf.rfe <- predict(rf.rfe.val, validation, type = 'response')
pred.rf.rpart <- predict(rf.rpart.val, validation, type = 'response')
roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.rf,
                       "With Selection" = pred.rf.rpart)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot

Interpreting the ROC Results

The validation results show that randomForest with feature selection produces slightly worse generalizations (accuracy on the validation set) results than our original model using all the variables.

Alternatively, we see that when using SVM with RFE selecion, that the feature selected model will result in better generalizations on the data.

Competition Entry: Saving & Uploading Your Predictions

The following code creates a valid entry for the contest:

  • Predict the test data and put the ranking in ranking_lrtest
    • The ranking can be any number leading to a classification like log odds.
    • This means values greater than 0 mean class 1 and values less than 0 mean class -1.
    • The results will be ranked by AUC.
  • Then write the results the CSV file named classification.csv (You must use this filename)
  • NOTE: You may need to execute this code chunk (and the chunks above it) individually for write.table() to work.
ss.analysis <- matrix(c('rF-All',.03,.12,'rF-rfe',.05,.19,'rf-rpart',.12,.66,'SVM-All',.10,.04,'SVM-rfe',.06,.03,'SVM-rpart',.18,.11),ncol=3,byrow=TRUE)
colnames(ss.analysis) <- c("Classification & Feature", 'False Positive Rate', 'False Negative Rate')
rownames(ss.analysis) <- c('1','2','3','4','5','6')
ss.analysis <- as.table(ss.analysis)
ss.analysis
  Classification & Feature False Positive Rate False Negative Rate
1 rF-All                   0.03                0.12               
2 rF-rfe                   0.05                0.19               
3 rf-rpart                 0.12                0.66               
4 SVM-All                  0.1                 0.04               
5 SVM-rfe                  0.06                0.03               
6 SVM-rpart                0.18                0.11               

Storing Feature Selection Results

  • Store your prediction for the features.
    • This should be binary, where 1 means keep the feature and 0 means don’t keep feature.
    • The results will be ranked by balanced accuracy.
  • Then write the results into the CSV file named selection.csv (You must use this filename)
  • NOTE: You may need to execute this code chunk (and the chunks above it) individually for write.table() to work.
# Predict the test data (OUTPUTS LOG-ODDS)
ranking_rf <- predict(fit.svm, test)
ranking_rf <- as.numeric(ranking_rf)

# no need to convert to 0 and 1 since ranking needed for AUC.
write.table(ranking_rf,file = "classification.csv", row.names=F, col.names=F)

Zipping and Submitting Your Results to the Challenge

  • Zip your classification.csv and selection.csv files – we must use these exact names! – into a single archive to generate the file MyEntry.csv.zip to enter the contest.
  • The name of your zip file is not important, but should not include spaces or characters like ( etc. The following code creates a zip filename that will always be unique.
  • NOTE:
    • You may need to execute this code chunk (and the chunks above it) individually for system() to work.
    • This code creates a zip with a filename based on time that will always be unique. This will result in many zips accumulating in your working directory!
# Here is the mean prediction file for submission to the website 
# features should be a column vector of 0's and 1's. 
# 1 = keep feature, 0 = don't
features<-matrix(0,nrow=(ncol(train)),ncol=1)
# Set the ones we want to keep to 1
features[variable.select] <- 1
write.table(features,file = "selection.csv", row.names=F, col.names=F)
---
title: 'MATP-4400 Final Project Notebook (2021)'
subtitle: 'Predicting Biodegradability Challenge'
author: "Evan Hernandez"
date: "May 2021"
output:
  html_notebook:
    theme: united
    toc: yes
  html_document:
    df_print: paged
    header-includes: \usepackage{color}
    toc: yes
  pdf_document:
    toc: yes
---

```{r, include=FALSE, set.seed(20)}
knitr::opts_chunk$set(cache = T)

# Set the correct default repository
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)


# These will install required packages if they are not already installed
if (!require("ggplot2")) {
   install.packages("ggplot2", dependencies = TRUE)
   library(ggplot2)
}
if (!require("knitr")) {
   install.packages("knitr", dependencies = TRUE)
   library(knitr)
}
if (!require("xtable")) {
   install.packages("xtable", dependencies = TRUE)
   library(xtable)
}
if (!require("pander")) {
   install.packages("pander", dependencies = TRUE)
   library(pander)
}

if (!require("devtools")) {
  install.packages("devtools",dependencies = TRUE ) 
  library(devtools)
}

if (!require("usethis")) {
  install.packages("usethis" ) 
  library(usethis)
}

if (!require("e1071")) {
 install.packages("e1071" ) 
  library(e1071)
}

if (!require("pROC")){
  install.packages("pROC")
   library(pROC)
} 

if (!require("dplyr")) {
   install.packages("dplyr", dependencies = TRUE)
   library(dplyr)
}

if (!require("tidyverse")) {
   install.packages("tidyverse", dependencies = TRUE)
   library(tidyverse)
}

if (!require("caret")) {
   install.packages("caret", dependencies = TRUE)
   library(caret)
}
if (!require('randomForest')){
   install.packages('randomForest')
   library(randomForest)
}
if (!require('InformationValue')){
   install.packages('InformationValue')
   library(InformationValue)
}
if (!require('DALEX')){
   install.packages('DALEX')
   library(DALEX)
}
if (!require('mlbench')){
   install.packages('mlbench')
   library(mlbench)
}
if (!require('kernlab')){
   install.packages('kernlab')
   library(kernlab)
}
if (!require('e1071')){
   install.packages('e1071')
   library(e1071)
}
knitr::opts_chunk$set(echo = TRUE)
```


# Team Information

This report was prepared for ChemsRUs by  Evan Hernandez, hernae5
for the team "We Don't Zinc So"
My team members are: Jessica Coco, Tamia Brefo, Morgan Ford, and Jeffrey Chai  cocoj2, brefot, fordm2, chaij 

Our team used http://codalab.idea.rpi.edu/competitions/37
   

# Introduction

Chems-R-Us has created an entry to the challenge at https://competitions.codalab.org/competitions/22892 based on logistic regression (LR).  Their entry is in the file 'FinalProjChemsRUs.Rmd' Based on the information in the leaderboard under bennek, their entry is not performing feature selection well. The approaches tried by Chems-R-Us were LR with feature selection based on the coefficients of logistic regression with p-values used to determine importance.   

The purpose of this report is to investigate alternative approaches to the that may help achieve high AUC scores on the testing set while correctly identifying the relevant features as measured by balanced accuracy. 


# Methods Used

The first method that I used for classification was randomForest. RandomForest works by creating a tree of nodes for selection and trying different variables at each level to find the correlation to the data. The second method I used was SVM (Support Vector Machines), which generates vectors about the points in the dataset to separate them into two classes. The first feature selection method that I used was Recursive Feature Elimination and the second feature selection I used was variable importance using the caret method 'rpart'.


# Data Description:

1. The dataset had 168 attributes
2a. The test data was divided independently
2b. The training and validation data were made from a 90:10 split of the feature, class, and result data provided
3. The testing set had 400 data points per feature, the validation set had 105 points per feature, and the training set had 905 data points per feature
4. The scaling method used was the preProcess function with the scale method


# Results Using  All Features
 
The accuracy for the training models were across the board better than the validation models. The reason for this is that the training data had nine times the information that the validation data, giving the AI more time to learn and better classify the results based on the information given. The difference was not stark, and was only about 4-8% at most. Logistically, this means that the return was most likely logarithmic for data and accuracy and therefore there is optimization potential for runtime, data used, accuracy, etc. for both models.


# Results Using Feature Selection

The two feature selection methods that I used were Recursive Feature Elimination and rpart Variable Importance. Recursive feature selection used a multi step process with k-folds in the data to repeat and evaluate what the most determining variables were. A similar process is used for rpart Variable Importance, and that selected the top 5 features only to be used. Recursie feature elimination for both classification methods performed better than all features for the training and validation sets, but did not perform as well on the challenge site or test data. Rpart Variable Importance performed the worst in the training, validation, and test sets for both classifiers, leading me to believe that rpart Variable Importance is not very viable for data that needs extreme accuracy, however the rpart ran significantly faster than the recursive feature elimination. 


# Results Comparison

```{r}
val.results <- matrix(c('rF-All',168,.81,.93,'rF-rfe',20,.62,.88,'rf-rpart',5,.61,.61,'SVM-All',168,.87,.93,'SVM-rfe',20,.92,.95,'SVM-rpart',5,.74,.85),ncol=4,byrow=TRUE)
colnames(val.results) <- c("Classification & Feature","Dimension","AUC Score",'Balanced Accuracy')
rownames(val.results) <- c('1','2','3','4','5','6')
val.results <- as.table(val.results)
val.results
```

The best method for prediction overall was SVM, and the best feature selection methods were all features for random forest and recursive feature elimination for SVM. The strength of random forest approaches is to be able to quickly parse through many variables to draw conclusions. Because of this, it makes sense why the feature elimination across the board decreased random forest effectiveness, as there were less nodes to work from and therefore less learning that the forest could do. Therefore, unless you are splitting data in a smaller size, i.e. 50% of data is significant, I'd advise against using feature selections for random forest. Support Vectors work differently, as they use a point based method and therefore outlying points will have an impact on the accuracy of the model. Recursive Feature Elimination, the proven better of the two methods, showed a 5% increase in AUC score and a 2% increase in balanced accuracy over the standard all variables approach. Contrarily, the numbers dropped drastically with rpart variable importance. Overall however, I would recommend using randomforest with all features as my go to method. SVM produced very accurate results, but was very inconsistent. In the training and testing data, SVM proved much less effective returning roughly 50% and 70% AUC scores for testing and training data respectively, while randomforest remained relatively the same throughout with all features (+- 2%).


# Additional Analysis 
```{r}
ss.analysis <- matrix(c('rF-All',.03,.12,'rF-rfe',.05,.19,'rf-rpart',.12,.66,'SVM-All',.10,.04,'SVM-rfe',.06,.03,'SVM-rpart',.18,.11),ncol=3,byrow=TRUE)
colnames(ss.analysis) <- c("Classification & Feature", 'False Positive Rate', 'False Negative Rate')
rownames(ss.analysis) <- c('1','2','3','4','5','6')
ss.analysis <- as.table(ss.analysis)
ss.analysis
```

For my additional analysis, I provided a table with false positive and false negative rates for each method. Overall, randomforest had better true positive accuracy, and svm had better true negative acccuracy. Also, rf-rpart had extreme difficulty correctly classifying negative data, with only a 34% true positive rate. Therefore, if one variable is more significant than the other, i.e. you would rather have false positives than false negatives, then it is viable to select one method over the other. 


# Challenge Prediction

My challenge ID is hernae5 with an AUC score of .82 for prediction and balanced accuracy .50 for feature selection. I used random forest as my final selection method, as it is the most consistent method and returned the higest results in the challenge of all that I produced. 

For my challenge entries, SVM consistently only returned 50% AUC score, while randomforest was 70-80%. All feature selection methods were fairly pedestrian, only returning about 50-53% for each score. Therefore, all randomforest methods were more effective than SVM, and of the choices randomforest with all features was clearly the best method posting a score of 82 compared to 77 for both with feaure selections. 


# Conclusion

In conclusion, SVM can be a very accurate tool for classification modeling, but I prefer randomforest due to its overall consistency. SVM worked best with recursive feature selection, and randomforest with all features selected. Overall, using rpart variable importance only served as a hindrance to any of the classification methods, making it a non-viable feature selection method for the task at hand. 


## Project Background

The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing. 

Our contract is to build a classification model to predict ready biodegradation of chemicals and to predict the classification of 400 newly-developed molecules. This will help Chems-R-Us to design new materials with desired biodegradation properties more quickly and for lower cost.


## Chems-R-Us Challenge Objective

The Chems-R-Us Challenge consists of two problems:

   * Binary classification: Each data row is labeled -1 or 1. We must train a predictive model on the train dataset to be able to find (as best we can) the labels of the test dataset.
   * Feature selection: Scattered among the 168 features there are fake features. These are randomly generated variables which don't help predicting the class. The goal of this problem is to classify features between fake (0) and real (1).


## Chems-R-Us Training & Testing Data

   * Experimental values of 1055 chemicals were collected. 
   * The training dataset consists of these 1055 chemicals; whether they were readily biodegradable (1= yes , -1 = no); and 168 molecular descriptors.  
   * Molecules and Molecular descriptors are proprietary.  No details are provided except _cryptic names_ in column headers
   * A testing set of 400 molecules with unknown biodegradability is provided


## Chems-R-Us Challenge Files (Detail)

* TRAINING DATA is divided into four files: 

   * `chems_train.data.csv`: Training data matrix with no response labels (1018 samples x 168 feature values)
   * `chems_feat.name.csv`: Name of the 168 attributes (168 x 1 features names).
   * `chems_train.solution.csv`: Training target values (1018 lines x 1 column)

* EXTERNAL TEST DATA is one file

   * `chems_test.data.csv`: Test data matrix (437 samples x 168 features values)


### Reading the Data

```{r data.read}
# Prepare biodegradability data 
#get feature names 
featurenames <- read.csv("~/MATP-4400/data/chems_feat.name.csv",
                         header=FALSE, 
                         colClasses = "character")
featurenames
# get training data and rename with feature names
cdata.df <-read.csv("~/MATP-4400/data/chems_train.data.csv",
                    header=FALSE)
colnames(cdata.df) <- featurenames$V1
cdata.df
# get external testing data and rename with feature names
tdata.df <-read.csv("~/MATP-4400/data/chems_test.data.csv",
                    header=FALSE) 

colnames(tdata.df) <- featurenames$V1

class <- read.csv("~/MATP-4400/data/chems_train.solution.csv",
                  header=FALSE, 
                  colClasses = "factor") 
class <- class$V1
```

### Preparing the Data: Create Training and Validation datasets

We split the data into **90% train** and **10% validation** datasets.

```{r data.split}
#ss will be the number of data points in the training set
n <- nrow(cdata.df)

ss <- ceiling(n*0.90)
# Set random seed for reproducibility
set.seed(200)
train.perm <- sample(1:n,ss)

#Split training and validation data
train <- cdata.df %>% slice(train.perm) 
validation <- cdata.df %>% slice(-train.perm) 
```

Next, we create a scaler to normalize the data and prevent outliers having significant impact on the results. 

```{r scaler}
# Initialize the scaler on the training data
scaler <- preProcess(train, method = "scale") 
scaler <- preProcess(validation, method = "scale")
scaler <- preProcess(tdata.df, method = "scale")


test <- predict(scaler, tdata.df)
# Normalize training data
# Split the output classes

classtrain <- class[train.perm]
classval <-class[-train.perm]


```

### Fitting Data

First, we create dataframes combining the variables and the class results.

```{r train.df.init, warning=FALSE}
# Fit model to classify all the variables
train.df <- cbind(train,classtrain)
val.df <- cbind(validation, classval)
```

Then we write our methods for feature selection, both for the training and validation sets. 

```{r rpart.featureselect}
#Create training for rpart variable importance
set.seed(100)
rPartMod <- train(classtrain ~ ., data=train.df, method="rpart")
rpartImp <- varImp(rPartMod)

#Select most important variables, and select those from dataframe to use
variable.select <- rpartImp$importance %>%
   dplyr::select(Overall) > 0
variable.fs <- train %>%
   select_if(variable.select)
train.df.rpart <- cbind(variable.fs, classtrain)
```

```{r rpart.featureselect.val}
set.seed(100)
rPartMod.val <- train(classval ~ ., data=val.df, method="rpart")
rpartImp.val <- varImp(rPartMod.val)
variable.select.val <- rpartImp.val$importance %>%
   dplyr::select(Overall) > 0
variable.fs.val <- validation %>%
   select_if(variable.select.val)
val.df.rpart <- cbind(variable.fs.val, classval)
```

```{r rfe.featureselect}
set.seed(7)
library(mlbench)
library(caret)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(train.df[1:168], train.df$classtrain, sizes=20, rfeControl=control)

sig.rfe <- predictors(results)
sig.rfe <- as.factor(sig.rfe)

train.df.rfe <- train.df %>% select(sig.rfe)
train.df.rfe <- cbind(train.df.rfe, classtrain)
```

```{r rfe.featureselect.val}
set.seed(7)
library(mlbench)
library(caret)
# define the control using a random forest selection function
control.val <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results.val <- rfe(val.df[1:168], val.df$classval, sizes=20, rfeControl=control.val)

sig.rfe.val <- predictors(results.val)
sig.rfe.val <- as.factor(sig.rfe.val)

val.df.rfe <- val.df %>% select(sig.rfe.val)
val.df.rfe <- cbind(val.df.rfe, classval)
```

Then, we plug in our training and validation dataframes to the training algorithms for both classifications and all feature combinations. We return values for balanced accuracy, and a confusion matrix of data. 

```{r rf.train, echo=FALSE}
#Create classification of data via randomForest w/ all features
rf <- randomForest(
   classtrain~.,
   data = train.df,
)

confmat.rf <- rf$confusion
confmat.rf <- subset(confmat.rf, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf <- sensitivity_from_confmat(confmat.rf)# True Negative Rate or Specificity
Specificity.rf <- specificity_from_confmat(confmat.rf)
BalancedAccuracy.rf <- (Sensitivity.rf+Specificity.rf)/2
```

```{r rf.val, echo=FALSE}
#Create classification of data via randomForest w/ all features
rf.val <- randomForest(
   classval~.,
   data = val.df,
   ntree = 300
)

confmat.rf.val <- rf$confusion
confmat.rf.val <- subset(confmat.rf.val, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.val <- sensitivity_from_confmat(confmat.rf.val)# True Negative Rate or Specificity
Specificity.rf.val <- specificity_from_confmat(confmat.rf.val)
BalancedAccuracy.rf.val <- (Sensitivity.rf.val+Specificity.rf.val)/2
```

```{r rf.rfe.train}

rf.rfe <- randomForest(
   classtrain~.,
   data = train.df.rfe,
)

confmat.rf.rfe <- rf.rfe$confusion
confmat.rf.rfe <- subset(confmat.rf.rfe, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rfe <- sensitivity_from_confmat(confmat.rf.rfe)# True Negative Rate or Specificity
Specificity.rf.rfe <- specificity_from_confmat(confmat.rf.rfe)
BalancedAccuracy.rf.rfe <- (Sensitivity.rf.rfe+Specificity.rf.rfe)/2
```

```{r rf.rfe.val}

rf.rfe.val <- randomForest(
   classval~.,
   data = val.df.rfe,
)

confmat.rf.rfe.val <- rf.rfe.val$confusion
confmat.rf.rfe.val <- subset(confmat.rf.rfe.val, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rfe.val <- sensitivity_from_confmat(confmat.rf.rfe.val)# True Negative Rate or Specificity
Specificity.rf.rfe.val <- specificity_from_confmat(confmat.rf.rfe.val)
BalancedAccuracy.rf.rfe.val <- (Sensitivity.rf.rfe.val+Specificity.rf.rfe.val)/2
BalancedAccuracy.rf.rfe.val
```

```{r rf.rpart.train}
#random forest using rpart variable imporance
rf.rpart <- randomForest(
   classtrain~.,
   data = train.df.rpart,
)

confmat.rf.rpart <- rf.rpart$confusion
confmat.rf.rpart <- subset(confmat.rf.rpart, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rpart <- sensitivity_from_confmat(confmat.rf.rpart)# True Negative Rate or Specificity
Specificity.rf.rpart <- specificity_from_confmat(confmat.rf.rpart)
BalancedAccuracy.rf.rpart <- (Sensitivity.rf.rpart+Specificity.rf.rpart)/2
```

```{r rf.rpart.val}
#random forest using rpart variable imporance
rf.rpart.val <- randomForest(
   classval~.,
   data = val.df.rpart,
)

confmat.rf.rpart.val <- rf.rpart.val$confusion
confmat.rf.rpart.val <- subset(confmat.rf.rpart.val, select = -c(class.error) )
# True Positive Rate or Sensitivity
Sensitivity.rf.rpart.val <- sensitivity_from_confmat(confmat.rf.rpart.val)# True Negative Rate or Specificity
Specificity.rf.rpart.val <- specificity_from_confmat(confmat.rf.rpart.val)
BalancedAccuracy.rf.rpart.val <- (Sensitivity.rf.rpart.val+Specificity.rf.rpart.val)/2
```

```{r svm.train}
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm <- train(classtrain~., data=train.df, method="svmRadial", metric=metric, trControl=control)
pred.svm <- predict(fit.svm, train.df)


confmat.svm <- table(pred.svm, train.df$classtrain, dnn=c("Prediction", "Actual"))   
# True Positive Rate or Sensitivity
Sensitivity.svm <- sensitivity_from_confmat(confmat.svm)# True Negative Rate or Specificity
Specificity.svm <- specificity_from_confmat(confmat.svm)
BalancedAccuracy.svm <- (Sensitivity.svm+Specificity.svm)/2
```

```{r svm.val}
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.val <- train(classval~., data=val.df, method="svmRadial", metric=metric, trControl=control)
pred.svm.val <- predict(fit.svm, val.df)


confmat.svm.val <- table(pred.svm.val, val.df$classval, dnn=c("Prediction", "Actual"))   
# True Positive Rate or Sensitivity
Sensitivity.svm.val <- sensitivity_from_confmat(confmat.svm.val)# True Negative Rate or Specificity
Specificity.svm.val <- specificity_from_confmat(confmat.svm.val)
BalancedAccuracy.svm.val <- (Sensitivity.svm.val+Specificity.svm.val)/2
```

```{r svm.rfe.train}
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rfe <- train(classtrain~., data=train.df.rfe, method="svmRadial", metric=metric, trControl=control)
pred.svm.rfe <- predict(fit.svm.rfe, train.df.rfe)


confmat.svm.rfe <- table(pred.svm.rfe, train.df$classtrain, dnn=c("Prediction", "Actual"))  
Sensitivity.svm.rfe <- sensitivity_from_confmat(confmat.svm.rfe)# True Negative Rate or Specificity
Specificity.svm.rfe <- specificity_from_confmat(confmat.svm.rfe)
BalancedAccuracy.svm.rfe <- (Sensitivity.svm.rfe+Specificity.svm.rfe)/2
```

```{r svm.rfe.val}
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rfe.val <- train(classval~., data=val.df.rfe, method="svmRadial", metric=metric, trControl=control)
pred.svm.rfe.val <- predict(fit.svm.rfe.val, val.df.rfe)


confmat.svm.rfe.val <- table(pred.svm.rfe.val, val.df$classval, dnn=c("Prediction", "Actual"))  
Sensitivity.svm.rfe.val <- sensitivity_from_confmat(confmat.svm.rfe.val)# True Negative Rate or Specificity
Specificity.svm.rfe.val <- specificity_from_confmat(confmat.svm.rfe.val)
BalancedAccuracy.svm.rfe.val <- (Sensitivity.svm.rfe.val+Specificity.svm.rfe.val)/2
```

```{r svm.rpart.train}
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rpart <- train(classtrain~., data=train.df.rpart, method="svmRadial", metric=metric, trControl=control)
pred.svm.rpart <- predict(fit.svm.rpart, train.df.rpart)

confmat.svm.rpart <- table(pred.svm.rpart, train.df$classtrain, dnn=c("Prediction", "Actual"))
Sensitivity.svm.rpart <- sensitivity_from_confmat(confmat.svm.rpart)# True Negative Rate or Specificity
Specificity.svm.rpart <- specificity_from_confmat(confmat.svm.rpart)
BalancedAccuracy.svm.rpart <- (Sensitivity.svm.rpart+Specificity.svm.rpart)/2
```

```{r svm.rpart.val}
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(7)
fit.svm.rpart.val <- train(classval~., data=val.df.rpart, method="svmRadial", metric=metric, trControl=control)
pred.svm.rpart.val <- predict(fit.svm.rpart.val, val.df.rpart)

confmat.svm.rpart.val <- table(pred.svm.rpart.val, val.df$classval, dnn=c("Prediction", "Actual"))
Sensitivity.svm.rpart.val <- sensitivity_from_confmat(confmat.svm.rpart.val)# True Negative Rate or Specificity
Specificity.svm.rpart.val <- specificity_from_confmat(confmat.svm.rpart.val)
BalancedAccuracy.svm.rpart.val <- (Sensitivity.svm.rpart.val+Specificity.svm.rpart.val)/2
```

### ROC Curves 


```{r svm.rfe, message=FALSE, warnings=FALSE}
roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.svm.val,
                       "With Selection" = pred.svm.rfe.val)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot
```

```{r svm.rpart, message=FALSE, warnings=FALSE}
roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.svm.val,
                       "With Selection" = pred.svm.rpart.val)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot
```

```{r rf.rfe, message=FALSE, warnings=FALSE}
pred.rf <- predict(rf.val, validation, type = 'response')
pred.rf.rfe <- predict(rf.rfe.val, validation, type = 'response')
pred.rf.rpart <- predict(rf.rpart.val, validation, type = 'response')
roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.rf,
                       "With Selection" = pred.rf.rfe)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot
```

```{r rf.rpart, message=FALSE, warnings=FALSE}
pred.rf <- predict(rf.val, validation, type = 'response')
pred.rf.rfe <- predict(rf.rfe.val, validation, type = 'response')
pred.rf.rpart <- predict(rf.rpart.val, validation, type = 'response')
roc.data <- data.frame("Class" = classval, 
                       "No Selection" = pred.rf,
                       "With Selection" = pred.rf.rpart)
roc.data$Class <- as.numeric(roc.data$Class)
roc.data$No.Selection <- as.numeric(roc.data$No.Selection)
roc.data$With.Selection <- as.numeric(roc.data$With.Selection)
roc.list <- roc(Class ~., 
                data = roc.data)
no.selection.auc <- round(auc(Class ~ No.Selection, data = roc.data), digits = 3)
with.selection.auc <- round(auc(Class ~ With.Selection, data = roc.data), digits = 3)
roc_plot <- ggroc(roc.list) + 
   ggtitle("ROC Curves (Validation Set)", subtitle = "SVM without Feature Selection versus SVM with RFE") + 
   scale_color_discrete(name = "Model", 
                        labels = c(paste("No Selection\nAUC :", no.selection.auc), paste("With Selection\nAUC :", with.selection.auc)))
roc_plot
```
### Interpreting the ROC Results

The validation results show that randomForest with feature selection produces slightly worse generalizations (accuracy on the validation set) results than our original model using all the variables.

Alternatively, we see that when using SVM with RFE selecion, that the feature selected model will result in better generalizations on the data.

## Competition Entry: Saving & Uploading Your Predictions

The following code creates a valid entry for the contest: 

   * Predict the test data and put the ranking in `ranking_lrtest`
      * The ranking can be any number *leading to a classification like log odds*.
      * This means **values greater than 0 mean class 1** and **values less than 0 mean class -1**. 
      * The results will be **ranked by AUC**.
   * Then write the results the CSV file named `classification.csv` (You _must_ use this filename)
   * NOTE: You may need to execute this code chunk (and the chunks above it) individually for `write.table()` to work.


```{r cache=FALSE}
# Predict the test data (OUTPUTS LOG-ODDS)
ranking_rf <- predict(fit.svm, test)
ranking_rf <- as.numeric(ranking_rf)

# no need to convert to 0 and 1 since ranking needed for AUC.
write.table(ranking_rf,file = "classification.csv", row.names=F, col.names=F)
```

## Storing Feature Selection Results

   * Store your prediction for the features.
      * This should be **binary**, where **1 means keep the feature** and **0 means don't keep feature**. 
      * The results will be ranked by **balanced accuracy**.
   * Then write the results into the CSV file named `selection.csv` (You _must_ use this filename)
   * NOTE: You may need to execute this code chunk (and the chunks above it) individually for `write.table()` to work.


```{r cache=FALSE}
# Here is the mean prediction file for submission to the website 
# features should be a column vector of 0's and 1's. 
# 1 = keep feature, 0 = don't
features<-matrix(0,nrow=(ncol(train)),ncol=1)
# Set the ones we want to keep to 1
features[variable.select] <- 1
write.table(features,file = "selection.csv", row.names=F, col.names=F)
```

## Zipping and Submitting Your Results to the Challenge

   * Zip your `classification.csv` and `selection.csv` files -- we must use these exact names! -- into a single archive to generate the file `MyEntry.csv.zip` to enter the contest.
   * The name of your zip file is not important, but **should not include spaces or characters like `(` etc**. The following code creates a zip filename that will always be unique. 
   * NOTE:
      * You may need to execute this code chunk (and the chunks above it) individually for `system()` to work.
      * This code creates a zip with a filename based on time that will always be unique. This will result in many zips accumulating in your working directory!


```{r cache=FALSE}
# get time
time <-  format(Sys.time(), "%H%M%S")

time # verify a new value generated

#This automatically generates a compressed (zip) file 
system(paste0("zip -u MyEntry-", time, ".csv.zip classification.csv"))
system(paste0("zip -u MyEntry-", time, ".csv.zip selection.csv"))

paste0("The name of your entry file: MyEntry-", time, ".csv.zip")
```

   
  

