Assignment3

Author

Semyon Toybis

Assignment

Assignment 3 requires training Support Vector Machine algorithms on the “Bank Marketing” data set from the UC Irvine Machine Learning Repository and comparing the output to algorithms used in assignment 2. I have saved the data-set to my working directory and will be importing it from there. Below I load the libraries that I initially think will be necessary. I will skip exploratory data analysis, as this was done in assignment 1. Additionally, I conducted pre-processing in assignment 2 where I used kNN to infer values for observations that had the value”unknown” for features “contact,”education”, and “job”. I saved the updated data to a csv and will work directly from that updated data set in this assignment.

preproccessed_data <- read.csv('bank_data_preproc_for_unknowns.csv')

Pre-processing

The original data set is fairly large, which creates an issue for training more complex algorithms in regards to run time and computer resources:

dim(preproccessed_data)
[1] 45211    17

In order to manage run time and computing resources, I will take a random sample of the full data set, which I will then use for training and test splits.

set.seed(00)
sample_data <- preproccessed_data |> sample_n(size = 1000)

The data set still maintains roughly the same imbalance as the original data set:

prop.table(table(preproccessed_data$y))

       no       yes 
0.8830152 0.1169848 
prop.table(table(sample_data$y))

   no   yes 
0.893 0.107 

The data set is imbalanced and I and need to use the SMOTE algorithm (discussed later) to create a balanced training set. For this, I will have to convert categorical variables to dummy variables to use the algorithm. I will also need these dummy variables for training SVM models.

dummy_variables <- dummyVars(~.-y, data = sample_data, fullRank = T)
dummy_data <- predict(dummy_variables, newdata = sample_data)
sample_data <- cbind(as.data.frame(dummy_data), y = sample_data$y)
colnames(sample_data)
 [1] "age"                "jobblue-collar"     "jobentrepreneur"   
 [4] "jobhousemaid"       "jobmanagement"      "jobretired"        
 [7] "jobself-employed"   "jobservices"        "jobstudent"        
[10] "jobtechnician"      "jobunemployed"      "maritalmarried"    
[13] "maritalsingle"      "educationsecondary" "educationtertiary" 
[16] "defaultyes"         "balance"            "housingyes"        
[19] "loanyes"            "contacttelephone"   "day"               
[22] "monthaug"           "monthdec"           "monthfeb"          
[25] "monthjan"           "monthjul"           "monthjun"          
[28] "monthmar"           "monthmay"           "monthnov"          
[31] "monthoct"           "monthsep"           "duration"          
[34] "campaign"           "pdays"              "previous"          
[37] "poutcomeother"      "poutcomesuccess"    "poutcomeunknown"   
[40] "y"                 

Below is the description of the “duration” feature from the data set description:

last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Thus, I will remove the feature from the data set

sample_data$duration <- NULL
str(sample_data)
'data.frame':   1000 obs. of  39 variables:
 $ age               : num  32 29 36 27 40 47 41 42 36 45 ...
 $ jobblue-collar    : num  0 0 0 0 0 0 0 0 0 1 ...
 $ jobentrepreneur   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ jobhousemaid      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ jobmanagement     : num  0 0 0 0 0 0 0 0 1 0 ...
 $ jobretired        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ jobself-employed  : num  0 0 0 0 0 0 1 0 0 0 ...
 $ jobservices       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ jobstudent        : num  0 0 0 1 0 0 0 0 0 0 ...
 $ jobtechnician     : num  0 0 1 0 0 0 0 0 0 0 ...
 $ jobunemployed     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ maritalmarried    : num  0 0 0 0 1 1 0 1 0 1 ...
 $ maritalsingle     : num  1 1 1 1 0 0 1 0 1 0 ...
 $ educationsecondary: num  1 1 1 0 1 1 0 1 0 1 ...
 $ educationtertiary : num  0 0 0 1 0 0 1 0 1 0 ...
 $ defaultyes        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ balance           : num  976 65 0 5741 -253 ...
 $ housingyes        : num  1 1 0 0 1 1 0 0 0 1 ...
 $ loanyes           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ contacttelephone  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ day               : num  28 14 8 21 9 14 18 25 29 7 ...
 $ monthaug          : num  0 0 0 1 0 0 0 0 0 0 ...
 $ monthdec          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthfeb          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthjan          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ monthjul          : num  1 0 1 0 1 0 0 0 1 0 ...
 $ monthjun          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthmar          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthmay          : num  0 1 0 0 0 0 0 0 0 1 ...
 $ monthnov          : num  0 0 0 0 0 0 1 0 0 0 ...
 $ monthoct          : num  0 0 0 0 0 1 0 0 0 0 ...
 $ monthsep          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ campaign          : num  3 6 1 1 1 1 1 1 7 1 ...
 $ pdays             : num  -1 -1 -1 88 -1 826 -1 -1 -1 -1 ...
 $ previous          : num  0 0 0 1 0 1 0 0 0 0 ...
 $ poutcomeother     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcomesuccess   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcomeunknown   : num  1 1 1 0 1 0 1 1 1 1 ...
 $ y                 : chr  "no" "no" "no" "no" ...

Next, I will convert the target to a factor:

sample_data$y <- factor(sample_data$y, levels = c("yes", "no"))

Train-test split

As a reminder, our data-set has an imbalance for the target variable:

prop.table(table(sample_data$y))

  yes    no 
0.107 0.893 

Thus, I will apply the SMOTE algorithm to the training set to avoid biasing the model toward the majority class.

First, I create the train-test split use the 80%/20% proportion.

set.seed(123)

sample_set <- sample(nrow(sample_data), round(nrow(sample_data)*0.8), replace = FALSE)

train_set <- sample_data[sample_set,]
test_set <- sample_data[-sample_set,]

Below I check the proportions of the target

original data set:

round(prop.table(table(select(sample_data, y), exclude = NULL)), 4) * 100
y
 yes   no 
10.7 89.3 

training data

round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100
y
  yes    no 
11.75 88.25 

test data

round(prop.table(table(select(test_set, y), exclude = NULL)), 4) * 100
y
 yes   no 
 6.5 93.5 

Next, I apply the SMOTE algorithm to generate synthetic data to make the training set balanced in the target

library(smotefamily)
set.seed(456)

smote_result<- SMOTE(X = train_set[, -39], target = train_set$y, 
                      K = 5, dup_size = 6.5)
train_set<-data.frame(smote_result$data)
names(train_set)[ncol(train_set)] <- "y"
train_set$y <- as.factor(train_set$y)
round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100
y
   no   yes 
51.76 48.24 

The training set now has about an equal split in the target.

I will change the factor levels to match the order created above:

train_set$y <- factor(train_set$y, levels = c("yes", "no"))

Lastly, I will convert the dummy variables to Boolean values:

train_set <- train_set |> 
  mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))
test_set <- test_set |> 
  mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))
str(train_set)
'data.frame':   1364 obs. of  39 variables:
 $ age               : num  55 24 24 22 59 41 48 40 39 63 ...
 $ jobblue.collar    : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ jobentrepreneur   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobhousemaid      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobmanagement     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobretired        : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ jobself.employed  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobservices       : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ jobstudent        : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
 $ jobtechnician     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobunemployed     : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ maritalmarried    : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ maritalsingle     : logi  FALSE TRUE TRUE TRUE FALSE FALSE ...
 $ educationsecondary: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ educationtertiary : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ defaultyes        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ balance           : num  1580 1123 9883 71 1047 ...
 $ housingyes        : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ loanyes           : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ contacttelephone  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ day               : num  19 13 8 13 18 2 11 6 30 27 ...
 $ monthaug          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthdec          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthfeb          : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
 $ monthjan          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthjul          : logi  FALSE TRUE TRUE TRUE FALSE FALSE ...
 $ monthjun          : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ monthmar          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthmay          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthnov          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthoct          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthsep          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ campaign          : logi  TRUE TRUE FALSE TRUE FALSE TRUE ...
 $ pdays             : num  -1 412 -1 181 -1 -1 -1 98 -1 180 ...
 $ previous          : num  0 1 0 2 0 0 0 6 0 7 ...
 $ poutcomeother     : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ poutcomesuccess   : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ poutcomeunknown   : logi  TRUE FALSE TRUE FALSE TRUE TRUE ...
 $ y                 : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 1 1 1 ...
str(test_set)
'data.frame':   200 obs. of  39 variables:
 $ age               : num  32 36 41 36 50 29 39 33 49 39 ...
 $ jobblue-collar    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobentrepreneur   : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ jobhousemaid      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobmanagement     : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ jobretired        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobself-employed  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ jobservices       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobstudent        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ jobtechnician     : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ jobunemployed     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ maritalmarried    : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ maritalsingle     : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
 $ educationsecondary: logi  TRUE TRUE FALSE FALSE TRUE FALSE ...
 $ educationtertiary : logi  FALSE FALSE TRUE TRUE FALSE TRUE ...
 $ defaultyes        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ balance           : num  976 0 0 8 854 261 0 -452 852 512 ...
 $ housingyes        : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ loanyes           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ contacttelephone  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ day               : num  28 8 18 29 16 17 25 29 20 4 ...
 $ monthaug          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthdec          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthfeb          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthjan          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthjul          : logi  TRUE TRUE FALSE TRUE FALSE FALSE ...
 $ monthjun          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthmar          : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ monthmay          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthnov          : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ monthoct          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ monthsep          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ campaign          : logi  FALSE TRUE TRUE FALSE TRUE FALSE ...
 $ pdays             : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcomeother     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ poutcomesuccess   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ poutcomeunknown   : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ y                 : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 1 2 2 ...

I also correct naming convention between the training and test set:

colnames(test_set)[which(names(test_set) == "jobblue-collar")] <- "jobblue.collar"
colnames(test_set)[which(names(test_set) == "jobself-employed")] <- "jobself.employed"

setdiff(names(test_set), names(train_set))
character(0)

Assignment 2 models - random forest and xgBoost

Since my approach is a little different (taking a sample) than in assignment 2, I will re-run the random forest and xgBoost models from assignment 2 for comparison with SVM models later in this assignment.

I will use parallel computing to improve run time.

Since I have a (synthetically) balanced training set, I will use accuracy as the metric for the training set and F1 as the metric for the test set.

library(foreach)

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
library(doParallel)
Loading required package: iterators
Loading required package: parallel

Below I create a dataframe to track training set and test set performance

library(MLmetrics)

Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':

    MAE, RMSE
The following object is masked from 'package:base':

    Recall
performance_metrics_training <- data.frame(Model=character(),
                                     Accuracy = numeric(),
                                     Kappa = numeric())
performance_metrics_test <- data.frame(Model=character(),
                                     Accuracy = numeric(),
                                     F1 = numeric(),
                                     Kappa = numeric(),
                                     Recall = numeric(),
                                     Precision = numeric())

Random forest

Below are the tunable parameters for the random forest model:

modelLookup('rf')
  model parameter                         label forReg forClass probModel
1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE

There is only one tunable parameter - mtry, which is the number of randomly selected features to consider at each split.

Below I train a random forest model that tries different values for the mtry parameter. It is recommend to try different , evenly spaced, values between 2 and the number of predictors for mtry

Below I start the cluster, train the model, and close the cluster:

cluster0 <- makeCluster(detectCores()-2)
registerDoParallel(cluster0)

set.seed(789)
rf <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'rf',
  trControl = trainControl(method = 'cv', number = 10),
  tuneGrid = expand.grid(.mtry=seq(2,38,8))
)

stopCluster(cluster0)
rf
Random Forest 

1364 samples
  38 predictor
   2 classes: 'yes', 'no' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1227, 1227, 1228, 1229, 1227, 1227, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9120430  0.8234633
  10    0.9318372  0.8632475
  18    0.9303666  0.8603544
  26    0.9230243  0.8456494
  34    0.9186287  0.8368510

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 10.

An mtry value of ten resulted in the highest accuracy.

I record the training set performance

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('rf',                                                                round(getTrainPerf(rf)[1],4),
                                                                round(getTrainPerf(rf)[2],4))

Next, I try predicting the test set:

rf_predictions <- predict(rf, test_set, type='raw')

Below are the performance metrics:

rfCM <- confusionMatrix(rf_predictions, test_set$y, positive = 'yes')
rfCM
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   4  15
       no    9 172
                                          
               Accuracy : 0.88            
                 95% CI : (0.8267, 0.9216)
    No Information Rate : 0.935           
    P-Value [Acc > NIR] : 0.9986          
                                          
                  Kappa : 0.1873          
                                          
 Mcnemar's Test P-Value : 0.3074          
                                          
            Sensitivity : 0.3077          
            Specificity : 0.9198          
         Pos Pred Value : 0.2105          
         Neg Pred Value : 0.9503          
             Prevalence : 0.0650          
         Detection Rate : 0.0200          
   Detection Prevalence : 0.0950          
      Balanced Accuracy : 0.6137          
                                          
       'Positive' Class : yes             
                                          

I also look at the F1 score:

rfF1 <- F1_Score(y_pred = rf_predictions, y_true = test_set$y, positive = "yes")
rfF1
[1] 0.25

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('rf',                                                                round(rfCM$overall[1],4),
                                                                round(rfF1,4),
                                                                round(rfCM$overall[2],4),
                                                                round(rfCM$byClass[1],4),
                                                                round(rfCM$byClass[3],4))

xgBoost

Next, I will train the xgBoost model.

modelLookup('xgbTree')
    model        parameter                          label forReg forClass
1 xgbTree          nrounds          # Boosting Iterations   TRUE     TRUE
2 xgbTree        max_depth                 Max Tree Depth   TRUE     TRUE
3 xgbTree              eta                      Shrinkage   TRUE     TRUE
4 xgbTree            gamma         Minimum Loss Reduction   TRUE     TRUE
5 xgbTree colsample_bytree     Subsample Ratio of Columns   TRUE     TRUE
6 xgbTree min_child_weight Minimum Sum of Instance Weight   TRUE     TRUE
7 xgbTree        subsample           Subsample Percentage   TRUE     TRUE
  probModel
1      TRUE
2      TRUE
3      TRUE
4      TRUE
5      TRUE
6      TRUE
7      TRUE

There are seven tunable parameters for the xgbTree model. Tuning all of these parameters may take a significant amount of time and computing power, thus I will try tuning two parameters: nrounds, which is the number of boosting iterations, and max_depth, which is the maximum tree depth. I will use default values for the other parameters in order to manage run time.

tune_grid_xgb <- expand.grid(
  nrounds = c(100, 200, 300),       
  max_depth = c(3, 6, 9, 12),         
  eta = 0.3,            
  gamma = 0.01,              
  colsample_bytree = 1,
  min_child_weight = 1,   
  subsample = 1        
)
cluster1 <- makeCluster(detectCores()-2)
registerDoParallel(cluster1)

set.seed(1213)


xgbTree <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'xgbTree',
  trControl = trainControl(method = 'cv', number = 10),
  tuneGrid = tune_grid_xgb
)

stopCluster(cluster1)
xgbTree
eXtreme Gradient Boosting 

1364 samples
  38 predictor
   2 classes: 'yes', 'no' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1227, 1228, 1228, 1227, 1228, 1228, ... 
Resampling results across tuning parameters:

  max_depth  nrounds  Accuracy   Kappa    
   3         100      0.9171157  0.8338250
   3         200      0.9193055  0.8382208
   3         300      0.9185702  0.8367794
   6         100      0.9288590  0.8574013
   6         200      0.9303242  0.8603373
   6         300      0.9325247  0.8647488
   9         100      0.9288482  0.8573748
   9         200      0.9303242  0.8603582
   9         300      0.9310541  0.8618176
  12         100      0.9317948  0.8632804
  12         200      0.9325301  0.8647384
  12         300      0.9325301  0.8647384

Tuning parameter 'eta' was held constant at a value of 0.3
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 1

Tuning parameter 'subsample' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 200, max_depth = 12, eta
 = 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
 subsample = 1.

The model with the highest accuracy had an nrounds value of 200 and max_depth of 12

I record the performance on the training data:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('xgbTree',                                                                round(getTrainPerf(xgbTree)[1],4),
                                                                round(getTrainPerf(xgbTree)[2],4))

Next, I try predicting the test set

xgbTree_predictions <- predict(xgbTree, test_set, type='raw')

Below are the performance metrics:

xgbTreeCM <- confusionMatrix(xgbTree_predictions, test_set$y, positive = 'yes')
xgbTreeCM
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   3  14
       no   10 173
                                          
               Accuracy : 0.88            
                 95% CI : (0.8267, 0.9216)
    No Information Rate : 0.935           
    P-Value [Acc > NIR] : 0.9986          
                                          
                  Kappa : 0.1364          
                                          
 Mcnemar's Test P-Value : 0.5403          
                                          
            Sensitivity : 0.2308          
            Specificity : 0.9251          
         Pos Pred Value : 0.1765          
         Neg Pred Value : 0.9454          
             Prevalence : 0.0650          
         Detection Rate : 0.0150          
   Detection Prevalence : 0.0850          
      Balanced Accuracy : 0.5780          
                                          
       'Positive' Class : yes             
                                          

F1 score:

xgbTreeF1 <- F1_Score(y_pred = xgbTree_predictions, y_true = test_set$y, positive = "yes")
xgbTreeF1
[1] 0.2

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('xgbTree',                                                                round(xgbTreeCM$overall[1],4),
                                                                round(xgbTreeF1,4),
                                                                round(xgbTreeCM$overall[2],4),
                                                                round(xgbTreeCM$byClass[1],4),
                                                                round(xgbTreeCM$byClass[3],4))

SVM

First, I look up the model parameters for SVM. There are a number of different model types for SVM in the caret package, which are differentiated by the type of kernel used, which is the function that is used to map the data into a higher dimensional space to create class separation.

Three commonly used kernels are linear, polynomial, and radial basis.

modelLookup('svmLinear')
      model parameter label forReg forClass probModel
1 svmLinear         C  Cost   TRUE     TRUE      TRUE

The tunable parameter, C, controls the trade-off between margin maximization and error tolerance, with higher C corresponding to over-fitting.

modelLookup('svmPoly')
    model parameter             label forReg forClass probModel
1 svmPoly    degree Polynomial Degree   TRUE     TRUE      TRUE
2 svmPoly     scale             Scale   TRUE     TRUE      TRUE
3 svmPoly         C              Cost   TRUE     TRUE      TRUE

There are two additional parameters here, degree and scale. The degree controls the flexibility of the decision boundary (i.e., how curvy the boundary is with higher values corresponding to more over-fitting). Scale controls the scaling parameter of the polynomial, which corresponds to how much weight each comparison between points has before it is raised to the degree of the polynomial.

modelLookup('svmRadial')
      model parameter label forReg forClass probModel
1 svmRadial     sigma Sigma   TRUE     TRUE      TRUE
2 svmRadial         C  Cost   TRUE     TRUE      TRUE

There is one additional parameter here, sigma, which corresponds to how much influence a single training example has. Larger values of sigma means that points further apart are still considered similar.

I will record training set and test set performance, similar to assignment 2. Since the training set is synthetically balanced, I will focus on accuracy as the evaluation metric. For the test set, the metric will be F1.

SVM Linear

First, I train a SVM model with a linear kernel. I will use parallel computing to improve run time.

There is only one parameter to tune, C, for which I will use a tuning grid.

cluster2 <- makeCluster(detectCores()-2)
registerDoParallel(cluster2)
set.seed(789)

svm_lin <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'svmLinear',
  preProcess = c('center','scale'),
  trControl = trainControl(method = 'cv', number = 10),
  tuneGrid = expand.grid(.C=c(0.1, 1, 10, 100))
)

stopCluster(cluster2)
svm_lin
Support Vector Machines with Linear Kernel 

1364 samples
  38 predictor
   2 classes: 'yes', 'no' 

Pre-processing: centered (38), scaled (38) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1227, 1227, 1228, 1229, 1227, 1227, ... 
Resampling results across tuning parameters:

  C      Accuracy   Kappa    
    0.1  0.9223051  0.8440541
    1.0  0.9237648  0.8469637
   10.0  0.9237648  0.8469637
  100.0  0.9237595  0.8469399

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 1.

The model with the highest accuracy had a C value of 1

I record the performance on the training data:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('svm_lin',                                                                round(getTrainPerf(svm_lin)[1],4),
                                                                round(getTrainPerf(svm_lin)[2],4))

Next, I try predicting the test set:

svm_lin_predictions <- predict(svm_lin, test_set, type='raw')

Below are the performance metrics:

svm_lin_CM <- confusionMatrix(svm_lin_predictions, test_set$y, positive = 'yes')
svm_lin_CM
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   6  10
       no    7 177
                                          
               Accuracy : 0.915           
                 95% CI : (0.8674, 0.9497)
    No Information Rate : 0.935           
    P-Value [Acc > NIR] : 0.8979          
                                          
                  Kappa : 0.3685          
                                          
 Mcnemar's Test P-Value : 0.6276          
                                          
            Sensitivity : 0.4615          
            Specificity : 0.9465          
         Pos Pred Value : 0.3750          
         Neg Pred Value : 0.9620          
             Prevalence : 0.0650          
         Detection Rate : 0.0300          
   Detection Prevalence : 0.0800          
      Balanced Accuracy : 0.7040          
                                          
       'Positive' Class : yes             
                                          

F1 score:

svm_lin_F1 <- F1_Score(y_pred = svm_lin_predictions, y_true = test_set$y, positive = "yes")
svm_lin_F1
[1] 0.4137931

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('svm_lin',                                                                round(svm_lin_CM$overall[1],4),
                                                                round(svm_lin_F1,4),
                                                                round(svm_lin_CM$overall[2],4),
                                                                round(svm_lin_CM$byClass[1],4),
                                                                round(svm_lin_CM$byClass[3],4))

SVM Polynomial

Next, I train a SVM model with a polynomial kernel. I will use parallel computing to improve run time.

There are three parameters to tune: C, degree, and scale. I will use a tune grid.

cluster3 <- makeCluster(detectCores()-2)
registerDoParallel(cluster3)
set.seed(1011)

svm_poly <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'svmPoly',
  preProcess = c('center','scale'),
  trControl = trainControl(method = 'cv', number = 10),
  tuneGrid = expand.grid(.degree= c(2,3,4, 5),
                         .scale = c(0.005, 0.01, 0.1, 1, 10),
                         .C=c(0.1, 1, 10, 100))
)

stopCluster(cluster3)
svm_poly
Support Vector Machines with Polynomial Kernel 

1364 samples
  38 predictor
   2 classes: 'yes', 'no' 

Pre-processing: centered (38), scaled (38) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1227, 1228, 1228, 1228, 1228, 1227, ... 
Resampling results across tuning parameters:

  degree  scale  C      Accuracy   Kappa    
  2       5e-03    0.1  0.9156451  0.8304752
  2       5e-03    1.0  0.9185809  0.8364469
  2       5e-03   10.0  0.9222413  0.8439052
  2       5e-03  100.0  0.9119848  0.8234396
  2       1e-02    0.1  0.9193216  0.8379155
  2       1e-02    1.0  0.9149098  0.8291333
  2       1e-02   10.0  0.9171157  0.8336513
  2       1e-02  100.0  0.9009661  0.8015161
  2       1e-01    0.1  0.9171157  0.8336164
  2       1e-01    1.0  0.8980303  0.7957246
  2       1e-01   10.0  0.8701803  0.7403325
  2       1e-01  100.0  0.8672714  0.7347511
  2       1e+00    0.1  0.8694558  0.7388408
  2       1e+00    1.0  0.8628757  0.7259047
  2       1e+00   10.0  0.8563117  0.7129285
  2       1e+00  100.0  0.8408813  0.6824982
  2       1e+01    0.1  0.8497155  0.6997424
  2       1e+01    1.0  0.8401514  0.6807907
  2       1e+01   10.0  0.8401514  0.6807907
  2       1e+01  100.0  0.8401514  0.6807907
  3       5e-03    0.1  0.9193216  0.8378879
  3       5e-03    1.0  0.9163751  0.8320398
  3       5e-03   10.0  0.9178456  0.8351233
  3       5e-03  100.0  0.9031666  0.8059308
  3       1e-02    0.1  0.9149152  0.8290137
  3       1e-02    1.0  0.9222413  0.8438822
  3       1e-02   10.0  0.9090436  0.8176162
  3       1e-02  100.0  0.8899957  0.7797604
  3       1e-01    0.1  0.8921694  0.7839530
  3       1e-01    1.0  0.8775386  0.7550480
  3       1e-01   10.0  0.8782686  0.7565587
  3       1e-01  100.0  0.8504186  0.7013836
  3       1e+00    0.1  0.8621189  0.7248938
  3       1e+00    1.0  0.8643087  0.7292762
  3       1e+00   10.0  0.8643087  0.7292762
  3       1e+00  100.0  0.8643087  0.7292762
  3       1e+01    0.1  0.8621136  0.7249634
  3       1e+01    1.0  0.8621136  0.7249634
  3       1e+01   10.0  0.8621136  0.7249634
  3       1e+01  100.0  0.8621136  0.7249634
  4       5e-03    0.1  0.9193162  0.8379204
  4       5e-03    1.0  0.9171264  0.8335730
  4       5e-03   10.0  0.9141799  0.8278452
  4       5e-03  100.0  0.9017067  0.8030596
  4       1e-02    0.1  0.9119901  0.8231547
  4       1e-02    1.0  0.9127308  0.8248739
  4       1e-02   10.0  0.9031559  0.8059172
  4       1e-02  100.0  0.8826589  0.7652971
  4       1e-01    0.1  0.8855839  0.7710341
  4       1e-01    1.0  0.8834049  0.7669066
  4       1e-01   10.0  0.8562742  0.7130706
  4       1e-01  100.0  0.8548197  0.7101670
  4       1e+00    0.1  0.8643409  0.7290661
  4       1e+00    1.0  0.8643409  0.7290661
  4       1e+00   10.0  0.8643409  0.7290661
  4       1e+00  100.0  0.8643409  0.7290661
  4       1e+01    0.1  0.8592261  0.7188975
  4       1e+01    1.0  0.8592261  0.7188975
  4       1e+01   10.0  0.8592261  0.7188975
  4       1e+01  100.0  0.8592261  0.7188975
  5       5e-03    0.1  0.9178510  0.8349529
  5       5e-03    1.0  0.9215167  0.8424312
  5       5e-03   10.0  0.9127147  0.8249236
  5       5e-03  100.0  0.8907364  0.7812290
  5       1e-02    0.1  0.9163965  0.8320561
  5       1e-02    1.0  0.9127201  0.8248911
  5       1e-02   10.0  0.8951052  0.7899309
  5       1e-02  100.0  0.8789985  0.7580891
  5       1e-01    0.1  0.8899957  0.7800643
  5       1e-01    1.0  0.8782900  0.7568108
  5       1e-01   10.0  0.8606859  0.7220868
  5       1e-01  100.0  0.8606859  0.7220868
  5       1e+00    0.1  0.8701964  0.7409418
  5       1e+00    1.0  0.8701964  0.7409418
  5       1e+00   10.0  0.8701964  0.7409418
  5       1e+00  100.0  0.8701964  0.7409418
  5       1e+01    0.1  0.8643517  0.7293678
  5       1e+01    1.0  0.8643517  0.7293678
  5       1e+01   10.0  0.8643517  0.7293678
  5       1e+01  100.0  0.8643517  0.7293678

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 2, scale = 0.005 and C = 10.

The model with the highest accuracy had degree 2, scale 0.005, and C = 10

I record the performance on the training data:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('svm_poly',                                                                round(getTrainPerf(svm_poly)[1],4),
                                                                round(getTrainPerf(svm_poly)[2],4))

Next, I try predicting the test set:

svm_poly_predictions <- predict(svm_poly, test_set, type='raw')

Below are the performance metrics:

svm_poly_CM <- confusionMatrix(svm_poly_predictions, test_set$y, positive = 'yes')
svm_poly_CM
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   6  11
       no    7 176
                                          
               Accuracy : 0.91            
                 95% CI : (0.8615, 0.9458)
    No Information Rate : 0.935           
    P-Value [Acc > NIR] : 0.9368          
                                          
                  Kappa : 0.3523          
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 0.4615          
            Specificity : 0.9412          
         Pos Pred Value : 0.3529          
         Neg Pred Value : 0.9617          
             Prevalence : 0.0650          
         Detection Rate : 0.0300          
   Detection Prevalence : 0.0850          
      Balanced Accuracy : 0.7014          
                                          
       'Positive' Class : yes             
                                          

F1 score:

svm_poly_F1 <- F1_Score(y_pred = svm_poly_predictions, y_true = test_set$y, positive = "yes")
svm_poly_F1
[1] 0.4

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('svm_poly',                                                                round(svm_poly_CM$overall[1],4),
                                                                round(svm_poly_F1,4),
                                                                round(svm_poly_CM$overall[2],4),
                                                                round(svm_poly_CM$byClass[1],4),
                                                                round(svm_poly_CM$byClass[3],4))

SVM Radial Basis

Next, I train a SVM model with a Radial Basis Function kernel. I will use parallel computing to improve run time.

There are two parameters to tune: sigma and C. I will use a tune grid. A rule of thumb for sigma is between 3/k and 6/k, where k is the number of input fields

cluster4 <- makeCluster(detectCores()-2)
registerDoParallel(cluster4)
set.seed(1011)

svm_rbf <- train(
  y ~ .,
  data = train_set,
  metric = 'Accuracy',
  method = 'svmRadial',
  preProcess = c('center','scale'),
  trControl = trainControl(method = 'cv', number = 10),
  tuneGrid = expand.grid(.sigma= c(0.08, 0.10, 0.12, 0.14, 0.16),
                         .C=c(0.1, 1, 10, 100))
)

stopCluster(cluster4)
svm_rbf
Support Vector Machines with Radial Basis Function Kernel 

1364 samples
  38 predictor
   2 classes: 'yes', 'no' 

Pre-processing: centered (38), scaled (38) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1227, 1228, 1228, 1228, 1228, 1227, ... 
Resampling results across tuning parameters:

  sigma  C      Accuracy   Kappa    
  0.08     0.1  0.8753274  0.7482129
  0.08     1.0  0.9178564  0.8347534
  0.08    10.0  0.9068645  0.8130909
  0.08   100.0  0.8980786  0.7955049
  0.10     0.1  0.8592100  0.7153203
  0.10     1.0  0.9134553  0.8258446
  0.10    10.0  0.9098164  0.8189271
  0.10   100.0  0.9039395  0.8071987
  0.12     0.1  0.8430979  0.6823337
  0.12     1.0  0.9119848  0.8228207
  0.12    10.0  0.9134661  0.8261885
  0.12   100.0  0.9090812  0.8174496
  0.14     0.1  0.8320846  0.6597756
  0.14     1.0  0.9090436  0.8169244
  0.14    10.0  0.9105303  0.8201916
  0.14   100.0  0.9046801  0.8085103
  0.16     0.1  0.8174270  0.6296384
  0.16     1.0  0.9105249  0.8198994
  0.16    10.0  0.9090597  0.8172172
  0.16   100.0  0.9046801  0.8084783

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.08 and C = 1.

The model with the highest accuracy had a sigma value of 0.08 and a C value of 1

I record the performance on the training set:

performance_metrics_training[nrow(performance_metrics_training) + 1, ] <- c('svm_rbf',                                                                round(getTrainPerf(svm_rbf)[1],4),
                                                                round(getTrainPerf(svm_rbf)[2],4))

Next, I try predicting the test set:

svm_rbf_predictions <- predict(svm_rbf, test_set, type='raw')

Below are the performance metrics:

svm_rbf_CM <- confusionMatrix(svm_rbf_predictions, test_set$y, positive = 'yes')
svm_rbf_CM
Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   3   5
       no   10 182
                                          
               Accuracy : 0.925           
                 95% CI : (0.8793, 0.9574)
    No Information Rate : 0.935           
    P-Value [Acc > NIR] : 0.7697          
                                          
                  Kappa : 0.2485          
                                          
 Mcnemar's Test P-Value : 0.3017          
                                          
            Sensitivity : 0.2308          
            Specificity : 0.9733          
         Pos Pred Value : 0.3750          
         Neg Pred Value : 0.9479          
             Prevalence : 0.0650          
         Detection Rate : 0.0150          
   Detection Prevalence : 0.0400          
      Balanced Accuracy : 0.6020          
                                          
       'Positive' Class : yes             
                                          

F1 score:

svm_rbf_F1 <- F1_Score(y_pred = svm_rbf_predictions, y_true = test_set$y, positive = "yes")
svm_rbf_F1
[1] 0.2857143

I add the metrics to the performance table:

performance_metrics_test[nrow(performance_metrics_test) + 1, ] <- c('svm_rbf',                                                                round(svm_rbf_CM$overall[1],4),
                                                                round(svm_rbf_F1,4),
                                                                round(svm_rbf_CM$overall[2],4),
                                                                round(svm_rbf_CM$byClass[1],4),
                                                                round(svm_rbf_CM$byClass[3],4))

Conclusion

Below is the output of the training set performance:

performance_metrics_training
     Model Accuracy  Kappa
1       rf   0.9318 0.8632
2  xgbTree   0.9325 0.8647
3  svm_lin   0.9238 0.8470
4 svm_poly   0.9222 0.8439
5  svm_rbf   0.9179 0.8348

and test set:

performance_metrics_test
     Model Accuracy     F1  Kappa Recall Precision
1       rf     0.88   0.25 0.1873 0.3077    0.2105
2  xgbTree     0.88    0.2 0.1364 0.2308    0.1765
3  svm_lin    0.915 0.4138 0.3685 0.4615     0.375
4 svm_poly     0.91    0.4 0.3523 0.4615    0.3529
5  svm_rbf    0.925 0.2857 0.2485 0.2308     0.375

Because these models showed strong performance on the training set but poor performance on the test set, it suggests there was low bias but high variance.

While all of the models had poor test set performance as measured by F1, the SVM Linear and SVM Polynomial did see a notable improvement versus the random forest and xgBoost models. Somewhat surprisingly, the RBF kernel performed the worst - I had suspected that the decision boundary is not linear and that RBF would perform better than a linear or polynomial kernel. The poor performance of the random forest and, especially, the xgbTree model was surprising as these are ensemble methods but still performed worse than a single SVM model.

While the F1 score was still poor overall (below 0.5), the SVM linear and polynomial models were notably better than other models. Using an ensemble method, such as bagging, may improve performance further. Additionally, different pre-processing steps may improve the performance of the model. For example, it is possible that the synthetically balanced training set translates poorly to unseen data. It could be worth trying the SVM models on the imbalanced data and using class weights, rather than synthetically balanced data, to train the models on predicting the minority class.