<- read.csv('bank_data_preproc_for_unknowns.csv') preproccessed_data
Assignment3
Assignment
Assignment 3 requires training Support Vector Machine algorithms on the “Bank Marketing” data set from the UC Irvine Machine Learning Repository and comparing the output to algorithms used in assignment 2. I have saved the data-set to my working directory and will be importing it from there. Below I load the libraries that I initially think will be necessary. I will skip exploratory data analysis, as this was done in assignment 1. Additionally, I conducted pre-processing in assignment 2 where I used kNN to infer values for observations that had the value”unknown” for features “contact,”education”, and “job”. I saved the updated data to a csv and will work directly from that updated data set in this assignment.
Pre-processing
The original data set is fairly large, which creates an issue for training more complex algorithms in regards to run time and computer resources:
dim(preproccessed_data)
[1] 45211 17
In order to manage run time and computing resources, I will take a random sample of the full data set, which I will then use for training and test splits.
set.seed(00)
<- preproccessed_data |> sample_n(size = 1000) sample_data
The data set still maintains roughly the same imbalance as the original data set:
prop.table(table(preproccessed_data$y))
no yes
0.8830152 0.1169848
prop.table(table(sample_data$y))
no yes
0.893 0.107
The data set is imbalanced and I and need to use the SMOTE algorithm (discussed later) to create a balanced training set. For this, I will have to convert categorical variables to dummy variables to use the algorithm. I will also need these dummy variables for training SVM models.
<- dummyVars(~.-y, data = sample_data, fullRank = T) dummy_variables
<- predict(dummy_variables, newdata = sample_data) dummy_data
<- cbind(as.data.frame(dummy_data), y = sample_data$y) sample_data
colnames(sample_data)
[1] "age" "jobblue-collar" "jobentrepreneur"
[4] "jobhousemaid" "jobmanagement" "jobretired"
[7] "jobself-employed" "jobservices" "jobstudent"
[10] "jobtechnician" "jobunemployed" "maritalmarried"
[13] "maritalsingle" "educationsecondary" "educationtertiary"
[16] "defaultyes" "balance" "housingyes"
[19] "loanyes" "contacttelephone" "day"
[22] "monthaug" "monthdec" "monthfeb"
[25] "monthjan" "monthjul" "monthjun"
[28] "monthmar" "monthmay" "monthnov"
[31] "monthoct" "monthsep" "duration"
[34] "campaign" "pdays" "previous"
[37] "poutcomeother" "poutcomesuccess" "poutcomeunknown"
[40] "y"
Below is the description of the “duration” feature from the data set description:
last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
Thus, I will remove the feature from the data set
$duration <- NULL sample_data
str(sample_data)
'data.frame': 1000 obs. of 39 variables:
$ age : num 32 29 36 27 40 47 41 42 36 45 ...
$ jobblue-collar : num 0 0 0 0 0 0 0 0 0 1 ...
$ jobentrepreneur : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobhousemaid : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobmanagement : num 0 0 0 0 0 0 0 0 1 0 ...
$ jobretired : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobself-employed : num 0 0 0 0 0 0 1 0 0 0 ...
$ jobservices : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobstudent : num 0 0 0 1 0 0 0 0 0 0 ...
$ jobtechnician : num 0 0 1 0 0 0 0 0 0 0 ...
$ jobunemployed : num 0 0 0 0 0 0 0 0 0 0 ...
$ maritalmarried : num 0 0 0 0 1 1 0 1 0 1 ...
$ maritalsingle : num 1 1 1 1 0 0 1 0 1 0 ...
$ educationsecondary: num 1 1 1 0 1 1 0 1 0 1 ...
$ educationtertiary : num 0 0 0 1 0 0 1 0 1 0 ...
$ defaultyes : num 0 0 0 0 0 0 0 0 0 0 ...
$ balance : num 976 65 0 5741 -253 ...
$ housingyes : num 1 1 0 0 1 1 0 0 0 1 ...
$ loanyes : num 0 0 0 0 0 0 0 0 0 0 ...
$ contacttelephone : num 0 0 0 0 0 0 0 0 0 0 ...
$ day : num 28 14 8 21 9 14 18 25 29 7 ...
$ monthaug : num 0 0 0 1 0 0 0 0 0 0 ...
$ monthdec : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthfeb : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjan : num 0 0 0 0 0 0 0 1 0 0 ...
$ monthjul : num 1 0 1 0 1 0 0 0 1 0 ...
$ monthjun : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthmar : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthmay : num 0 1 0 0 0 0 0 0 0 1 ...
$ monthnov : num 0 0 0 0 0 0 1 0 0 0 ...
$ monthoct : num 0 0 0 0 0 1 0 0 0 0 ...
$ monthsep : num 0 0 0 0 0 0 0 0 0 0 ...
$ campaign : num 3 6 1 1 1 1 1 1 7 1 ...
$ pdays : num -1 -1 -1 88 -1 826 -1 -1 -1 -1 ...
$ previous : num 0 0 0 1 0 1 0 0 0 0 ...
$ poutcomeother : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomesuccess : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomeunknown : num 1 1 1 0 1 0 1 1 1 1 ...
$ y : chr "no" "no" "no" "no" ...
Next, I will convert the target to a factor:
$y <- factor(sample_data$y, levels = c("yes", "no")) sample_data
Train-test split
As a reminder, our data-set has an imbalance for the target variable:
prop.table(table(sample_data$y))
yes no
0.107 0.893
Thus, I will apply the SMOTE algorithm to the training set to avoid biasing the model toward the majority class.
First, I create the train-test split use the 80%/20% proportion.
set.seed(123)
<- sample(nrow(sample_data), round(nrow(sample_data)*0.8), replace = FALSE)
sample_set
<- sample_data[sample_set,]
train_set <- sample_data[-sample_set,] test_set
Below I check the proportions of the target
original data set:
round(prop.table(table(select(sample_data, y), exclude = NULL)), 4) * 100
y
yes no
10.7 89.3
training data
round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100
y
yes no
11.75 88.25
test data
round(prop.table(table(select(test_set, y), exclude = NULL)), 4) * 100
y
yes no
6.5 93.5
Next, I apply the SMOTE algorithm to generate synthetic data to make the training set balanced in the target
library(smotefamily)
set.seed(456)
<- SMOTE(X = train_set[, -39], target = train_set$y,
smote_resultK = 5, dup_size = 6.5)
<-data.frame(smote_result$data)
train_setnames(train_set)[ncol(train_set)] <- "y"
$y <- as.factor(train_set$y) train_set
round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100
y
no yes
51.76 48.24
The training set now has about an equal split in the target.
I will change the factor levels to match the order created above:
$y <- factor(train_set$y, levels = c("yes", "no")) train_set
Lastly, I will convert the dummy variables to Boolean values:
<- train_set |>
train_set mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))
<- test_set |>
test_set mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))
str(train_set)
'data.frame': 1364 obs. of 39 variables:
$ age : num 55 24 24 22 59 41 48 40 39 63 ...
$ jobblue.collar : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ jobentrepreneur : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobhousemaid : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobmanagement : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobretired : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ jobself.employed : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobservices : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ jobstudent : logi FALSE TRUE FALSE TRUE FALSE FALSE ...
$ jobtechnician : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobunemployed : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ maritalmarried : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ maritalsingle : logi FALSE TRUE TRUE TRUE FALSE FALSE ...
$ educationsecondary: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ educationtertiary : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ defaultyes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ balance : num 1580 1123 9883 71 1047 ...
$ housingyes : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ loanyes : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ contacttelephone : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ day : num 19 13 8 13 18 2 11 6 30 27 ...
$ monthaug : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthdec : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthfeb : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
$ monthjan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthjul : logi FALSE TRUE TRUE TRUE FALSE FALSE ...
$ monthjun : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ monthmar : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthmay : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthnov : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthoct : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthsep : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ campaign : logi TRUE TRUE FALSE TRUE FALSE TRUE ...
$ pdays : num -1 412 -1 181 -1 -1 -1 98 -1 180 ...
$ previous : num 0 1 0 2 0 0 0 6 0 7 ...
$ poutcomeother : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ poutcomesuccess : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ poutcomeunknown : logi TRUE FALSE TRUE FALSE TRUE TRUE ...
$ y : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 1 1 1 ...
str(test_set)
'data.frame': 200 obs. of 39 variables:
$ age : num 32 36 41 36 50 29 39 33 49 39 ...
$ jobblue-collar : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobentrepreneur : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ jobhousemaid : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobmanagement : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ jobretired : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobself-employed : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ jobservices : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobstudent : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ jobtechnician : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ jobunemployed : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ maritalmarried : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ maritalsingle : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ educationsecondary: logi TRUE TRUE FALSE FALSE TRUE FALSE ...
$ educationtertiary : logi FALSE FALSE TRUE TRUE FALSE TRUE ...
$ defaultyes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ balance : num 976 0 0 8 854 261 0 -452 852 512 ...
$ housingyes : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ loanyes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ contacttelephone : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ day : num 28 8 18 29 16 17 25 29 20 4 ...
$ monthaug : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthdec : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthfeb : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthjan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthjul : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
$ monthjun : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthmar : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ monthmay : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthnov : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ monthoct : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ monthsep : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ campaign : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomeother : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ poutcomesuccess : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ poutcomeunknown : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ y : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 1 2 2 ...
I also correct naming convention between the training and test set:
colnames(test_set)[which(names(test_set) == "jobblue-collar")] <- "jobblue.collar"
colnames(test_set)[which(names(test_set) == "jobself-employed")] <- "jobself.employed"
setdiff(names(test_set), names(train_set))
character(0)
Assignment 2 models - random forest and xgBoost
Since my approach is a little different (taking a sample) than in assignment 2, I will re-run the random forest and xgBoost models from assignment 2 for comparison with SVM models later in this assignment.
I will use parallel computing to improve run time.
Since I have a (synthetically) balanced training set, I will use accuracy as the metric for the training set and F1 as the metric for the test set.
library(foreach)
Attaching package: 'foreach'
The following objects are masked from 'package:purrr':
accumulate, when
library(doParallel)
Loading required package: iterators
Loading required package: parallel
Below I create a dataframe to track training set and test set performance
library(MLmetrics)
Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':
MAE, RMSE
The following object is masked from 'package:base':
Recall
<- data.frame(Model=character(),
performance_metrics_training Accuracy = numeric(),
Kappa = numeric())
<- data.frame(Model=character(),
performance_metrics_test Accuracy = numeric(),
F1 = numeric(),
Kappa = numeric(),
Recall = numeric(),
Precision = numeric())
Random forest
Below are the tunable parameters for the random forest model:
modelLookup('rf')
model parameter label forReg forClass probModel
1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
There is only one tunable parameter - mtry, which is the number of randomly selected features to consider at each split.
Below I train a random forest model that tries different values for the mtry parameter. It is recommend to try different , evenly spaced, values between 2 and the number of predictors for mtry
Below I start the cluster, train the model, and close the cluster:
<- makeCluster(detectCores()-2) cluster0
registerDoParallel(cluster0)
set.seed(789)
<- train(
rf ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'rf',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(.mtry=seq(2,38,8))
)
stopCluster(cluster0)
rf
Random Forest
1364 samples
38 predictor
2 classes: 'yes', 'no'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1227, 1227, 1228, 1229, 1227, 1227, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9120430 0.8234633
10 0.9318372 0.8632475
18 0.9303666 0.8603544
26 0.9230243 0.8456494
34 0.9186287 0.8368510
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 10.
An mtry value of ten resulted in the highest accuracy.
I record the training set performance
nrow(performance_metrics_training) + 1, ] <- c('rf', round(getTrainPerf(rf)[1],4),
performance_metrics_training[round(getTrainPerf(rf)[2],4))
Next, I try predicting the test set:
<- predict(rf, test_set, type='raw') rf_predictions
Below are the performance metrics:
<- confusionMatrix(rf_predictions, test_set$y, positive = 'yes')
rfCM rfCM
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 4 15
no 9 172
Accuracy : 0.88
95% CI : (0.8267, 0.9216)
No Information Rate : 0.935
P-Value [Acc > NIR] : 0.9986
Kappa : 0.1873
Mcnemar's Test P-Value : 0.3074
Sensitivity : 0.3077
Specificity : 0.9198
Pos Pred Value : 0.2105
Neg Pred Value : 0.9503
Prevalence : 0.0650
Detection Rate : 0.0200
Detection Prevalence : 0.0950
Balanced Accuracy : 0.6137
'Positive' Class : yes
I also look at the F1 score:
<- F1_Score(y_pred = rf_predictions, y_true = test_set$y, positive = "yes")
rfF1 rfF1
[1] 0.25
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('rf', round(rfCM$overall[1],4),
performance_metrics_test[round(rfF1,4),
round(rfCM$overall[2],4),
round(rfCM$byClass[1],4),
round(rfCM$byClass[3],4))
xgBoost
Next, I will train the xgBoost model.
modelLookup('xgbTree')
model parameter label forReg forClass
1 xgbTree nrounds # Boosting Iterations TRUE TRUE
2 xgbTree max_depth Max Tree Depth TRUE TRUE
3 xgbTree eta Shrinkage TRUE TRUE
4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
7 xgbTree subsample Subsample Percentage TRUE TRUE
probModel
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
7 TRUE
There are seven tunable parameters for the xgbTree model. Tuning all of these parameters may take a significant amount of time and computing power, thus I will try tuning two parameters: nrounds, which is the number of boosting iterations, and max_depth, which is the maximum tree depth. I will use default values for the other parameters in order to manage run time.
<- expand.grid(
tune_grid_xgb nrounds = c(100, 200, 300),
max_depth = c(3, 6, 9, 12),
eta = 0.3,
gamma = 0.01,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
<- makeCluster(detectCores()-2) cluster1
registerDoParallel(cluster1)
set.seed(1213)
<- train(
xgbTree ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'xgbTree',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = tune_grid_xgb
)
stopCluster(cluster1)
xgbTree
eXtreme Gradient Boosting
1364 samples
38 predictor
2 classes: 'yes', 'no'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1227, 1228, 1228, 1227, 1228, 1228, ...
Resampling results across tuning parameters:
max_depth nrounds Accuracy Kappa
3 100 0.9171157 0.8338250
3 200 0.9193055 0.8382208
3 300 0.9185702 0.8367794
6 100 0.9288590 0.8574013
6 200 0.9303242 0.8603373
6 300 0.9325247 0.8647488
9 100 0.9288482 0.8573748
9 200 0.9303242 0.8603582
9 300 0.9310541 0.8618176
12 100 0.9317948 0.8632804
12 200 0.9325301 0.8647384
12 300 0.9325301 0.8647384
Tuning parameter 'eta' was held constant at a value of 0.3
Tuning
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning parameter 'subsample' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 200, max_depth = 12, eta
= 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
subsample = 1.
The model with the highest accuracy had an nrounds value of 200 and max_depth of 12
I record the performance on the training data:
nrow(performance_metrics_training) + 1, ] <- c('xgbTree', round(getTrainPerf(xgbTree)[1],4),
performance_metrics_training[round(getTrainPerf(xgbTree)[2],4))
Next, I try predicting the test set
<- predict(xgbTree, test_set, type='raw') xgbTree_predictions
Below are the performance metrics:
<- confusionMatrix(xgbTree_predictions, test_set$y, positive = 'yes')
xgbTreeCM xgbTreeCM
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 3 14
no 10 173
Accuracy : 0.88
95% CI : (0.8267, 0.9216)
No Information Rate : 0.935
P-Value [Acc > NIR] : 0.9986
Kappa : 0.1364
Mcnemar's Test P-Value : 0.5403
Sensitivity : 0.2308
Specificity : 0.9251
Pos Pred Value : 0.1765
Neg Pred Value : 0.9454
Prevalence : 0.0650
Detection Rate : 0.0150
Detection Prevalence : 0.0850
Balanced Accuracy : 0.5780
'Positive' Class : yes
F1 score:
<- F1_Score(y_pred = xgbTree_predictions, y_true = test_set$y, positive = "yes")
xgbTreeF1 xgbTreeF1
[1] 0.2
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('xgbTree', round(xgbTreeCM$overall[1],4),
performance_metrics_test[round(xgbTreeF1,4),
round(xgbTreeCM$overall[2],4),
round(xgbTreeCM$byClass[1],4),
round(xgbTreeCM$byClass[3],4))
SVM
First, I look up the model parameters for SVM. There are a number of different model types for SVM in the caret package, which are differentiated by the type of kernel used, which is the function that is used to map the data into a higher dimensional space to create class separation.
Three commonly used kernels are linear, polynomial, and radial basis.
modelLookup('svmLinear')
model parameter label forReg forClass probModel
1 svmLinear C Cost TRUE TRUE TRUE
The tunable parameter, C, controls the trade-off between margin maximization and error tolerance, with higher C corresponding to over-fitting.
modelLookup('svmPoly')
model parameter label forReg forClass probModel
1 svmPoly degree Polynomial Degree TRUE TRUE TRUE
2 svmPoly scale Scale TRUE TRUE TRUE
3 svmPoly C Cost TRUE TRUE TRUE
There are two additional parameters here, degree and scale. The degree controls the flexibility of the decision boundary (i.e., how curvy the boundary is with higher values corresponding to more over-fitting). Scale controls the scaling parameter of the polynomial, which corresponds to how much weight each comparison between points has before it is raised to the degree of the polynomial.
modelLookup('svmRadial')
model parameter label forReg forClass probModel
1 svmRadial sigma Sigma TRUE TRUE TRUE
2 svmRadial C Cost TRUE TRUE TRUE
There is one additional parameter here, sigma, which corresponds to how much influence a single training example has. Larger values of sigma means that points further apart are still considered similar.
I will record training set and test set performance, similar to assignment 2. Since the training set is synthetically balanced, I will focus on accuracy as the evaluation metric. For the test set, the metric will be F1.
SVM Linear
First, I train a SVM model with a linear kernel. I will use parallel computing to improve run time.
There is only one parameter to tune, C, for which I will use a tuning grid.
<- makeCluster(detectCores()-2) cluster2
registerDoParallel(cluster2)
set.seed(789)
<- train(
svm_lin ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'svmLinear',
preProcess = c('center','scale'),
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(.C=c(0.1, 1, 10, 100))
)
stopCluster(cluster2)
svm_lin
Support Vector Machines with Linear Kernel
1364 samples
38 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (38), scaled (38)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1227, 1227, 1228, 1229, 1227, 1227, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.1 0.9223051 0.8440541
1.0 0.9237648 0.8469637
10.0 0.9237648 0.8469637
100.0 0.9237595 0.8469399
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 1.
The model with the highest accuracy had a C value of 1
I record the performance on the training data:
nrow(performance_metrics_training) + 1, ] <- c('svm_lin', round(getTrainPerf(svm_lin)[1],4),
performance_metrics_training[round(getTrainPerf(svm_lin)[2],4))
Next, I try predicting the test set:
<- predict(svm_lin, test_set, type='raw') svm_lin_predictions
Below are the performance metrics:
<- confusionMatrix(svm_lin_predictions, test_set$y, positive = 'yes')
svm_lin_CM svm_lin_CM
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 6 10
no 7 177
Accuracy : 0.915
95% CI : (0.8674, 0.9497)
No Information Rate : 0.935
P-Value [Acc > NIR] : 0.8979
Kappa : 0.3685
Mcnemar's Test P-Value : 0.6276
Sensitivity : 0.4615
Specificity : 0.9465
Pos Pred Value : 0.3750
Neg Pred Value : 0.9620
Prevalence : 0.0650
Detection Rate : 0.0300
Detection Prevalence : 0.0800
Balanced Accuracy : 0.7040
'Positive' Class : yes
F1 score:
<- F1_Score(y_pred = svm_lin_predictions, y_true = test_set$y, positive = "yes")
svm_lin_F1 svm_lin_F1
[1] 0.4137931
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('svm_lin', round(svm_lin_CM$overall[1],4),
performance_metrics_test[round(svm_lin_F1,4),
round(svm_lin_CM$overall[2],4),
round(svm_lin_CM$byClass[1],4),
round(svm_lin_CM$byClass[3],4))
SVM Polynomial
Next, I train a SVM model with a polynomial kernel. I will use parallel computing to improve run time.
There are three parameters to tune: C, degree, and scale. I will use a tune grid.
<- makeCluster(detectCores()-2) cluster3
registerDoParallel(cluster3)
set.seed(1011)
<- train(
svm_poly ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'svmPoly',
preProcess = c('center','scale'),
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(.degree= c(2,3,4, 5),
.scale = c(0.005, 0.01, 0.1, 1, 10),
.C=c(0.1, 1, 10, 100))
)
stopCluster(cluster3)
svm_poly
Support Vector Machines with Polynomial Kernel
1364 samples
38 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (38), scaled (38)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1227, 1228, 1228, 1228, 1228, 1227, ...
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
2 5e-03 0.1 0.9156451 0.8304752
2 5e-03 1.0 0.9185809 0.8364469
2 5e-03 10.0 0.9222413 0.8439052
2 5e-03 100.0 0.9119848 0.8234396
2 1e-02 0.1 0.9193216 0.8379155
2 1e-02 1.0 0.9149098 0.8291333
2 1e-02 10.0 0.9171157 0.8336513
2 1e-02 100.0 0.9009661 0.8015161
2 1e-01 0.1 0.9171157 0.8336164
2 1e-01 1.0 0.8980303 0.7957246
2 1e-01 10.0 0.8701803 0.7403325
2 1e-01 100.0 0.8672714 0.7347511
2 1e+00 0.1 0.8694558 0.7388408
2 1e+00 1.0 0.8628757 0.7259047
2 1e+00 10.0 0.8563117 0.7129285
2 1e+00 100.0 0.8408813 0.6824982
2 1e+01 0.1 0.8497155 0.6997424
2 1e+01 1.0 0.8401514 0.6807907
2 1e+01 10.0 0.8401514 0.6807907
2 1e+01 100.0 0.8401514 0.6807907
3 5e-03 0.1 0.9193216 0.8378879
3 5e-03 1.0 0.9163751 0.8320398
3 5e-03 10.0 0.9178456 0.8351233
3 5e-03 100.0 0.9031666 0.8059308
3 1e-02 0.1 0.9149152 0.8290137
3 1e-02 1.0 0.9222413 0.8438822
3 1e-02 10.0 0.9090436 0.8176162
3 1e-02 100.0 0.8899957 0.7797604
3 1e-01 0.1 0.8921694 0.7839530
3 1e-01 1.0 0.8775386 0.7550480
3 1e-01 10.0 0.8782686 0.7565587
3 1e-01 100.0 0.8504186 0.7013836
3 1e+00 0.1 0.8621189 0.7248938
3 1e+00 1.0 0.8643087 0.7292762
3 1e+00 10.0 0.8643087 0.7292762
3 1e+00 100.0 0.8643087 0.7292762
3 1e+01 0.1 0.8621136 0.7249634
3 1e+01 1.0 0.8621136 0.7249634
3 1e+01 10.0 0.8621136 0.7249634
3 1e+01 100.0 0.8621136 0.7249634
4 5e-03 0.1 0.9193162 0.8379204
4 5e-03 1.0 0.9171264 0.8335730
4 5e-03 10.0 0.9141799 0.8278452
4 5e-03 100.0 0.9017067 0.8030596
4 1e-02 0.1 0.9119901 0.8231547
4 1e-02 1.0 0.9127308 0.8248739
4 1e-02 10.0 0.9031559 0.8059172
4 1e-02 100.0 0.8826589 0.7652971
4 1e-01 0.1 0.8855839 0.7710341
4 1e-01 1.0 0.8834049 0.7669066
4 1e-01 10.0 0.8562742 0.7130706
4 1e-01 100.0 0.8548197 0.7101670
4 1e+00 0.1 0.8643409 0.7290661
4 1e+00 1.0 0.8643409 0.7290661
4 1e+00 10.0 0.8643409 0.7290661
4 1e+00 100.0 0.8643409 0.7290661
4 1e+01 0.1 0.8592261 0.7188975
4 1e+01 1.0 0.8592261 0.7188975
4 1e+01 10.0 0.8592261 0.7188975
4 1e+01 100.0 0.8592261 0.7188975
5 5e-03 0.1 0.9178510 0.8349529
5 5e-03 1.0 0.9215167 0.8424312
5 5e-03 10.0 0.9127147 0.8249236
5 5e-03 100.0 0.8907364 0.7812290
5 1e-02 0.1 0.9163965 0.8320561
5 1e-02 1.0 0.9127201 0.8248911
5 1e-02 10.0 0.8951052 0.7899309
5 1e-02 100.0 0.8789985 0.7580891
5 1e-01 0.1 0.8899957 0.7800643
5 1e-01 1.0 0.8782900 0.7568108
5 1e-01 10.0 0.8606859 0.7220868
5 1e-01 100.0 0.8606859 0.7220868
5 1e+00 0.1 0.8701964 0.7409418
5 1e+00 1.0 0.8701964 0.7409418
5 1e+00 10.0 0.8701964 0.7409418
5 1e+00 100.0 0.8701964 0.7409418
5 1e+01 0.1 0.8643517 0.7293678
5 1e+01 1.0 0.8643517 0.7293678
5 1e+01 10.0 0.8643517 0.7293678
5 1e+01 100.0 0.8643517 0.7293678
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were degree = 2, scale = 0.005 and C = 10.
The model with the highest accuracy had degree 2, scale 0.005, and C = 10
I record the performance on the training data:
nrow(performance_metrics_training) + 1, ] <- c('svm_poly', round(getTrainPerf(svm_poly)[1],4),
performance_metrics_training[round(getTrainPerf(svm_poly)[2],4))
Next, I try predicting the test set:
<- predict(svm_poly, test_set, type='raw') svm_poly_predictions
Below are the performance metrics:
<- confusionMatrix(svm_poly_predictions, test_set$y, positive = 'yes')
svm_poly_CM svm_poly_CM
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 6 11
no 7 176
Accuracy : 0.91
95% CI : (0.8615, 0.9458)
No Information Rate : 0.935
P-Value [Acc > NIR] : 0.9368
Kappa : 0.3523
Mcnemar's Test P-Value : 0.4795
Sensitivity : 0.4615
Specificity : 0.9412
Pos Pred Value : 0.3529
Neg Pred Value : 0.9617
Prevalence : 0.0650
Detection Rate : 0.0300
Detection Prevalence : 0.0850
Balanced Accuracy : 0.7014
'Positive' Class : yes
F1 score:
<- F1_Score(y_pred = svm_poly_predictions, y_true = test_set$y, positive = "yes")
svm_poly_F1 svm_poly_F1
[1] 0.4
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('svm_poly', round(svm_poly_CM$overall[1],4),
performance_metrics_test[round(svm_poly_F1,4),
round(svm_poly_CM$overall[2],4),
round(svm_poly_CM$byClass[1],4),
round(svm_poly_CM$byClass[3],4))
SVM Radial Basis
Next, I train a SVM model with a Radial Basis Function kernel. I will use parallel computing to improve run time.
There are two parameters to tune: sigma and C. I will use a tune grid. A rule of thumb for sigma is between 3/k and 6/k, where k is the number of input fields
<- makeCluster(detectCores()-2) cluster4
registerDoParallel(cluster4)
set.seed(1011)
<- train(
svm_rbf ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'svmRadial',
preProcess = c('center','scale'),
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(.sigma= c(0.08, 0.10, 0.12, 0.14, 0.16),
.C=c(0.1, 1, 10, 100))
)
stopCluster(cluster4)
svm_rbf
Support Vector Machines with Radial Basis Function Kernel
1364 samples
38 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (38), scaled (38)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1227, 1228, 1228, 1228, 1228, 1227, ...
Resampling results across tuning parameters:
sigma C Accuracy Kappa
0.08 0.1 0.8753274 0.7482129
0.08 1.0 0.9178564 0.8347534
0.08 10.0 0.9068645 0.8130909
0.08 100.0 0.8980786 0.7955049
0.10 0.1 0.8592100 0.7153203
0.10 1.0 0.9134553 0.8258446
0.10 10.0 0.9098164 0.8189271
0.10 100.0 0.9039395 0.8071987
0.12 0.1 0.8430979 0.6823337
0.12 1.0 0.9119848 0.8228207
0.12 10.0 0.9134661 0.8261885
0.12 100.0 0.9090812 0.8174496
0.14 0.1 0.8320846 0.6597756
0.14 1.0 0.9090436 0.8169244
0.14 10.0 0.9105303 0.8201916
0.14 100.0 0.9046801 0.8085103
0.16 0.1 0.8174270 0.6296384
0.16 1.0 0.9105249 0.8198994
0.16 10.0 0.9090597 0.8172172
0.16 100.0 0.9046801 0.8084783
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.08 and C = 1.
The model with the highest accuracy had a sigma value of 0.08 and a C value of 1
I record the performance on the training set:
nrow(performance_metrics_training) + 1, ] <- c('svm_rbf', round(getTrainPerf(svm_rbf)[1],4),
performance_metrics_training[round(getTrainPerf(svm_rbf)[2],4))
Next, I try predicting the test set:
<- predict(svm_rbf, test_set, type='raw') svm_rbf_predictions
Below are the performance metrics:
<- confusionMatrix(svm_rbf_predictions, test_set$y, positive = 'yes')
svm_rbf_CM svm_rbf_CM
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 3 5
no 10 182
Accuracy : 0.925
95% CI : (0.8793, 0.9574)
No Information Rate : 0.935
P-Value [Acc > NIR] : 0.7697
Kappa : 0.2485
Mcnemar's Test P-Value : 0.3017
Sensitivity : 0.2308
Specificity : 0.9733
Pos Pred Value : 0.3750
Neg Pred Value : 0.9479
Prevalence : 0.0650
Detection Rate : 0.0150
Detection Prevalence : 0.0400
Balanced Accuracy : 0.6020
'Positive' Class : yes
F1 score:
<- F1_Score(y_pred = svm_rbf_predictions, y_true = test_set$y, positive = "yes")
svm_rbf_F1 svm_rbf_F1
[1] 0.2857143
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('svm_rbf', round(svm_rbf_CM$overall[1],4),
performance_metrics_test[round(svm_rbf_F1,4),
round(svm_rbf_CM$overall[2],4),
round(svm_rbf_CM$byClass[1],4),
round(svm_rbf_CM$byClass[3],4))
Conclusion
Below is the output of the training set performance:
performance_metrics_training
Model Accuracy Kappa
1 rf 0.9318 0.8632
2 xgbTree 0.9325 0.8647
3 svm_lin 0.9238 0.8470
4 svm_poly 0.9222 0.8439
5 svm_rbf 0.9179 0.8348
and test set:
performance_metrics_test
Model Accuracy F1 Kappa Recall Precision
1 rf 0.88 0.25 0.1873 0.3077 0.2105
2 xgbTree 0.88 0.2 0.1364 0.2308 0.1765
3 svm_lin 0.915 0.4138 0.3685 0.4615 0.375
4 svm_poly 0.91 0.4 0.3523 0.4615 0.3529
5 svm_rbf 0.925 0.2857 0.2485 0.2308 0.375
Because these models showed strong performance on the training set but poor performance on the test set, it suggests there was low bias but high variance.
While all of the models had poor test set performance as measured by F1, the SVM Linear and SVM Polynomial did see a notable improvement versus the random forest and xgBoost models. Somewhat surprisingly, the RBF kernel performed the worst - I had suspected that the decision boundary is not linear and that RBF would perform better than a linear or polynomial kernel. The poor performance of the random forest and, especially, the xgbTree model was surprising as these are ensemble methods but still performed worse than a single SVM model.
While the F1 score was still poor overall (below 0.5), the SVM linear and polynomial models were notably better than other models. Using an ensemble method, such as bagging, may improve performance further. Additionally, different pre-processing steps may improve the performance of the model. For example, it is possible that the synthetically balanced training set translates poorly to unseen data. It could be worth trying the SVM models on the imbalanced data and using class weights, rather than synthetically balanced data, to train the models on predicting the minority class.