<- read.csv('bank-full.csv', sep = ';') bank_data
Assignment2
Assignment
Assignment 2 requires conducting experiments on the “Bank Marketing” data set from the UC Irvine Machine Learning Repository by training decision tree, random forest, and Adaboost (or Gradient boost) models and varying the approach within each to compare the results. I have saved the data-set to my working directory and will be importing it from there. Below I load the libraries that I initially think will be necessary. I will skip exploratory data analysis, as this was done in assignment 1.
Below, I import the data set
Pre-processing
As discussed in assignment 1, there are a few pre-processing steps:
Imputing values for the “unknown” observations for features “contact”, “education”, and “job”
Converting categorical variables to dummy variables
Checking for features with near zero variance
Transforming continuous variables to control for outliers
Centering and scaling features will be done when fitting a model via the “caret” package
<- bank_data preproccessed_data
Converting unknown values
First, I replace the observations “unknown” in the “contact,”education”, and “job” columns with NA
$contact[preproccessed_data$contact=='unknown'] <- NA
preproccessed_data$education[preproccessed_data$education=='unknown'] <- NA
preproccessed_data$job[preproccessed_data$job=='unknown'] <- NA preproccessed_data
Next, I convert the categorical columns to factors:
<- preproccessed_data |> mutate(across(where(is.character), as.factor)) preproccessed_data
Next, I perform KNN imputation to replace the NAs
library(VIM)
I use parallel computing later in the code to improve the run time.
library(foreach)
library(doParallel)
<- kNN(data = preproccessed_data, variable = c('contact', 'education', 'job'), k = 5, imp_var = F) preproccessed_data
prop.table(table(preproccessed_data$contact))
cellular telephone
0.93052576 0.06947424
prop.table(table(preproccessed_data$education))
primary secondary tertiary
0.1586782 0.5382318 0.3030900
prop.table(table(preproccessed_data$job))
admin. blue-collar entrepreneur housemaid management
0.11486143 0.21651810 0.03302294 0.02789144 0.21050187
retired self-employed services student technician
0.05091681 0.03505784 0.09247749 0.02081352 0.16891907
unemployed
0.02901949
Dummy variables
The decision tree logarithm in the “rpart” package can handle categorical variables directly without conversion, so normally l could forgo converting the categorical features.
However, the data set is imbalanced and I and need to use the SMOTE algorithm (discussed later) to create a balanced training set. For this, I will have to convert categorical variables to dummy variables to use the algorithm.
<- dummyVars(~.-y, data = preproccessed_data, fullRank = T) dummy_variables
<- predict(dummy_variables, newdata = preproccessed_data) dummy_data
<- cbind(as.data.frame(dummy_data), y = preproccessed_data$y) preproccessed_data
colnames(preproccessed_data)
[1] "age" "job.blue-collar" "job.entrepreneur"
[4] "job.housemaid" "job.management" "job.retired"
[7] "job.self-employed" "job.services" "job.student"
[10] "job.technician" "job.unemployed" "marital.married"
[13] "marital.single" "education.secondary" "education.tertiary"
[16] "default.yes" "balance" "housing.yes"
[19] "loan.yes" "contact.telephone" "day"
[22] "month.aug" "month.dec" "month.feb"
[25] "month.jan" "month.jul" "month.jun"
[28] "month.mar" "month.may" "month.nov"
[31] "month.oct" "month.sep" "duration"
[34] "campaign" "pdays" "previous"
[37] "poutcome.other" "poutcome.success" "poutcome.unknown"
[40] "y"
Near zero variance
For a model like logistic regression, I would exclude variables with near zero variance. However, for this tree based model, I will forgo doing this as the decision tree inherently focuses on the most important variables for minimizing impurity and does not predict coefficients for variables.
Transforming continuous variables
First, I look at the continuous variables besides “day”
|> select(age, balance, duration, campaign, previous) |> describe() preproccessed_data
vars n mean sd median trimmed mad min max range
age 1 45211 40.94 10.62 39 40.25 10.38 18 95 77
balance 2 45211 1362.27 3044.77 448 767.21 664.20 -8019 102127 110146
duration 3 45211 258.16 257.53 180 210.87 137.88 0 4918 4918
campaign 4 45211 2.76 3.10 2 2.12 1.48 1 63 62
previous 5 45211 0.58 2.30 0 0.13 0.00 0 275 275
skew kurtosis se
age 0.68 0.32 0.05
balance 8.36 140.73 14.32
duration 3.14 18.15 1.21
campaign 4.90 39.24 0.01
previous 41.84 4506.16 0.01
|> select(age, balance, duration, campaign, previous) |> pivot_longer(cols = everything(),
preproccessed_data names_to = 'variable',
values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')
As discussed above, these features have outliers. However, decision trees are able to handle data sets with outliers - thus I will leave the features as they currently are without any transformations.
Below is the description of the “duration” feature from the data set description:
last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
Thus, I will remove the feature from the data set
$duration <- NULL preproccessed_data
Centering and scaling data can be done during training via the Caret package.
Train-Test Split
As a reminder, our data-set has an imbalance for the target variable:
prop.table(table(preproccessed_data$y))
no yes
0.8830152 0.1169848
Thus, I will apply the SMOTE algorithm to the training set to avoid biasing the model toward the majority class.
First, I create the train-test split use the 80%/20% proportion.
set.seed(123)
<- sample(nrow(preproccessed_data), round(nrow(preproccessed_data)*0.8), replace = FALSE)
sample_set
<- preproccessed_data[sample_set,]
train_set <- preproccessed_data[-sample_set,] test_set
Below I check the proportions of the target
original data set:
round(prop.table(table(select(preproccessed_data, y), exclude = NULL)), 4) * 100
y
no yes
88.3 11.7
training data
round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100
y
no yes
88.28 11.72
test data
round(prop.table(table(select(test_set, y), exclude = NULL)), 4) * 100
y
no yes
88.39 11.61
Next, I apply the SMOTE algorithm to generate synthetic data to make the training set balanced in the target
library(smotefamily)
set.seed(456)
<- SMOTE(X = train_set[, -39], target = train_set$y,
smote_resultK = 5, dup_size = 6.5)
<-data.frame(smote_result$data)
train_setnames(train_set)[ncol(train_set)] <- "y"
$y <- as.factor(train_set$y) train_set
round(prop.table(table(select(train_set, y), exclude = NULL)), 4) * 100
y
no yes
51.83 48.17
The training set now has about an equal split in the target.
Lastly, I will convert the dummy variables to Boolean values:
<- train_set |>
train_set mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))
<- test_set |>
test_set mutate(across(where(is.numeric) & !all_of(c("age", "balance", "day", "pdays", "previous", "y")), ~ . == 1))
str(train_set)
'data.frame': 61603 obs. of 39 variables:
$ age : num 65 54 35 26 30 60 43 67 48 41 ...
$ job.blue.collar : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.entrepreneur : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.housemaid : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.management : logi FALSE TRUE FALSE TRUE FALSE FALSE ...
$ job.retired : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ job.self.employed : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ job.services : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.student : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.technician : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.unemployed : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ marital.married : logi TRUE TRUE FALSE FALSE FALSE TRUE ...
$ marital.single : logi FALSE FALSE TRUE TRUE TRUE FALSE ...
$ education.secondary: logi TRUE FALSE TRUE FALSE FALSE FALSE ...
$ education.tertiary : logi FALSE TRUE FALSE TRUE TRUE TRUE ...
$ default.yes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ balance : num 952 1303 127 3704 3343 ...
$ housing.yes : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ loan.yes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ contact.telephone : logi FALSE FALSE TRUE FALSE FALSE TRUE ...
$ day : num 6 3 14 19 1 26 2 11 28 5 ...
$ month.aug : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ month.dec : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.feb : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ month.jan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.jul : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.jun : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ month.mar : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.may : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ month.nov : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ month.oct : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.sep : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ campaign : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ pdays : num 96 -1 -1 -1 -1 -1 -1 91 92 90 ...
$ previous : num 1 0 0 0 0 0 0 3 12 7 ...
$ poutcome.other : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ poutcome.success : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ poutcome.unknown : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
$ y : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
str(test_set)
'data.frame': 9042 obs. of 39 variables:
$ age : num 58 33 58 29 32 57 25 36 50 36 ...
$ job.blue-collar : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ job.entrepreneur : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ job.housemaid : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.management : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ job.retired : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ job.self-employed : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.services : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.student : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ job.technician : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ job.unemployed : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ marital.married : logi TRUE TRUE TRUE FALSE FALSE FALSE ...
$ marital.single : logi FALSE FALSE FALSE TRUE TRUE FALSE ...
$ education.secondary: logi FALSE TRUE FALSE TRUE FALSE TRUE ...
$ education.tertiary : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ default.yes : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ balance : num 2143 2 121 390 23 ...
$ housing.yes : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ loan.yes : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
$ contact.telephone : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ day : num 5 5 5 5 5 5 5 5 5 5 ...
$ month.aug : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.dec : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.feb : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.jan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.jul : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.jun : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.mar : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.may : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ month.nov : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.oct : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ month.sep : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ campaign : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome.other : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ poutcome.success : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ poutcome.unknown : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Decision tree
The first experiment will involve two decision tree models.
For this experiment, I will test two models - one that does not preprocess the data by centering and scaling and one that does.
My hypothesis is that that there is no meaningful difference between the two models in terms of performance because decision trees are able to handle continuous variables without scaling.
The evaluation metric for the training set will be accuracy, since we have a roughly balanced data set. The metric for the test set will be F1, since the test set is unbalanced.
First experiment
Below I train a decision tree model without centering and scaling:
<- train(
tree1 ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'rpart',
trControl = trainControl(method = 'cv', number = 10)
)
tree1
CART
61603 samples
38 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 55443, 55442, 55443, 55442, 55442, 55443, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.05961649 0.7419124 0.4842386
0.09355306 0.7034886 0.4103832
0.34971186 0.5832351 0.1465104
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.05961649.
As seen above the model chosen had a complexity parameter of 0.05 and and accuracy of 0.74. Below is a visualization of the tree:
::rpart.plot(tree1$finalModel) rpart.plot
I will record the performance of the final model on the training data for comparison with other models
<- data.frame(Model=character(),
performance_metrics_training Accuracy = numeric(),
Kappa = numeric())
nrow(performance_metrics_training) + 1, ] <- c('Tree1', round(getTrainPerf(tree1)[1],4),
performance_metrics_training[round(getTrainPerf(tree1)[2],4))
Next, I try predicting the values in the test set, however I run into an issue. The test set does not have two columns:
setdiff(names(train_set), names(test_set))
[1] "job.blue.collar" "job.self.employed"
setdiff(names(test_set), names(train_set))
[1] "job.blue-collar" "job.self-employed"
it seems like the error is being causing due to the inconsistency in the spelling of the columns (period vs hyphen). I will change the names in the test set to match the train set.
colnames(test_set)[which(names(test_set) == "job.blue-collar")] <- "job.blue.collar"
colnames(test_set)[which(names(test_set) == "job.self-employed")] <- "job.self.employed"
setdiff(names(test_set), names(train_set))
character(0)
Next, I try predicting again:
<- predict(tree1, test_set, type='raw') tree1_predictions
Below are the performance metrics:
<- confusionMatrix(tree1_predictions, test_set$y, positive = 'yes')
tree1CM tree1CM
Confusion Matrix and Statistics
Reference
Prediction no yes
no 5613 489
yes 2379 561
Accuracy : 0.6828
95% CI : (0.6731, 0.6924)
No Information Rate : 0.8839
P-Value [Acc > NIR] : 1
Kappa : 0.1328
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.53429
Specificity : 0.70233
Pos Pred Value : 0.19082
Neg Pred Value : 0.91986
Prevalence : 0.11612
Detection Rate : 0.06204
Detection Prevalence : 0.32515
Balanced Accuracy : 0.61831
'Positive' Class : yes
I also look at the F1 score:
library(MLmetrics)
<- F1_Score(y_pred = tree1_predictions, y_true = test_set$y, positive = "yes")
tree1F1 tree1F1
[1] 0.281203
This model has poor performance - the accuracy of 0.68 is worse than just guessing “no” for all predictions, the kappa score is near zero and the F1 score is low as well.
I will create a data frame to track performance metrics and compare
<- data.frame(Model=character(),
performance_metrics_test Accuracy = numeric(),
F1 = numeric(),
Kappa = numeric(),
Recall = numeric(),
Precision = numeric())
nrow(performance_metrics_test) + 1, ] <- c('Tree1', round(tree1CM$overall[1],4),
performance_metrics_test[round(tree1F1,4),
round(tree1CM$overall[2],4),
round(tree1CM$byClass[1],4),
round(tree1CM$byClass[3],4))
performance_metrics_test
Model Accuracy F1 Kappa Recall Precision
1 Tree1 0.6828 0.2812 0.1328 0.5343 0.1908
Second experiment
Below I train a decision tree model with centering and scaling:
<- train(
tree2 ~ .,
y data = train_set,
preProcess = c('center','scale'),
metric = 'Accuracy',
method = 'rpart',
trControl = trainControl(method = 'cv', number = 10)
)
tree2
CART
61603 samples
38 predictor
2 classes: 'no', 'yes'
Pre-processing: centered (38), scaled (38)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 55442, 55442, 55443, 55443, 55442, 55443, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.05961649 0.7424314 0.4852607
0.09355306 0.7032606 0.4099243
0.34971186 0.6169986 0.2222288
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.05961649.
The model was a similar cost parameter of 0.06 and an accuracy on the test set of 0.74
Below is a plot of the model
::rpart.plot(tree2$finalModel) rpart.plot
I record the performance metrics on the training set:
nrow(performance_metrics_training) + 1, ] <- c('Tree2', round(getTrainPerf(tree2)[1],4),
performance_metrics_training[round(getTrainPerf(tree2)[2],4))
Next, I try predicting the test set:
<- predict(tree2, test_set, type='raw') tree2_predictions
Below are the performance metrics:
<- confusionMatrix(tree2_predictions, test_set$y, positive = 'yes')
tree2CM tree2CM
Confusion Matrix and Statistics
Reference
Prediction no yes
no 5613 489
yes 2379 561
Accuracy : 0.6828
95% CI : (0.6731, 0.6924)
No Information Rate : 0.8839
P-Value [Acc > NIR] : 1
Kappa : 0.1328
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.53429
Specificity : 0.70233
Pos Pred Value : 0.19082
Neg Pred Value : 0.91986
Prevalence : 0.11612
Detection Rate : 0.06204
Detection Prevalence : 0.32515
Balanced Accuracy : 0.61831
'Positive' Class : yes
Below is the F1 score:
<- F1_Score(y_pred = tree2_predictions, y_true = test_set$y, positive = "yes")
tree2F1 tree2F1
[1] 0.281203
I add the metrics to the performance data frame:
nrow(performance_metrics_test) + 1, ] <- c('Tree2', round(tree2CM$overall[1],4),
performance_metrics_test[round(tree2F1,4),
round(tree2CM$overall[2],4),
round(tree2CM$byClass[1],4),
round(tree2CM$byClass[3],4))
performance_metrics_test
Model Accuracy F1 Kappa Recall Precision
1 Tree1 0.6828 0.2812 0.1328 0.5343 0.1908
2 Tree2 0.6828 0.2812 0.1328 0.5343 0.1908
My hypothesis was correct - the two models (one without centering and scaling of data and one with) have essentially the same performance.
Random Forest
The next experiment involves random forest models. Below, I look up the tunable parameters for the model.
modelLookup('rf')
model parameter label forReg forClass probModel
1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
There is only one tunable parameter - mtry, which is the number of randomly selected features to consider at each split.
The default value for mtry is the square root of the number of features in the data set when working on classification problems.
For this experiment, I will compare the output of a model with the default value for mtry vs one that tries several different values for mtry via hyper-parameter tuning.
A small value of mtry corresponds to having a wider variety of trees with significant differentiation between them.
My hypothesis is that the value of mtry from hyper-parameter tuning will be near the default value of six.
The evaluation metric for the training set will be accuracy, since we have a roughly balanced data set. The metric for the test set will be F1, since the test set is unbalanced.
First experiment
First, I will try training the random forest model without parameter tuning, using the square root of the number of features as the mtry parameter. Because I am not doing hyper-parameter tuning, I don’t need to do resampling. I will also use parallel computing to speed up the training process.
<- makeCluster(detectCores()-2) cluster00
registerDoParallel(cluster00)
set.seed(1011)
<- train(
rf0 ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'rf',
trControl = trainControl(method = 'none'),
tuneGrid = expand.grid(.mtry = 6)
)
stopCluster(cluster00)
rf0
Random Forest
61603 samples
38 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: None
While there is no single tree to visualize, I can visualize variable importance:
plot(varImp(rf0))
I record the performance metrics on the training data:
<- predict(rf0, newdata = train_set) rf0_train_predictions
<-confusionMatrix(rf0_train_predictions, train_set$y, positive = 'yes') rf0_train_cm
nrow(performance_metrics_training) + 1, ] <- c('rf0', round(rf0_train_cm$overall[1],4),
performance_metrics_training[round(rf0_train_cm$overall[2],4))
Next, I try predicting the test set with the model:
<- predict(rf0, test_set, type='raw') rf0_predictions
Below are the performance metrics:
<- confusionMatrix(rf0_predictions, test_set$y, positive = 'yes')
rf0CM rf0CM
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7648 727
yes 344 323
Accuracy : 0.8816
95% CI : (0.8747, 0.8881)
No Information Rate : 0.8839
P-Value [Acc > NIR] : 0.7605
Kappa : 0.3144
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.30762
Specificity : 0.95696
Pos Pred Value : 0.48426
Neg Pred Value : 0.91319
Prevalence : 0.11612
Detection Rate : 0.03572
Detection Prevalence : 0.07377
Balanced Accuracy : 0.63229
'Positive' Class : yes
Below is the F1 score:
<- F1_Score(y_pred = rf0_predictions, y_true = test_set$y, positive = "yes")
rf0F1 rf0F1
[1] 0.3762376
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('RF0', round(rf0CM$overall[1],4),
performance_metrics_test[round(rf0F1,4),
round(rf0CM$overall[2],4),
round(rf0CM$byClass[1],4),
round(rf0CM$byClass[3],4))
Second experiment
Below I train a random forest model that tries different values for the mtry parameter.
Below I start the cluster, train the model, and close the cluster:
<- makeCluster(detectCores()-2) cluster0
registerDoParallel(cluster0)
set.seed(789)
<- train(
rf1 ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'rf',
trControl = trainControl(method = 'cv', number = 10)
)
stopCluster(cluster0)
rf1
Random Forest
61603 samples
38 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 55443, 55443, 55442, 55443, 55443, 55443, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8913527 0.7814750
20 0.9245654 0.8486363
38 0.9209617 0.8414435
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 20.
The model with the best performance metrics was one with mtry = 20. This model has better performance metrics on the training set than the decision tree models.
While there is no single tree to visualize, I can visualize variable importance:
plot(varImp(rf1))
I record the performance on the training set:
nrow(performance_metrics_training) + 1, ] <- c('rf1', round(getTrainPerf(rf1)[1],4),
performance_metrics_training[round(getTrainPerf(rf1)[2],4))
Next, I try predicting the test set:
<- predict(rf1, test_set, type='raw') rf1_predictions
Below are the performance metrics:
<- confusionMatrix(rf1_predictions, test_set$y, positive = 'yes')
rf1CM rf1CM
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7645 708
yes 347 342
Accuracy : 0.8833
95% CI : (0.8765, 0.8899)
No Information Rate : 0.8839
P-Value [Acc > NIR] : 0.5732
Kappa : 0.3318
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.32571
Specificity : 0.95658
Pos Pred Value : 0.49637
Neg Pred Value : 0.91524
Prevalence : 0.11612
Detection Rate : 0.03782
Detection Prevalence : 0.07620
Balanced Accuracy : 0.64115
'Positive' Class : yes
I also look at the F1 score:
<- F1_Score(y_pred = rf1_predictions, y_true = test_set$y, positive = "yes")
rf1F1 rf1F1
[1] 0.3933295
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('RF1', round(rf1CM$overall[1],4),
performance_metrics_test[round(rf1F1,4),
round(rf1CM$overall[2],4),
round(rf1CM$byClass[1],4),
round(rf1CM$byClass[3],4))
performance_metrics_test
Model Accuracy F1 Kappa Recall Precision
1 Tree1 0.6828 0.2812 0.1328 0.5343 0.1908
2 Tree2 0.6828 0.2812 0.1328 0.5343 0.1908
3 RF0 0.8816 0.3762 0.3144 0.3076 0.4843
4 RF1 0.8833 0.3933 0.3318 0.3257 0.4964
Interestingly, parameter tuning picked a model with mtry = 20, which had a better accuracy on the training set than the model with mtry set to the square root of the number of features. Both models still performed poorly on the test set, though the model with parameter tuning performed better then the model without tuning and both models performed better than the decision tree models.
Boosted models
The next experiment involves boosted models. Below, I look up the tunable parameters for the model:
modelLookup('xgbTree')
model parameter label forReg forClass
1 xgbTree nrounds # Boosting Iterations TRUE TRUE
2 xgbTree max_depth Max Tree Depth TRUE TRUE
3 xgbTree eta Shrinkage TRUE TRUE
4 xgbTree gamma Minimum Loss Reduction TRUE TRUE
5 xgbTree colsample_bytree Subsample Ratio of Columns TRUE TRUE
6 xgbTree min_child_weight Minimum Sum of Instance Weight TRUE TRUE
7 xgbTree subsample Subsample Percentage TRUE TRUE
probModel
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
7 TRUE
There are seven tunable parameters for the xgbTree model. Tuning all of these parameters on a fairly large training set may take a significant amount of time and computing power, thus I will try tuning two parameters: nrounds, which is the number of boosting iterations, and max_depth, which is the maximum tree depth. I will use default values for the other parameters in order to manage run time.
For this experiment I will try different cross-validation methods: k-fold cross validation and random cross validation (known as leave-group-out-cross-validation). The difference between the two methods is that random cross validation creates the validation sets during each iteration whereas k-fold cross validation creates validation sets before iterating through the process. The sets created by random cross validation are independent from each other, which means that iterations can have some overlap in their validation sets (i.e., instances may be used more than once in validation).
My hypothesis is that there will be no difference in model performance on both the the training and test sets between the two cross validation methods. I suspect this to be the case because the training set is large and it is synthetically balanced. The metric for the training set will be accuracy and the metric for the test set will be F1, since the test set is unbalanced.
First experiment
First, I will try training the xgbTree model using k-fold cross validation. I will use the default values for the tuning parameters besides nrounds and max depth. I will also use parallel computing to improve the run time.
<- expand.grid(
tune_grid nrounds = c(100, 200),
max_depth = c(3, 6),
eta = 0.3,
gamma = 0.01,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
<- makeCluster(detectCores()-2) cluster1
registerDoParallel(cluster1)
set.seed(1213)
<- train(
xgbTree1 ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'xgbTree',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = tune_grid
)
stopCluster(cluster1)
xgbTree1
eXtreme Gradient Boosting
61603 samples
38 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 55443, 55443, 55443, 55442, 55443, 55443, ...
Resampling results across tuning parameters:
max_depth nrounds Accuracy Kappa
3 100 0.9171633 0.8337416
3 200 0.9279257 0.8553095
6 100 0.9308152 0.8611369
6 200 0.9328118 0.8651448
Tuning parameter 'eta' was held constant at a value of 0.3
Tuning
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning parameter 'subsample' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 200, max_depth = 6, eta
= 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
subsample = 1.
Below I plot the variable importance:
plot(varImp(xgbTree1))
I record the performance on the training data:
nrow(performance_metrics_training) + 1, ] <- c('xgbTree1', round(getTrainPerf(xgbTree1)[1],4),
performance_metrics_training[round(getTrainPerf(xgbTree1)[2],4))
Next, I try predicting the test set
<- predict(xgbTree1, test_set, type='raw') xgbTree1_predictions
Below are the performance metrics:
<- confusionMatrix(xgbTree1_predictions, test_set$y, positive = 'yes')
xgbTree1CM xgbTree1CM
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7722 744
yes 270 306
Accuracy : 0.8879
95% CI : (0.8812, 0.8943)
No Information Rate : 0.8839
P-Value [Acc > NIR] : 0.1216
Kappa : 0.3205
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.29143
Specificity : 0.96622
Pos Pred Value : 0.53125
Neg Pred Value : 0.91212
Prevalence : 0.11612
Detection Rate : 0.03384
Detection Prevalence : 0.06370
Balanced Accuracy : 0.62882
'Positive' Class : yes
F1 score:
<- F1_Score(y_pred = xgbTree1_predictions, y_true = test_set$y, positive = "yes")
xgbTree1F1 xgbTree1F1
[1] 0.3763838
I add the metrics to the performance table:
nrow(performance_metrics_test) + 1, ] <- c('xgbTree1', round(xgbTree1CM$overall[1],4),
performance_metrics_test[round(xgbTree1F1,4),
round(xgbTree1CM$overall[2],4),
round(xgbTree1CM$byClass[1],4),
round(xgbTree1CM$byClass[3],4))
Second experiment
Below, I re-run the model from above but use random cross validation.
<- makeCluster(detectCores()-2) cluster2
registerDoParallel(cluster2)
set.seed(1213)
<- train(
xgbTree2 ~ .,
y data = train_set,
metric = 'Accuracy',
method = 'xgbTree',
trControl = trainControl(method = 'LGOCV', p = 0.1, number = 10),
tuneGrid = tune_grid
)
stopCluster(cluster2)
xgbTree2
eXtreme Gradient Boosting
61603 samples
38 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (10 reps, 10%)
Summary of sample sizes: 6161, 6161, 6161, 6161, 6161, 6161, ...
Resampling results across tuning parameters:
max_depth nrounds Accuracy Kappa
3 100 0.9096389 0.8186690
3 200 0.9161881 0.8318128
6 100 0.9156849 0.8308264
6 200 0.9156921 0.8308465
Tuning parameter 'eta' was held constant at a value of 0.3
Tuning
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning parameter 'subsample' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 200, max_depth = 3, eta
= 0.3, gamma = 0.01, colsample_bytree = 1, min_child_weight = 1 and
subsample = 1.
Variable importance:
plot(varImp(xgbTree2))
I record the performance on the training data:
nrow(performance_metrics_training) + 1, ] <- c('xgbTree2', round(getTrainPerf(xgbTree2)[1],4),
performance_metrics_training[round(getTrainPerf(xgbTree2)[2],4))
Predicting the test set:
<- predict(xgbTree2, test_set, type='raw') xgbTree2_predictions
Performance metrics:
<- confusionMatrix(xgbTree2_predictions, test_set$y, positive = 'yes')
xgbTree2CM xgbTree2CM
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7697 732
yes 295 318
Accuracy : 0.8864
95% CI : (0.8797, 0.8929)
No Information Rate : 0.8839
P-Value [Acc > NIR] : 0.2307
Kappa : 0.3246
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.30286
Specificity : 0.96309
Pos Pred Value : 0.51876
Neg Pred Value : 0.91316
Prevalence : 0.11612
Detection Rate : 0.03517
Detection Prevalence : 0.06779
Balanced Accuracy : 0.63297
'Positive' Class : yes
F1 score:
<- F1_Score(y_pred = xgbTree2_predictions, y_true = test_set$y, positive = "yes")
xgbTree2F1 xgbTree2F1
[1] 0.3824414
Performance table
nrow(performance_metrics_test) + 1, ] <- c('xgbTree2', round(xgbTree2CM$overall[1],4),
performance_metrics_test[round(xgbTree2F1,4),
round(xgbTree2CM$overall[2],4),
round(xgbTree2CM$byClass[1],4),
round(xgbTree2CM$byClass[3],4))
performance_metrics_test
Model Accuracy F1 Kappa Recall Precision
1 Tree1 0.6828 0.2812 0.1328 0.5343 0.1908
2 Tree2 0.6828 0.2812 0.1328 0.5343 0.1908
3 RF0 0.8816 0.3762 0.3144 0.3076 0.4843
4 RF1 0.8833 0.3933 0.3318 0.3257 0.4964
5 xgbTree1 0.8879 0.3764 0.3205 0.2914 0.5312
6 xgbTree2 0.8864 0.3824 0.3246 0.3029 0.5188
There is no performance difference between the two models on the test set.
Conclusion
Below are the final data frames that have performance on the training and test sets by model
Training:
performance_metrics_training
Model Accuracy Kappa
1 Tree1 0.7419 0.4842
2 Tree2 0.7424 0.4853
3 rf0 0.9559 0.9115
4 rf1 0.9246 0.8486
5 xgbTree1 0.9328 0.8651
6 xgbTree2 0.9162 0.8318
Test:
performance_metrics_test
Model Accuracy F1 Kappa Recall Precision
1 Tree1 0.6828 0.2812 0.1328 0.5343 0.1908
2 Tree2 0.6828 0.2812 0.1328 0.5343 0.1908
3 RF0 0.8816 0.3762 0.3144 0.3076 0.4843
4 RF1 0.8833 0.3933 0.3318 0.3257 0.4964
5 xgbTree1 0.8879 0.3764 0.3205 0.2914 0.5312
6 xgbTree2 0.8864 0.3824 0.3246 0.3029 0.5188
The results of the models are seen above. Relative to the hypothesis, the conclusion for each experiment is:
Experiment 1: the hypothesis that there is no meaningful difference between a model that centers and scales the data and one that does not was correct
Experiment 2: the hypothesis that the model with mtry equal to the square root of the number of features would perform better than one that trying hyperemater tuning was slightly incorrect: the model that tuned the mtry hypermeter had worse accuracy on the training set but a slightly better F1 score on the test set
Experiment 3: the hypothesis that there would be no difference between a model that uses k-fold cross validation and random cross validation was correct for the test set, though the model with k-fold cross validation had better performance on the training set
The decision tree models had mediocre performance on the raining set while the random forest models and the xgbTree models had strong performance on the training set. However, all of the models had poor performance on the test set, which was imbalanced. Given the imbalance, we were interested in creating a model that can accurately predict the minority class (“yes”) and decided to use the F1 Score which combines both recall (actual positive instances that were correctly predicted by the model) and precision (correct positive predictions out of all positive predictions). All of the models had an F1 score below 0.5, indicating poor performance. The model that performed the best on the test set has RF1, with an F1 Score of 0.4. This suggests that the models have low bias but high variance. Though the RF1 model had the highest performance on the test set, it still performed poorly overall and thus I would not recommend the model. Further experiments can consider more hyper-parameter tuning. Additionally, I am curious how the models would have performed if I did not create a synthetically balanced training set. Its possible that there was too much synthetic data which poorly represented the minority class in the test set. Additionally, I would try keeping categorical variables as one column rather than creating dummy variables as decision tree models are able to handle categorical data.