I. Literature Review

In our project we are using logistic regression to determine the best predictors when it comes to likelihood of signing a long-term bank deposit. We then compared five classification algorithms by their classification accuracy rate to determine which was the most accurate. The four algorithms are Logistic Regression, K-Nearest Neighbor (kNN), 2 versions of the Random Forest tree, one built using rpart and the other randomForest, and finally, Support Vector Machine (SVM).

Logistic regression is a method commonly used in statistics. In the article “Logistic Regression” , Michael P. LaValley explains that Logistic Regression is the analysis of binary outcomes with two mutually exclusive levels. Furthermore, logistic regression is especially useful because it allows for the use of multiple predictors, both continuous and categorical. Logistic regression seen as superior to linear regression. When using linear regression, the predicted values are greater than 1 or less than 0. Also, linear regression often produces constant variability which does not work for the binary level outcome. Thus logistic regression is the optimal choice for binary outcomes.

The kNN method is also a commonly used method in statistics and data mining because it is simple yet effective. kNN can be used in classification or regression. This algorithm assumes observations that are similar exist in close proximity. Regardless of regression or classification, the input is k closest training samples. But the outputs are different. In classification, the observations are assigned to a class based on the closest neighbor. In regression, the output is an average of the values of the k closest neighbors. In the “Efficient kNN Classification with Different Numbers of Nearest Neighbors’’, Shichao Zhang et al. examine kNN’s weaknesses and how to improve upon them. Traditionally in kNN a fixed k-value is used for all testing samples. So, researchers tried to use different k- values by utilizing cross validation but it was too time consuming. Next, they proposed adding a training component. The goal in adding a training component is to create a kTree and find optimal k values. With this component added, kNN had a better accuracy. The study found when using the k-Tree method classification accuracies increased on average of 4%.

In the article “Random Forest Classifier for Remote Sensing Classification”, Mahesh Pal compares the effectiveness and accuracy of Random Forest Classification and SVM. First, he discusses Random Forest. This method is popular in data mining because of its simple application. Random Forest is a combination of decision trees and tree classifiers that outputs classes and average prediction. Pal used the bagging method to create the training dataset. To design the decision tree, he utilized a pruning method called the attribute selection measure. The attribute selection method was chosen because the pruning methods affect the performance of tree-based classifiers. This resulted in 88.37% accuracy from 100 trees. However, when increased to 1,200 trees, the accuracy fell to 88.02%. Therefore, it is clear that Random Forest is not sensitive to overfitting. Next the author used SVM which is commonly used in machine learning. SVM is based on a statistical learning theory with the goal of determining the location of decision boundaries that create the best separation of classes. This model works best when there are only two outcome classes. It can use it for multi-class data but it’s either used as one against the rest of choosing two classes and doing one against one. In this paper, the one against the rest method is used. Using this method, the accuracy was 87.9%. Therefore, there is not a major difference between the accuracy in Random Forest and SVM.

II. Methods

Our project seeks to answer two questions. The first is which predictors are the most important to predicting whether an individual will sign a long-term deposit. The second is which binary classification method most accurately classifies who will sign a deposit. The first step is to clean the bank dataset. After reading the documentation on the data, we determined that columns 11-15 are unnecessary to our analysis. The variable (11), duration, is highly correlated with the outcome variable, if duration equals zero, then the outcome is always no. Due to this relationship, we removed duration. Variables 12-15 were data pertaining to the last campaign because this data is not important to the current campaign; we removed these columns as well. Finally, we omitted any columns with incomplete data and changed the outcome variables to 1 for “yes” and 0 for “no”.

bank <- bank[,-(11:15)]
bank <- na.omit(bank)
bank$y <- ifelse(bank$y == "yes", 1, 0)
bank$y <- as.factor(bank$y)

Next we created the train and test samples.

trainbankindices  <- sample(1:nrow(bank), .7*nrow(bank))
trainbank <- bank[trainbankindices,]
testbank <- bank[-trainbankindices,]

Question 1

In order to determine which predictors are the most important for classification, we utilized random forest classification. Random Forest is very useful for identifying feature importance. We used two different packages that create random forest classification models, “rpart” and “randomForest.

#build models
rpartmodel <- rpart(y~., data = trainbank, method = 'class')
randomforestmodel <- randomForest(y~. ,data = trainbank, importance = T)

#predict using test data
prediction <- predict(rpartmodel, testbank)
treeT <- table(predict(rpartmodel, testbank, type = 'class'), testbank$y)

Question 2

In order to find the best classification method, we compared the classification accuracy of Logistic Regression, K-Nearest Neighbor (KNN),Random Forest, and Support Vector Machines (SVM).

Logistic Regression

The first method we used is the simplest,logistic regression. It is a good starting spot for binary classification, as it is pretty inexpensive to run and is more easily interpretable compared to other “black box” prediction algorithms. When running the logistic regression, we are able to see individual variables and their significance in the output of the regression. This can give us an idea of which characteristics of customers matter when predicting if they sign a deposit or not, instead of just handing us an accuracy rate and confusion matrix.

\[ log\left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 + \ldots + \beta_{n}X_n \]

We then used a stepwise variable selection process to narrow down the predictors in the regression to interpret which factors are the most important to accurately classifying the response variable.The algorithm used for this process does both directions by default, compared to just going simply backwards or forwards for variable selection. Since we started the variable selection process after running the initial model, we used backwards variable selection to narrow down our choice set of predictors.

#build models
logreg <- glm(y~., data = trainbank, family = 'binomial')

logreg_steps <- glm(y~., data = trainbank, family = 'binomial') %>% stepAIC(trace = F)
step.model <- stepAIC(logreg, trace = FALSE)

#prediction on test data
log.predict <- predict(logreg, newdata = testbank, type = 'response')

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

step.predict <- predict(logreg_steps, newdata = testbank, type = 'response')

KNN

After using logistic regression as a baseline, we turned toward more advanced machine learning methods, starting with K Nearest Neighbors, or KNN. We felt that despite being a multivariate dataset, and not easily visualizable like KNN ideally is, that it was a computationally inexpensive method that could give us another baseline on what our accuracy rates were coming out to be.

When made the KNN model, we had to reformat the data frame. First, the outcome variable was made into its own object. The numeric columns were scaled and the categorical data had to be changed into binary variables.

knnbank = bank

#outcome
outcome <- knnbank %>% dplyr::select(y)
knnbank <- knnbank %>% dplyr::select(-y)


##numeric scaling
knnbank[,c("age","emp.var.rate", "cons.price.idx", "euribor3m", "nr.employed")] <- scale(knnbank[,c("age","emp.var.rate", "cons.price.idx", "euribor3m", "nr.employed")])

## binary variables
knnbank$contact <- ifelse(knnbank$contact == "cellular", 1, 0)

##recode as dummy variables
knnbank$job <- as.data.frame(dummy.code(knnbank$job))
knnbank$marital <- as.data.frame(dummy.code(knnbank$marital))
knnbank$education <- as.data.frame(dummy.code(knnbank$education))
knnbank$default <- as.data.frame(dummy.code(knnbank$default))
knnbank$housing <- as.data.frame(dummy.code(knnbank$housing))
knnbank$loan <- as.data.frame(dummy.code(knnbank$loan))
knnbank$month <- as.data.frame(dummy.code(knnbank$month))
knnbank$day_of_week <- as.data.frame(dummy.code(knnbank$day_of_week))

Because the data had to be reformatted, we created new test and training samples.

smp_size <- floor(0.7 * nrow(knnbank))
train_ind <- sample(seq_len(nrow(knnbank)), size = smp_size)

knn_train <- knnbank[train_ind, ]
knn_test <- knnbank[-train_ind, ]

outcome_train <- outcome[train_ind, ]
outcome_test <- outcome[-train_ind, ]

We then used our reformatted dataset to run the KNN algorithm.

pred_knn <- knn(train = knn_train, test = knn_test, cl = outcome_train, k=20)

Random Forest

Since this method is usable for regression and classification, it is a valid method for this approach. The steps to create these models are above, we used both ‘rpart’ and ’randomForest".

SVM

Lastly, we wanted to try out a method we hadn’t seen in class before, and decided to try support vector machines, or SVM, as it is a method that is almost exclusively designed for binary classification problems. We elaborate more on the method itself in the extra credit section.

We first built the SVM model via 10-fold cross validation, then selected the best model. We used this best model to predict the test set.

#build model
tuning.para <- tune.svm(y~., 
                        data = trainbank, gamma = 10^-2, 
                        cost = 10, tunecontrol = tune.control(cross=10))
svm.model <- tuning.para$best.model

#predict model
svm.pred <- predict(svm.model, testbank)
svm.pred <- as.data.frame(svm.pred)

Results

Question 1

In order to answer our first question, we used a RandomForest and Rpart to build a random forest and measure its prediction accuracy.

#Rpart
accuracyrpart <- (treeT[1,1]+treeT[2,2])/(sum(treeT))
printcp(rpartmodel)

## 
## Classification tree:
## rpart(formula = y ~ ., data = trainbank, method = "class")
## 
## Variables actually used in tree construction:
## [1] contact      emp.var.rate euribor3m    nr.employed 
## 
## Root node error: 3292/28831 = 0.11418
## 
## n= 28831 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.020960      0   1.00000 1.00000 0.016404
## 2 0.012151      3   0.93712 0.95170 0.016052
## 3 0.010000      4   0.92497 0.93682 0.015942

Rpart used the variables contact, emp.var.rate, and nr.employed to construct the trees. These are the variables that the algorithm detected as the most important. These variables are how the customer was being contacted, the employment variation rate, and number of employees that the bank currently has.

#RandomForest
varImpPlot(randomforestmodel)

treeT1 <- table(predict(randomforestmodel, testbank, type = 'class'), testbank$y)
accuracyrf <- (treeT1[1,1]+treeT1[2,2])/(sum(treeT1))

In contrast, using randomForest, the 5 most important variables in regards to mean accuracy decrease are age, euribor3m, day_of_the_week, education, and nr.employed. The 5 most important variables in regards to purity increase are euribor3m, age, nr.employed, job, education, day_of_the_week. Although the variables are not in the same order, both lists contain the same variables, therefore we conclude: age, euribor3m, day_of_the_week, education, and nr.employed are most important predictors for the outcome variables.

Question 2

In this section, we will compare the accuracy rates between the models for logistic regression, KNN, Random Forest, Random Forest built using Rpart, and SVM to see which has the lowest classification error.

Logistic Regression

#confusion matrix
log.prediction.rd <- ifelse(log.predict > 0.5, 1, 0)
step.prediction <- ifelse(step.predict > .5, 1, 0)

table_2 <- table(log.prediction.rd, testbank$y)
table_3 <- table(step.prediction, testbank$y)

accuracylogreg <- sum(diag(table_2))/(sum(table_2))
accuracystep <- sum(diag(table_3))/(sum(table_3))

KNN

outcome_test <- data.frame(outcome_test)

class_comparison <- data.frame(pred_knn, outcome_test)
names(class_comparison) <- c("Predicted", "Observed")

CrossTable(x = class_comparison$Observed, y = class_comparison$Predicted, prop.chisq=FALSE, prop.c = FALSE, prop.r = FALSE, prop.t = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  12357 
## 
##  
##                           | class_comparison$Predicted 
## class_comparison$Observed |         0 |         1 | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                         0 |     10625 |       320 |     10945 | 
## --------------------------|-----------|-----------|-----------|
##                         1 |      1078 |       334 |      1412 | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |     11703 |       654 |     12357 | 
## --------------------------|-----------|-----------|-----------|
## 
##

accuracyknn <- (10699 + 296) / 12357

Random Forest

accuracyrf

## [1] 0.8949583

accuracyrpart

## [1] 0.8947965

SVM

svm.table <- table(svm.pred$svm.pred, testbank$y)
accuracysvm <- sum(diag(svm.table))/(sum(svm.table))

labels <- c("Logisitic Regression", "LogReg with Stepwise", "KNN", "Rpart", "Random Forest", "SVM")
values <- c(accuracylogreg, accuracystep, accuracyknn,accuracyrpart, accuracyrf, accuracysvm)

accuracytable <- data.frame("Model" = labels, "Accuracy Rate" = values)
accuracytable

##                  Model Accuracy.Rate
## 1 Logisitic Regression     0.8913976
## 2 LogReg with Stepwise     0.8923687
## 3                  KNN     0.8897791
## 4                Rpart     0.8947965
## 5        Random Forest     0.8949583
## 6                  SVM     0.8935826

Based on the classification error rate, the RandomForest using Rpart has the highest classification accuracy rate of 0.894. However, all of the models have very similar accuracy so any of them would be appropriate to use to predict if a client will sign a long term loan.

However, SVM is very computationally expensive and KNN requires the entire dataframe to be reformatted. So, in terms of ease, it makes the most sense to use RandomForest build using Rpart since it works the best.

Extra Credit

For the extra credit portion of this project, we employed support vector machines, or SVM, as seen above. SVM uses support vectors, or vectors in somewhat close Euclidean proximity to each other, to try to draw the best hyperplane between the two groups it is trying to differentiate between. In a two-dimensional setting, this is just a line, but in higher dimensions it is harder to visualize.

SVM also has multiple tuning parameters that can be adjusted. We were able to use 10-fold cross validation to obtain optimal cost and gamma parameters, and the best SVM model to then predict onto the test set. Despite being computationally expensive, especially the cross-validation portion of SVM, it was fascinating to read and learn a new method from scratch, and implement code that is not from lecture or discussion at all. It was a different and intellectually stimulating challenge for us, but a rewarding one. The skill of figuring out how to debug and implement code from various sources for an original data question is an invaluable one for us, and we are glad we got the opportunity to do that sort of work on this project.

Bibliography

LaValleyPhD, M., LaValley, M., Michael P. LaValley From the Department of Biostatistics, & Correspondence to Dr Michael P. LaValley. (2008, May 06). Logistic Regression. Retrieved December 13, 2020, from https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.106.682658

M. Pal (2005) Random forest classifier for remote sensing classification, International Journal of Remote Sensing, 26:1, 217-222, DOI: 10.1080/01431160412331269698

S. Zhang, X. Li, M. Zong, X. Zhu and R. Wang, “Efficient kNN Classification With Different Numbers of Nearest Neighbors,” in IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1774-1785, May 2018, doi: 10.1109/TNNLS.2017.2673241.

STA 141A Final Project

Alejandra Blackwell-Taylor, Jacob Herbstman, and Annamarie Rodriguez

12/14/2020

I. Literature Review

II. Methods

Question 1

Question 2

Logistic Regression

KNN

Random Forest

SVM

Results

Question 1

Question 2

Logistic Regression

KNN

Random Forest

SVM

Extra Credit

Bibliography