In our project we are using kNN, random forest and SVM to determine the most accurate classifications. We also are using logistic regression to determine the best predictors when it comes to likelihood of making a bank deposit.
The kNN method is a commonly used method in statistics and data mining because of how simple it is but also because of how effective it is. KNN is used in classification and regression. KNN is an algorithm that assumes things that are similar exist in close proximity. In both cases the input is k closest training samples. But the outputs are different for regression and classification. In classification it is class membership meaning its class is assigned based on the closest neighbor. In regression the output is an average of the values of the k closest neighbors. In the “Efficient kNN Classification with Different Numbers of Nearest Neighbors” by Shichao Zhang and other authors they examine kNN’s flaws and ways to make it better. Traditionally in kNN a fixed k-value is used for all testing samples. So, researchers tried to fix it and get different k- values using cross validation but found it to be too time consuming. So, in order to fix that problem it is proposed to add a training component. The goal is to add a training to create a kTree and get optimal k values. And even with this added it should still come out with similar accuracy. In the study we find actually that when using the k-Tree method classification accuracies increased on average of 4%.
In the article “Random Forest Classifier for Remote Sensing Classification,” the author Mahesh Pal compares Random Forest Classification and SVM’s to test which one is more effective and accurate. First, he uses Random Forest Classification. This method is commonly used in data mining because of its simple application. Random Forest Classification is when there is a combination of decision trees/ tree classifiers at training time and outputting the most classes and average prediction. This dataset for this study used the bagging method to create its training dataset. To design the decision tree, he had to use a pruning method or an attribute selection measure. He chose the attribute selection method because the pruning methods affect the performance of tree-based classifiers. He came out with 88.37% accuracy with 100 trees. However, when he increased the number of trees to 1,200, he got an 88.02% accuracy which demonstrates that Random Forest classification is not sensitive to overfitting which also makes it an optimal method to use it as well. Next the author used Support Vector Machine (SVM). This method is commonly used in machine learning. SVM is based on a statistical learning theory that has the goal of determining the location of decision boundaries that create the best separation of classes. This model is optimal or the best when there are only two classes. You can use it for multi-class data but its either used as one against the rest of choosing two classes and doing one against one. In this data he used the one against the rest method. Using this method, he got an 87.9% accuracy. This shows that in this specific dataset there really is not a difference between Random Forests and SVM accuracy which means you could use either or and get decent accuracies.
Logistic regression is a method commonly used in statistics. In the article “Logistic Regression” by Michael P. LaValley he explains that Logistic Regression is the analysis of binary outcomes with two mutually exclusive levels. However, logistic regression is especially useful because it allows for the use of continuous or categorical predictors as well as multiple predictors. Logistic regression is often compared to linear regression to show the ways in which it is better. One problem is that when using linear regression, the predicted values are either higher than 1 or lower than 0. Also, linear regression often produces constant variability which does not work for the binary level outcome. Thus logistic regression is the optimal choice in our setting.
In conclusion, we found that all these methods all produced similar results in terms of classification accuracy. But the method that gave us the best results are
Our project seeks to answer two questions. The first is which predictors are the most important to predicting whether an individual will sign a long-term deposit. The second is which binary classification method most accurately classifies who will sign a deposit. The first step is to clean the bank dataset. After reading the documentation on the data, we determined that columns 11-15 are unnecessary to our analysis. The variable (11), duration, is highly correlated with the outcome variable, if duration equals zero, then the outcome is always no. Due to this relationship, we removed duration. Variables 12-15 were data pertaining to the last campaign because this data is not important to the current campaign; we removed these columns as well. Finally, we omitted any columns with incomplete data and changed the outcome variables to 1 for “yes” and 0 for “no”.
bank <- bank[,-(11:15)]
bank <- na.omit(bank)
bank$y <- ifelse(bank$y == "yes", 1, 0)
bank$y <- as.factor(bank$y)
Next we created the train and test samples.
trainbankindices <- sample(1:nrow(bank), .7*nrow(bank))
trainbank <- bank[trainbankindices,]
testbank <- bank[-trainbankindices,]
In order to determine which predictors are the most important for classification, we utilized random forest classification. Random Forest is very useful for identifying feature importance. We used two different packages that create random forest classification models, “rpart” and “randomForest.
#build models
rpartmodel <- rpart(y~., data = trainbank, method = 'class')
randomforestmodel <- randomForest(y~. ,data = trainbank, importance = T)
#predict using test data
prediction <- predict(rpartmodel, testbank)
treeT <- table(predict(rpartmodel, testbank, type = 'class'), testbank$y)
In order to find the best classification method, we compared the classification accuracy of Logistic Regression, K-Nearest Neighbor (KNN),Random Forest, and Support Vector Machines (SVM).
The first method we used is the simplest,logistic regression. It is a good starting spot for binary classification, as it is pretty inexpensive to run and is more easily interpretable compared to other “black box” prediction algorithms. When running the logistic regression, we are able to see individual variables and their significance in the output of the regression. This can give us an idea of which characteristics of customers matter when predicting if they sign a deposit or not, instead of just handing us an accuracy rate and confusion matrix.
\[ log\left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 + \ldots + \beta_{n}X_n \]
We then used a stepwise variable selection process to narrow down the predictors in the regression to interpret which factors are the most important to accurately classifying the response variable.The algorithm used for this process does both directions by default, compared to just going simply backwards or forwards for variable selection. Since we started the variable selection process after running the initial model, we used backwards variable selection to narrow down our choice set of predictors.
#build models
logreg <- glm(y~., data = trainbank, family = 'binomial')
logreg_steps <- glm(y~., data = trainbank, family = 'binomial') %>% stepAIC(trace = F)
step.model <- stepAIC(logreg, trace = FALSE)
#prediction on test data
log.predict <- predict(logreg, newdata = testbank, type = 'response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
step.predict <- predict(logreg_steps, newdata = testbank, type = 'response')
First we made the KNN model, we had to reformat the dataframe. First, the outcome variable was made into its own object. The numeric columns were scaled and the categorical data had to be changed into binary variables.
knnbank = bank
#outcome
outcome <- knnbank %>% dplyr::select(y)
knnbank <- knnbank %>% dplyr::select(-y)
##numeric scaling
knnbank[,c("age","emp.var.rate", "cons.price.idx", "euribor3m", "nr.employed")] <- scale(knnbank[,c("age","emp.var.rate", "cons.price.idx", "euribor3m", "nr.employed")])
## binary variables
knnbank$contact <- ifelse(knnbank$contact == "cellular", 1, 0)
##recode as dummy variables
knnbank$job <- as.data.frame(dummy.code(knnbank$job))
knnbank$marital <- as.data.frame(dummy.code(knnbank$marital))
knnbank$education <- as.data.frame(dummy.code(knnbank$education))
knnbank$default <- as.data.frame(dummy.code(knnbank$default))
knnbank$housing <- as.data.frame(dummy.code(knnbank$housing))
knnbank$loan <- as.data.frame(dummy.code(knnbank$loan))
knnbank$month <- as.data.frame(dummy.code(knnbank$month))
knnbank$day_of_week <- as.data.frame(dummy.code(knnbank$day_of_week))
Because the data had to be reformatted, we created new test and training samples.
smp_size <- floor(0.7 * nrow(knnbank))
train_ind <- sample(seq_len(nrow(knnbank)), size = smp_size)
knn_train <- knnbank[train_ind, ]
knn_test <- knnbank[-train_ind, ]
outcome_train <- outcome[train_ind, ]
outcome_test <- outcome[-train_ind, ]
After using logistic regression as a baseline, we turned toward more advanced machine learning methods, starting with K Nearest Neighbors, or KNN. We felt that despite being a multivariate dataset, and not easily visualizable like KNN ideally is, that it was a computationally inexpensive method that could give us another baseline on what our accuracy rates were coming out to be.
pred_knn <- knn(train = knn_train, test = knn_test, cl = outcome_train, k=20)
Since this method is usable for regression and classification, it is a valid method for this approach. The steps to create these models are above, we used both ‘rpart’ and ’randomForest".
Lastly, we wanted to try out a method we hadn’t seen in class before, and decided to try support vector machines, or SVM, as it is a method that is almost exclusively designed for binary classification problems. SVM uses support vectors, or vectors in somewhat close Euclidean proximity to each other, to try to draw the best hyperplane between the two groups it is trying to differentiate between. In a two-dimensional setting, this is just a line, but in higher dimensions it is harder to visualize.
Despite being computationally expensive it was fascinating to read and learn a new method from scratch, and implement code that is not from lecture or discussion at all. It was a different and intellectually stimulating challenge for us, but a rewarding one. The skill of figuring out how to debug and implement code from various sources for an original data question is an invaluable one for us, and we are glad we got the opportunity to do that sort of work on this project.
#build model
tuning.para <- tune.svm(y~.,
data = trainbank, gamma = 10^-2,
cost = 10, tunecontrol = tune.control(cross=10))
svm.model <- tuning.para$best.model
#predict model
svm.pred <- predict(svm.model, testbank)
svm.pred <- as.data.frame(svm.pred)
In order to answer our first question, we used a RandomForest and Rpart to build random forest trees.
#Rpart
accuracyrpart <- (treeT[1,1]+treeT[2,2])/(sum(treeT))
printcp(rpartmodel)
##
## Classification tree:
## rpart(formula = y ~ ., data = trainbank, method = "class")
##
## Variables actually used in tree construction:
## [1] contact emp.var.rate nr.employed
##
## Root node error: 3257/28831 = 0.11297
##
## n= 28831
##
## CP nsplit rel error xerror xstd
## 1 0.016273 0 1.00000 1.0000 0.016503
## 2 0.010000 3 0.95118 0.9521 0.016152
Rpart used the variables contact, emp.var.rate, and nr.employed to construct the trees. These are the variables that the algorithm detected as the most important. These variables are how the customer was being contacted, the employment variation rate, and number of employees that the bank currently has.
#RandomForest
varImpPlot(randomforestmodel)
treeT1 <- table(predict(randomforestmodel, testbank, type = 'class'), testbank$y)
accuracyrf <- (treeT1[1,1]+treeT1[2,2])/(sum(treeT1))
In contrast, using randomForest, the 5 most important variables in regards to mean accuracy decrease are age, euribor3m, day_of_the_week, education, and nr.employed. The 5 most important variables in regards to purity increase are euribor3m, age, nr.employed, job, education, day_of_the_week. Although the variables are not in the same order, both lists contain the same variables, therefore we conclude: age, euribor3m, day_of_the_week, education, and nr.employed are most important predictors for the outcome variables.
In this section, we will compare the accuracy rates between the models for logistic regression, KNN, Random Forest, Random Forest built using Rpart, and SVM to see which has the lowest classification error.
#confusion matrix
log.prediction.rd <- ifelse(log.predict > 0.5, 1, 0)
step.prediction <- ifelse(step.predict > .5, 1, 0)
table_2 <- table(log.prediction.rd, testbank$y)
table_3 <- table(step.prediction, testbank$y)
accuracylogreg <- sum(diag(table_2))/(sum(table_2))
accuracystep <- sum(diag(table_3))/(sum(table_3))
outcome_test <- data.frame(outcome_test)
class_comparison <- data.frame(pred_knn, outcome_test)
names(class_comparison) <- c("Predicted", "Observed")
CrossTable(x = class_comparison$Observed, y = class_comparison$Predicted, prop.chisq=FALSE, prop.c = FALSE, prop.r = FALSE, prop.t = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## |-------------------------|
##
##
## Total Observations in Table: 12357
##
##
## | class_comparison$Predicted
## class_comparison$Observed | 0 | 1 | Row Total |
## --------------------------|-----------|-----------|-----------|
## 0 | 10675 | 289 | 10964 |
## --------------------------|-----------|-----------|-----------|
## 1 | 1066 | 327 | 1393 |
## --------------------------|-----------|-----------|-----------|
## Column Total | 11741 | 616 | 12357 |
## --------------------------|-----------|-----------|-----------|
##
##
accuracyknn <- (10699 + 296) / 12357
accuracyrf
## [1] 0.8951202
accuracyrpart
## [1] 0.8959294
svm.table <- table(svm.pred$svm.pred, testbank$y)
accuracysvm <- sum(diag(svm.table))/(sum(svm.table))
labels <- c("Logisitic Regression", "LogReg with Stepwise", "KNN", "Rpart", "Random Forest", "SVM")
values <- c(accuracylogreg, accuracystep, accuracyknn,accuracyrpart, accuracyrf, accuracysvm)
accuracytable <- data.frame("Model" = labels, "Accuracy Rate" = values)
accuracytable
## Model Accuracy.Rate
## 1 Logisitic Regression 0.8938254
## 2 LogReg with Stepwise 0.8920450
## 3 KNN 0.8897791
## 4 Rpart 0.8959294
## 5 Random Forest 0.8951202
## 6 SVM 0.8956057
Based on the classification error rate, the RandomForest using Rpart has the highest classification accuracy rate of 0.894. However, all of the models have very similar accuracy so any of them would be appropriate to use to predict if a client will sign a long term loan.
However, svm is very computationally expensive and KNN requires the entire dataframe to be reformatted. So, in terms of ease, it makes the most sense to use RandomForest build using Rpart since it works the best.
LaValleyPhD, M., LaValley, M., Michael P. LaValley From the Department of Biostatistics, & Correspondence to Dr Michael P. LaValley. (2008, May 06). Logistic Regression. Retrieved December 13, 2020, from https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.106.682658
M. Pal (2005) Random forest classifier for remote sensing classification, International Journal of Remote Sensing, 26:1, 217-222, DOI: 10.1080/01431160412331269698
S. Zhang, X. Li, M. Zong, X. Zhu and R. Wang, “Efficient kNN Classification With Different Numbers of Nearest Neighbors,” in IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1774-1785, May 2018, doi: 10.1109/TNNLS.2017.2673241.