Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework.
Based on articles:
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.
\(1)\) Which algorithm is recommended to get more accurate results?
\(2)\) Is it better for classification or regression scenarios?
\(3)\) Do you agree with the recommendations? Why?
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.1.3
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom 0.7.12 v rsample 0.1.1
## v dials 0.1.1 v tune 0.2.0
## v infer 1.0.0 v workflows 0.2.6
## v modeldata 0.1.1 v workflowsets 0.2.1
## v parsnip 0.2.1 v yardstick 0.0.9
## v recipes 0.2.0
## Warning: package 'dials' was built under R version 4.1.3
## Warning: package 'infer' was built under R version 4.1.3
## Warning: package 'modeldata' was built under R version 4.1.3
## Warning: package 'parsnip' was built under R version 4.1.3
## Warning: package 'rsample' was built under R version 4.1.3
## Warning: package 'tune' was built under R version 4.1.3
## Warning: package 'workflows' was built under R version 4.1.3
## Warning: package 'workflowsets' was built under R version 4.1.3
## Warning: package 'yardstick' was built under R version 4.1.3
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(e1071)
## Warning: package 'e1071' was built under R version 4.1.3
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
## The following object is masked from 'package:parsnip':
##
## tune
library(car)
## Warning: package 'car' was built under R version 4.1.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
data <- read.csv("https://raw.githubusercontent.com/jconno/R-data/master/Churn_Modelling.csv")
bank.churn <- data
# Removing redundant variables; id, etc.
bank.churn <- bank.churn %>% select(Exited,CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary)
bank.churn[c("Exited", "Geography", "Gender", "Tenure", "NumOfProducts" ,"IsActiveMember")] <- lapply(bank.churn[c("Exited", "Geography", "Gender", "Tenure", "NumOfProducts" ,"IsActiveMember")], factor)
summary(bank.churn)
## Exited CreditScore Geography Gender Age
## 0:7963 Min. :350.0 France :5014 Female:4543 Min. :18.00
## 1:2037 1st Qu.:584.0 Germany:2509 Male :5457 1st Qu.:32.00
## Median :652.0 Spain :2477 Median :37.00
## Mean :650.5 Mean :38.92
## 3rd Qu.:718.0 3rd Qu.:44.00
## Max. :850.0 Max. :92.00
##
## Tenure Balance NumOfProducts HasCrCard IsActiveMember
## 2 :1048 Min. : 0 1:5084 Min. :0.0000 0:4849
## 1 :1035 1st Qu.: 0 2:4590 1st Qu.:0.0000 1:5151
## 7 :1028 Median : 97199 3: 266 Median :1.0000
## 8 :1025 Mean : 76486 4: 60 Mean :0.7055
## 5 :1012 3rd Qu.:127644 3rd Qu.:1.0000
## 3 :1009 Max. :250898 Max. :1.0000
## (Other):3843
## EstimatedSalary
## Min. : 11.58
## 1st Qu.: 51002.11
## Median :100193.91
## Mean :100090.24
## 3rd Qu.:149388.25
## Max. :199992.48
##
No missing values are found, therefore no imputations are necessary
# Categorical: plot.geog, plot.gender, plot.tenure, plot.numProd, plot.active, plot.crCard,
# Qualitative: cred score, age, balance, salary
plot.cred.score <- ggplot(bank.churn, aes(x = CreditScore, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.geog <- ggplot(bank.churn, aes(x = Geography, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.gender <- ggplot(bank.churn, aes(x = Gender, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.age <- ggplot(bank.churn, aes(x = Age, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.tenure <- ggplot(bank.churn, aes(x = Tenure, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.balance <- ggplot(bank.churn, aes(x = Balance, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.numProd <- ggplot(bank.churn, aes(x = NumOfProducts, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.crCard <- ggplot(bank.churn, aes(x = HasCrCard, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.active <- ggplot(bank.churn, aes(x = IsActiveMember, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
plot.est <- ggplot(bank.churn, aes(x = EstimatedSalary, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
ggarrange(plot.cred.score, plot.age, plot.balance, plot.est + rremove("x.text"), ncol = 1, nrow = 4)
# Categorical: plot.geog, plot.gender, plot.tenure, plot.numProd, plot.active, plot.crCard,
ggarrange(plot.geog, plot.gender, plot.tenure, plot.numProd, plot.active, plot.crCard + rremove("x.text"), ncol = 2, nrow = 3)
# cred score, age, balance, salary
corrplot::corrplot(cor(bank.churn[c("CreditScore", "Age", "Balance", "EstimatedSalary")], use = "na.or.complete"), method = "number", type = "upper")
data.model <- bank.churn
glm1 <- glm(Exited ~. -Tenure -IsActiveMember, family = binomial, data.model)
head(data.model)
## Exited CreditScore Geography Gender Age Tenure Balance NumOfProducts
## 1 1 619 France Female 42 2 0.00 1
## 2 0 608 Spain Female 41 1 83807.86 1
## 3 1 502 France Female 42 8 159660.80 3
## 4 0 699 France Female 39 1 0.00 2
## 5 0 850 Spain Female 43 2 125510.82 1
## 6 1 645 Spain Male 44 8 113755.78 2
## HasCrCard IsActiveMember EstimatedSalary
## 1 1 1 101348.88
## 2 0 1 112542.58
## 3 1 0 113931.57
## 4 0 0 93826.63
## 5 1 1 79084.10
## 6 1 0 149756.71
marginalModelPlots(glm1, ~ CreditScore + Age + Balance + EstimatedSalary + Geography + Gender + NumOfProducts + HasCrCard, layout = c(2,3))
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
## A term has fewer unique covariate combinations than specified maximum degrees of freedom
Transformation for CreditScore and Balance could help the model. Having a Credit Card does not register with the plots.
The distribution for Age is normal for both Exiting status, (1,0).
# If the variance of the variable is different for the reponse variables, then a square term should be added.
data.frame(Age.variance.E0 = c(var(data.model$Age[data.model$Exited == 0])),
Age.variance.E1 = c(var(data.model$Age[data.model$Exited == 1])))
## Age.variance.E0 Age.variance.E1
## 1 102.523 95.28808
var.test(Age ~ Exited, data.model, alternative = "two.sided")
##
## F test to compare two variances
##
## data: Age by Exited
## F = 1.0759, num df = 7962, denom df = 2036, p-value = 0.03925
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.003625 1.151795
## sample estimates:
## ratio of variances
## 1.075926
data.model$Age_squared <- data.model$Age^2
data.model$CreditScore_log <- log(data.model$CreditScore)
data.model$Balance_log <- log(data.model$Balance)
glm.full <- glm(Exited ~. - Tenure, family = binomial, bank.churn)
summary(glm.full)
##
## Call:
## glm(formula = Exited ~ . - Tenure, family = binomial, data = bank.churn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4896 -0.5848 -0.3615 -0.1789 3.2559
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.897e+00 2.474e-01 -11.709 <2e-16 ***
## CreditScore -6.951e-04 3.032e-04 -2.292 0.0219 *
## GeographyGermany 9.502e-01 7.292e-02 13.032 <2e-16 ***
## GeographySpain 6.073e-02 7.602e-02 0.799 0.4243
## GenderMale -5.236e-01 5.896e-02 -8.881 <2e-16 ***
## Age 7.121e-02 2.760e-03 25.797 <2e-16 ***
## Balance -7.080e-07 5.696e-07 -1.243 0.2139
## NumOfProducts2 -1.547e+00 7.121e-02 -21.722 <2e-16 ***
## NumOfProducts3 2.566e+00 1.798e-01 14.271 <2e-16 ***
## NumOfProducts4 1.632e+01 1.756e+02 0.093 0.9260
## HasCrCard -6.432e-02 6.413e-02 -1.003 0.3159
## IsActiveMember1 -1.101e+00 6.227e-02 -17.684 <2e-16 ***
## EstimatedSalary 4.338e-07 5.134e-07 0.845 0.3981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 7432.4 on 9987 degrees of freedom
## AIC: 7458.4
##
## Number of Fisher Scoring iterations: 14
rmse <- function(error){
sqrt(mean(error^2))
}
error <- glm.full$residuals
predictionRMSE <- rmse(error)
# SVM
svm1 <- svm(Exited~., bank.churn)
summary(svm1)
##
## Call:
## svm(formula = Exited ~ ., data = bank.churn)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 3684
##
## ( 1793 1891 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
bank.train = bank.churn %>% sample_frac(0.5)
svmfit <- svm(Exited~., data = bank.train, kernel = "radial", gamma = 1, cost =1e5)
summary(svmfit)
##
## Call:
## svm(formula = Exited ~ ., data = bank.train, kernel = "radial", gamma = 1,
## cost = 1e+05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1e+05
##
## Number of Support Vectors: 4155
##
## ( 952 3203 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
set.seed(1)
tune_out = tune(svm, Exited~., data = bank.train, kernel = "radial", ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 1, 2, 3, 4)))
best_model = tune_out$best.model
summary(best_model)
##
## Call:
## best.tune(method = svm, train.x = Exited ~ ., data = bank.train,
## ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5,
## 1, 2, 3, 4)), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2856
##
## ( 921 1935 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
table(true = bank.train$Exited, pred = predict(tune_out$best.model, newdata = bank.train))
## pred
## true 0 1
## 0 4007 32
## 1 331 630
When doing this assignment, it became known very quickly to me that decision trees are quite efficient when producing results. SVM on the other hand, is quite cumbersome and slow to render any results, particularly with “tune” from the e1071 package. In terms of effectiveness, SVM can be easily dismissed as too slow. But, at the same token, decision trees can render inaccurate results, and often do not perform well when faced with a complicated data set.
According to [this article published in the Hindawi jounral](https:// www.hindawi.com/journals/complexity/2021/5550344/) , the researchers used standard decision tree ensembles to measure imbalanced data sets, such as the dataset in this paper regarding results from a COVID-19 lab. Yet, the best result produced, which was a measure of AUROC, was done so by random forest. But overall, it’s mentioned that decision tree ensembles produced the best results for imbalanced datasets. While Random Forests are related to decision trees, they are not the same; in fact, random forest models aren’t as quick as decision trees, and are therefore more computationally expensive.
In this article [published in PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/), the authors highlight the use of SVM (support vector machines) and their usefulness in classification, particularly in assisting cardiologists improve diagnostics among patients. Compared to other machine learning models, SVM renders results with accuracies ranging from 57.85% - 91.83%.
Not only is SVM effective in the medical field, but also in the realm of cyber security. [In this paper published by the IEEE,] (https://ieeexplore.ieee.org/abstract/document/9142183), researchers came up with an SVM model for the detection of DDOS attacks in software defined networks. SVM is used as the main classifier for predicting malicious traffic, particularly the attack classification. The reason being is that SVM is the most effective and accurate classification method, at least according to these researchers.
Another use of SVM and cyber security is exemplified in [this paper, which discusses networks intrusion detection systems, or NIDS, and the role of SVM](https://ieeexplore.ieee.org/abstract/document/8463474) Throughout the many options of shallow machine learning/unsupervised learning methods like random forest, naïve bayes, and SVM, SVM was found the be best. Generally, unsupervised machine learning methods are often used in feature extraction and representation to improve the quality of data and improve the effectiveness of classification. The researchers in this case developed a self-taught deep learning based intrusion detection system utilizing SVM for network detection. In terms of binary and multiclass classifications, the performance of this system was most superior compared to other classification methods and intrusion detections.
In addition to medicine and cyber security, SVM is also used for [rock stars](https:// www.mdpi.com/2071-1050/12/6/2229)-- in the context of predicting brittleness of rocks. Like the other researchers mentioned above in their works, these rockers discovered that SVM models rendered the most accurate results, and all four variations of other SVM models indicated the same least impactful feature of the model, allowing for clear and concise elimination and thus, better predictability.
In conclusion, SVM seems to reign in it’s effectiveness among a plethora of industries, not limited to the ones discussed in this essay. Despite some claiming that in the business world sometimes it’s better to make a quick decision rather than a good decision, SVM proves that it’s worthwhile when one makes a good decision, and takes their time in building an effective model.