Data 622, Homework 3

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework.

Based on articles:

1: Decision Tree 2: SVM

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise.

\(1)\) Which algorithm is recommended to get more accurate results?

\(2)\) Is it better for classification or regression scenarios?

\(3)\) Do you agree with the recommendations? Why?

Libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.1.3

## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --

## v broom        0.7.12     v rsample      0.1.1 
## v dials        0.1.1      v tune         0.2.0 
## v infer        1.0.0      v workflows    0.2.6 
## v modeldata    0.1.1      v workflowsets 0.2.1 
## v parsnip      0.2.1      v yardstick    0.0.9 
## v recipes      0.2.0

## Warning: package 'dials' was built under R version 4.1.3

## Warning: package 'infer' was built under R version 4.1.3

## Warning: package 'modeldata' was built under R version 4.1.3

## Warning: package 'parsnip' was built under R version 4.1.3

## Warning: package 'rsample' was built under R version 4.1.3

## Warning: package 'tune' was built under R version 4.1.3

## Warning: package 'workflows' was built under R version 4.1.3

## Warning: package 'workflowsets' was built under R version 4.1.3

## Warning: package 'yardstick' was built under R version 4.1.3

## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Search for functions across packages at https://www.tidymodels.org/find/

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(caTools)

## Warning: package 'caTools' was built under R version 4.1.3

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.1.3

library(e1071)

## Warning: package 'e1071' was built under R version 4.1.3

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

library(car)

## Warning: package 'car' was built under R version 4.1.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.1.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

Data

data <- read.csv("https://raw.githubusercontent.com/jconno/R-data/master/Churn_Modelling.csv")

bank.churn <- data

# Removing redundant variables; id, etc. 

bank.churn <- bank.churn %>% select(Exited,CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary)


bank.churn[c("Exited", "Geography", "Gender", "Tenure", "NumOfProducts" ,"IsActiveMember")] <- lapply(bank.churn[c("Exited", "Geography", "Gender", "Tenure", "NumOfProducts" ,"IsActiveMember")], factor)

summary(bank.churn)

##  Exited    CreditScore      Geography       Gender          Age       
##  0:7963   Min.   :350.0   France :5014   Female:4543   Min.   :18.00  
##  1:2037   1st Qu.:584.0   Germany:2509   Male  :5457   1st Qu.:32.00  
##           Median :652.0   Spain  :2477                 Median :37.00  
##           Mean   :650.5                                Mean   :38.92  
##           3rd Qu.:718.0                                3rd Qu.:44.00  
##           Max.   :850.0                                Max.   :92.00  
##                                                                       
##      Tenure        Balance       NumOfProducts   HasCrCard      IsActiveMember
##  2      :1048   Min.   :     0   1:5084        Min.   :0.0000   0:4849        
##  1      :1035   1st Qu.:     0   2:4590        1st Qu.:0.0000   1:5151        
##  7      :1028   Median : 97199   3: 266        Median :1.0000                 
##  8      :1025   Mean   : 76486   4:  60        Mean   :0.7055                 
##  5      :1012   3rd Qu.:127644                 3rd Qu.:1.0000                 
##  3      :1009   Max.   :250898                 Max.   :1.0000                 
##  (Other):3843                                                                 
##  EstimatedSalary    
##  Min.   :    11.58  
##  1st Qu.: 51002.11  
##  Median :100193.91  
##  Mean   :100090.24  
##  3rd Qu.:149388.25  
##  Max.   :199992.48  
##

No missing values are found, therefore no imputations are necessary

Distribution Plots

# Categorical: plot.geog, plot.gender, plot.tenure, plot.numProd, plot.active, plot.crCard,

# Qualitative: cred score, age, balance, salary

plot.cred.score <- ggplot(bank.churn, aes(x = CreditScore, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.geog <- ggplot(bank.churn, aes(x = Geography, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.gender <- ggplot(bank.churn, aes(x = Gender, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.age <- ggplot(bank.churn, aes(x = Age, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.tenure <- ggplot(bank.churn, aes(x = Tenure, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.balance <- ggplot(bank.churn, aes(x = Balance, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.numProd <- ggplot(bank.churn, aes(x = NumOfProducts, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.crCard <- ggplot(bank.churn, aes(x = HasCrCard, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.active <- ggplot(bank.churn, aes(x = IsActiveMember, color = Exited)) + geom_bar(position = position_dodge()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot.est <- ggplot(bank.churn, aes(x = EstimatedSalary, color = Exited)) + geom_density(na.rm = TRUE, bw = 0.3) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

ggarrange(plot.cred.score,  plot.age, plot.balance, plot.est + rremove("x.text"), ncol = 1, nrow = 4)

Balance is a good predictor for customer turnover

# Categorical: plot.geog, plot.gender, plot.tenure, plot.numProd, plot.active, plot.crCard,
ggarrange(plot.geog, plot.gender, plot.tenure, plot.numProd, plot.active, plot.crCard + rremove("x.text"), ncol = 2, nrow = 3)

Geography, the number of products, and if one has a credit card has noticeable differences between Exiting the bank (0,1). Tenure, and Active membership are not significant in this.

Correlation Plot

# cred score, age, balance, salary

corrplot::corrplot(cor(bank.churn[c("CreditScore", "Age", "Balance", "EstimatedSalary")], use = "na.or.complete"), method = "number", type = "upper")

There is no correlation among the qualitative variables–therefore, multi-collinearity is not an issue

SVM Model

data.model <- bank.churn

glm1 <- glm(Exited ~. -Tenure -IsActiveMember, family = binomial, data.model)

head(data.model)

##   Exited CreditScore Geography Gender Age Tenure   Balance NumOfProducts
## 1      1         619    France Female  42      2      0.00             1
## 2      0         608     Spain Female  41      1  83807.86             1
## 3      1         502    France Female  42      8 159660.80             3
## 4      0         699    France Female  39      1      0.00             2
## 5      0         850     Spain Female  43      2 125510.82             1
## 6      1         645     Spain   Male  44      8 113755.78             2
##   HasCrCard IsActiveMember EstimatedSalary
## 1         1              1       101348.88
## 2         0              1       112542.58
## 3         1              0       113931.57
## 4         0              0        93826.63
## 5         1              1        79084.10
## 6         1              0       149756.71

marginalModelPlots(glm1, ~ CreditScore + Age + Balance + EstimatedSalary + Geography + Gender + NumOfProducts + HasCrCard, layout = c(2,3))

## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
##   A term has fewer unique covariate combinations than specified maximum degrees of freedom
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
##   A term has fewer unique covariate combinations than specified maximum degrees of freedom

Transformation for CreditScore and Balance could help the model. Having a Credit Card does not register with the plots.
The distribution for Age is normal for both Exiting status, (1,0).

# If the variance of the variable is different for the reponse variables, then a square term should be added.

data.frame(Age.variance.E0 = c(var(data.model$Age[data.model$Exited == 0])),
           Age.variance.E1 = c(var(data.model$Age[data.model$Exited == 1])))

##   Age.variance.E0 Age.variance.E1
## 1         102.523        95.28808

var.test(Age ~ Exited, data.model, alternative = "two.sided")

## 
##  F test to compare two variances
## 
## data:  Age by Exited
## F = 1.0759, num df = 7962, denom df = 2036, p-value = 0.03925
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.003625 1.151795
## sample estimates:
## ratio of variances 
##           1.075926

As the variance is stated to be different. a square term will be added in order to see if it fits the model.

data.model$Age_squared <- data.model$Age^2

For the variables CreditScore and Balance, transformations will be attempted

data.model$CreditScore_log <- log(data.model$CreditScore)
data.model$Balance_log <- log(data.model$Balance)

Retesting the data after new transformations: After many attempts, transformation was not successful.

Building the model

glm.full <- glm(Exited ~. - Tenure, family = binomial, bank.churn)

summary(glm.full)

## 
## Call:
## glm(formula = Exited ~ . - Tenure, family = binomial, data = bank.churn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4896  -0.5848  -0.3615  -0.1789   3.2559  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.897e+00  2.474e-01 -11.709   <2e-16 ***
## CreditScore      -6.951e-04  3.032e-04  -2.292   0.0219 *  
## GeographyGermany  9.502e-01  7.292e-02  13.032   <2e-16 ***
## GeographySpain    6.073e-02  7.602e-02   0.799   0.4243    
## GenderMale       -5.236e-01  5.896e-02  -8.881   <2e-16 ***
## Age               7.121e-02  2.760e-03  25.797   <2e-16 ***
## Balance          -7.080e-07  5.696e-07  -1.243   0.2139    
## NumOfProducts2   -1.547e+00  7.121e-02 -21.722   <2e-16 ***
## NumOfProducts3    2.566e+00  1.798e-01  14.271   <2e-16 ***
## NumOfProducts4    1.632e+01  1.756e+02   0.093   0.9260    
## HasCrCard        -6.432e-02  6.413e-02  -1.003   0.3159    
## IsActiveMember1  -1.101e+00  6.227e-02 -17.684   <2e-16 ***
## EstimatedSalary   4.338e-07  5.134e-07   0.845   0.3981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  7432.4  on 9987  degrees of freedom
## AIC: 7458.4
## 
## Number of Fisher Scoring iterations: 14

SVM: Linear Approach

rmse <- function(error){
  sqrt(mean(error^2))
}


error <- glm.full$residuals
predictionRMSE <- rmse(error)

# SVM

svm1 <- svm(Exited~., bank.churn)

summary(svm1)

## 
## Call:
## svm(formula = Exited ~ ., data = bank.churn)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  3684
## 
##  ( 1793 1891 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

SVM: Non Linear Approach

bank.train = bank.churn %>% sample_frac(0.5)

svmfit <- svm(Exited~., data = bank.train, kernel = "radial", gamma = 1, cost =1e5)

summary(svmfit)

## 
## Call:
## svm(formula = Exited ~ ., data = bank.train, kernel = "radial", gamma = 1, 
##     cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1e+05 
## 
## Number of Support Vectors:  4155
## 
##  ( 952 3203 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

set.seed(1)

tune_out = tune(svm, Exited~., data = bank.train, kernel = "radial", ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 1, 2, 3, 4)))

best_model = tune_out$best.model

summary(best_model)

## 
## Call:
## best.tune(method = svm, train.x = Exited ~ ., data = bank.train, 
##     ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 
##         1, 2, 3, 4)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  2856
## 
##  ( 921 1935 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

table(true = bank.train$Exited, pred = predict(tune_out$best.model, newdata = bank.train))

##     pred
## true    0    1
##    0 4007   32
##    1  331  630

Conclusion

When doing this assignment, it became known very quickly to me that decision trees are quite efficient when producing results. SVM on the other hand, is quite cumbersome and slow to render any results, particularly with “tune” from the e1071 package. In terms of effectiveness, SVM can be easily dismissed as too slow. But, at the same token, decision trees can render inaccurate results, and often do not perform well when faced with a complicated data set. 

According to [this article published in the Hindawi jounral](https:// www.hindawi.com/journals/complexity/2021/5550344/) , the researchers used standard decision tree ensembles to measure imbalanced data sets, such as the dataset in this paper regarding results from a COVID-19 lab. Yet, the best result produced, which was a measure of AUROC, was done so by random forest. But overall, it’s mentioned that decision tree ensembles produced the best results for imbalanced datasets. While Random Forests are related to decision trees, they are not the same; in fact, random forest models aren’t as quick as decision trees, and are therefore more computationally expensive.

In this article [published in PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/), the authors highlight the use of SVM (support vector machines) and their usefulness in classification, particularly in assisting cardiologists improve diagnostics among patients. Compared to other machine learning models, SVM renders results with accuracies ranging from 57.85% - 91.83%.

Not only is SVM effective in the medical field, but also in the realm of cyber security. [In this paper published by the IEEE,] (https://ieeexplore.ieee.org/abstract/document/9142183), researchers came up with an SVM model for the detection of DDOS attacks in software defined networks. SVM is used as the main classifier for predicting malicious traffic, particularly the attack classification. The reason being is that SVM is the most effective and accurate classification method, at least according to these researchers. 

Another use of SVM and cyber security is exemplified in [this paper, which discusses networks intrusion detection systems, or NIDS, and the role of SVM](https://ieeexplore.ieee.org/abstract/document/8463474) Throughout the many options of shallow machine learning/unsupervised learning methods like random forest, naïve bayes, and SVM, SVM was found the be best. Generally, unsupervised machine learning methods are often used in feature extraction and representation to improve the quality of data and improve the effectiveness of classification. The researchers in this case developed a self-taught deep learning based intrusion detection system utilizing SVM for network detection. In terms of binary and multiclass classifications, the performance of this system was most superior compared to other classification methods and intrusion detections.

In addition to medicine and cyber security, SVM is also used for [rock stars](https:// www.mdpi.com/2071-1050/12/6/2229)-- in the context of predicting brittleness of rocks. Like the other researchers mentioned above in their works, these rockers discovered that SVM models rendered the most accurate results, and all four variations of other SVM models indicated the same least impactful feature of the model, allowing for clear and concise elimination and thus, better predictability.  

In conclusion, SVM seems to reign in it’s effectiveness among a plethora of industries, not limited to the ones discussed in this essay. Despite some claiming that in the business world sometimes it’s better to make a quick decision rather than a good decision, SVM proves that it’s worthwhile when one makes a good decision, and takes their time in building an effective model.

Data 622 HW 3

Joe Connolly

2022-04-22