Project Overview

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework. Based on articles https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/ Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

Data Description

For this project, I would like to use the the Loan application data set from the Kaggle’s website. https://www.kaggle.com/datasets/angadgupta/loanapplicantdata The data set is structured with 13 variables and 768 observations.Among the 13 variables, the first variable Loan_ID is deleted due to the Loan_ID is not a influential factors among the data set. The variable - Loan_Status in the data is the target variable. The object of this project is to determine the factors that affect the Loan applications resulting either approved or rejected.

Loan ID - Unique Loan ID

Gender - Male/Female

Married - Applicant married(Y/N)

Dependents - Number of Dependents

Education - Application Education(Graduate/Under Graduate)

Self_Employed - Self-Employed(Y/N)

ApplicantIncome - Applicant Income

CoapplicantIncome - Applicant Income

LoanAmount - Loan Amount in Thousands

Loan_Amount_Term - Term of Loan in Months

Credit_History - Credit History(1 for Yes, 0 for No)

Property_Area - Property Areas (Semiurban, Urban, Other)

Loan_Status - Loan Status(True, False)

Libraries

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(skimr)
library(rpart)
library(rpart.plot)
library(DMwR)

## Loading required package: lattice

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(ggplot2)
library(randomForest)

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.2.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x randomForest::combine() masks dplyr::combine()
## x dplyr::filter()         masks stats::filter()
## x dplyr::lag()            masks stats::lag()
## x purrr::lift()           masks caret::lift()
## x randomForest::margin()  masks ggplot2::margin()

library(e1071)

Import and Subset Data

Loan<-read_csv("loan.csv", col_types = 'fffnffnnnnfff')

skim(Loan)

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".

Data summary
Name	Loan
Number of rows	614
Number of columns	13
_______________________
Column type frequency:
factor	8
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Loan_ID	1	FALSE	614	LP0: 1, LP0: 1, LP0: 1, LP0: 1
Gender	1	FALSE	3	Mal: 489, Fem: 112, emp: 13
Married	1	FALSE	3	Yes: 398, No: 213, emp: 3
Education	1	FALSE	2	Gra: 480, Not: 134
Self_Employed	1	FALSE	3	No: 500, Yes: 82, emp: 32
Credit_History	1	FALSE	3	1: 475, 0: 89, emp: 50
Property_Area	1	FALSE	3	Sem: 233, Urb: 202, Rur: 179
Loan_Status	1	FALSE	2	Y: 422, N: 192

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Dependents	15	0.98	0.76	1.02	0	0.0	0.0	2.00	3	▇▂▁▂▁
ApplicantIncome	0	1.00	5403.46	6109.04	150	2877.5	3812.5	5795.00	81000	▇▁▁▁▁
CoapplicantIncome	0	1.00	1621.25	2926.25	0	0.0	1188.5	2297.25	41667	▇▁▁▁▁
LoanAmount	22	0.96	146.41	85.59	9	100.0	128.0	168.00	700	▇▃▁▁▁
Loan_Amount_Term	14	0.98	342.00	65.12	12	360.0	360.0	360.00	480	▁▁▁▇▁

The skim function present that there are missing values in columns of Dependents, LoanAmount and Loan_Amount Term the data set, and the target variable Loan_Status has 422 ‘Yes’ and 192 ‘No’, it is the signal of class imbalance, will handle the class imbalance problem with SMOTE function.

ggplot(Loan, aes(x = Loan_Status, y = ApplicantIncome))+geom_boxplot()+ylim(0,30000)

## Warning: Removed 7 rows containing non-finite values (stat_boxplot).

ggplot(Loan, aes(x = Loan_Status, y = CoapplicantIncome))+geom_boxplot()+ylim(0,10000)

## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

According to the Box plot, In terms of income, the median income of Applicants who either approved or rejected seems no difference. However the median income for Co-applicants who get approved are higher than the one who get rejected.

ggplot(Loan, aes(x=Loan_Status, y= Education, color = Education)) +
  geom_bar(stat="identity")

ggplot(Loan, aes(x=Loan_Status, y= Property_Area, color = Property_Area)) +
  geom_bar(stat="identity")

The bar plot shows that the Higher portions of Graduate individuals has approved the loan compare to the one who got approved from Not Graduate group. Also, individuals who have property in Urban have the higher chance to get approved the loan, followed by Rural and Semiurban.

Feature Engineering

First, remove the unnecessary variable - Loan_ID. The decision tree model is very convenient to build, it handles the noise and missing values by itself. The only concern is the class imbalance problem, as the skim() function shows, there are Y: 422, N: 192 in the Loan_Status variable. Therefore, use SMOTE() to fix the class imbalance problem.

Loan<-Loan[,-1]
round(prop.table(table(select(Loan, Loan_Status),exclude = NULL)),4)*100

## 
##     Y     N 
## 68.73 31.27

set.seed(1234)
Loan<-SMOTE(factor(Loan_Status)~., data.frame(Loan),perc.over = 100,perc.under = 200)
round(prop.table(table(select(Loan, Loan_Status),exclude = NULL)),4)*100

## 
##  Y  N 
## 50 50

Now, the proportion of Y and N is 50,50.

Split the data set in to 75/25 for training set and test set.

set.seed(1234)
split <- sample(nrow(Loan), round(nrow(Loan)*0.75), replace = F) 
train <- Loan[split,]
test <- Loan[-split,]
round(prop.table(table(select(train, Loan_Status),exclude = NULL)),4)*100

## 
##     Y     N 
## 49.83 50.17

round(prop.table(table(select(test, Loan_Status),exclude = NULL)),4)*100

## 
##     Y     N 
## 50.52 49.48

With 75/25 split, there are 460 observations in the training data, 154 observations in the test data.

Decision Tree Modeling

mod<-rpart(Loan_Status~.,method = 'class' ,data = Loan)
rpart.plot(mod)

According to the rpart.plot function, the key variable or the root node is the Credit_History. If there is not credit history, the decision directly go to no approval. The second key variable is Co-applicants’ income, and followed by the property area located at urban. Since the Credit_History and Co-applicants’ income are the key facotrs, I’m going to build another tree model without these two variables to compare the result.

loan<-Loan[,-c(7,10)]
mod2<-rpart(Loan_Status~.,method = 'class' ,data = loan)
rpart.plot(mod2)

Without the credit history and Co-Applicants income info, the Loan_Amount became the root node, and the key variables followed by marital status and applicants’ income and property locations.

Decision Tree Model Evaluation

mod_pred <- predict(mod, test, type = 'class')
mod_table<-table(test$Loan_Status, mod_pred)
mod_table

##    mod_pred
##      Y  N
##   Y 89  8
##   N 33 62

mod_accuracy<- sum(diag(mod_table))/nrow(test)
print(paste('The first tree model accuracy is',mod_accuracy))

## [1] "The first tree model accuracy is 0.786458333333333"

The first decision tree model presents 78.65% of the accuracy against the test data. Let’s try the second decision tree model to see how does the accuracy change.

mod2_pred <- predict(mod2, test, type = 'class')
mod2_table<-table(test$Loan_Status, mod2_pred)
mod2_table

##    mod2_pred
##      Y  N
##   Y 80 17
##   N 37 58

mod2_accuracy<- sum(diag(mod2_table))/nrow(test)
print(paste('The second tree model accuracy is',mod2_accuracy))

## [1] "The second tree model accuracy is 0.71875"

The model accuracy shows that the model included the “Credit_History” and “Co-ApplicantIncome” has higher accuracy about 7% against the test data.

Random Forest Modeling and Evaluation

forest_mod <- randomForest(Loan_Status ~ .,ntree = 2000, importance = T,data = train, na.action = na.roughfix)

forest_pred <- predict(forest_mod, test)

confusionMatrix(forest_pred, test$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Y  N
##          Y 74 27
##          N 10 61
##                                           
##                Accuracy : 0.7849          
##                  95% CI : (0.7159, 0.8438)
##     No Information Rate : 0.5116          
##     P-Value [Acc > NIR] : 1.25e-13        
##                                           
##                   Kappa : 0.5715          
##                                           
##  Mcnemar's Test P-Value : 0.008529        
##                                           
##             Sensitivity : 0.8810          
##             Specificity : 0.6932          
##          Pos Pred Value : 0.7327          
##          Neg Pred Value : 0.8592          
##              Prevalence : 0.4884          
##          Detection Rate : 0.4302          
##    Detection Prevalence : 0.5872          
##       Balanced Accuracy : 0.7871          
##                                           
##        'Positive' Class : Y               
##

There are 20 rows in the dataset has NA, so,I would like to drop the NA values in the dataset before building the SVM model.

SVM and Evaluation

test_set<-drop_na(test)

### Support Vector Machine

svm_mod<-svm(Loan_Status~., data = train)
svm_pred<-predict(svm_mod, newdata = test_set)
confusionMatrix(svm_pred,test_set$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Y  N
##          Y 83 51
##          N  1 37
##                                           
##                Accuracy : 0.6977          
##                  95% CI : (0.6231, 0.7653)
##     No Information Rate : 0.5116          
##     P-Value [Acc > NIR] : 5.488e-07       
##                                           
##                   Kappa : 0.4031          
##                                           
##  Mcnemar's Test P-Value : 1.083e-11       
##                                           
##             Sensitivity : 0.9881          
##             Specificity : 0.4205          
##          Pos Pred Value : 0.6194          
##          Neg Pred Value : 0.9737          
##              Prevalence : 0.4884          
##          Detection Rate : 0.4826          
##    Detection Prevalence : 0.7791          
##       Balanced Accuracy : 0.7043          
##                                           
##        'Positive' Class : Y               
##

Conclusion

In this case, the random forest performs the best with accuracy of 79%, and it is slightly better than the first decision tree model about 1 %, and much better than the SVM model which has the accuracy 70%. tree based methods are considered non-parametric, making no assumption on the distribution of data and the structure of the true model. It requires less data cleaning and not affected by noise and multi-collinearity. On the other hand, SVM requires to construct before building a model. usually need to transform the data, such as using kernel functions, and select a hyper-plane that maximizes the margin between the plane and classification group. To compare the SVM and decision tree, the SVM uses kernel trick to solve non-linear problems such to solve this multi-variable data set, whereas decision tree derive hyper-rectangles in input space to solve the problem. In general, the decision tree performs better for categorical data and it deals co-linearity better than SVM. The target variable in this project is categorical, and the accuracy for the SVM vs Decision tree confirms the advantage of the performance for decision tree at classification over SVM. However, The best performed model is the Random Forest model . Random Forest is a tree-based machine learning algorithm that leverages the power of multiple decision trees for making decisions. In other words,The Random Forest Algorithm combines the output of multiple (randomly created) Decision Trees to generate the final output. The key difference between decision tree and Random Forest is that Random forest leverages the power of multiple decision trees and it does not rely on the feature importance given by a single decision tree.Therefore, I would like to recommend the random forest model among the models that have been created. Because random forest model is suitable for the large size data set, and the interpretation is less important since the major object for this project is just identify either approve or disapprove the loan applications.

refernce:

https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222

https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b

https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/

DATA622_HW3_ChunjieNan

Chunjie Nan

4/13/2022