library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)
For this assignment, we will be working with a dataset on loan approval status. The ‘Loan_Status’ is the target variable here -
Loan Approval Status Data Dictionary
<- read_csv('Loan_approval.csv')
dataset head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LP001002 | Male | No | 0 | Graduate | No | 5849 | 0 | NA | 360 | 1 | Urban | Y |
LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508 | 128 | 360 | 1 | Rural | N |
LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0 | 66 | 360 | 1 | Urban | Y |
LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358 | 120 | 360 | 1 | Urban | Y |
LP001008 | Male | No | 0 | Graduate | No | 6000 | 0 | 141 | 360 | 1 | Urban | Y |
LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196 | 267 | 360 | 1 | Urban | Y |
summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length:614 | Length:614 | Length:614 | Length:614 | Length:614 | Length:614 | Min. : 150 | Min. : 0 | Min. : 9.0 | Min. : 12 | Min. :0.0000 | Length:614 | Length:614 | |
Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 2878 | 1st Qu.: 0 | 1st Qu.:100.0 | 1st Qu.:360 | 1st Qu.:1.0000 | Class :character | Class :character | |
Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 3812 | Median : 1188 | Median :128.0 | Median :360 | Median :1.0000 | Mode :character | Mode :character | |
NA | NA | NA | NA | NA | NA | Mean : 5403 | Mean : 1621 | Mean :146.4 | Mean :342 | Mean :0.8422 | NA | NA | |
NA | NA | NA | NA | NA | NA | 3rd Qu.: 5795 | 3rd Qu.: 2297 | 3rd Qu.:168.0 | 3rd Qu.:360 | 3rd Qu.:1.0000 | NA | NA | |
NA | NA | NA | NA | NA | NA | Max. :81000 | Max. :41667 | Max. :700.0 | Max. :480 | Max. :1.0000 | NA | NA | |
NA | NA | NA | NA | NA | NA | NA | NA | NA’s :22 | NA’s :14 | NA’s :50 | NA | NA |
Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.
## Counts of missing data per feature
<- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_counts <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))
dataset_missing_pct
<- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
dataset_missing_counts colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL
%>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px") dataset_missing_counts
Feature | NA_Count | NA_Percentage |
---|---|---|
Credit_History | 50 | 8.1433225 |
Self_Employed | 32 | 5.2117264 |
LoanAmount | 22 | 3.5830619 |
Dependents | 15 | 2.4429967 |
Loan_Amount_Term | 14 | 2.2801303 |
Gender | 13 | 2.1172638 |
Married | 3 | 0.4885993 |
ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) +
geom_bar(stat = 'identity', fill = 'steelblue') +
geom_label(aes(label = NA_Count)) +
labs(title = 'Missing Counts') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())
# Use nanair package to plot missing value patterns
gg_miss_upset(dataset)
Based on above missing value analysis, we are going to perform data imputation using the mice package following Random Forest method. But before that, we converted all categorical variables into factors -
#transformation
#Loan_ID should be removed before imputing data
#mice uses all data to impute
<- dataset %>%
dataset select(-'Loan_ID') %>%
mutate(
Gender = as.factor(Gender),
Married = as.factor(Married),
Dependents = as.factor(Dependents),
Education = as.factor(Education),
Self_Employed = as.factor(Self_Employed),
Credit_History = as.factor(Credit_History),
Property_Area = as.factor(Property_Area),
Loan_Status = as.factor(Loan_Status)
)
#imputation by using the random forest method ('rf')
<- mice(dataset, maxit = 0)
init <- init$predictorMatrix
predM set.seed(123)
<- mice(dataset, method = 'rf', predictorMatrix = predM, m=5) imputed
##
## iter imp variable
## 1 1 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 1 2 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 1 3 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 1 4 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 1 5 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 2 1 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 2 2 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 2 3 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 2 4 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 2 5 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 3 1 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 3 2 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 3 3 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 3 4 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 3 5 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 4 1 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 4 2 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 4 3 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 4 4 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 4 5 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 5 1 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 5 2 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 5 3 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 5 4 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 5 5 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term Credit_History
<- complete(imputed)
dataset summary(dataset)
## Gender Married Dependents Education Self_Employed
## Female:113 No :214 0 :355 Graduate :480 No :529
## Male :501 Yes:400 1 :102 Not Graduate:134 Yes: 85
## 2 :104
## 3+: 53
##
##
## ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## Min. : 150 Min. : 0 Min. : 9.0 Min. : 12.0
## 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0 1st Qu.:360.0
## Median : 3812 Median : 1188 Median :127.5 Median :360.0
## Mean : 5403 Mean : 1621 Mean :146.7 Mean :341.7
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:169.5 3rd Qu.:360.0
## Max. :81000 Max. :41667 Max. :700.0 Max. :480.0
## Credit_History Property_Area Loan_Status
## 0: 91 Rural :179 N:192
## 1:523 Semiurban:233 Y:422
## Urban :202
##
##
##
We also checked for presence of any de-generate variables and found no such variable present in our dataset -
# none of the variables meet the condition to be a degenerate feature
nearZeroVar(dataset)
## integer(0)
We did separate data analysis for categorical and continuous variables -
<- dataset %>%
cat_vars ::select(-c('ApplicantIncome', 'CoapplicantIncome','LoanAmount','Loan_Amount_Term')) %>%
dplyrgather(key = 'predictor_variable', value = 'value', -Loan_Status)
# Plot and print a histogram for each predictor variable.
ggplot(cat_vars) +
geom_histogram(aes(x = value, fill = Loan_Status),stat='count', bins = 30) +
labs(title = 'Distributions of Categorical Variables') +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)
<- dataset %>%
cont_vars ::select(ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Loan_Status) %>%
dplyrgather(key = 'predictor_variable', value = 'value', -Loan_Status)
# Plot and print a histogram for each predictor variable.
ggplot(cont_vars) +
geom_histogram(aes(x = value, y = ..density.., fill = Loan_Status), bins = 30) +
labs(title = 'Log Distributions of Continuous Variables') +
scale_x_log10() +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(. ~predictor_variable, scales = 'free', ncol = 2)
# Plot and print a histogram for a pair of predictor variables.
<- ggplot(dataset, aes(x = ApplicantIncome)) +
bp geom_histogram(bins = 30, color = "darkblue", fill = "lightblue") +
labs(title = 'Distributions of Applicant Income By Gender and Marriage Status') +
theme(plot.title = element_text(hjust = 0.5))
+ facet_grid(Gender ~ Married, labeller=label_both) bp
It can be observed from above plot that married males have applied for more loans and comparatively unmarried females have applied for more loans than married females.
# Plot and print a histogram for each predictor variable.
<- ggplot(dataset, aes(x = ApplicantIncome)) +
bp geom_histogram(bins = 30, color = "darkblue", fill = "lightblue") +
labs(title = 'Distributions of Applicant Income By Education and Self-Employed') +
theme(plot.title = element_text(hjust = 0.5))
+ facet_grid(Education ~ Self_Employed, labeller=label_both) bp
Above plot shows, people with a graduate degree and having a salaried job applied for more loans than Self-employed and non-graduate folks.
# Plot and print a histogram for each predictor variable.
<- ggplot(dataset, aes(x = ApplicantIncome)) +
bp geom_histogram(bins = 30, color = "darkblue", fill = "lightblue") +
labs(title = 'Distributions of Applicant Income By Dependents and Property_Area') +
theme(plot.title = element_text(hjust = 0.5))
+ facet_grid(Dependents ~ Property_Area, labeller=label_both) bp
From the above plot, it can be observed that people with no dependents have applied for more loans and people living in semi-urban areas also applied for more loans.
Converting all categorical variables present in the dataset to numeric codes -
# Caret package dummyVars() to do one hot encoding
#dummy <- dummyVars(" ~ .", data=dataset)
#newdata <- data.frame(predict(dummy, newdata = dataset))
# Converting categorical variables to numeric
<- dataset %>% mutate_if(is.factor, as.numeric) newdata
Next, we’ll have to work through a few transformations for our highly skewed continuous data. We will use log transformation to normalize the data.
<- newdata %>%
newdata mutate(
ApplicantIncome = log(ApplicantIncome),
CoapplicantIncome = log(CoapplicantIncome),
LoanAmount = log(LoanAmount),
Loan_Amount_Term = log(Loan_Amount_Term))
<- newdata %>%
newdata mutate(CoapplicantIncome = ifelse(CoapplicantIncome < 0, 0, CoapplicantIncome))
head(newdata) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width="100%",height="300px")
Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 1 | 1 | 1 | 8.674026 | 0.000000 | 5.231109 | 5.886104 | 2 | 3 | 2 |
2 | 2 | 2 | 1 | 1 | 8.430109 | 7.318539 | 4.852030 | 5.886104 | 2 | 1 | 1 |
2 | 2 | 1 | 1 | 2 | 8.006368 | 0.000000 | 4.189655 | 5.886104 | 2 | 3 | 2 |
2 | 2 | 1 | 2 | 1 | 7.856707 | 7.765569 | 4.787492 | 5.886104 | 2 | 3 | 2 |
2 | 1 | 1 | 1 | 1 | 8.699515 | 0.000000 | 4.948760 | 5.886104 | 2 | 3 | 2 |
2 | 2 | 3 | 1 | 2 | 8.597297 | 8.341887 | 5.587249 | 5.886104 | 2 | 3 | 2 |
<- round(cor(newdata),4)
corrMatrix
%>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100)) corrMatrix
Based on above plot, it can be concluded there is no multicollinearity present in the dataset. There are certain variable pairs like LoanAmount & ApplicantIncome and Credit_History & Loan_Status that have higher correlation due to obvious reasons.
We are going to do a 75-25% split for training and test purposes.
= sample.split(newdata$Loan_Status, SplitRatio = 0.75)
sample = subset(newdata, sample == TRUE)
train = subset(newdata, sample == FALSE)
test
#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
#Creating seperate dataframe for 'Loan_Status' feature which is our target.
<- train[,12]
train.loan_labels <- test[,12]
test.loan_labels
<- train[,-12]
train1 <- test[,-12] test1
Before we start building the LDA model, we applied some basic transformation.
#convert loan_status variable as factor to work with LDA model
$Loan_Status <- factor(train$Loan_Status) train
We built the LDA model using caret’s train() method -
library(caret)
#create lda model using caret's train()
<- caret::train(Loan_Status ~ .,
lda method = 'lda', trControl = trainControl(method = "cv"))
train, #output lda
lda
## Linear Discriminant Analysis
##
## 460 samples
## 11 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 414, 414, 414, 413, 414, 414, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8087532 0.4742248
Below is the summary of the generated model -
#run summary on lda model using lda variable
summary(lda)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 22 -none- numeric
## scaling 11 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## xNames 11 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Then we checked the prediction against actual value in tabular form -
#create variable called pred.Loan_status to make prediction on test data using the lda (variable/model) model we made
= predict(lda, test)
pred.Loan_Status #organize results in a table
table(pred.Loan_Status, test$Loan_Status)
##
## pred.Loan_Status 1 2
## 1 23 2
## 2 25 104
We calculated the prediction accuracy -
#create variable called pred.accuracy to show accuracy -
= round(mean(pred.Loan_Status == test$Loan_Status)*100,2)
pred.accuracy pred.accuracy
## [1] 82.47
The calculated accuracy in this case being 82.47% which is a pretty good result.
We record the summary of the LDA model metrics in a data frame -
<- confusionMatrix(table(pred.Loan_Status ,test$Loan_Status))$byClass
lda_model <- confusionMatrix(table(pred.Loan_Status ,test$Loan_Status))$overall['Accuracy']
lda_accuracy <- data.frame(lda_model)
lda_model <- rbind("Accuracy" = lda_accuracy, lda_model) lda_model
For this model, we will need to identify the appropriate value for K. K represents the number of nearby neighbors used to determine the sample data point class. A rule of thumb is to start with the square root of the number of observations (i.e., rows or records). This varies depending on the number of records, and typically this rule of thumb leads to slightly higher than optimal values, but it’s a good starting point. Later we will use the caret package to help find the optimal value for K.
#Find the number of observation
NROW(train$Loan_Status)
## [1] 460
So, we have 460 observations in our training data set.
# initial value for K
sqrt(nrow(train))
## [1] 21.44761
The square root of 460 is around 21.45, therefore we’ll create two models. One with ‘K’ value as 21 and the other model with a ‘K’ value as 22.
.21 <- knn(train=train1, test=test1, cl=train.loan_labels, k=21)
knn.22 <- knn(train=train1, test=test1, cl=train.loan_labels, k=22)
knn
#Calculate the proportion of correct classification for k = 21, 22
.21 <- 100 * sum(test.loan_labels == knn.21)/NROW(test.loan_labels)
ACC.22 <- 100 * sum(test.loan_labels == knn.22)/NROW(test.loan_labels)
ACC
# Print Accuracy Scores
.21 ACC
## [1] 70.12987
.22 ACC
## [1] 70.77922
# Check prediction against actual value in tabular form for k=21
table(knn.21 ,test.loan_labels)
## test.loan_labels
## knn.21 1 2
## 1 2 0
## 2 46 106
# Check prediction against actual value in tabular form for k=22
confusionMatrix(table(knn.22 ,test.loan_labels))
## Confusion Matrix and Statistics
##
## test.loan_labels
## knn.22 1 2
## 1 3 0
## 2 45 106
##
## Accuracy : 0.7078
## 95% CI : (0.6292, 0.7782)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.3351
##
## Kappa : 0.0841
##
## Mcnemar's Test P-Value : 5.412e-11
##
## Sensitivity : 0.06250
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.70199
## Prevalence : 0.31169
## Detection Rate : 0.01948
## Detection Prevalence : 0.01948
## Balanced Accuracy : 0.53125
##
## 'Positive' Class : 1
##
Our resulting confusion matrix shows the kNN predicted loan_approval_status compared to the actual loan status. We can see that K=22 performed well on our test set with an overall Accuracy=70.78%. While this model performed well, our choice of K=22 might be sub-optimal. Next, we will try to find the optimal value for K through testing.
In order to improve the accuracy of the model, you can use n number of techniques such as the Elbow method and maximum percentage accuracy graph. In the below code snippet, I’ve created a loop that calculates the accuracy of the KNN model for ‘K’ values ranging from 1 to 25. This way you can check which ‘K’ value will result in the most accurate model:
=1
i=1
k.optmfor (i in 1:25){
<- knn(train=train1, test=test1, cl=train.loan_labels, k=i)
knn.mod <- 100 * sum(test.loan_labels == knn.mod)/NROW(test.loan_labels)
k.optm[i] =i
kcat(k,'=',k.optm[i],'
')
}
## 1 = 72.72727
## 2 = 70.12987
## 3 = 72.72727
## 4 = 72.07792
## 5 = 72.07792
## 6 = 71.42857
## 7 = 74.02597
## 8 = 73.37662
## 9 = 72.07792
## 10 = 73.37662
## 11 = 71.42857
## 12 = 72.07792
## 13 = 71.42857
## 14 = 71.42857
## 15 = 71.42857
## 16 = 70.77922
## 17 = 71.42857
## 18 = 72.07792
## 19 = 70.77922
## 20 = 70.77922
## 21 = 70.12987
## 22 = 69.48052
## 23 = 70.12987
## 24 = 70.12987
## 25 = 69.48052
#Accuracy plot
plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")
Based on the above plot, K=7 seems to perform best in terms of accuracy.
.7 <- knn(train=train1, test=test1, cl=train.loan_labels, k=7)
knn
#Calculate the proportion of correct classification for k = 21, 22
.7 <- 100 * sum(test.loan_labels == knn.7)/NROW(test.loan_labels)
ACC
# Print Accuracy Scores
.7 ACC
## [1] 74.02597
# Check prediction against actual value in tabular form for k=21
table(knn.7 ,test.loan_labels)
## test.loan_labels
## knn.7 1 2
## 1 11 3
## 2 37 103
# Check prediction against actual value in tabular form for k=22
confusionMatrix(table(knn.7 ,test.loan_labels))
## Confusion Matrix and Statistics
##
## test.loan_labels
## knn.7 1 2
## 1 11 3
## 2 37 103
##
## Accuracy : 0.7403
## 95% CI : (0.6635, 0.8075)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.0945
##
## Kappa : 0.2491
##
## Mcnemar's Test P-Value : 1.811e-07
##
## Sensitivity : 0.22917
## Specificity : 0.97170
## Pos Pred Value : 0.78571
## Neg Pred Value : 0.73571
## Prevalence : 0.31169
## Detection Rate : 0.07143
## Detection Prevalence : 0.09091
## Balanced Accuracy : 0.60043
##
## 'Positive' Class : 1
##
From our output, we can see that a K=7 was found to be the optimal value for our model based on the calculated Accuracy and Kappa values. Given this, a K=7 may ultimately perform better than our initial KNN model with K=22, especially on new data.
We record the summary of the kNN model metrics in a data frame -
<- confusionMatrix(table(knn.7 ,test.loan_labels))$byClass
knn_model <- confusionMatrix(table(knn.7 ,test.loan_labels))$overall['Accuracy']
knn_accuracy <- data.frame(knn_model)
knn_model <- rbind("Accuracy" = knn_accuracy, knn_model)
knn_model
knn_model
knn_model | |
---|---|
Accuracy | 0.7402597 |
Sensitivity | 0.2291667 |
Specificity | 0.9716981 |
Pos Pred Value | 0.7857143 |
Neg Pred Value | 0.7357143 |
Precision | 0.7857143 |
Recall | 0.2291667 |
F1 | 0.3548387 |
Prevalence | 0.3116883 |
Detection Rate | 0.0714286 |
Detection Prevalence | 0.0909091 |
Balanced Accuracy | 0.6004324 |
Prior to building our decision tree - we will need to ensure that all categorical variables in our train and test set are coded as factors:
<- train %>%
train2
mutate(
Gender = as.factor(Gender),
Married = as.factor(Married),
Dependents = as.factor(Dependents),
Education = as.factor(Education),
Self_Employed = as.factor(Self_Employed),
Credit_History = as.factor(Credit_History),
Property_Area = as.factor(Property_Area),
Loan_Status = as.factor(Loan_Status))
<- test %>%
test2
mutate(
Gender = as.factor(Gender),
Married = as.factor(Married),
Dependents = as.factor(Dependents),
Education = as.factor(Education),
Self_Employed = as.factor(Self_Employed),
Credit_History = as.factor(Credit_History),
Property_Area = as.factor(Property_Area),
Loan_Status = as.factor(Loan_Status))
<- rpart(Loan_Status ~ .,
dt data=train2, method="class")
rpart.plot(dt, nn=TRUE)
Here are some of the details of the tree:
summary (dt)
## Call:
## rpart(formula = Loan_Status ~ ., data = train2, method = "class")
## n= 460
##
## CP nsplit rel error xerror xstd
## 1 0.3888889 0 1.0000000 1.0000000 0.06906903
## 2 0.0150463 1 0.6111111 0.6111111 0.05858297
## 3 0.0100000 7 0.5208333 0.6666667 0.06052641
##
## Variable importance
## Credit_History ApplicantIncome LoanAmount CoapplicantIncome
## 63 13 7 6
## Property_Area Dependents Married Gender
## 5 3 2 1
##
## Node number 1: 460 observations, complexity param=0.3888889
## predicted class=2 expected loss=0.3130435 P(node) =1
## class counts: 144 316
## probabilities: 0.313 0.687
## left son=2 (66 obs) right son=3 (394 obs)
## Primary splits:
## Credit_History splits as LR, improve=57.570600, (0 missing)
## Property_Area splits as LRL, improve= 5.345705, (0 missing)
## Loan_Amount_Term < 6.029945 to the right, improve= 3.867370, (0 missing)
## Education splits as RL, improve= 2.381620, (0 missing)
## LoanAmount < 5.096771 to the right, improve= 2.014418, (0 missing)
## Surrogate splits:
## ApplicantIncome < 10.58584 to the right, agree=0.861, adj=0.03, (0 split)
##
## Node number 2: 66 observations
## predicted class=1 expected loss=0.07575758 P(node) =0.1434783
## class counts: 61 5
## probabilities: 0.924 0.076
##
## Node number 3: 394 observations, complexity param=0.0150463
## predicted class=2 expected loss=0.2106599 P(node) =0.8565217
## class counts: 83 311
## probabilities: 0.211 0.789
## left son=6 (246 obs) right son=7 (148 obs)
## Primary splits:
## Property_Area splits as LRL, improve=3.758429, (0 missing)
## LoanAmount < 5.102795 to the right, improve=2.923957, (0 missing)
## Loan_Amount_Term < 6.029945 to the right, improve=2.803773, (0 missing)
## ApplicantIncome < 7.615516 to the left, improve=1.380457, (0 missing)
## CoapplicantIncome < 8.916276 to the right, improve=1.367244, (0 missing)
## Surrogate splits:
## ApplicantIncome < 10.25357 to the left, agree=0.632, adj=0.02, (0 split)
##
## Node number 6: 246 observations, complexity param=0.0150463
## predicted class=2 expected loss=0.2642276 P(node) =0.5347826
## class counts: 65 181
## probabilities: 0.264 0.736
## left son=12 (54 obs) right son=13 (192 obs)
## Primary splits:
## LoanAmount < 5.150393 to the right, improve=2.1504070, (0 missing)
## CoapplicantIncome < 6.408961 to the left, improve=1.6364930, (0 missing)
## ApplicantIncome < 7.615516 to the left, improve=1.5026190, (0 missing)
## Education splits as RL, improve=0.7206106, (0 missing)
## Loan_Amount_Term < 5.592211 to the right, improve=0.6046146, (0 missing)
## Surrogate splits:
## ApplicantIncome < 8.890988 to the right, agree=0.866, adj=0.389, (0 split)
## CoapplicantIncome < 8.594171 to the right, agree=0.805, adj=0.111, (0 split)
##
## Node number 7: 148 observations
## predicted class=2 expected loss=0.1216216 P(node) =0.3217391
## class counts: 18 130
## probabilities: 0.122 0.878
##
## Node number 12: 54 observations, complexity param=0.0150463
## predicted class=2 expected loss=0.3888889 P(node) =0.1173913
## class counts: 21 33
## probabilities: 0.389 0.611
## left son=24 (28 obs) right son=25 (26 obs)
## Primary splits:
## Dependents splits as LRRR, improve=2.5073260, (0 missing)
## Loan_Amount_Term < 5.794943 to the right, improve=2.4326240, (0 missing)
## CoapplicantIncome < 8.590765 to the left, improve=1.3079710, (0 missing)
## ApplicantIncome < 8.671111 to the left, improve=0.9798622, (0 missing)
## Property_Area splits as L-R, improve=0.6611722, (0 missing)
## Surrogate splits:
## Married splits as LR, agree=0.667, adj=0.308, (0 split)
## CoapplicantIncome < 7.645629 to the right, agree=0.611, adj=0.192, (0 split)
## ApplicantIncome < 8.190675 to the right, agree=0.593, adj=0.154, (0 split)
## Property_Area splits as L-R, agree=0.593, adj=0.154, (0 split)
## LoanAmount < 5.212181 to the left, agree=0.574, adj=0.115, (0 split)
##
## Node number 13: 192 observations, complexity param=0.0150463
## predicted class=2 expected loss=0.2291667 P(node) =0.4173913
## class counts: 44 148
## probabilities: 0.229 0.771
## left son=26 (93 obs) right son=27 (99 obs)
## Primary splits:
## CoapplicantIncome < 6.408961 to the left, improve=3.9141410, (0 missing)
## ApplicantIncome < 7.806656 to the left, improve=1.9390670, (0 missing)
## LoanAmount < 4.740565 to the left, improve=1.3235290, (0 missing)
## Education splits as RL, improve=0.8888889, (0 missing)
## Dependents splits as RLRR, improve=0.5333333, (0 missing)
## Surrogate splits:
## ApplicantIncome < 8.131384 to the right, agree=0.698, adj=0.376, (0 split)
## Married splits as LR, agree=0.661, adj=0.301, (0 split)
## LoanAmount < 4.412633 to the left, agree=0.646, adj=0.269, (0 split)
## Gender splits as LR, agree=0.620, adj=0.215, (0 split)
## Dependents splits as LRRL, agree=0.568, adj=0.108, (0 split)
##
## Node number 24: 28 observations, complexity param=0.0150463
## predicted class=1 expected loss=0.4642857 P(node) =0.06086957
## class counts: 15 13
## probabilities: 0.536 0.464
## left son=48 (13 obs) right son=49 (15 obs)
## Primary splits:
## LoanAmount < 5.395689 to the right, improve=2.6465200, (0 missing)
## CoapplicantIncome < 7.869991 to the left, improve=1.0729560, (0 missing)
## ApplicantIncome < 9.066183 to the right, improve=0.3670741, (0 missing)
## Property_Area splits as L-R, improve=0.2387319, (0 missing)
## Married splits as LR, improve=0.0952381, (0 missing)
## Surrogate splits:
## ApplicantIncome < 9.066183 to the right, agree=0.857, adj=0.692, (0 split)
## CoapplicantIncome < 7.659594 to the left, agree=0.679, adj=0.308, (0 split)
## Loan_Amount_Term < 5.794943 to the right, agree=0.571, adj=0.077, (0 split)
## Property_Area splits as R-L, agree=0.571, adj=0.077, (0 split)
##
## Node number 25: 26 observations
## predicted class=2 expected loss=0.2307692 P(node) =0.05652174
## class counts: 6 20
## probabilities: 0.231 0.769
##
## Node number 26: 93 observations, complexity param=0.0150463
## predicted class=2 expected loss=0.3333333 P(node) =0.2021739
## class counts: 31 62
## probabilities: 0.333 0.667
## left son=52 (8 obs) right son=53 (85 obs)
## Primary splits:
## ApplicantIncome < 7.806656 to the left, improve=5.1362750, (0 missing)
## LoanAmount < 4.795756 to the left, improve=2.0751370, (0 missing)
## Loan_Amount_Term < 5.592211 to the right, improve=0.9737179, (0 missing)
## Education splits as RL, improve=0.7780392, (0 missing)
## Dependents splits as RLLR, improve=0.6289855, (0 missing)
## Surrogate splits:
## LoanAmount < 3.622114 to the left, agree=0.925, adj=0.125, (0 split)
##
## Node number 27: 99 observations
## predicted class=2 expected loss=0.1313131 P(node) =0.2152174
## class counts: 13 86
## probabilities: 0.131 0.869
##
## Node number 48: 13 observations
## predicted class=1 expected loss=0.2307692 P(node) =0.02826087
## class counts: 10 3
## probabilities: 0.769 0.231
##
## Node number 49: 15 observations
## predicted class=2 expected loss=0.3333333 P(node) =0.0326087
## class counts: 5 10
## probabilities: 0.333 0.667
##
## Node number 52: 8 observations
## predicted class=1 expected loss=0.125 P(node) =0.0173913
## class counts: 7 1
## probabilities: 0.875 0.125
##
## Node number 53: 85 observations
## predicted class=2 expected loss=0.2823529 P(node) =0.1847826
## class counts: 24 61
## probabilities: 0.282 0.718
Credit history and income seem to be some of the important predictors for loan approval.
We are going to apply our tree to the test data and create a confusion table to evaluate the accuracy of the classifications.
= predict(dt,test2,type="class") #because we want to predict the class labels
tree.pred # Confusion Tree
confusionMatrix(predict(dt,type="class"), train2$Loan_Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 78 9
## 2 66 307
##
## Accuracy : 0.837
## 95% CI : (0.8, 0.8695)
## No Information Rate : 0.687
## P-Value [Acc > NIR] : 1.459e-13
##
## Kappa : 0.5751
##
## Mcnemar's Test P-Value : 1.004e-10
##
## Sensitivity : 0.5417
## Specificity : 0.9715
## Pos Pred Value : 0.8966
## Neg Pred Value : 0.8231
## Prevalence : 0.3130
## Detection Rate : 0.1696
## Detection Prevalence : 0.1891
## Balanced Accuracy : 0.7566
##
## 'Positive' Class : 1
##
Our model is 83% accurate. This tree was grown to full depth and therefore there might be too many variables. To achieve improved accuracy, we need to prune the tree using the cross-validation:
plotcp(dt)
$cptable dt
## CP nsplit rel error xerror xstd
## 1 0.3888889 0 1.0000000 1.0000000 0.06906903
## 2 0.0150463 1 0.6111111 0.6111111 0.05858297
## 3 0.0100000 7 0.5208333 0.6666667 0.06052641
The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 2, so we will prune our tree to a size of 2. At size 2, the error is ~0.638 and cp is 0.01157407
=prune(dt,cp=0.01157)
prune_dtrpart.plot(prune_dt)
<- predict(prune_dt,
Prune_pred
test2, type="class")
confusionMatrix(Prune_pred, test2$Loan_Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 23 6
## 2 25 100
##
## Accuracy : 0.7987
## 95% CI : (0.7266, 0.8589)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.001482
##
## Kappa : 0.4739
##
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.4792
## Specificity : 0.9434
## Pos Pred Value : 0.7931
## Neg Pred Value : 0.8000
## Prevalence : 0.3117
## Detection Rate : 0.1494
## Detection Prevalence : 0.1883
## Balanced Accuracy : 0.7113
##
## 'Positive' Class : 1
##
Seems like pruning the tree reduced model accuracy to 80%. So we will stick to base decision tree model.
We record the summary of the Decision Tree model metrics in a data frame -
<- confusionMatrix(table(tree.pred, test2$Loan_Status))$byClass
dtree_model <- confusionMatrix(table(tree.pred, test2$Loan_Status))$overall['Accuracy']
dtree_accuracy <- data.frame(dtree_model)
dtree_model <- rbind("Accuracy" = dtree_accuracy, dtree_model) dtree_model
However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and misclassification errors.
In creating the best random forest model, we want to minimize the OOB error rate by finding the optimal number of variables selected at each split, known as the mtry. The below code finds the optimal mtry to use in our random forest model.
# Finding best mtry to use in random forest model by evaluating using the lowest OOB error
<- randomForest::tuneRF(train2[-12],train2$Loan_Status, ntreeTry=500,
mtry stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
## mtry = 3 OOB error = 20.22%
## Searching left ...
## mtry = 2 OOB error = 20%
## 0.01075269 0.01
## Searching right ...
## mtry = 4 OOB error = 20.43%
## -0.02173913 0.01
<- mtry[mtry[, 2] == min(mtry[, 2]), 1]
best.m print(mtry)
## mtry OOBError
## 2.OOB 2 0.2000000
## 3.OOB 3 0.2021739
## 4.OOB 4 0.2043478
print(best.m)
## [1] 2
Once the optimal mtry value is found, we apply it to our model.
# Using best mtry in model, plotting importance
set.seed(71)
<-randomForest(Loan_Status~.,data=train2, mtry=best.m, importance=TRUE,ntree=500)
rf print(rf)
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = train2, mtry = best.m, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20%
## Confusion matrix:
## 1 2 class.error
## 1 65 79 0.54861111
## 2 13 303 0.04113924
The below graph illustrates the importance of the variables used to predict the Loan Status. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Credit History is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Credit History is the clear leader but with a narrower gap followed by Loan Amount.
#Evaluate variable importance
varImpPlot(rf)
The random forest model we end up using has a accuracy of 81.82% on the test dataset. Futhermore, the model has a 97.15 speficity rate and 47.92% sensitivity rate.
<- predict(rf, newdata = test2)
rf_predict <- confusionMatrix(rf_predict, test2$Loan_Status)
rf_conf_matrix print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 23 3
## 2 25 103
##
## Accuracy : 0.8182
## 95% CI : (0.7481, 0.8757)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.0001955
##
## Kappa : 0.5155
##
## Mcnemar's Test P-Value : 7.229e-05
##
## Sensitivity : 0.4792
## Specificity : 0.9717
## Pos Pred Value : 0.8846
## Neg Pred Value : 0.8047
## Prevalence : 0.3117
## Detection Rate : 0.1494
## Detection Prevalence : 0.1688
## Balanced Accuracy : 0.7254
##
## 'Positive' Class : 1
##
We record the summary of the Random Forest model metrics in a data frame -
<- confusionMatrix(table(rf_predict, test2$Loan_Status))$byClass
rf_model <- confusionMatrix(table(rf_predict, test2$Loan_Status))$overall['Accuracy']
rf_accuracy <- data.frame(rf_model)
rf_model <- rbind("Accuracy" = rf_accuracy, rf_model) rf_model
After running various LDA, kNN, decision trees and random forest models, we can take a look at the overall evaluation metrics for these techniques on the loan approval dataset. By creating a dataframe to store all of our metrics, we can visualize the outcomes below:
<- data.frame(lda_model, knn_model, dtree_model, rf_model)
model_summary
%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="450px") model_summary
lda_model | knn_model | dtree_model | rf_model | |
---|---|---|---|---|
Accuracy | 0.8246753 | 0.7402597 | 0.7987013 | 0.8181818 |
Sensitivity | 0.4791667 | 0.2291667 | 0.4791667 | 0.4791667 |
Specificity | 0.9811321 | 0.9716981 | 0.9433962 | 0.9716981 |
Pos Pred Value | 0.9200000 | 0.7857143 | 0.7931034 | 0.8846154 |
Neg Pred Value | 0.8062016 | 0.7357143 | 0.8000000 | 0.8046875 |
Precision | 0.9200000 | 0.7857143 | 0.7931034 | 0.8846154 |
Recall | 0.4791667 | 0.2291667 | 0.4791667 | 0.4791667 |
F1 | 0.6301370 | 0.3548387 | 0.5974026 | 0.6216216 |
Prevalence | 0.3116883 | 0.3116883 | 0.3116883 | 0.3116883 |
Detection Rate | 0.1493506 | 0.0714286 | 0.1493506 | 0.1493506 |
Detection Prevalence | 0.1623377 | 0.0909091 | 0.1883117 | 0.1688312 |
Balanced Accuracy | 0.7301494 | 0.6004324 | 0.7112814 | 0.7254324 |
Consulting the above output table, we observe that LDA Model has the strongest performance for Accuracy, Specificity, Pos Pred Value, Precision and F1 etc.
Decision tree’s often raise concerns regarding over-fitting, bias and variance error because of their simplicity, and random forests are meant to address these concerns by accounting for a collection of decision trees to come to a single, aggregated result. We found it surprising that the LDA outperformed the random forest model for many metrics (ie. Balanced Accuracy). This may have been because of how we implemented the model or it may have simply been a poor situation for random forests. Also we’re dealing with imbalanced classes (recall 192 N’s, 422 Y’s) this might have had an implication in model performance.