DATA 622 - Homework3 - Loan Approval Prediction

Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(caret)
library(mice)
library(randomForest)
library(caTools)
library(corrplot)
library(class)
library(rpart)
library(rpart.plot)
library(naniar)

Background

For this assignment, we will be working with a dataset on loan approval status. The ‘Loan_Status’ is the target variable here -

Data Dictionary

Loan Approval Status Data Dictionary

Problem Statement

As we begin working with the dataset, we will conduct a thorough exploratory data analysis. This step is necessary as we figure out which variables should be included in models. (10 points)
We will use the LDA algorithm to predict the loan approval status. This will include the walk through for the steps we took, and how we decided on the key variables. (40 points)
Use K-nearest neighbor (KNN) algorithm to predict the loan approval status variable. Please be sure to walk through the steps you took. This includes talking about what value for ‘k’ you settled on and why. (40 points)
Use Decision Trees to predict on loan approval status. (40 points)
Use Random Forests to predict on loan approval status. (40 points)
Model performance: Comparison of the models we settled on in problem # 2- 5. Comment on their relative performance. Which one would you prefer the most? Why? (5 points)

Dataset

dataset <- read_csv('Loan_approval.csv')
head(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="300px")

Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
LP001002	Male	No	0	Graduate	No	5849	0	NA	360	1	Urban	Y
LP001003	Male	Yes	1	Graduate	No	4583	1508	128	360	1	Rural	N
LP001005	Male	Yes	0	Graduate	Yes	3000	0	66	360	1	Urban	Y
LP001006	Male	Yes	0	Not Graduate	No	2583	2358	120	360	1	Urban	Y
LP001008	Male	No	0	Graduate	No	6000	0	141	360	1	Urban	Y
LP001011	Male	Yes	2	Graduate	Yes	5417	4196	267	360	1	Urban	Y

Descriptive Dataset Summary

summary(dataset)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")

Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
Length:614	Length:614	Length:614	Length:614	Length:614	Length:614	Min. : 150	Min. : 0	Min. : 9.0	Min. : 12	Min. :0.0000	Length:614	Length:614
Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 2878	1st Qu.: 0	1st Qu.:100.0	1st Qu.:360	1st Qu.:1.0000	Class :character	Class :character
Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 3812	Median : 1188	Median :128.0	Median :360	Median :1.0000	Mode :character	Mode :character
NA	NA	NA	NA	NA	NA	Mean : 5403	Mean : 1621	Mean :146.4	Mean :342	Mean :0.8422	NA	NA
NA	NA	NA	NA	NA	NA	3rd Qu.: 5795	3rd Qu.: 2297	3rd Qu.:168.0	3rd Qu.:360	3rd Qu.:1.0000	NA	NA
NA	NA	NA	NA	NA	NA	Max. :81000	Max. :41667	Max. :700.0	Max. :480	Max. :1.0000	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA’s :22	NA’s :14	NA’s :50	NA	NA

Pre-Processing

Missing Value Analysis

Based on the above descriptive data summary, there are quite a few variables with missing values. So we conducted an analysis of all missing values in various attributes to identify proper imputation technique.

## Counts of missing data per feature
dataset_missing_counts <- data.frame(apply(dataset, 2, function(x) length(which(is.na(x)))))
dataset_missing_pct <- data.frame(apply(dataset, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

dataset_missing_counts <- cbind(Feature = rownames(dataset_missing_counts), dataset_missing_counts, dataset_missing_pct)
colnames(dataset_missing_counts) <- c('Feature','NA_Count','NA_Percentage')
rownames(dataset_missing_counts) <- NULL

dataset_missing_counts %>% filter(`NA_Count` != 0) %>% arrange(desc(`NA_Count`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature	NA_Count	NA_Percentage
Credit_History	50	8.1433225
Self_Employed	32	5.2117264
LoanAmount	22	3.5830619
Dependents	15	2.4429967
Loan_Amount_Term	14	2.2801303
Gender	13	2.1172638
Married	3	0.4885993

ggplot(dataset_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'steelblue') +
  geom_label(aes(label = NA_Count)) +
  labs(title = 'Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_blank())

# Use nanair package to plot missing value patterns
gg_miss_upset(dataset)

Data Imputation

Based on above missing value analysis, we are going to perform data imputation using the mice package following Random Forest method. But before that, we converted all categorical variables into factors -

#transformation
#Loan_ID should be removed before imputing data
#mice uses all data to impute
dataset <- dataset %>%
  select(-'Loan_ID') %>%
  mutate(
    Gender = as.factor(Gender),
    Married = as.factor(Married),
    Dependents = as.factor(Dependents),
    Education = as.factor(Education),
    Self_Employed = as.factor(Self_Employed),
    Credit_History = as.factor(Credit_History),
    Property_Area = as.factor(Property_Area),
    Loan_Status = as.factor(Loan_Status)
  )

#imputation by using the random forest method ('rf')
init <- mice(dataset, maxit = 0)
predM <- init$predictorMatrix
set.seed(123)
imputed <- mice(dataset, method = 'rf', predictorMatrix = predM, m=5)

## 
##  iter imp variable
##   1   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   1   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   1   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   1   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   1   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   2   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   2   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   2   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   2   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   2   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   3   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   4   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   4   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   4   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   4   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   4   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   5   1  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   5   2  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   5   3  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   5   4  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History
##   5   5  Gender  Married  Dependents  Self_Employed  LoanAmount  Loan_Amount_Term  Credit_History

dataset <- complete(imputed)
summary(dataset)

##     Gender    Married   Dependents        Education   Self_Employed
##  Female:113   No :214   0 :355     Graduate    :480   No :529      
##  Male  :501   Yes:400   1 :102     Not Graduate:134   Yes: 85      
##                         2 :104                                     
##                         3+: 53                                     
##                                                                    
##                                                                    
##  ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
##  Min.   :  150   Min.   :    0     Min.   :  9.0   Min.   : 12.0   
##  1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0   1st Qu.:360.0   
##  Median : 3812   Median : 1188     Median :127.5   Median :360.0   
##  Mean   : 5403   Mean   : 1621     Mean   :146.7   Mean   :341.7   
##  3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:169.5   3rd Qu.:360.0   
##  Max.   :81000   Max.   :41667     Max.   :700.0   Max.   :480.0   
##  Credit_History   Property_Area Loan_Status
##  0: 91          Rural    :179   N:192      
##  1:523          Semiurban:233   Y:422      
##                 Urban    :202              
##                                            
##                                            
##

We also checked for presence of any de-generate variables and found no such variable present in our dataset -

# none of the variables meet the condition to be a degenerate feature
nearZeroVar(dataset)

## integer(0)

Exploratory Data Analysis

We did separate data analysis for categorical and continuous variables -

Categorical Variables

cat_vars <- dataset %>%
  dplyr::select(-c('ApplicantIncome', 'CoapplicantIncome','LoanAmount','Loan_Amount_Term')) %>%
  gather(key = 'predictor_variable', value = 'value', -Loan_Status)

# Plot and print a histogram for each predictor variable.
ggplot(cat_vars) +
  geom_histogram(aes(x = value, fill = Loan_Status),stat='count', bins = 30) +
  labs(title = 'Distributions of Categorical Variables') +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 3)

Continuous Variables

cont_vars <- dataset %>%
  dplyr::select(ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Loan_Status) %>%
  gather(key = 'predictor_variable', value = 'value', -Loan_Status)

# Plot and print a histogram for each predictor variable.
ggplot(cont_vars) +
  geom_histogram(aes(x = value, y = ..density.., fill = Loan_Status), bins = 30) +
  labs(title = 'Log Distributions of Continuous Variables') +
  scale_x_log10() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(. ~predictor_variable, scales = 'free', ncol = 2)

Observations:

More males have applied for loans than females and and they also have a higher rate of approval.
More married couples have applied for loans.
Self employed individuals have applied for less loans which indicates salary earners apply for and obtain more loans.
People with better credit history guidelines are more likely to get their loans approved as they have higher chances of paying back the loan on time.
People leaving in Semi-Urban area have most loan applications and have a higher rate of approval followed by urban and rural areas. Especially Rural loan applicants have a lower rate of loan approval.
An extremely high number of them go for a 360 months loan term. That’s pay back within a 15 years period.
People with no dependents tend to have applied for more loan applications
People with a graduate degree have applied for more loans than w/o a graduate degree and have much higher rate of loan approvals.
We can see that ApplicantIncome, CoapplicantIncome and LoanAmount are highly right-skewed, with long right tails. Conversely, it looks like Loan_Amount_Term is highly left-skewed, with a long left tail.

Further Analysis

# Plot and print a histogram for a pair of predictor variables.
bp <- ggplot(dataset, aes(x = ApplicantIncome)) +
  geom_histogram(bins = 30, color = "darkblue", fill = "lightblue") +
  labs(title = 'Distributions of Applicant Income By Gender and Marriage Status') +
  theme(plot.title = element_text(hjust = 0.5)) 

bp + facet_grid(Gender ~ Married, labeller=label_both)

It can be observed from above plot that married males have applied for more loans and comparatively unmarried females have applied for more loans than married females.

# Plot and print a histogram for each predictor variable.
bp <- ggplot(dataset, aes(x = ApplicantIncome)) +
  geom_histogram(bins = 30, color = "darkblue", fill = "lightblue") +
  labs(title = 'Distributions of Applicant Income By Education and Self-Employed') +
  theme(plot.title = element_text(hjust = 0.5)) 

bp + facet_grid(Education ~ Self_Employed, labeller=label_both)

Above plot shows, people with a graduate degree and having a salaried job applied for more loans than Self-employed and non-graduate folks.

# Plot and print a histogram for each predictor variable.
bp <- ggplot(dataset, aes(x = ApplicantIncome)) +
  geom_histogram(bins = 30, color = "darkblue", fill = "lightblue") +
  labs(title = 'Distributions of Applicant Income By Dependents and Property_Area') +
  theme(plot.title = element_text(hjust = 0.5)) 

bp + facet_grid(Dependents ~ Property_Area, labeller=label_both)

From the above plot, it can be observed that people with no dependents have applied for more loans and people living in semi-urban areas also applied for more loans.

Feature Engineering

Converting all categorical variables present in the dataset to numeric codes -

# Caret package dummyVars() to do one hot encoding
#dummy <- dummyVars(" ~ .", data=dataset)
#newdata <- data.frame(predict(dummy, newdata = dataset))

# Converting categorical variables to numeric
newdata <- dataset %>% mutate_if(is.factor, as.numeric)

Next, we’ll have to work through a few transformations for our highly skewed continuous data. We will use log transformation to normalize the data.

newdata <- newdata %>%
  mutate(
    ApplicantIncome = log(ApplicantIncome),
    CoapplicantIncome = log(CoapplicantIncome),
    LoanAmount = log(LoanAmount),
    Loan_Amount_Term = log(Loan_Amount_Term))

newdata <- newdata %>%
  mutate(CoapplicantIncome = ifelse(CoapplicantIncome < 0, 0, CoapplicantIncome))
    
head(newdata) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width="100%",height="300px")

Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
2	1	1	1	1	8.674026	0.000000	5.231109	5.886104	2	3	2
2	2	2	1	1	8.430109	7.318539	4.852030	5.886104	2	1	1
2	2	1	1	2	8.006368	0.000000	4.189655	5.886104	2	3	2
2	2	1	2	1	7.856707	7.765569	4.787492	5.886104	2	3	2
2	1	1	1	1	8.699515	0.000000	4.948760	5.886104	2	3	2
2	2	3	1	2	8.597297	8.341887	5.587249	5.886104	2	3	2

Correlation Plot: Multicollinearity Check

corrMatrix <- round(cor(newdata),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Based on above plot, it can be concluded there is no multicollinearity present in the dataset. There are certain variable pairs like LoanAmount & ApplicantIncome and Credit_History & Loan_Status that have higher correlation due to obvious reasons.

Model Building

Splitting Data: Train/Test

We are going to do a 75-25% split for training and test purposes.

sample = sample.split(newdata$Loan_Status, SplitRatio = 0.75)
train = subset(newdata, sample == TRUE)
test = subset(newdata, sample == FALSE)

#head(train)%>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

#Creating seperate dataframe for 'Loan_Status' feature which is our target.
train.loan_labels <- train[,12]
test.loan_labels <- test[,12]

train1 <- train[,-12]
test1 <- test[,-12]

Model1: LDA Algorithm

Before we start building the LDA model, we applied some basic transformation.

#convert loan_status variable as factor to work with LDA model
train$Loan_Status <- factor(train$Loan_Status)

We built the LDA model using caret’s train() method -

library(caret)
#create lda model using caret's train()
lda <- caret::train(Loan_Status ~ .,
             train, method = 'lda', trControl = trainControl(method = "cv"))
#output lda
lda

## Linear Discriminant Analysis 
## 
## 460 samples
##  11 predictor
##   2 classes: '1', '2' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 414, 414, 414, 413, 414, 414, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8087532  0.4742248

Below is the summary of the generated model -

#run summary on lda model using lda variable
summary(lda)

##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       22     -none-     numeric  
## scaling     11     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         3     -none-     call     
## xNames      11     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

Then we checked the prediction against actual value in tabular form -

#create variable called pred.Loan_status to make prediction on test data using the lda (variable/model) model we made
pred.Loan_Status = predict(lda, test)
#organize results in a table
table(pred.Loan_Status, test$Loan_Status)

##                 
## pred.Loan_Status   1   2
##                1  23   2
##                2  25 104

We calculated the prediction accuracy -

#create variable called pred.accuracy to show accuracy - 
pred.accuracy = round(mean(pred.Loan_Status == test$Loan_Status)*100,2)
pred.accuracy

## [1] 82.47

The calculated accuracy in this case being 82.47% which is a pretty good result.

Model Summary

We record the summary of the LDA model metrics in a data frame -

lda_model <- confusionMatrix(table(pred.Loan_Status ,test$Loan_Status))$byClass
lda_accuracy <- confusionMatrix(table(pred.Loan_Status ,test$Loan_Status))$overall['Accuracy']
lda_model <- data.frame(lda_model)
lda_model <- rbind("Accuracy" = lda_accuracy, lda_model)

Model2: K-nearest neighbor (KNN) algorithm

For this model, we will need to identify the appropriate value for K. K represents the number of nearby neighbors used to determine the sample data point class. A rule of thumb is to start with the square root of the number of observations (i.e., rows or records). This varies depending on the number of records, and typically this rule of thumb leads to slightly higher than optimal values, but it’s a good starting point. Later we will use the caret package to help find the optimal value for K.

#Find the number of observation
NROW(train$Loan_Status)

## [1] 460

So, we have 460 observations in our training data set.

# initial value for K
sqrt(nrow(train))

## [1] 21.44761

The square root of 460 is around 21.45, therefore we’ll create two models. One with ‘K’ value as 21 and the other model with a ‘K’ value as 22.

knn.21 <- knn(train=train1, test=test1, cl=train.loan_labels, k=21)
knn.22 <- knn(train=train1, test=test1, cl=train.loan_labels, k=22)

#Calculate the proportion of correct classification for k = 21, 22
ACC.21 <- 100 * sum(test.loan_labels == knn.21)/NROW(test.loan_labels)
ACC.22 <- 100 * sum(test.loan_labels == knn.22)/NROW(test.loan_labels)

# Print Accuracy Scores
ACC.21

## [1] 70.12987

ACC.22

## [1] 70.77922

# Check prediction against actual value in tabular form for k=21
table(knn.21 ,test.loan_labels)

##       test.loan_labels
## knn.21   1   2
##      1   2   0
##      2  46 106

# Check prediction against actual value in tabular form for k=22
confusionMatrix(table(knn.22 ,test.loan_labels))

## Confusion Matrix and Statistics
## 
##       test.loan_labels
## knn.22   1   2
##      1   3   0
##      2  45 106
##                                           
##                Accuracy : 0.7078          
##                  95% CI : (0.6292, 0.7782)
##     No Information Rate : 0.6883          
##     P-Value [Acc > NIR] : 0.3351          
##                                           
##                   Kappa : 0.0841          
##                                           
##  Mcnemar's Test P-Value : 5.412e-11       
##                                           
##             Sensitivity : 0.06250         
##             Specificity : 1.00000         
##          Pos Pred Value : 1.00000         
##          Neg Pred Value : 0.70199         
##              Prevalence : 0.31169         
##          Detection Rate : 0.01948         
##    Detection Prevalence : 0.01948         
##       Balanced Accuracy : 0.53125         
##                                           
##        'Positive' Class : 1               
##

Our resulting confusion matrix shows the kNN predicted loan_approval_status compared to the actual loan status. We can see that K=22 performed well on our test set with an overall Accuracy=70.78%. While this model performed well, our choice of K=22 might be sub-optimal. Next, we will try to find the optimal value for K through testing.

Model Optimization

In order to improve the accuracy of the model, you can use n number of techniques such as the Elbow method and maximum percentage accuracy graph. In the below code snippet, I’ve created a loop that calculates the accuracy of the KNN model for ‘K’ values ranging from 1 to 25. This way you can check which ‘K’ value will result in the most accurate model:

i=1
k.optm=1
for (i in 1:25){
 knn.mod <- knn(train=train1, test=test1, cl=train.loan_labels, k=i)
 k.optm[i] <- 100 * sum(test.loan_labels == knn.mod)/NROW(test.loan_labels)
 k=i
 cat(k,'=',k.optm[i],'
')
}

## 1 = 72.72727 
## 2 = 70.12987 
## 3 = 72.72727 
## 4 = 72.07792 
## 5 = 72.07792 
## 6 = 71.42857 
## 7 = 74.02597 
## 8 = 73.37662 
## 9 = 72.07792 
## 10 = 73.37662 
## 11 = 71.42857 
## 12 = 72.07792 
## 13 = 71.42857 
## 14 = 71.42857 
## 15 = 71.42857 
## 16 = 70.77922 
## 17 = 71.42857 
## 18 = 72.07792 
## 19 = 70.77922 
## 20 = 70.77922 
## 21 = 70.12987 
## 22 = 69.48052 
## 23 = 70.12987 
## 24 = 70.12987 
## 25 = 69.48052

Accuracy Plot

#Accuracy plot
plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")

Based on the above plot, K=7 seems to perform best in terms of accuracy.

knn.7 <- knn(train=train1, test=test1, cl=train.loan_labels, k=7)

#Calculate the proportion of correct classification for k = 21, 22
ACC.7 <- 100 * sum(test.loan_labels == knn.7)/NROW(test.loan_labels)

# Print Accuracy Scores
ACC.7

## [1] 74.02597

# Check prediction against actual value in tabular form for k=21
table(knn.7 ,test.loan_labels)

##      test.loan_labels
## knn.7   1   2
##     1  11   3
##     2  37 103

# Check prediction against actual value in tabular form for k=22
confusionMatrix(table(knn.7 ,test.loan_labels))

## Confusion Matrix and Statistics
## 
##      test.loan_labels
## knn.7   1   2
##     1  11   3
##     2  37 103
##                                           
##                Accuracy : 0.7403          
##                  95% CI : (0.6635, 0.8075)
##     No Information Rate : 0.6883          
##     P-Value [Acc > NIR] : 0.0945          
##                                           
##                   Kappa : 0.2491          
##                                           
##  Mcnemar's Test P-Value : 1.811e-07       
##                                           
##             Sensitivity : 0.22917         
##             Specificity : 0.97170         
##          Pos Pred Value : 0.78571         
##          Neg Pred Value : 0.73571         
##              Prevalence : 0.31169         
##          Detection Rate : 0.07143         
##    Detection Prevalence : 0.09091         
##       Balanced Accuracy : 0.60043         
##                                           
##        'Positive' Class : 1               
##

From our output, we can see that a K=7 was found to be the optimal value for our model based on the calculated Accuracy and Kappa values. Given this, a K=7 may ultimately perform better than our initial KNN model with K=22, especially on new data.

Model Summary

We record the summary of the kNN model metrics in a data frame -

knn_model <- confusionMatrix(table(knn.7 ,test.loan_labels))$byClass
knn_accuracy <- confusionMatrix(table(knn.7 ,test.loan_labels))$overall['Accuracy']
knn_model <- data.frame(knn_model)
knn_model <- rbind("Accuracy" = knn_accuracy, knn_model)

knn_model

	knn_model
Accuracy	0.7402597
Sensitivity	0.2291667
Specificity	0.9716981
Pos Pred Value	0.7857143
Neg Pred Value	0.7357143
Precision	0.7857143
Recall	0.2291667
F1	0.3548387
Prevalence	0.3116883
Detection Rate	0.0714286
Detection Prevalence	0.0909091
Balanced Accuracy	0.6004324

Model3: Decision Tree

Prior to building our decision tree - we will need to ensure that all categorical variables in our train and test set are coded as factors:

Converting Categorical Variables to Factors in Train

train2 <- train %>%
  
  mutate(
    Gender = as.factor(Gender),
    Married = as.factor(Married),
    Dependents = as.factor(Dependents),
    Education = as.factor(Education),
    Self_Employed = as.factor(Self_Employed),
    Credit_History = as.factor(Credit_History),
    Property_Area = as.factor(Property_Area),
    Loan_Status = as.factor(Loan_Status))

Converting Categorical Variables to Factors in Test

test2 <- test %>%
  
  mutate(
    Gender = as.factor(Gender),
    Married = as.factor(Married),
    Dependents = as.factor(Dependents),
    Education = as.factor(Education),
    Self_Employed = as.factor(Self_Employed),
    Credit_History = as.factor(Credit_History),
    Property_Area = as.factor(Property_Area),
    Loan_Status = as.factor(Loan_Status))

Initial Decision Tree

dt <- rpart(Loan_Status ~ ., 
               data=train2, method="class")
rpart.plot(dt, nn=TRUE)

Here are some of the details of the tree:

summary (dt)

## Call:
## rpart(formula = Loan_Status ~ ., data = train2, method = "class")
##   n= 460 
## 
##          CP nsplit rel error    xerror       xstd
## 1 0.3888889      0 1.0000000 1.0000000 0.06906903
## 2 0.0150463      1 0.6111111 0.6111111 0.05858297
## 3 0.0100000      7 0.5208333 0.6666667 0.06052641
## 
## Variable importance
##    Credit_History   ApplicantIncome        LoanAmount CoapplicantIncome 
##                63                13                 7                 6 
##     Property_Area        Dependents           Married            Gender 
##                 5                 3                 2                 1 
## 
## Node number 1: 460 observations,    complexity param=0.3888889
##   predicted class=2  expected loss=0.3130435  P(node) =1
##     class counts:   144   316
##    probabilities: 0.313 0.687 
##   left son=2 (66 obs) right son=3 (394 obs)
##   Primary splits:
##       Credit_History   splits as  LR,           improve=57.570600, (0 missing)
##       Property_Area    splits as  LRL,          improve= 5.345705, (0 missing)
##       Loan_Amount_Term < 6.029945 to the right, improve= 3.867370, (0 missing)
##       Education        splits as  RL,           improve= 2.381620, (0 missing)
##       LoanAmount       < 5.096771 to the right, improve= 2.014418, (0 missing)
##   Surrogate splits:
##       ApplicantIncome < 10.58584 to the right, agree=0.861, adj=0.03, (0 split)
## 
## Node number 2: 66 observations
##   predicted class=1  expected loss=0.07575758  P(node) =0.1434783
##     class counts:    61     5
##    probabilities: 0.924 0.076 
## 
## Node number 3: 394 observations,    complexity param=0.0150463
##   predicted class=2  expected loss=0.2106599  P(node) =0.8565217
##     class counts:    83   311
##    probabilities: 0.211 0.789 
##   left son=6 (246 obs) right son=7 (148 obs)
##   Primary splits:
##       Property_Area     splits as  LRL,          improve=3.758429, (0 missing)
##       LoanAmount        < 5.102795 to the right, improve=2.923957, (0 missing)
##       Loan_Amount_Term  < 6.029945 to the right, improve=2.803773, (0 missing)
##       ApplicantIncome   < 7.615516 to the left,  improve=1.380457, (0 missing)
##       CoapplicantIncome < 8.916276 to the right, improve=1.367244, (0 missing)
##   Surrogate splits:
##       ApplicantIncome < 10.25357 to the left,  agree=0.632, adj=0.02, (0 split)
## 
## Node number 6: 246 observations,    complexity param=0.0150463
##   predicted class=2  expected loss=0.2642276  P(node) =0.5347826
##     class counts:    65   181
##    probabilities: 0.264 0.736 
##   left son=12 (54 obs) right son=13 (192 obs)
##   Primary splits:
##       LoanAmount        < 5.150393 to the right, improve=2.1504070, (0 missing)
##       CoapplicantIncome < 6.408961 to the left,  improve=1.6364930, (0 missing)
##       ApplicantIncome   < 7.615516 to the left,  improve=1.5026190, (0 missing)
##       Education         splits as  RL,           improve=0.7206106, (0 missing)
##       Loan_Amount_Term  < 5.592211 to the right, improve=0.6046146, (0 missing)
##   Surrogate splits:
##       ApplicantIncome   < 8.890988 to the right, agree=0.866, adj=0.389, (0 split)
##       CoapplicantIncome < 8.594171 to the right, agree=0.805, adj=0.111, (0 split)
## 
## Node number 7: 148 observations
##   predicted class=2  expected loss=0.1216216  P(node) =0.3217391
##     class counts:    18   130
##    probabilities: 0.122 0.878 
## 
## Node number 12: 54 observations,    complexity param=0.0150463
##   predicted class=2  expected loss=0.3888889  P(node) =0.1173913
##     class counts:    21    33
##    probabilities: 0.389 0.611 
##   left son=24 (28 obs) right son=25 (26 obs)
##   Primary splits:
##       Dependents        splits as  LRRR,         improve=2.5073260, (0 missing)
##       Loan_Amount_Term  < 5.794943 to the right, improve=2.4326240, (0 missing)
##       CoapplicantIncome < 8.590765 to the left,  improve=1.3079710, (0 missing)
##       ApplicantIncome   < 8.671111 to the left,  improve=0.9798622, (0 missing)
##       Property_Area     splits as  L-R,          improve=0.6611722, (0 missing)
##   Surrogate splits:
##       Married           splits as  LR,           agree=0.667, adj=0.308, (0 split)
##       CoapplicantIncome < 7.645629 to the right, agree=0.611, adj=0.192, (0 split)
##       ApplicantIncome   < 8.190675 to the right, agree=0.593, adj=0.154, (0 split)
##       Property_Area     splits as  L-R,          agree=0.593, adj=0.154, (0 split)
##       LoanAmount        < 5.212181 to the left,  agree=0.574, adj=0.115, (0 split)
## 
## Node number 13: 192 observations,    complexity param=0.0150463
##   predicted class=2  expected loss=0.2291667  P(node) =0.4173913
##     class counts:    44   148
##    probabilities: 0.229 0.771 
##   left son=26 (93 obs) right son=27 (99 obs)
##   Primary splits:
##       CoapplicantIncome < 6.408961 to the left,  improve=3.9141410, (0 missing)
##       ApplicantIncome   < 7.806656 to the left,  improve=1.9390670, (0 missing)
##       LoanAmount        < 4.740565 to the left,  improve=1.3235290, (0 missing)
##       Education         splits as  RL,           improve=0.8888889, (0 missing)
##       Dependents        splits as  RLRR,         improve=0.5333333, (0 missing)
##   Surrogate splits:
##       ApplicantIncome < 8.131384 to the right, agree=0.698, adj=0.376, (0 split)
##       Married         splits as  LR,           agree=0.661, adj=0.301, (0 split)
##       LoanAmount      < 4.412633 to the left,  agree=0.646, adj=0.269, (0 split)
##       Gender          splits as  LR,           agree=0.620, adj=0.215, (0 split)
##       Dependents      splits as  LRRL,         agree=0.568, adj=0.108, (0 split)
## 
## Node number 24: 28 observations,    complexity param=0.0150463
##   predicted class=1  expected loss=0.4642857  P(node) =0.06086957
##     class counts:    15    13
##    probabilities: 0.536 0.464 
##   left son=48 (13 obs) right son=49 (15 obs)
##   Primary splits:
##       LoanAmount        < 5.395689 to the right, improve=2.6465200, (0 missing)
##       CoapplicantIncome < 7.869991 to the left,  improve=1.0729560, (0 missing)
##       ApplicantIncome   < 9.066183 to the right, improve=0.3670741, (0 missing)
##       Property_Area     splits as  L-R,          improve=0.2387319, (0 missing)
##       Married           splits as  LR,           improve=0.0952381, (0 missing)
##   Surrogate splits:
##       ApplicantIncome   < 9.066183 to the right, agree=0.857, adj=0.692, (0 split)
##       CoapplicantIncome < 7.659594 to the left,  agree=0.679, adj=0.308, (0 split)
##       Loan_Amount_Term  < 5.794943 to the right, agree=0.571, adj=0.077, (0 split)
##       Property_Area     splits as  R-L,          agree=0.571, adj=0.077, (0 split)
## 
## Node number 25: 26 observations
##   predicted class=2  expected loss=0.2307692  P(node) =0.05652174
##     class counts:     6    20
##    probabilities: 0.231 0.769 
## 
## Node number 26: 93 observations,    complexity param=0.0150463
##   predicted class=2  expected loss=0.3333333  P(node) =0.2021739
##     class counts:    31    62
##    probabilities: 0.333 0.667 
##   left son=52 (8 obs) right son=53 (85 obs)
##   Primary splits:
##       ApplicantIncome  < 7.806656 to the left,  improve=5.1362750, (0 missing)
##       LoanAmount       < 4.795756 to the left,  improve=2.0751370, (0 missing)
##       Loan_Amount_Term < 5.592211 to the right, improve=0.9737179, (0 missing)
##       Education        splits as  RL,           improve=0.7780392, (0 missing)
##       Dependents       splits as  RLLR,         improve=0.6289855, (0 missing)
##   Surrogate splits:
##       LoanAmount < 3.622114 to the left,  agree=0.925, adj=0.125, (0 split)
## 
## Node number 27: 99 observations
##   predicted class=2  expected loss=0.1313131  P(node) =0.2152174
##     class counts:    13    86
##    probabilities: 0.131 0.869 
## 
## Node number 48: 13 observations
##   predicted class=1  expected loss=0.2307692  P(node) =0.02826087
##     class counts:    10     3
##    probabilities: 0.769 0.231 
## 
## Node number 49: 15 observations
##   predicted class=2  expected loss=0.3333333  P(node) =0.0326087
##     class counts:     5    10
##    probabilities: 0.333 0.667 
## 
## Node number 52: 8 observations
##   predicted class=1  expected loss=0.125  P(node) =0.0173913
##     class counts:     7     1
##    probabilities: 0.875 0.125 
## 
## Node number 53: 85 observations
##   predicted class=2  expected loss=0.2823529  P(node) =0.1847826
##     class counts:    24    61
##    probabilities: 0.282 0.718

Credit history and income seem to be some of the important predictors for loan approval.

Predicting on the Test Data

We are going to apply our tree to the test data and create a confusion table to evaluate the accuracy of the classifications.

tree.pred = predict(dt,test2,type="class") #because we want to predict the class labels
# Confusion Tree
confusionMatrix(predict(dt,type="class"), train2$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1  78   9
##          2  66 307
##                                        
##                Accuracy : 0.837        
##                  95% CI : (0.8, 0.8695)
##     No Information Rate : 0.687        
##     P-Value [Acc > NIR] : 1.459e-13    
##                                        
##                   Kappa : 0.5751       
##                                        
##  Mcnemar's Test P-Value : 1.004e-10    
##                                        
##             Sensitivity : 0.5417       
##             Specificity : 0.9715       
##          Pos Pred Value : 0.8966       
##          Neg Pred Value : 0.8231       
##              Prevalence : 0.3130       
##          Detection Rate : 0.1696       
##    Detection Prevalence : 0.1891       
##       Balanced Accuracy : 0.7566       
##                                        
##        'Positive' Class : 1            
##

Our model is 83% accurate. This tree was grown to full depth and therefore there might be too many variables. To achieve improved accuracy, we need to prune the tree using the cross-validation:

plotcp(dt)

dt$cptable

##          CP nsplit rel error    xerror       xstd
## 1 0.3888889      0 1.0000000 1.0000000 0.06906903
## 2 0.0150463      1 0.6111111 0.6111111 0.05858297
## 3 0.0100000      7 0.5208333 0.6666667 0.06052641

The plot above shows the cross validated errors against the complexity parameters. The curve is at its lowest at 2, so we will prune our tree to a size of 2. At size 2, the error is ~0.638 and cp is 0.01157407

prune_dt=prune(dt,cp=0.01157)
rpart.plot(prune_dt)

Predicting Pruned Tree on Test Data

Prune_pred <- predict(prune_dt, 
                   test2, 
                   type="class")
confusionMatrix(Prune_pred, test2$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1  23   6
##          2  25 100
##                                           
##                Accuracy : 0.7987          
##                  95% CI : (0.7266, 0.8589)
##     No Information Rate : 0.6883          
##     P-Value [Acc > NIR] : 0.001482        
##                                           
##                   Kappa : 0.4739          
##                                           
##  Mcnemar's Test P-Value : 0.001225        
##                                           
##             Sensitivity : 0.4792          
##             Specificity : 0.9434          
##          Pos Pred Value : 0.7931          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.3117          
##          Detection Rate : 0.1494          
##    Detection Prevalence : 0.1883          
##       Balanced Accuracy : 0.7113          
##                                           
##        'Positive' Class : 1               
##

Seems like pruning the tree reduced model accuracy to 80%. So we will stick to base decision tree model.

Model Summary

We record the summary of the Decision Tree model metrics in a data frame -

dtree_model <- confusionMatrix(table(tree.pred, test2$Loan_Status))$byClass
dtree_accuracy <- confusionMatrix(table(tree.pred, test2$Loan_Status))$overall['Accuracy']
dtree_model <- data.frame(dtree_model)
dtree_model <- rbind("Accuracy" = dtree_accuracy, dtree_model)

However, more often than not, trees do not give very good prediction errors. Therefore, we will build out a random forest models which tend to outperform trees in terms of prediction and misclassification errors.

Model4: Random Forest

In creating the best random forest model, we want to minimize the OOB error rate by finding the optimal number of variables selected at each split, known as the mtry. The below code finds the optimal mtry to use in our random forest model.

# Finding best mtry to use in random forest model by evaluating using the lowest OOB error
mtry <- randomForest::tuneRF(train2[-12],train2$Loan_Status, ntreeTry=500,
               stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)

## mtry = 3  OOB error = 20.22% 
## Searching left ...
## mtry = 2     OOB error = 20% 
## 0.01075269 0.01 
## Searching right ...
## mtry = 4     OOB error = 20.43% 
## -0.02173913 0.01

best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
print(mtry)

##       mtry  OOBError
## 2.OOB    2 0.2000000
## 3.OOB    3 0.2021739
## 4.OOB    4 0.2043478

print(best.m)

## [1] 2

Once the optimal mtry value is found, we apply it to our model.

# Using best mtry in model, plotting importance
set.seed(71)
rf <-randomForest(Loan_Status~.,data=train2, mtry=best.m, importance=TRUE,ntree=500)
print(rf)

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = train2, mtry = best.m,      importance = TRUE, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20%
## Confusion matrix:
##    1   2 class.error
## 1 65  79  0.54861111
## 2 13 303  0.04113924

The below graph illustrates the importance of the variables used to predict the Loan Status. The Mean Decrease Accuracy displays how much the model accuracy decreases if we drop the variable. Here, Credit History is regarded as the most important variable by a wide margin. The Mean Decrease Gini graph displays the variable importance on the Gini impurity index used for splitting trees. Again, Credit History is the clear leader but with a narrower gap followed by Loan Amount.

#Evaluate variable importance
varImpPlot(rf)

The random forest model we end up using has a accuracy of 81.82% on the test dataset. Futhermore, the model has a 97.15 speficity rate and 47.92% sensitivity rate.

rf_predict <- predict(rf, newdata = test2)
rf_conf_matrix <- confusionMatrix(rf_predict, test2$Loan_Status)
print(rf_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1  23   3
##          2  25 103
##                                           
##                Accuracy : 0.8182          
##                  95% CI : (0.7481, 0.8757)
##     No Information Rate : 0.6883          
##     P-Value [Acc > NIR] : 0.0001955       
##                                           
##                   Kappa : 0.5155          
##                                           
##  Mcnemar's Test P-Value : 7.229e-05       
##                                           
##             Sensitivity : 0.4792          
##             Specificity : 0.9717          
##          Pos Pred Value : 0.8846          
##          Neg Pred Value : 0.8047          
##              Prevalence : 0.3117          
##          Detection Rate : 0.1494          
##    Detection Prevalence : 0.1688          
##       Balanced Accuracy : 0.7254          
##                                           
##        'Positive' Class : 1               
##

Model Summary

We record the summary of the Random Forest model metrics in a data frame -

rf_model <- confusionMatrix(table(rf_predict, test2$Loan_Status))$byClass
rf_accuracy <- confusionMatrix(table(rf_predict, test2$Loan_Status))$overall['Accuracy']
rf_model <- data.frame(rf_model)
rf_model <- rbind("Accuracy" = rf_accuracy, rf_model)

Model Performance Comparision

After running various LDA, kNN, decision trees and random forest models, we can take a look at the overall evaluation metrics for these techniques on the loan approval dataset. By creating a dataframe to store all of our metrics, we can visualize the outcomes below:

model_summary <- data.frame(lda_model, knn_model, dtree_model, rf_model)

model_summary %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="450px")

	lda_model	knn_model	dtree_model	rf_model
Accuracy	0.8246753	0.7402597	0.7987013	0.8181818
Sensitivity	0.4791667	0.2291667	0.4791667	0.4791667
Specificity	0.9811321	0.9716981	0.9433962	0.9716981
Pos Pred Value	0.9200000	0.7857143	0.7931034	0.8846154
Neg Pred Value	0.8062016	0.7357143	0.8000000	0.8046875
Precision	0.9200000	0.7857143	0.7931034	0.8846154
Recall	0.4791667	0.2291667	0.4791667	0.4791667
F1	0.6301370	0.3548387	0.5974026	0.6216216
Prevalence	0.3116883	0.3116883	0.3116883	0.3116883
Detection Rate	0.1493506	0.0714286	0.1493506	0.1493506
Detection Prevalence	0.1623377	0.0909091	0.1883117	0.1688312
Balanced Accuracy	0.7301494	0.6004324	0.7112814	0.7254324

Consulting the above output table, we observe that LDA Model has the strongest performance for Accuracy, Specificity, Pos Pred Value, Precision and F1 etc.

Conclusion

Decision tree’s often raise concerns regarding over-fitting, bias and variance error because of their simplicity, and random forests are meant to address these concerns by accounting for a collection of decision trees to come to a single, aggregated result. We found it surprising that the LDA outperformed the random forest model for many metrics (ie. Balanced Accuracy). This may have been because of how we implemented the model or it may have simply been a poor situation for random forests. Also we’re dealing with imbalanced classes (recall 192 N’s, 422 Y’s) this might have had an implication in model performance.

DATA 622 - Homework3 - Loan Approval Prediction

Group1: Diego Correa, Amanda Arce, Soumya Ghosh & Atina Karim

October 10, 2021

Libraries

Background

Data Dictionary

Problem Statement

Dataset

Descriptive Dataset Summary

Pre-Processing

Missing Value Analysis

Data Imputation

Exploratory Data Analysis

Categorical Variables

Continuous Variables

Observations:

Further Analysis

Feature Engineering

Correlation Plot: Multicollinearity Check

Model Building

Splitting Data: Train/Test

Model1: LDA Algorithm

Model Summary

Model2: K-nearest neighbor (KNN) algorithm

Model Optimization

Accuracy Plot

Model Summary

Model3: Decision Tree

Converting Categorical Variables to Factors in Train

Converting Categorical Variables to Factors in Test

Initial Decision Tree

Predicting on the Test Data

Predicting Pruned Tree on Test Data

Model Summary

Model4: Random Forest

Model Summary

Model Performance Comparision

Conclusion