Data

The Loan Prediction Problem Dataset was created by a Kaggle user named ‘Debdatta Chatterjee’, and it serves as a resource for machine learning enthusiasts and practitioners who want to develop their skills in predicting loan eligibility. The dataset aims to help users build models that can predict whether a loan will be approved or not based on a variety of applicant information. This is a classic classification problem in the field of machine learning and can be used to practice various algorithms and techniques.

The data was collected from real-life loan applications but has been anonymized to protect the privacy of the individuals involved. The purpose of collecting this data was to facilitate the development of predictive models that can streamline the loan approval process, making it more efficient for financial institutions and improving the user experience for loan applicants.

Dataset Attributes

The dataset consists of 614 records and 13 attributes, which include:

Loan_ID: A unique identification number for each loan application.
Gender: The gender of the applicant (Male or Female).
Married: The marital status of the applicant (Yes or No).
Dependents: The number of dependents of the applicant (0, 1, 2, or 3+).
Education: The level of education of the applicant (Graduate or Not Graduate).
Self_Employed: Whether the applicant is self-employed or not (Yes or No).
ApplicantIncome: The monthly income of the applicant.
CoapplicantIncome: The monthly income of the co-applicant.
LoanAmount: The loan amount requested (in thousands).
Loan_Amount_Term: The term of the loan, in months.
Credit_History: A binary variable representing the applicant’s credit history (1 if they have a credit history that meets the guidelines, 0 otherwise).
Property_Area: The category of the property area where the applicant resides (Urban, Semiurban, or Rural).
Loan_Status: The final decision on the loan application (Y for approved, N for not approved).

Representation Decisions

The representation decisions made for this dataset involve encoding categorical variables with string labels, such as Gender, Married, Education, Self_Employed, Property_Area, and Loan_Status. The use of string labels makes the dataset more interpretable but requires preprocessing, such as one-hot encoding, when applying machine learning algorithms. Additionally, the Credit_History attribute is a binary variable, which simplifies the representation of credit history but may not capture the full complexity of an applicant’s credit background.

By referring to the “Datasheets for Datasets” paper by Gebru et al. (2018), users working with this dataset can gain a deeper understanding of its attributes, limitations, and potential biases, ensuring responsible and effective use of the data in developing predictive models for loan eligibility.

Questions

The subsequent step involves thoroughly examining the data we’re working with. In practice, most datasets, including those from reputable sources, may contain errors or inconsistencies. It’s crucial to identify and address these issues before investing time in data analysis. Based on the information in the dataset, we can consider addressing the following questions:

Are there any noticeable errors or inconsistencies in the data?
Do any ambiguous or unclear variables exist in the dataset?
Are there any variables that require modification or removal?
How does the credit history of an applicant affect the loan approval rate, and how important is it in predicting loan approval?
How does the level of education of the applicant affect the loan approval rate, and is there a correlation between education level and other factors such as income or employment status?
Does the gender of the applicant have an impact on loan approval, and is there any gender bias in the loan approval process?

Let’s begin by importing the data using the read.csv() function and display the initial segment of the dataset to gain a better understanding of its structure:

setwd("/Users/mikexzibit/Documents/FinalProject")
tr <- read.csv('train.csv', header = TRUE, stringsAsFactors=TRUE)
head(tr)

By reviewing the initial segment of the dataset, we can identify any obvious errors or inconsistencies. Furthermore, it’s important to perform a more comprehensive assessment of the data quality. First, let’s investigate summary statistics and data distributions to detect potential outliers, missing values, or unusual patterns.

summary(tr)

##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         : 13      :  3     : 15     Graduate    :480  
##  LP001003:  1   Female:112   No :213   0 :345     Not Graduate:134  
##  LP001005:  1   Male  :489   Yes:398   1 :102                       
##  LP001006:  1                          2 :101                       
##  LP001008:  1                          3+: 51                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 32       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :500       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

Visualization

Upon examining the output, we can draw several conclusions and identify potential issues that require further investigation or remediation:

There are 51 records marked with a (+) sign, which may indicate an issue with the data formatting or a special case that needs to be addressed.
The mean of the Credit_History variable is 0.8422. This is unusual, given that the variable has a binary value of 1 for customers with credit history and 0 for those without. This could suggest a possible imbalance in the dataset or the presence of incorrect values.
Missing or blank values have been identified in several categorical variables, including Gender, Married, Dependents, and Self_Employed. These should be addressed by either imputing appropriate values or removing records with missing data, depending on the chosen strategy.
There are missing values (NAs) in the LoanAmount, Loan_Amount_Term, and Credit_History variables. Similar to the categorical variables, these missing values should be handled using suitable imputation techniques or by removing the affected records, depending on the impact on the overall dataset.

To resolve these issues and ensure the dataset’s integrity, it is essential to conduct further examination and preprocessing, such as removing or imputing missing values, correcting any inconsistencies, and understanding the context behind any unusual patterns or values. This process will help ensure that the dataset is reliable and well-suited for subsequent analysis or modeling tasks.

tr <- read.csv(file="train.csv", na.strings=c("", "NA"), header=TRUE) 
library(plyr)
tr$Dependents <- revalue(tr$Dependents, c("3+"="3"))

Now, let’s have a closer look at the missing data:

sapply(tr, function(x) sum(is.na(x)))

##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

mice_plot <- aggr(tr, col=c('navyblue','red'),
                  numbers=TRUE, sortVars=TRUE,
                  labels=names(tr), cex.axis=.7,
                  gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##           Variable       Count
##     Credit_History 0.081433225
##      Self_Employed 0.052117264
##         LoanAmount 0.035830619
##         Dependents 0.024429967
##   Loan_Amount_Term 0.022801303
##             Gender 0.021172638
##            Married 0.004885993
##            Loan_ID 0.000000000
##          Education 0.000000000
##    ApplicantIncome 0.000000000
##  CoapplicantIncome 0.000000000
##      Property_Area 0.000000000
##        Loan_Status 0.000000000

From the chart and table, we can identify that there are seven variables with missing data. To better understand the dataset, it’s essential to analyze the distribution of the data, particularly for numerical variables such as LoanAmount and ApplicantIncome. Visualizing these variables using histograms and boxplots can provide valuable insights into their distribution and potential outliers.

Begin by plotting histograms for LoanAmount and ApplicantIncome to visualize the data distribution and identify potential skewness or unusual patterns.
Next, create boxplots for LoanAmount and ApplicantIncome to gain a clearer understanding of the central tendency, spread, and potential outliers in the data.

#distribution
hist(tr$LoanAmount, 
     main="Histogram for Loan Amount", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     xlim=c(0,700),
     breaks=20)

hist(tr$ApplicantIncome, 
     main="Histogram for Applicant Income", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     xlim=c(0,80000),
     breaks=50)

## Histograms & Boxplots
par(mfrow=c(2,2))
hist(tr$LoanAmount, 
     main="Histogram for LoanAmount", 
     xlab="Loan Amount", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=20, prob = TRUE)
#lines(density(tr$LoanAmount), col='black', lwd=3)
boxplot(tr$LoanAmount, col='maroon',xlab = 'LoanAmount', main = 'Box Plot for Loan Amount')


hist(tr$ApplicantIncome, 
     main="Histogram for Applicant Income", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=50, prob = TRUE)
#lines(density(tr$ApplicantIncome), col='black', lwd=3)
boxplot(tr$ApplicantIncome, col='maroon',xlab = 'ApplicantIncome', main = 'Box Plot for Applicant Income')

Here, we observe that there are a few extreme values in both LoanAmount and ApplicantIncome variables.

To further explore the dataset, let’s examine whether the distribution of loan amounts for applicants is influenced by their educational level:

library(ggplot2)
ggplot(data=tr, aes(x=LoanAmount, fill=Education)) +
  geom_density() +
  facet_grid(Education~.)

## Warning: Removed 22 rows containing non-finite values (`stat_density()`).

We observe that graduates exhibit a greater number of outliers, and their loan amount distribution is broader compared to non-graduates.To further enhance our understanding of the dataset, let’s explore the categorical variables:

par(mfrow=c(2,3))
counts <- table(tr$Loan_Status, tr$Gender)
barplot(counts, main="Loan Status by Gender",
        xlab="Gender", col=c("darkgrey","maroon"),
        legend = rownames(counts))
counts2 <- table(tr$Loan_Status, tr$Education)
barplot(counts2, main="Loan Status by Education",
        xlab="Education", col=c("darkgrey","maroon"),
        legend = rownames(counts2))
counts3 <- table(tr$Loan_Status, tr$Married)
barplot(counts3, main="Loan Status by Married",
        xlab="Married", col=c("darkgrey","maroon"),
        legend = rownames(counts3))
counts4 <- table(tr$Loan_Status, tr$Self_Employed)
barplot(counts4, main="Loan Status by Self Employed",
        xlab="Self_Employed", col=c("darkgrey","maroon"),
        legend = rownames(counts4))
counts5 <- table(tr$Loan_Status, tr$Property_Area)
barplot(counts5, main="Loan Status by Property_Area",
        xlab="Property_Area", col=c("darkgrey","maroon"),
        legend = rownames(counts5))
counts6 <- table(tr$Loan_Status, tr$Credit_History)
barplot(counts6, main="Loan Status by Credit_History",
        xlab="Credit_History", col=c("darkgrey","maroon"),
        legend = rownames(counts5))

Upon examining the Gender graph, we observe that males constitute a larger proportion of the dataset, with more than half of their loan applications being approved. Although there are fewer female applicants, a significant portion of their applications have also been approved. We look at the other charts with the same eye to evaluate how each category performed in regards to the approval of the loan applications.

Data Cleaning

Data Cleaning Before proceeding with our analysis, we must address the identified issues within the dataset. Let’s recap the main concerns:

Missing values in some variables - We need to decide on an appropriate method for handling missing data based on the significance of each variable.
Outliers in ApplicantIncome and LoanAmount - We must determine how to handle these outliers and whether they result from measurement errors, recording errors, or genuine anomalies.

Handling Missing Values

In this dataset, we’ll assume that the missing values are systematic, as they appear in specific variables in a seemingly random manner. Missing values occur in both numerical and categorical data. We’ll use the ‘mice’ package in R, which helps in imputing missing values with plausible data values. These values are inferred from a distribution designed for each missing data point. From the missing data plot, we observe that 78% of the data contains no missing information, 7% are missing the Credit_History value, and the remaining data exhibits other missing patterns.

train <- read.csv('train.csv', header = TRUE, stringsAsFactors=TRUE)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

attach(train)
loan_clean <- train %>% 
  select(-c(Loan_ID)) %>% 
  mutate(Credit_History = as.factor(Credit_History))

colSums(is.na(loan_clean))

##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                22 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                14                50                 0                 0

It’s worth noting that ‘mice’ stands for Multiple Imputation by Chained Equations. To merge the imputed data with our original dataset, we can utilize the complete() function:

After addressing the missing data through imputation, it’s essential to verify that no missing values remain in the dataset:

loan_clean <- loan_clean %>% 
  filter(complete.cases(.)) 
  colSums(is.na(loan_clean))

##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                 0 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                 0                 0                 0                 0

Now, let’s address the extreme values in our dataset. When examining the LoanAmount variable, it’s plausible that some customers may apply for larger loan amounts due to various reasons. To mitigate the impact of these extreme values and normalize the data, we can apply a log transformation:

tr <- loan_clean 
tr$LogLoanAmount <- log(tr$LoanAmount)
par(mfrow=c(1,2))
hist(tr$LogLoanAmount, 
     main="Histogram for Loan Amount", 
     xlab="Loan Amount", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=20, prob = TRUE)
lines(density(tr$LogLoanAmount), col='black', lwd=3)
boxplot(tr$LogLoanAmount, col='maroon',xlab = 'Income', main = 'Box Plot for Applicant Income')

Now the distribution looks closer to normal and effect of extreme values has significantly subsided.

Coming to ApplicantIncome, it will be a good idea to combine both ApplicantIncome and Co-applicants as total income and then perform log transformation of the combined variable.

We will use the CART imputation method. If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement. We see that the distribution is better and closer to a normal distribution.

tr$Income <- tr$ApplicantIncome + tr$CoapplicantIncome
tr$ApplicantIncome <- NULL
tr$CoapplicantIncome <- NULL

tr$LogIncome <- log(tr$Income)
par(mfrow=c(1,2))
hist(tr$LogIncome, 
     main="Histogram for Applicant Income", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=50, prob = TRUE)
lines(density(tr$LogIncome), col='black', lwd=3)
boxplot(tr$LogIncome, col='maroon',xlab = 'Income', main = 'Box Plot for Applicant Income')

Models

To improve the explanation of the process, we can elaborate more on the importance of splitting the data and how it helps in avoiding overfitting. Here’s an updated version:

Now it’s time to take the next crucial step in our analysis, which is splitting the data into training and test sets.

A training set is the subset of the data that we use to train our models, while a test set is a random subset of the data derived from the training set. The test set is used to validate our models as unforeseen data, providing a measure of how well our models generalize to new, unseen cases.

In a sparse dataset like ours, overfitting can be a significant concern. Overfitting occurs when the model learns the training set so well that it struggles to handle cases it has never seen before. To address this issue, we score the models using the test set. The test set serves as a proxy for real-world, unseen data and helps us estimate the model’s performance in practice. Once we split the data, we treat the test set as if it no longer exists, ensuring that we don’t unintentionally use it during model training.

Now, let’s split the data into training and test sets:

set.seed(42)
sample <- sample.int(n = nrow(tr), size = floor(.70*nrow(tr)), replace = F)
trainnew <- tr[sample, ]
testnew  <- tr[-sample, ]

Logistic Regression

As we begin our first logistic regression model, it is crucial to avoid overfitting the data by carefully selecting the variables to include. To determine the most important variables, we can consider their logical relevance in predicting loan approval. The likelihood of an applicant’s loan being approved could be higher if:

The applicant has a history of taking loans, which is represented by the Credit_History variable.
The applicant has a higher income, which could be represented by the combined income variable we created earlier.
The applicant has a higher level of education, as indicated by the Education variable.
The applicant has a stable job, potentially represented by the Self_Employed variable.

For our initial logistic regression model, we will focus on the Credit_History variable, as it appears to be a key factor in determining loan approval.

logistic1 <- glm (Loan_Status ~ Credit_History,data = trainnew, family = binomial)
summary(logistic1)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History, family = binomial, 
##     data = trainnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.4408   0.6728   0.6728   2.1815  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.2824     0.4694  -4.862 1.16e-06 ***
## Credit_History1   3.6529     0.4898   7.458 8.81e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.38  on 369  degrees of freedom
## Residual deviance: 351.78  on 368  degrees of freedom
## AIC: 355.78
## 
## Number of Fisher Scoring iterations: 4

my_prediction_tr1 <- predict(logistic1, newdata = trainnew, type = "response")
table(trainnew$Loan_Status, my_prediction_tr1 > 0.5)

##    
##     FALSE TRUE
##   N    49   64
##   Y     5  252

logistic_test1 <- glm (Loan_Status ~ Credit_History,data = testnew, family = binomial)
summary(logistic_test1)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History, family = binomial, 
##     data = testnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7900  -0.4084   0.6708   0.6708   2.2475  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.4423     0.7372  -3.313 0.000923 ***
## Credit_History1   3.8193     0.7680   4.973 6.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198.0  on 158  degrees of freedom
## Residual deviance: 148.6  on 157  degrees of freedom
## AIC: 152.6
## 
## Number of Fisher Scoring iterations: 5

my_prediction_te1 <- predict(logistic_test1, newdata = testnew, type = "response")
table(testnew$Loan_Status, my_prediction_te1 > 0.5)

##    
##     FALSE TRUE
##   N    23   27
##   Y     2  107

Upon examining the logistic regression output, we first observe the call, which serves as a reminder of the model we have run. Following this, the deviance residuals are displayed, which measure the model’s fit. This section showcases the distribution of deviance residuals for individual cases used in the model.

Subsequently, the output presents the coefficients, their standard errors, the z-statistic, and the associated p-values. Each term’s p-value tests the null hypothesis that the corresponding coefficient is equal to zero, indicating no effect. A low p-value (< 0.05) suggests that we can reject the null hypothesis, and the predictor is significant in the model. On the other hand, a larger p-value indicates that changes in the predictor are not associated with changes in the dependent variable, rendering it insignificant. In our case, the p-value for Credit_History is notably small, signifying its significance in the model.

To evaluate the model’s accuracy, we have generated a confusion table for both the training and test data:

Train data: 81.12% accuracy
Test data: 83.24% accuracy

To further improve the model, we can incorporate additional variables and assess their impact on accuracy.

logistic2 <- glm (Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                    LogIncome,data = trainnew, family = binomial)
summary(logistic2)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History + Education + Self_Employed + 
##     Property_Area + LogLoanAmount + LogIncome, family = binomial, 
##     data = trainnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2797  -0.3803   0.4930   0.7063   2.3992  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.1252     2.6579  -1.176 0.239663    
## Credit_History1          3.8503     0.5122   7.517 5.62e-14 ***
## EducationNot Graduate   -0.2141     0.3416  -0.627 0.530844    
## Self_EmployedNo         -0.5174     0.7152  -0.723 0.469423    
## Self_EmployedYes        -1.0221     0.7830  -1.305 0.191751    
## Property_AreaSemiurban   1.2127     0.3549   3.417 0.000633 ***
## Property_AreaUrban       0.4050     0.3337   1.214 0.224931    
## LogLoanAmount           -0.4011     0.4005  -1.002 0.316500    
## LogIncome                0.3089     0.4024   0.768 0.442726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.38  on 369  degrees of freedom
## Residual deviance: 334.13  on 361  degrees of freedom
## AIC: 352.13
## 
## Number of Fisher Scoring iterations: 5

my_prediction_tr2 <- predict(logistic2, newdata = trainnew, type = "response")
table(trainnew$Loan_Status, my_prediction_tr2 > 0.5)

##    
##     FALSE TRUE
##   N    49   64
##   Y     5  252

logistic_test2 <- glm (Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                         LogIncome,data = testnew, family = binomial)
summary(logistic_test2)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History + Education + Self_Employed + 
##     Property_Area + LogLoanAmount + LogIncome, family = binomial, 
##     data = testnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1433  -0.3639   0.5583   0.7322   2.3253  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.80182    3.72186  -1.021    0.307    
## Credit_History1         3.73115    0.78224   4.770 1.84e-06 ***
## EducationNot Graduate  -0.35954    0.53502  -0.672    0.502    
## Self_EmployedNo        -0.88816    1.09824  -0.809    0.419    
## Self_EmployedYes       -0.22080    1.31186  -0.168    0.866    
## Property_AreaSemiurban  0.64145    0.51558   1.244    0.213    
## Property_AreaUrban     -0.13494    0.52278  -0.258    0.796    
## LogLoanAmount           0.03382    0.56346   0.060    0.952    
## LogIncome               0.22246    0.48048   0.463    0.643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198.00  on 158  degrees of freedom
## Residual deviance: 143.29  on 150  degrees of freedom
## AIC: 161.29
## 
## Number of Fisher Scoring iterations: 5

my_prediction_te2 <- predict(logistic_test2, newdata = testnew, type = "response")
table(testnew$Loan_Status, my_prediction_te2 > 0.5)

##    
##     FALSE TRUE
##   N    23   27
##   Y     2  107

Train data: 81.11%
Test data: 83.78%

We observe that incorporating additional variables into the model has led to an improvement in the accuracy of the test set. This highlights the value of including relevant predictors when building a predictive model to achieve better performance on unseen data.

Decision Tree

Decision trees are a powerful method for creating a series of binary splits on predictor variables, ultimately forming a tree structure that can be used to classify new observations into one of two groups. In this analysis, we will employ classical decision trees. The algorithm for this model consists of the following steps:

Identify the predictor variable that optimally separates the data into two distinct groups.
Divide the data into these two groups.
Continue this process until a subgroup contains fewer than a predetermined minimum number of observations.
To classify a new case, follow the tree structure down to a terminal node and assign the predicted outcome value based on the previous steps.

By employing this approach, we can create a robust classification model that accounts for various predictor variables to accurately determine group membership for new observations.

library(rpart)
# grow tree 
dtree <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                 LogIncome,method="class", data=trainnew,parms=list(split="information"))
dtree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.38938053      0 1.0000000 1.0000000 0.07840188
## 2 0.01474926      1 0.6106195 0.6106195 0.06630228
## 3 0.01000000      4 0.5663717 0.6991150 0.06975587

plotcp(dtree)

dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
                    dnn=c("Actual", "Predicted"))
dtree.perf

##       Predicted
## Actual   N   Y
##      N  49  64
##      Y   5 252

In R, decision trees can be grown and pruned using the rpart() function and prune() function from the rpart package. First, we grow the tree using the rpart() function, then print the tree and the summary to examine the fitted model. Oftentimes, the tree may be too large and need pruning. To determine the optimal tree size, we examine the cptable of the list returned by rpart(). This table contains data about the prediction error for different tree sizes. The complexity parameter (cp) is utilized to penalize larger trees. Tree size is defined by the number of branch splits (nsplit), with a tree having n splits resulting in n + 1 terminal nodes. The (rel error) column contains the error rate for a tree of a given size in the training sample, while the cross-validated error (xerror) is based on 10-fold cross-validation using the training sample. The (xstd) column contains the standard error of the cross-validation error.

Using the plotcp() function, we can visualize the cross-validated error plotted against the complexity parameter. To choose the final tree size, we aim to select the smallest tree whose cross-validated error is within one standard error of the minimum cross-validated error value. In our case, the minimum cross-validated error is 0.618, with a standard error of 0.0618. Consequently, we seek the smallest tree with a cross-validated error within the range of 0.618 ± 0.0618, or between 0.56 and 0.68. According to the table, a tree with one split (cross-validated error = 0.618) meets this criterion.

The cptable indicates that a tree with one split has a complexity parameter of 0.02290076. Using the statement prune(dtree, cp=0.2290076), we obtain a tree of the desired size. Next, we plot the pruned tree for predicting loan status, examining the tree from the top, moving left if a condition is true and right otherwise. When an observation reaches a terminal node, it is classified. Each node contains the probability of the classes in that node, along with the percentage of the sample.

Lastly, we generate a confusion table to assess the model’s accuracy, applying the same steps to the test data.

dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                 LogIncome,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable

##     CP nsplit rel error xerror       xstd
## 1 0.42      0      1.00   1.00 0.11709266
## 2 0.07      1      0.58   0.58 0.09738725
## 3 0.01      3      0.44   0.64 0.10111330

plotcp(dtree_test)

dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred,
                    dnn=c("Actual", "Predicted"))
dtree_test.perf

##       Predicted
## Actual   N   Y
##      N  30  20
##      Y   2 107

Accuracy:

Train data: 81.81%
Test data: 85.4%

The results demonstrate an improved performance compared to the logistic regression model, indicating a more accurate prediction of loan approval status.

Random Forest

A random forest is an ensemble learning approach to supervised learning. This approach develops multiple predictive models, and the results are aggregated to improve classification. The algorithm is as follows:

Grow many decision trees by sampling;

Sample m < M variables at each node;

Grow each tree fully without pruning;

Terminal nodes are assigned to a class based on the mode of cases in that node;

Classify new cases by sending them down all the trees and taking a vote.

Random forests are grown using randomForest() function in the randomForest Package in R. The default number of trees is 500, the default number of variables sampled at each node is sqrt(M), and the minimum node size is 1.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(42) 
fit.forest <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                             LogIncome, data=trainnew,
                           na.action=na.roughfix,
                           importance=TRUE)
fit.forest

## 
## Call:
##  randomForest(formula = Loan_Status ~ Credit_History + Education +      Self_Employed + Property_Area + LogLoanAmount + LogIncome,      data = trainnew, importance = TRUE, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.27%
## Confusion matrix:
##    N   Y class.error
## N 52  61  0.53982301
## Y 14 243  0.05447471

importance(fit.forest, type=2)

##                MeanDecreaseGini
## Credit_History         41.21536
## Education               2.94800
## Self_Employed           5.14608
## Property_Area           7.59871
## LogLoanAmount          28.58602
## LogIncome              31.06103

forest.pred <- predict(fit.forest, testnew)
forest.perf <- table(testnew$Loan_Status, forest.pred,
                     dnn=c("Actual", "Predicted"))
forest.perf

##       Predicted
## Actual   N   Y
##      N  25  25
##      Y   5 104

Here is the accuracy of the model:

Train data: 79.95%
Test data: 82.16%

Using the random forest function, we grew 500 traditional decision trees, sampling 429 observations with replacement from the training sample. The random forest algorithm provides a natural measure of variable importance. The relative importance measure (specified by type=2 option) represents the average decrease in node impurities when splitting on that variable across all trees. In our case, the most important variable is Credit_History, while the least important is Self_Employed. After measuring the accuracy for the training sample and applying the prediction to the test sample, we observe that the accuracy for both is lower than the decision tree model’s accuracy.

We will now rerun the model, this time selecting only the top three variables in terms of importance, to see if there is any improvement in accuracy:

set.seed(42) 
fit.forest2 <- randomForest(Loan_Status ~ Credit_History+LogLoanAmount+
                             LogIncome, data=trainnew,importance=TRUE)
fit.forest2

## 
## Call:
##  randomForest(formula = Loan_Status ~ Credit_History + LogLoanAmount +      LogIncome, data = trainnew, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 19.19%
## Confusion matrix:
##    N   Y class.error
## N 49  64  0.56637168
## Y  7 250  0.02723735

forest.pred2 <- predict(fit.forest2, testnew)
forest.perf_test <- table(testnew$Loan_Status, forest.pred2,
                     dnn=c("Actual", "Predicted"))
forest.perf_test

##       Predicted
## Actual   N   Y
##      N  23  27
##      Y   2 107

Upon selecting the top three variables in terms of importance, we observe slight improvements in both the training and test samples. The accuracy for the training sample is 80.88%, and the accuracy for the test sample is 83.24%. However, the decision tree model still outperforms this random forest model in terms of accuracy.

Despite this, random forests are generally more accurate compared to other classification methods and can handle larger problems. Personally, I have more confidence in the results generated by random forests than decision trees. A single decision tree may be prone to overfitting, while a random forest prevents overfitting by creating random subsets of the variables, building smaller trees using these subsets, and then combining the subtrees to form the final model.

Results, Analysis, and Discussion

Based on the data and analysis performed, we can draw several conclusions:

Credit_History appears to be the most important factor in determining loan approval. This is evident from both the decision tree and random forest models, where it ranked as the most significant variable.
The decision tree model performed better than the logistic model, with an accuracy of 81.81% on the train data and 85.4% on the test data. This indicates that the decision tree model is better at generalizing and predicting loan approval outcomes compared to the logistic model.
The random forest model, on the other hand, did not outperform the decision tree model in terms of accuracy. However, random forests tend to be very accurate compared to other classification methods and can handle large problems.
While the decision tree model showed better accuracy, it is important to consider the potential for overfitting. Random forests may be more reliable in this regard as they prevent overfitting by creating random subsets of the variables and building smaller trees using the subsets, then combining the subtrees.

Evidence against these conclusions could include:

The models used may not capture all the relevant factors that influence loan approval. There might be other factors that were not included in the dataset, such as the applicant’s job stability, which could affect the accuracy of the models.
The sample size may not be large enough to accurately represent the population of loan applicants.

Limitations of the analysis

The models used may not be the best fit for the data. There are various other machine learning models that could be explored to see if they can achieve better accuracy.
The analysis assumes that the dataset is representative of the population. If there is any bias in the data, this could lead to biased conclusions.

To make a better decision, additional data could be collected, such as:

The applicant’s job stability, which could be an important factor in determining loan approval.
The applicant’s debt-to-income ratio, which could provide a more accurate picture of their ability to repay the loan.
Economic factors, such as the interest rates and overall economic conditions, which could influence loan approval decisions.
By including more relevant variables and exploring other machine learning models, it may be possible to improve the accuracy of the predictions and make more informed decisions about loan approval.

Impact

Based on the analysis conducted in this project, the imputation methods employed have the potential to significantly impact the decision-making process within the loan application domain. If our conclusions were to be adopted in the real world, the following implications could arise:

Improved decision-making: By accurately imputing missing data, financial institutions could make more informed decisions when assessing loan applicants, potentially reducing the chances of granting loans to high-risk applicants and improving the overall loan approval process.
Reducing costs: Implementing robust imputation techniques could result in reduced costs associated with loan defaults, as financial institutions would be better equipped to identify and minimize lending risks.
Fairness in lending: With more complete and accurate data, financial institutions could minimize potential bias in their decision-making process. This would ensure that all applicants are treated fairly and that lending decisions are based on objective criteria.
Impacts on applicants: The adoption of these methods could have a positive impact on loan applicants, particularly those who may have been previously disadvantaged due to missing or incomplete data in their applications. These applicants may now have a better chance of securing loans, providing them with access to essential financial resources.
Shifting power dynamics: By providing more accurate and comprehensive data, our analysis could empower both financial institutions and loan applicants to make better-informed decisions, potentially shifting the power dynamic between lenders and borrowers.

However, it is important to consider the potential drawbacks associated with our methods:

Impact of model accuracy: If our imputation models are accurate, they could significantly improve the loan application process. However, if the models are not accurate, they could introduce additional errors and biases in the data, leading to incorrect decisions and potentially exacerbating existing disparities.
Overreliance on data: There is a risk that an overreliance on imputed data could lead to a disregard for other important factors in the decision-making process, such as qualitative information and individual circumstances.

In conclusion, the methods and conclusions presented in this project have the potential to create significant positive impacts in the loan application domain. However, it is crucial to ensure that the imputation techniques employed are accurate and robust, and that decision-makers do not become over-reliant on data imputation at the expense of other essential considerations.

Predicting Loan Approval

Frimpong Osei

2023-03-20