Data

The Loan Prediction Problem Dataset was created by a Kaggle user named ‘Debdatta Chatterjee’, and it serves as a resource for machine learning enthusiasts and practitioners who want to develop their skills in predicting loan eligibility. The dataset aims to help users build models that can predict whether a loan will be approved or not based on a variety of applicant information. This is a classic classification problem in the field of machine learning and can be used to practice various algorithms and techniques.

The data was collected from real-life loan applications but has been anonymized to protect the privacy of the individuals involved. The purpose of collecting this data was to facilitate the development of predictive models that can streamline the loan approval process, making it more efficient for financial institutions and improving the user experience for loan applicants.

Dataset Attributes

The dataset consists of 614 records and 13 attributes, which include:

Loan_ID: A unique identification number for each loan application.
Gender: The gender of the applicant (Male or Female).
Married: The marital status of the applicant (Yes or No).
Dependents: The number of dependents of the applicant (0, 1, 2, or 3+).
Education: The level of education of the applicant (Graduate or Not Graduate).
Self_Employed: Whether the applicant is self-employed or not (Yes or No).
ApplicantIncome: The monthly income of the applicant.
CoapplicantIncome: The monthly income of the co-applicant.
LoanAmount: The loan amount requested (in thousands).
Loan_Amount_Term: The term of the loan, in months.
Credit_History: A binary variable representing the applicant’s credit history (1 if they have a credit history that meets the guidelines, 0 otherwise).
Property_Area: The category of the property area where the applicant resides (Urban, Semiurban, or Rural).
Loan_Status: The final decision on the loan application (Y for approved, N for not approved).

Representation Decisions

The representation decisions made for this dataset involve encoding categorical variables with string labels, such as Gender, Married, Education, Self_Employed, Property_Area, and Loan_Status. The use of string labels makes the dataset more interpretable but requires preprocessing, such as one-hot encoding, when applying machine learning algorithms. Additionally, the Credit_History attribute is a binary variable, which simplifies the representation of credit history but may not capture the full complexity of an applicant’s credit background.

By referring to the “Datasheets for Datasets” paper by Gebru et al. (2018), users working with this dataset can gain a deeper understanding of its attributes, limitations, and potential biases, ensuring responsible and effective use of the data in developing predictive models for loan eligibility.

Questions

The subsequent step involves thoroughly examining the data we’re working with. In practice, most datasets, including those from reputable sources, may contain errors or inconsistencies. It’s crucial to identify and address these issues before investing time in data analysis. To ensure the quality of the dataset, consider addressing the following questions:

Are there any noticeable errors or inconsistencies in the data?
Do any ambiguous or unclear variables exist in the dataset?
Are there any variables that require modification or removal?

Let’s begin by importing the data using the read.csv() function and display the initial segment of the dataset to gain a better understanding of its structure:

setwd("/Users/mikexzibit/Documents/FinalProject")
tr <- read.csv('train.csv', header = TRUE, stringsAsFactors=TRUE)
head(tr)

By reviewing the initial segment of the dataset, we can identify any obvious errors or inconsistencies. Furthermore, it’s important to perform a more comprehensive assessment of the data quality. First, let’s investigate summary statistics and data distributions to detect potential outliers, missing values, or unusual patterns.

summary(tr)

##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         : 13      :  3     : 15     Graduate    :480  
##  LP001003:  1   Female:112   No :213   0 :345     Not Graduate:134  
##  LP001005:  1   Male  :489   Yes:398   1 :102                       
##  LP001006:  1                          2 :101                       
##  LP001008:  1                          3+: 51                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 32       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :500       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

Visualization

Upon examining the output, we can draw several conclusions and identify potential issues that require further investigation or remediation:

There are 51 records marked with a (+) sign, which may indicate an issue with the data formatting or a special case that needs to be addressed.
The mean of the Credit_History variable is 0.8422. This is unusual, given that the variable has a binary value of 1 for customers with credit history and 0 for those without. This could suggest a possible imbalance in the dataset or the presence of incorrect values.
Missing or blank values have been identified in several categorical variables, including Gender, Married, Dependents, and Self_Employed. These should be addressed by either imputing appropriate values or removing records with missing data, depending on the chosen strategy.
There are missing values (NAs) in the LoanAmount, Loan_Amount_Term, and Credit_History variables. Similar to the categorical variables, these missing values should be handled using suitable imputation techniques or by removing the affected records, depending on the impact on the overall dataset.

To resolve these issues and ensure the dataset’s integrity, it is essential to conduct further examination and preprocessing, such as removing or imputing missing values, correcting any inconsistencies, and understanding the context behind any unusual patterns or values. This process will help ensure that the dataset is reliable and well-suited for subsequent analysis or modeling tasks.

tr <- read.csv(file="train.csv", na.strings=c("", "NA"), header=TRUE) 
library(plyr)
tr$Dependents <- revalue(tr$Dependents, c("3+"="3"))

Now, let’s have a closer look at the missing data:

sapply(tr, function(x) sum(is.na(x)))

##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

mice_plot <- aggr(tr, col=c('navyblue','red'),
                  numbers=TRUE, sortVars=TRUE,
                  labels=names(tr), cex.axis=.7,
                  gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##           Variable       Count
##     Credit_History 0.081433225
##      Self_Employed 0.052117264
##         LoanAmount 0.035830619
##         Dependents 0.024429967
##   Loan_Amount_Term 0.022801303
##             Gender 0.021172638
##            Married 0.004885993
##            Loan_ID 0.000000000
##          Education 0.000000000
##    ApplicantIncome 0.000000000
##  CoapplicantIncome 0.000000000
##      Property_Area 0.000000000
##        Loan_Status 0.000000000

From the chart and table, we can identify that there are seven variables with missing data. To better understand the dataset, it’s essential to analyze the distribution of the data, particularly for numerical variables such as LoanAmount and ApplicantIncome. Visualizing these variables using histograms and boxplots can provide valuable insights into their distribution and potential outliers.

Begin by plotting histograms for LoanAmount and ApplicantIncome to visualize the data distribution and identify potential skewness or unusual patterns.
Next, create boxplots for LoanAmount and ApplicantIncome to gain a clearer understanding of the central tendency, spread, and potential outliers in the data.

#distribution
hist(tr$LoanAmount, 
     main="Histogram for Loan Amount", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     xlim=c(0,700),
     breaks=20)

hist(tr$ApplicantIncome, 
     main="Histogram for Applicant Income", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     xlim=c(0,80000),
     breaks=50)

## Histograms & Boxplots
par(mfrow=c(2,2))
hist(tr$LoanAmount, 
     main="Histogram for LoanAmount", 
     xlab="Loan Amount", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=20, prob = TRUE)
#lines(density(tr$LoanAmount), col='black', lwd=3)
boxplot(tr$LoanAmount, col='maroon',xlab = 'LoanAmount', main = 'Box Plot for Loan Amount')


hist(tr$ApplicantIncome, 
     main="Histogram for Applicant Income", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=50, prob = TRUE)
#lines(density(tr$ApplicantIncome), col='black', lwd=3)
boxplot(tr$ApplicantIncome, col='maroon',xlab = 'ApplicantIncome', main = 'Box Plot for Applicant Income')

Here, we observe that there are a few extreme values in both LoanAmount and ApplicantIncome variables.

To further explore the dataset, let’s examine whether the distribution of loan amounts for applicants is influenced by their educational level:

library(ggplot2)
ggplot(data=tr, aes(x=LoanAmount, fill=Education)) +
  geom_density() +
  facet_grid(Education~.)

## Warning: Removed 22 rows containing non-finite values (`stat_density()`).

We observe that graduates exhibit a greater number of outliers, and their loan amount distribution is broader compared to non-graduates.To further enhance our understanding of the dataset, let’s explore the categorical variables:

par(mfrow=c(2,3))
counts <- table(tr$Loan_Status, tr$Gender)
barplot(counts, main="Loan Status by Gender",
        xlab="Gender", col=c("darkgrey","maroon"),
        legend = rownames(counts))
counts2 <- table(tr$Loan_Status, tr$Education)
barplot(counts2, main="Loan Status by Education",
        xlab="Education", col=c("darkgrey","maroon"),
        legend = rownames(counts2))
counts3 <- table(tr$Loan_Status, tr$Married)
barplot(counts3, main="Loan Status by Married",
        xlab="Married", col=c("darkgrey","maroon"),
        legend = rownames(counts3))
counts4 <- table(tr$Loan_Status, tr$Self_Employed)
barplot(counts4, main="Loan Status by Self Employed",
        xlab="Self_Employed", col=c("darkgrey","maroon"),
        legend = rownames(counts4))
counts5 <- table(tr$Loan_Status, tr$Property_Area)
barplot(counts5, main="Loan Status by Property_Area",
        xlab="Property_Area", col=c("darkgrey","maroon"),
        legend = rownames(counts5))
counts6 <- table(tr$Loan_Status, tr$Credit_History)
barplot(counts6, main="Loan Status by Credit_History",
        xlab="Credit_History", col=c("darkgrey","maroon"),
        legend = rownames(counts5))

Upon examining the Gender graph, we observe that males constitute a larger proportion of the dataset, with more than half of their loan applications being approved. Although there are fewer female applicants, a significant portion of their applications have also been approved. We look at the other charts with the same eye to evaluate how each category performed in regards to the approval of the loan applications.

Data Cleaning

Data Cleaning Before proceeding with our analysis, we must address the identified issues within the dataset. Let’s recap the main concerns:

Missing values in some variables - We need to decide on an appropriate method for handling missing data based on the significance of each variable.
Outliers in ApplicantIncome and LoanAmount - We must determine how to handle these outliers and whether they result from measurement errors, recording errors, or genuine anomalies.

Handling Missing Values:

In this dataset, we’ll assume that the missing values are systematic, as they appear in specific variables in a seemingly random manner. Missing values occur in both numerical and categorical data. We’ll use the ‘mice’ package in R, which helps in imputing missing values with plausible data values. These values are inferred from a distribution designed for each missing data point. From the missing data plot, we observe that 78% of the data contains no missing information, 7% are missing the Credit_History value, and the remaining data exhibits other missing patterns.

train <- read.csv('train.csv', header = TRUE, stringsAsFactors=TRUE)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

attach(train)
loan_clean <- train %>% 
  select(-c(Loan_ID)) %>% 
  mutate(Credit_History = as.factor(Credit_History))

colSums(is.na(loan_clean))

##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                22 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                14                50                 0                 0

It’s worth noting that ‘mice’ stands for Multiple Imputation by Chained Equations. To merge the imputed data with our original dataset, we can utilize the complete() function:

After addressing the missing data through imputation, it’s essential to verify that no missing values remain in the dataset:

loan_clean <- loan_clean %>% 
  filter(complete.cases(.)) 
  colSums(is.na(loan_clean))

##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                 0 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                 0                 0                 0                 0

Now, let’s address the extreme values in our dataset. When examining the LoanAmount variable, it’s plausible that some customers may apply for larger loan amounts due to various reasons. To mitigate the impact of these extreme values and normalize the data, we can apply a log transformation:

tr <- loan_clean 
tr$LogLoanAmount <- log(tr$LoanAmount)
par(mfrow=c(1,2))
hist(tr$LogLoanAmount, 
     main="Histogram for Loan Amount", 
     xlab="Loan Amount", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=20, prob = TRUE)
lines(density(tr$LogLoanAmount), col='black', lwd=3)
boxplot(tr$LogLoanAmount, col='maroon',xlab = 'Income', main = 'Box Plot for Applicant Income')

Now the distribution looks closer to normal and effect of extreme values has significantly subsided.

Coming to ApplicantIncome, it will be a good idea to combine both ApplicantIncome and Co-applicants as total income and then perform log transformation of the combined variable.

We will use the CART imputation method. If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement. We see that the distribution is better and closer to a normal distribution.

tr$Income <- tr$ApplicantIncome + tr$CoapplicantIncome
tr$ApplicantIncome <- NULL
tr$CoapplicantIncome <- NULL

tr$LogIncome <- log(tr$Income)
par(mfrow=c(1,2))
hist(tr$LogIncome, 
     main="Histogram for Applicant Income", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=50, prob = TRUE)
lines(density(tr$LogIncome), col='black', lwd=3)
boxplot(tr$LogIncome, col='maroon',xlab = 'Income', main = 'Box Plot for Applicant Income')

Final Data Preprocessing

Now it’s the time to make the next big step in our analysis which is splitting the data into training and test sets.

A training set is the subset of the data that we use to train our models but the test set is a random subset of the data which are derived from the training set. We will use the test set to validate our models as un-foreseen data.

In a sparse data like ours, it’s easy to overfit the data. Overfit in simple terms means that the model will learn the training set that it won’t be able to handle most of the cases it has never seen before. Therefore, we are going to score the data using our test set. Once we split the data, we will treat the testing set like it no longer exists. Let’s split the data:

set.seed(42)
sample <- sample.int(n = nrow(tr), size = floor(.70*nrow(tr)), replace = F)
trainnew <- tr[sample, ]
testnew  <- tr[-sample, ]

Models

Logistic Regression

We will now start with our first logistic regression model. We will not take all the variables in the model because this might cause an overfitting of the data. To choose our variables, let’s examine the importance of the variables logically. The chances that an applicant’s application would be approved is higher if:

Applicants took a loan before. Credit history is the variable which answers that.
Applicants with higher incomes. So, we might look at the income variable which we created.
Applicants with higher education.
Applicants who have stable jobs.

We will use Credit_History variable in our first logistic regression model.

logistic1 <- glm (Loan_Status ~ Credit_History,data = trainnew, family = binomial)
summary(logistic1)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History, family = binomial, 
##     data = trainnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.4408   0.6728   0.6728   2.1815  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.2824     0.4694  -4.862 1.16e-06 ***
## Credit_History1   3.6529     0.4898   7.458 8.81e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.38  on 369  degrees of freedom
## Residual deviance: 351.78  on 368  degrees of freedom
## AIC: 355.78
## 
## Number of Fisher Scoring iterations: 4

my_prediction_tr1 <- predict(logistic1, newdata = trainnew, type = "response")
table(trainnew$Loan_Status, my_prediction_tr1 > 0.5)

##    
##     FALSE TRUE
##   N    49   64
##   Y     5  252

logistic_test1 <- glm (Loan_Status ~ Credit_History,data = testnew, family = binomial)
summary(logistic_test1)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History, family = binomial, 
##     data = testnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7900  -0.4084   0.6708   0.6708   2.2475  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.4423     0.7372  -3.313 0.000923 ***
## Credit_History1   3.8193     0.7680   4.973 6.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198.0  on 158  degrees of freedom
## Residual deviance: 148.6  on 157  degrees of freedom
## AIC: 152.6
## 
## Number of Fisher Scoring iterations: 5

my_prediction_te1 <- predict(logistic_test1, newdata = testnew, type = "response")
table(testnew$Loan_Status, my_prediction_te1 > 0.5)

##    
##     FALSE TRUE
##   N    23   27
##   Y     2  107

Logistic Regression, in simple terms, predicts the probability of occurrence of an event by fitting data to a logit function. Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This type of models is part of a larger class of algorithms known as Generalized Linear Model or GLM.

The role of the link function is to link the expectation of ‘y’ to the linear predictor. Logistic regression has the following assumptions:

GLM does not assume a linear relationship between dependent and independent variables.

Dependent variable need not to be normally distributed.

It uses maximum likelihood estimation (MLE).

Errors need to be independent but not normally distributed.

In the output, the first thing we see is the call, this is R reminding us about the model we have run. Next, we see the deviance residuals which are the measures of the model fit. This part shows the distribution of the deviance residuals for individual cases used in the model. The next part shows the coefficients, their standard errors, the z-statistic, and the associated p-value. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that we can reject the null hypothesis and the predictor is meaningful to the model. Conversely, a larger p-value indicates that changes in the predictor are not associated with changes in the dependent variable and that it’s insignificant. The p-value for the Credit_History is so small and therefore, it’s significant.

We have also generated a confusion table to check the accuracy of the model on both the train and the test data:

Train data: 81.12% Test data: 83.24%

Let’s add other variables and check the accuracy:

logistic2 <- glm (Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                    LogIncome,data = trainnew, family = binomial)
summary(logistic2)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History + Education + Self_Employed + 
##     Property_Area + LogLoanAmount + LogIncome, family = binomial, 
##     data = trainnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2797  -0.3803   0.4930   0.7063   2.3992  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.1252     2.6579  -1.176 0.239663    
## Credit_History1          3.8503     0.5122   7.517 5.62e-14 ***
## EducationNot Graduate   -0.2141     0.3416  -0.627 0.530844    
## Self_EmployedNo         -0.5174     0.7152  -0.723 0.469423    
## Self_EmployedYes        -1.0221     0.7830  -1.305 0.191751    
## Property_AreaSemiurban   1.2127     0.3549   3.417 0.000633 ***
## Property_AreaUrban       0.4050     0.3337   1.214 0.224931    
## LogLoanAmount           -0.4011     0.4005  -1.002 0.316500    
## LogIncome                0.3089     0.4024   0.768 0.442726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.38  on 369  degrees of freedom
## Residual deviance: 334.13  on 361  degrees of freedom
## AIC: 352.13
## 
## Number of Fisher Scoring iterations: 5

my_prediction_tr2 <- predict(logistic2, newdata = trainnew, type = "response")
table(trainnew$Loan_Status, my_prediction_tr2 > 0.5)

##    
##     FALSE TRUE
##   N    49   64
##   Y     5  252

logistic_test2 <- glm (Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                         LogIncome,data = testnew, family = binomial)
summary(logistic_test2)

## 
## Call:
## glm(formula = Loan_Status ~ Credit_History + Education + Self_Employed + 
##     Property_Area + LogLoanAmount + LogIncome, family = binomial, 
##     data = testnew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1433  -0.3639   0.5583   0.7322   2.3253  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.80182    3.72186  -1.021    0.307    
## Credit_History1         3.73115    0.78224   4.770 1.84e-06 ***
## EducationNot Graduate  -0.35954    0.53502  -0.672    0.502    
## Self_EmployedNo        -0.88816    1.09824  -0.809    0.419    
## Self_EmployedYes       -0.22080    1.31186  -0.168    0.866    
## Property_AreaSemiurban  0.64145    0.51558   1.244    0.213    
## Property_AreaUrban     -0.13494    0.52278  -0.258    0.796    
## LogLoanAmount           0.03382    0.56346   0.060    0.952    
## LogIncome               0.22246    0.48048   0.463    0.643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198.00  on 158  degrees of freedom
## Residual deviance: 143.29  on 150  degrees of freedom
## AIC: 161.29
## 
## Number of Fisher Scoring iterations: 5

my_prediction_te2 <- predict(logistic_test2, newdata = testnew, type = "response")
table(testnew$Loan_Status, my_prediction_te2 > 0.5)

##    
##     FALSE TRUE
##   N    23   27
##   Y     2  107

Train data: 81.11% Test data: 83.78% We note that adding variables improved the accuracy of the test set.

Decision Tree

Decision trees create a set of binary splits on the predictor variables in order to create a tree that can be used to classify new observations into one of two groups. Here, we will be using classical trees. The algorithm of this model is the following:

Choose the predictor variable that best splits the data into two groups;

Separate the data into these two groups;

Repeat these steps until a subgroup contains fewer than a minimum number of observations;

To classify a case, run it down the tree to a terminal node, and assign it the model outcome value assigned in the previous step.

library(rpart)
# grow tree 
dtree <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                 LogIncome,method="class", data=trainnew,parms=list(split="information"))
dtree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.38938053      0 1.0000000 1.0000000 0.07840188
## 2 0.01474926      1 0.6106195 0.6106195 0.06630228
## 3 0.01000000      4 0.5663717 0.6991150 0.06975587

plotcp(dtree)

dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
                    dnn=c("Actual", "Predicted"))
dtree.perf

##       Predicted
## Actual   N   Y
##      N  49  64
##      Y   5 252

In R, decision trees can be grown and pruned using the rpart() function and prune() function in the rpart package. First, the tree is grown using the rpart() function. We printed the tree and the summary to examine the fitted model. The tree may be too large and need to be pruned. To choose a final tree size, examine the cptable of the list returned by rpart(). It contains data about the prediction error for different tree sizes. The complexity parameter (cp) is used to penalize larger trees. Tree size is defined by the number of branch splits (nsplit). A tree with n splits has n + 1 terminal nodes. The (rel error) contains the error rate for a tree of a given size in the training sample. The cross-validated error (xerror) is based on 10-fold cross validation, using the training sample. The (xstd) contains the standard error of the cross-validation error.

The plotcp() function plots the cross-validated error against the complexity parameter. To choose the final tree size, we need to choose the smallest tree whose cross-validated error is within one standard error of the minimum cross-validated error value. In our case, the minimum cross-validated error is 0.618 with a standard error of 0.0618. So, the smallest tree with a cross-validated error is within 0.618 � 0.0618 that is between 0.56 and 0.68 is selected. From the table, a tree with one splits (cross-validated error = 0.618) fits the requirement.

From the cptable, a tree with one splits has a complexity parameter of 0.02290076, so the statement prune(dtree, cp=0.2290076) returns a tree with the desired size. We have then plotted the tree: pruned tree for predicting the loan status. We look at the tree at the top moving left if a condition is true or right otherwise. When an observation hits a terminal node, it’s classified. Each node contains the probability of the classes in that node, along with percentage of the sample.

Finally, we ran the confusion table to know the accuracy of the model. PS: We followed the same steps in the test data.

dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                 LogIncome,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable

##     CP nsplit rel error xerror       xstd
## 1 0.42      0      1.00   1.00 0.11709266
## 2 0.07      1      0.58   0.58 0.09738725
## 3 0.01      3      0.44   0.64 0.10111330

plotcp(dtree_test)

dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred,
                    dnn=c("Actual", "Predicted"))
dtree_test.perf

##       Predicted
## Actual   N   Y
##      N  30  20
##      Y   2 107

Accuracy: Train data: 81.81% Test data: 85.4% Results show better performance than the logistic model.

Random Forest

A random forest is an ensemble learning approach to supervised learning. This approach develops multiple predictive models, and the results are aggregated to improve classification. The algorithm is as follows:

Grow many decision trees by sampling;

Sample m < M variables at each node;

Grow each tree fully without pruning;

Terminal nodes are assigned to a class based on the mode of cases in that node;

Classify new cases by sending them down all the trees and taking a vote.

Random forests are grown using randomForest() function in the randomForest Package in R. The default number of trees is 500, the default number of variables sampled at each node is sqrt(M), and the minimum node size is 1.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(42) 
fit.forest <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
                             LogIncome, data=trainnew,
                           na.action=na.roughfix,
                           importance=TRUE)
fit.forest

## 
## Call:
##  randomForest(formula = Loan_Status ~ Credit_History + Education +      Self_Employed + Property_Area + LogLoanAmount + LogIncome,      data = trainnew, importance = TRUE, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.27%
## Confusion matrix:
##    N   Y class.error
## N 52  61  0.53982301
## Y 14 243  0.05447471

importance(fit.forest, type=2)

##                MeanDecreaseGini
## Credit_History         41.21536
## Education               2.94800
## Self_Employed           5.14608
## Property_Area           7.59871
## LogLoanAmount          28.58602
## LogIncome              31.06103

forest.pred <- predict(fit.forest, testnew)
forest.perf <- table(testnew$Loan_Status, forest.pred,
                     dnn=c("Actual", "Predicted"))
forest.perf

##       Predicted
## Actual   N   Y
##      N  25  25
##      Y   5 104

Here is the accuracy of the model: Train data: 79.95% Test data: 82.16%

The random forest function grew 500 traditional decision trees by sampling 429 observations with replacement from the training sample. Random forests provides natural measure of variable importance. The relative importance measure specified by type=2 option is the total decrease in node impurities from splitting on that variable, averaged over all trees. In our trees,the most important variable is Credit_History and the least is Self_Employed. We have finally measured the accuracy for the training sample and applied the prediction to the test sample. We note that the accuracy for both are less than the decision tree’s accuracy.

We will run the same model but this time we will select the highest three in importance:

set.seed(42) 
fit.forest2 <- randomForest(Loan_Status ~ Credit_History+LogLoanAmount+
                             LogIncome, data=trainnew,importance=TRUE)
fit.forest2

## 
## Call:
##  randomForest(formula = Loan_Status ~ Credit_History + LogLoanAmount +      LogIncome, data = trainnew, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 19.19%
## Confusion matrix:
##    N   Y class.error
## N 49  64  0.56637168
## Y  7 250  0.02723735

forest.pred2 <- predict(fit.forest2, testnew)
forest.perf_test <- table(testnew$Loan_Status, forest.pred2,
                     dnn=c("Actual", "Predicted"))
forest.perf_test

##       Predicted
## Actual   N   Y
##      N  23  27
##      Y   2 107

Here, we notice slight improvements on both samples where accuracy for the training sample is 80.88% and the accuracy for the test sample is 83.24%. Accuracy for decision tree is still better.

Random forests tend to be very accurate compared to other classification methods though. Also, they can handle large problems. Personally, I have more confidence from the results generated from forest trees compared to decision trees. One problem which might occur with single decision tree is that it can overfit. Random forest, on the other hand, prevents overfitting by creating random subsets of the variables and building smaller trees using the subsets and then it combines the subtrees.

Chosen Model & Scoring

Although the accuracy for the decision tree is better, I’m choosing the random forest tree model. The reason is that the difference in accuracy slightly differ between the two models. Also, I prefer the forest model for the reasons mentioned in the previous section.
Let’s now create a data frame with two columns: Loan_ID and Loan_Status containing our predictions:

#my_solution <- data.frame(Loan_ID = testnew$Loan_ID, Loan_Status = forest.pred2)

Write the solution to a csv file with the name my_solution.csv

#write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

So now, we have predictions for 185 customers who apply for loans with accuracy of 83.24%. We can apply this method for any new data set with same variables to have a prediction about their eligibility of getting a loan.

Loan Approval Prediction

Frimpong Osei

2023-03-19