The Loan Prediction Problem Dataset was created by a Kaggle user named ‘Debdatta Chatterjee’, and it serves as a resource for machine learning enthusiasts and practitioners who want to develop their skills in predicting loan eligibility. The dataset aims to help users build models that can predict whether a loan will be approved or not based on a variety of applicant information. This is a classic classification problem in the field of machine learning and can be used to practice various algorithms and techniques.
The data was collected from real-life loan applications but has been anonymized to protect the privacy of the individuals involved. The purpose of collecting this data was to facilitate the development of predictive models that can streamline the loan approval process, making it more efficient for financial institutions and improving the user experience for loan applicants.
The dataset consists of 614 records and 13 attributes, which include:
Loan_ID: A unique identification number for each loan
application.Gender: The gender of the applicant (Male or
Female).Married: The marital status of the applicant (Yes or
No).Dependents: The number of dependents of the applicant
(0, 1, 2, or 3+).Education: The level of education of the applicant
(Graduate or Not Graduate).Self_Employed: Whether the applicant is self-employed
or not (Yes or No).ApplicantIncome: The monthly income of the
applicant.CoapplicantIncome: The monthly income of the
co-applicant.LoanAmount: The loan amount requested (in
thousands).Loan_Amount_Term: The term of the loan, in months.Credit_History: A binary variable representing the
applicant’s credit history (1 if they have a credit history that meets
the guidelines, 0 otherwise).Property_Area: The category of the property area where
the applicant resides (Urban, Semiurban, or Rural).Loan_Status: The final decision on the loan application
(Y for approved, N for not approved).The representation decisions made for this dataset involve encoding
categorical variables with string labels, such as Gender,
Married, Education,
Self_Employed, Property_Area, and
Loan_Status. The use of string labels makes the dataset
more interpretable but requires preprocessing, such as one-hot encoding,
when applying machine learning algorithms. Additionally, the
Credit_History attribute is a binary variable, which
simplifies the representation of credit history but may not capture the
full complexity of an applicant’s credit background.
By referring to the “Datasheets for Datasets” paper by Gebru et al. (2018), users working with this dataset can gain a deeper understanding of its attributes, limitations, and potential biases, ensuring responsible and effective use of the data in developing predictive models for loan eligibility.
The subsequent step involves thoroughly examining the data we’re working with. In practice, most datasets, including those from reputable sources, may contain errors or inconsistencies. It’s crucial to identify and address these issues before investing time in data analysis. To ensure the quality of the dataset, consider addressing the following questions:
Let’s begin by importing the data using the read.csv()
function and display the initial segment of the dataset to gain a better
understanding of its structure:
setwd("/Users/mikexzibit/Documents/FinalProject")
tr <- read.csv('train.csv', header = TRUE, stringsAsFactors=TRUE)
head(tr)By reviewing the initial segment of the dataset, we can identify any obvious errors or inconsistencies. Furthermore, it’s important to perform a more comprehensive assessment of the data quality. First, let’s investigate summary statistics and data distributions to detect potential outliers, missing values, or unusual patterns.
summary(tr)## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 13 : 3 : 15 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes:398 1 :102
## LP001006: 1 2 :101
## LP001008: 1 3+: 51
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 32 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes: 82 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
Upon examining the output, we can draw several conclusions and identify potential issues that require further investigation or remediation:
To resolve these issues and ensure the dataset’s integrity, it is essential to conduct further examination and preprocessing, such as removing or imputing missing values, correcting any inconsistencies, and understanding the context behind any unusual patterns or values. This process will help ensure that the dataset is reliable and well-suited for subsequent analysis or modeling tasks.
tr <- read.csv(file="train.csv", na.strings=c("", "NA"), header=TRUE)
library(plyr)
tr$Dependents <- revalue(tr$Dependents, c("3+"="3"))Now, let’s have a closer look at the missing data:
sapply(tr, function(x) sum(is.na(x)))## Loan_ID Gender Married Dependents
## 0 13 3 15
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 32 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 22 14 50 0
## Loan_Status
## 0
library(mice)##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
mice_plot <- aggr(tr, col=c('navyblue','red'),
numbers=TRUE, sortVars=TRUE,
labels=names(tr), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))##
## Variables sorted by number of missings:
## Variable Count
## Credit_History 0.081433225
## Self_Employed 0.052117264
## LoanAmount 0.035830619
## Dependents 0.024429967
## Loan_Amount_Term 0.022801303
## Gender 0.021172638
## Married 0.004885993
## Loan_ID 0.000000000
## Education 0.000000000
## ApplicantIncome 0.000000000
## CoapplicantIncome 0.000000000
## Property_Area 0.000000000
## Loan_Status 0.000000000
From the chart and table, we can identify that there are seven variables with missing data. To better understand the dataset, it’s essential to analyze the distribution of the data, particularly for numerical variables such as LoanAmount and ApplicantIncome. Visualizing these variables using histograms and boxplots can provide valuable insights into their distribution and potential outliers.
Begin by plotting histograms for LoanAmount and ApplicantIncome to visualize the data distribution and identify potential skewness or unusual patterns.
Next, create boxplots for LoanAmount and ApplicantIncome to gain a clearer understanding of the central tendency, spread, and potential outliers in the data.
#distribution
hist(tr$LoanAmount,
main="Histogram for Loan Amount",
xlab="Income",
border="blue",
col="maroon",
xlim=c(0,700),
breaks=20)hist(tr$ApplicantIncome,
main="Histogram for Applicant Income",
xlab="Income",
border="blue",
col="maroon",
xlim=c(0,80000),
breaks=50)## Histograms & Boxplots
par(mfrow=c(2,2))
hist(tr$LoanAmount,
main="Histogram for LoanAmount",
xlab="Loan Amount",
border="blue",
col="maroon",
las=1,
breaks=20, prob = TRUE)
#lines(density(tr$LoanAmount), col='black', lwd=3)
boxplot(tr$LoanAmount, col='maroon',xlab = 'LoanAmount', main = 'Box Plot for Loan Amount')
hist(tr$ApplicantIncome,
main="Histogram for Applicant Income",
xlab="Income",
border="blue",
col="maroon",
las=1,
breaks=50, prob = TRUE)
#lines(density(tr$ApplicantIncome), col='black', lwd=3)
boxplot(tr$ApplicantIncome, col='maroon',xlab = 'ApplicantIncome', main = 'Box Plot for Applicant Income')Here, we observe that there are a few extreme values in both LoanAmount and ApplicantIncome variables.
To further explore the dataset, let’s examine whether the distribution of loan amounts for applicants is influenced by their educational level:
library(ggplot2)
ggplot(data=tr, aes(x=LoanAmount, fill=Education)) +
geom_density() +
facet_grid(Education~.)## Warning: Removed 22 rows containing non-finite values (`stat_density()`).
We observe that graduates exhibit a greater number of outliers, and their loan amount distribution is broader compared to non-graduates.To further enhance our understanding of the dataset, let’s explore the categorical variables:
par(mfrow=c(2,3))
counts <- table(tr$Loan_Status, tr$Gender)
barplot(counts, main="Loan Status by Gender",
xlab="Gender", col=c("darkgrey","maroon"),
legend = rownames(counts))
counts2 <- table(tr$Loan_Status, tr$Education)
barplot(counts2, main="Loan Status by Education",
xlab="Education", col=c("darkgrey","maroon"),
legend = rownames(counts2))
counts3 <- table(tr$Loan_Status, tr$Married)
barplot(counts3, main="Loan Status by Married",
xlab="Married", col=c("darkgrey","maroon"),
legend = rownames(counts3))
counts4 <- table(tr$Loan_Status, tr$Self_Employed)
barplot(counts4, main="Loan Status by Self Employed",
xlab="Self_Employed", col=c("darkgrey","maroon"),
legend = rownames(counts4))
counts5 <- table(tr$Loan_Status, tr$Property_Area)
barplot(counts5, main="Loan Status by Property_Area",
xlab="Property_Area", col=c("darkgrey","maroon"),
legend = rownames(counts5))
counts6 <- table(tr$Loan_Status, tr$Credit_History)
barplot(counts6, main="Loan Status by Credit_History",
xlab="Credit_History", col=c("darkgrey","maroon"),
legend = rownames(counts5))Upon examining the Gender graph, we observe that males constitute a larger proportion of the dataset, with more than half of their loan applications being approved. Although there are fewer female applicants, a significant portion of their applications have also been approved. We look at the other charts with the same eye to evaluate how each category performed in regards to the approval of the loan applications.
Data Cleaning Before proceeding with our analysis, we must address the identified issues within the dataset. Let’s recap the main concerns:
Missing values in some variables - We need to decide on an appropriate method for handling missing data based on the significance of each variable.
Outliers in ApplicantIncome and LoanAmount - We must determine how to handle these outliers and whether they result from measurement errors, recording errors, or genuine anomalies.
In this dataset, we’ll assume that the missing values are systematic, as they appear in specific variables in a seemingly random manner. Missing values occur in both numerical and categorical data. We’ll use the ‘mice’ package in R, which helps in imputing missing values with plausible data values. These values are inferred from a distribution designed for each missing data point. From the missing data plot, we observe that 78% of the data contains no missing information, 7% are missing the Credit_History value, and the remaining data exhibits other missing patterns.
train <- read.csv('train.csv', header = TRUE, stringsAsFactors=TRUE)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
attach(train)
loan_clean <- train %>%
select(-c(Loan_ID)) %>%
mutate(Credit_History = as.factor(Credit_History))
colSums(is.na(loan_clean))## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 0 0 0 22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 14 50 0 0
It’s worth noting that ‘mice’ stands for Multiple Imputation by
Chained Equations. To merge the imputed data with our original dataset,
we can utilize the complete() function:
After addressing the missing data through imputation, it’s essential to verify that no missing values remain in the dataset:
loan_clean <- loan_clean %>%
filter(complete.cases(.))
colSums(is.na(loan_clean))## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 0 0 0 0
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 0 0 0 0
Now, let’s address the extreme values in our dataset. When examining the LoanAmount variable, it’s plausible that some customers may apply for larger loan amounts due to various reasons. To mitigate the impact of these extreme values and normalize the data, we can apply a log transformation:
tr <- loan_clean
tr$LogLoanAmount <- log(tr$LoanAmount)
par(mfrow=c(1,2))
hist(tr$LogLoanAmount,
main="Histogram for Loan Amount",
xlab="Loan Amount",
border="blue",
col="maroon",
las=1,
breaks=20, prob = TRUE)
lines(density(tr$LogLoanAmount), col='black', lwd=3)
boxplot(tr$LogLoanAmount, col='maroon',xlab = 'Income', main = 'Box Plot for Applicant Income')Now the distribution looks closer to normal and effect of extreme values has significantly subsided.
Coming to ApplicantIncome, it will be a good idea to
combine both ApplicantIncome and Co-applicants
as total income and then perform log transformation of the combined
variable.
We will use the CART imputation method. If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement. We see that the distribution is better and closer to a normal distribution.
tr$Income <- tr$ApplicantIncome + tr$CoapplicantIncome
tr$ApplicantIncome <- NULL
tr$CoapplicantIncome <- NULL
tr$LogIncome <- log(tr$Income)
par(mfrow=c(1,2))
hist(tr$LogIncome,
main="Histogram for Applicant Income",
xlab="Income",
border="blue",
col="maroon",
las=1,
breaks=50, prob = TRUE)
lines(density(tr$LogIncome), col='black', lwd=3)
boxplot(tr$LogIncome, col='maroon',xlab = 'Income', main = 'Box Plot for Applicant Income')Now it’s the time to make the next big step in our analysis which is splitting the data into training and test sets.
A training set is the subset of the data that we use to train our models but the test set is a random subset of the data which are derived from the training set. We will use the test set to validate our models as un-foreseen data.
In a sparse data like ours, it’s easy to overfit the data. Overfit in simple terms means that the model will learn the training set that it won’t be able to handle most of the cases it has never seen before. Therefore, we are going to score the data using our test set. Once we split the data, we will treat the testing set like it no longer exists. Let’s split the data:
set.seed(42)
sample <- sample.int(n = nrow(tr), size = floor(.70*nrow(tr)), replace = F)
trainnew <- tr[sample, ]
testnew <- tr[-sample, ]We will now start with our first logistic regression model. We will not take all the variables in the model because this might cause an overfitting of the data. To choose our variables, let’s examine the importance of the variables logically. The chances that an applicant’s application would be approved is higher if:
Applicants took a loan before. Credit history is the variable which answers that.
Applicants with higher incomes. So, we might look at the income variable which we created.
Applicants with higher education.
Applicants who have stable jobs.
We will use Credit_History variable in our first logistic regression model.
logistic1 <- glm (Loan_Status ~ Credit_History,data = trainnew, family = binomial)
summary(logistic1)##
## Call:
## glm(formula = Loan_Status ~ Credit_History, family = binomial,
## data = trainnew)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.4408 0.6728 0.6728 2.1815
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2824 0.4694 -4.862 1.16e-06 ***
## Credit_History1 3.6529 0.4898 7.458 8.81e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.38 on 369 degrees of freedom
## Residual deviance: 351.78 on 368 degrees of freedom
## AIC: 355.78
##
## Number of Fisher Scoring iterations: 4
my_prediction_tr1 <- predict(logistic1, newdata = trainnew, type = "response")
table(trainnew$Loan_Status, my_prediction_tr1 > 0.5)##
## FALSE TRUE
## N 49 64
## Y 5 252
logistic_test1 <- glm (Loan_Status ~ Credit_History,data = testnew, family = binomial)
summary(logistic_test1)##
## Call:
## glm(formula = Loan_Status ~ Credit_History, family = binomial,
## data = testnew)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7900 -0.4084 0.6708 0.6708 2.2475
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4423 0.7372 -3.313 0.000923 ***
## Credit_History1 3.8193 0.7680 4.973 6.59e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 198.0 on 158 degrees of freedom
## Residual deviance: 148.6 on 157 degrees of freedom
## AIC: 152.6
##
## Number of Fisher Scoring iterations: 5
my_prediction_te1 <- predict(logistic_test1, newdata = testnew, type = "response")
table(testnew$Loan_Status, my_prediction_te1 > 0.5)##
## FALSE TRUE
## N 23 27
## Y 2 107
Logistic Regression, in simple terms, predicts the probability of occurrence of an event by fitting data to a logit function. Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This type of models is part of a larger class of algorithms known as Generalized Linear Model or GLM.
The role of the link function is to link the expectation of ‘y’ to the linear predictor. Logistic regression has the following assumptions:
GLM does not assume a linear relationship between dependent and independent variables.
Dependent variable need not to be normally distributed.
It uses maximum likelihood estimation (MLE).
Errors need to be independent but not normally distributed.
In the output, the first thing we see is the call, this is R reminding us about the model we have run. Next, we see the deviance residuals which are the measures of the model fit. This part shows the distribution of the deviance residuals for individual cases used in the model. The next part shows the coefficients, their standard errors, the z-statistic, and the associated p-value. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that we can reject the null hypothesis and the predictor is meaningful to the model. Conversely, a larger p-value indicates that changes in the predictor are not associated with changes in the dependent variable and that it’s insignificant. The p-value for the Credit_History is so small and therefore, it’s significant.
We have also generated a confusion table to check the accuracy of the model on both the train and the test data:
Train data: 81.12% Test data: 83.24%
Let’s add other variables and check the accuracy:
logistic2 <- glm (Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
LogIncome,data = trainnew, family = binomial)
summary(logistic2)##
## Call:
## glm(formula = Loan_Status ~ Credit_History + Education + Self_Employed +
## Property_Area + LogLoanAmount + LogIncome, family = binomial,
## data = trainnew)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2797 -0.3803 0.4930 0.7063 2.3992
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1252 2.6579 -1.176 0.239663
## Credit_History1 3.8503 0.5122 7.517 5.62e-14 ***
## EducationNot Graduate -0.2141 0.3416 -0.627 0.530844
## Self_EmployedNo -0.5174 0.7152 -0.723 0.469423
## Self_EmployedYes -1.0221 0.7830 -1.305 0.191751
## Property_AreaSemiurban 1.2127 0.3549 3.417 0.000633 ***
## Property_AreaUrban 0.4050 0.3337 1.214 0.224931
## LogLoanAmount -0.4011 0.4005 -1.002 0.316500
## LogIncome 0.3089 0.4024 0.768 0.442726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.38 on 369 degrees of freedom
## Residual deviance: 334.13 on 361 degrees of freedom
## AIC: 352.13
##
## Number of Fisher Scoring iterations: 5
my_prediction_tr2 <- predict(logistic2, newdata = trainnew, type = "response")
table(trainnew$Loan_Status, my_prediction_tr2 > 0.5)##
## FALSE TRUE
## N 49 64
## Y 5 252
logistic_test2 <- glm (Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
LogIncome,data = testnew, family = binomial)
summary(logistic_test2)##
## Call:
## glm(formula = Loan_Status ~ Credit_History + Education + Self_Employed +
## Property_Area + LogLoanAmount + LogIncome, family = binomial,
## data = testnew)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1433 -0.3639 0.5583 0.7322 2.3253
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.80182 3.72186 -1.021 0.307
## Credit_History1 3.73115 0.78224 4.770 1.84e-06 ***
## EducationNot Graduate -0.35954 0.53502 -0.672 0.502
## Self_EmployedNo -0.88816 1.09824 -0.809 0.419
## Self_EmployedYes -0.22080 1.31186 -0.168 0.866
## Property_AreaSemiurban 0.64145 0.51558 1.244 0.213
## Property_AreaUrban -0.13494 0.52278 -0.258 0.796
## LogLoanAmount 0.03382 0.56346 0.060 0.952
## LogIncome 0.22246 0.48048 0.463 0.643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 198.00 on 158 degrees of freedom
## Residual deviance: 143.29 on 150 degrees of freedom
## AIC: 161.29
##
## Number of Fisher Scoring iterations: 5
my_prediction_te2 <- predict(logistic_test2, newdata = testnew, type = "response")
table(testnew$Loan_Status, my_prediction_te2 > 0.5)##
## FALSE TRUE
## N 23 27
## Y 2 107
Train data: 81.11% Test data: 83.78% We note that adding variables improved the accuracy of the test set.
Decision trees create a set of binary splits on the predictor variables in order to create a tree that can be used to classify new observations into one of two groups. Here, we will be using classical trees. The algorithm of this model is the following:
Choose the predictor variable that best splits the data into two groups;
Separate the data into these two groups;
Repeat these steps until a subgroup contains fewer than a minimum number of observations;
To classify a case, run it down the tree to a terminal node, and assign it the model outcome value assigned in the previous step.
library(rpart)
# grow tree
dtree <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
LogIncome,method="class", data=trainnew,parms=list(split="information"))
dtree$cptable## CP nsplit rel error xerror xstd
## 1 0.38938053 0 1.0000000 1.0000000 0.07840188
## 2 0.01474926 1 0.6106195 0.6106195 0.06630228
## 3 0.01000000 4 0.5663717 0.6991150 0.06975587
plotcp(dtree)dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
dnn=c("Actual", "Predicted"))
dtree.perf## Predicted
## Actual N Y
## N 49 64
## Y 5 252
In R, decision trees can be grown and pruned using the rpart() function and prune() function in the rpart package. First, the tree is grown using the rpart() function. We printed the tree and the summary to examine the fitted model. The tree may be too large and need to be pruned. To choose a final tree size, examine the cptable of the list returned by rpart(). It contains data about the prediction error for different tree sizes. The complexity parameter (cp) is used to penalize larger trees. Tree size is defined by the number of branch splits (nsplit). A tree with n splits has n + 1 terminal nodes. The (rel error) contains the error rate for a tree of a given size in the training sample. The cross-validated error (xerror) is based on 10-fold cross validation, using the training sample. The (xstd) contains the standard error of the cross-validation error.
The plotcp() function plots the cross-validated error against the complexity parameter. To choose the final tree size, we need to choose the smallest tree whose cross-validated error is within one standard error of the minimum cross-validated error value. In our case, the minimum cross-validated error is 0.618 with a standard error of 0.0618. So, the smallest tree with a cross-validated error is within 0.618 � 0.0618 that is between 0.56 and 0.68 is selected. From the table, a tree with one splits (cross-validated error = 0.618) fits the requirement.
From the cptable, a tree with one splits has a complexity parameter of 0.02290076, so the statement prune(dtree, cp=0.2290076) returns a tree with the desired size. We have then plotted the tree: pruned tree for predicting the loan status. We look at the tree at the top moving left if a condition is true or right otherwise. When an observation hits a terminal node, it’s classified. Each node contains the probability of the classes in that node, along with percentage of the sample.
Finally, we ran the confusion table to know the accuracy of the model. PS: We followed the same steps in the test data.
dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
LogIncome,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable## CP nsplit rel error xerror xstd
## 1 0.42 0 1.00 1.00 0.11709266
## 2 0.07 1 0.58 0.58 0.09738725
## 3 0.01 3 0.44 0.64 0.10111330
plotcp(dtree_test)dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred,
dnn=c("Actual", "Predicted"))
dtree_test.perf## Predicted
## Actual N Y
## N 30 20
## Y 2 107
Accuracy: Train data: 81.81% Test data: 85.4% Results show better performance than the logistic model.
A random forest is an ensemble learning approach to supervised learning. This approach develops multiple predictive models, and the results are aggregated to improve classification. The algorithm is as follows:
Grow many decision trees by sampling;
Sample m < M variables at each node;
Grow each tree fully without pruning;
Terminal nodes are assigned to a class based on the mode of cases in that node;
Classify new cases by sending them down all the trees and taking a vote.
Random forests are grown using randomForest() function in the randomForest Package in R. The default number of trees is 500, the default number of variables sampled at each node is sqrt(M), and the minimum node size is 1.
library(randomForest) ## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(42)
fit.forest <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+
LogIncome, data=trainnew,
na.action=na.roughfix,
importance=TRUE)
fit.forest##
## Call:
## randomForest(formula = Loan_Status ~ Credit_History + Education + Self_Employed + Property_Area + LogLoanAmount + LogIncome, data = trainnew, importance = TRUE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.27%
## Confusion matrix:
## N Y class.error
## N 52 61 0.53982301
## Y 14 243 0.05447471
importance(fit.forest, type=2)## MeanDecreaseGini
## Credit_History 41.21536
## Education 2.94800
## Self_Employed 5.14608
## Property_Area 7.59871
## LogLoanAmount 28.58602
## LogIncome 31.06103
forest.pred <- predict(fit.forest, testnew)
forest.perf <- table(testnew$Loan_Status, forest.pred,
dnn=c("Actual", "Predicted"))
forest.perf## Predicted
## Actual N Y
## N 25 25
## Y 5 104
Here is the accuracy of the model: Train data: 79.95% Test data: 82.16%
The random forest function grew 500 traditional decision trees by sampling 429 observations with replacement from the training sample. Random forests provides natural measure of variable importance. The relative importance measure specified by type=2 option is the total decrease in node impurities from splitting on that variable, averaged over all trees. In our trees,the most important variable is Credit_History and the least is Self_Employed. We have finally measured the accuracy for the training sample and applied the prediction to the test sample. We note that the accuracy for both are less than the decision tree’s accuracy.
We will run the same model but this time we will select the highest three in importance:
set.seed(42)
fit.forest2 <- randomForest(Loan_Status ~ Credit_History+LogLoanAmount+
LogIncome, data=trainnew,importance=TRUE)
fit.forest2##
## Call:
## randomForest(formula = Loan_Status ~ Credit_History + LogLoanAmount + LogIncome, data = trainnew, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 19.19%
## Confusion matrix:
## N Y class.error
## N 49 64 0.56637168
## Y 7 250 0.02723735
forest.pred2 <- predict(fit.forest2, testnew)
forest.perf_test <- table(testnew$Loan_Status, forest.pred2,
dnn=c("Actual", "Predicted"))
forest.perf_test## Predicted
## Actual N Y
## N 23 27
## Y 2 107
Here, we notice slight improvements on both samples where accuracy for the training sample is 80.88% and the accuracy for the test sample is 83.24%. Accuracy for decision tree is still better.
Random forests tend to be very accurate compared to other classification methods though. Also, they can handle large problems. Personally, I have more confidence from the results generated from forest trees compared to decision trees. One problem which might occur with single decision tree is that it can overfit. Random forest, on the other hand, prevents overfitting by creating random subsets of the variables and building smaller trees using the subsets and then it combines the subtrees.
Although the accuracy for the decision tree is better, I’m choosing
the random forest tree model. The reason is that the difference in
accuracy slightly differ between the two models. Also, I prefer the
forest model for the reasons mentioned in the previous section.
Let’s now create a data frame with two columns: Loan_ID and Loan_Status
containing our predictions:
#my_solution <- data.frame(Loan_ID = testnew$Loan_ID, Loan_Status = forest.pred2)Write the solution to a csv file with the name my_solution.csv
#write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)So now, we have predictions for 185 customers who apply for loans with accuracy of 83.24%. We can apply this method for any new data set with same variables to have a prediction about their eligibility of getting a loan.