The aim of the following analysis is to examine the relationships between an individual requesting a loan, information about the loan request and if the loan was approved. This analysis hopes to identify key factors to predict the probability of a loan request being accepted Diving deep into this topic will hopefully provide useful insights for individuals who are applying for a loan and wish to calculate their chance of the loan being approved.
This dataset comes from Kaggle, a free web-based platform that data scientists and statisticians use to share both ideas and datasets. The link to the Loan Approval Classification Dataset dataset is below:
Loan Approval Classification Dataset
This dataset may be updated over time. I downloaded this dataset on Monday, November 4th, 2024. Link to this version of the dataset is below:
R/R Markdown were used in this project as it is free and open-source, allowing users to customize their experience with various libraries and features that other coding software such as SAS do not have. R provides an easy-to-use and comprehensive toolset of statistical analyses and tests. While these advanced analyses will not be used in this project, further work can be done to provide additional insights to the major factors in laptops’ price.
This analysis seeks to answer the questions:
loan.data <- read.csv("https://raw.githubusercontent.com/EPKeep32/STA551/refs/heads/main/loan_data.csv")
Here is what the author on Kaggle had to say about the Loan Approval Classification Dataset: “This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.”
A detailed description of the variables within the dataset is given below:
person_age: Age of the person [numerical]
person_gender: Gender of the person
[categorical]
person_education: Highest education level
[categorical]
person_income: Annual income [numerical]
person_emp_exp: Years of employment experience
[numerical]
person_home_ownership: Home ownership status (e.g.,
rent, own, mortgage) [categorical]
loan_amnt: Loan amount requested
[numerical]
loan_intent: Purpose of the loan
[categorical]
loan_int_rate: Loan interest rate
[numerical]
loan_percent_income: Loan amount as a percentage of
annual income [numerical]
cb_person_cred_hist_length: Length of credit history in
years [numerical]
credit_score: Credit score of the person
[numerical]
previous_loan_defaults_on_file: Indicator of previous
loan defaults [categorical]
loan_status: Loan approval status: 1 = approved; 0 =
rejected [binary]
Below is a list of any additional variables that were created during the model building process and their descriptions.
For regression techniques, I created dummy variables for all categorical variables. The list of them can be found below:
Gender Male: if
person_gender = “male”, 1. Otherwise, 0
Education HS: if
person_education = “High School”, 1. Otherwise, 0
Associate: if person_education = “Associate”,
1. Otherwise, 0 Bachelor: if person_education
= “Bachelor”, 1. Otherwise, 0 Master: if
person_education = “Master”, 1. Otherwise, 0
Home Ownership Own_Other: if
person_home_ownership = “OTHER”, 1. Otherwise, 0
Own_Own: if person_home_ownership = “OWN”, 1.
Otherwise, 0 Own_Mortgage: if
person_home_ownership = “MORTGAGE”, 1. Otherwise, 0
Loan Intent HomeImp: if
loan_intent = “HOMEIMPROVEMENT”, 1. Otherwise, 0
Debt: if loan_intent = “DEBTCONSOLIDATION”, 1.
Otherwise, 0 Personal: if loan_intent =
“PERSONAL”, 1. Otherwise, 0 Venture: if
loan_intent = “VENTURE”, 1. Otherwise, 0
Medical: if loan_intent = “MEDICAL”, 1.
Otherwise, 0
Previous Loan Previous: if
previous_loan_defaults_on_file = “Yes”, 1. Otherwise, 0
It is common practice for the amount of dummy variables per variable
to be 1 less than the amount of groups. For example, there is only one
dummy variable created for person_gender. If the applicant
had a gender of “female”, their male value would be 0. This
means that “female” is the reference category for male.
loan.data$Male <- ifelse(loan.data$person_gender == "male", 1, 0)
loan.data$HS <- ifelse(loan.data$person_education == "High School", 1, 0)
loan.data$Associate <- ifelse(loan.data$person_education == "Associate", 1, 0)
loan.data$Bachelor <- ifelse(loan.data$person_education == "Bachelor", 1, 0)
loan.data$Master <- ifelse(loan.data$person_education == "Master", 1, 0)
loan.data$Own_Other <- ifelse(loan.data$person_home_ownership == "OTHER", 1, 0)
loan.data$Own_Own <- ifelse(loan.data$person_home_ownership == "OWN", 1, 0)
loan.data$Own_Mortgage <- ifelse(loan.data$person_home_ownership == "MORTGAGE", 1, 0)
loan.data$HomeImp <- ifelse(loan.data$loan_intent == "HOMEIMPROVEMENT", 1, 0)
loan.data$Debt <- ifelse(loan.data$loan_intent == "DEBTCONSOLIDATION", 1, 0)
loan.data$Personal <- ifelse(loan.data$loan_intent == "PERSONAL", 1, 0)
loan.data$Venture <- ifelse(loan.data$loan_intent == "VENTURE", 1, 0)
loan.data$Medical <- ifelse(loan.data$loan_intent == "MEDICAL", 1, 0)
loan.data$Previous <- ifelse(loan.data$previous_loan_defaults_on_file == "Yes", 1, 0)
The next step is to determine if there are any missing variables. Missing variables can cause biased models to be generated, ultimately causing decisions that may be incorrect.
Another issue with missing values is how to replace them. There are many ways to replace missing values, including mean and median imputation, and model-based imputation.
However, I will first determine if there are any missing values within the dataset before eplxoring replacement techniques.
missing_values <- is.na(loan.data)
summary(missing_values)
person_age person_gender person_education person_income
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
person_emp_exp person_home_ownership loan_amnt loan_intent
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
previous_loan_defaults_on_file loan_status Male HS
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
Associate Bachelor Master Own_Other
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
Own_Own Own_Mortgage HomeImp Debt
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
Personal Venture Medical Previous
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:45000 FALSE:45000 FALSE:45000 FALSE:45000
missing_rows <- loan.data[!complete.cases(loan.data), ]
print(missing_rows)
[1] person_age person_gender
[3] person_education person_income
[5] person_emp_exp person_home_ownership
[7] loan_amnt loan_intent
[9] loan_int_rate loan_percent_income
[11] cb_person_cred_hist_length credit_score
[13] previous_loan_defaults_on_file loan_status
[15] Male HS
[17] Associate Bachelor
[19] Master Own_Other
[21] Own_Own Own_Mortgage
[23] HomeImp Debt
[25] Personal Venture
[27] Medical Previous
<0 rows> (or 0-length row.names)
The above summary table shows that 0 of the
variables have missing values. If any of the variables had missing
values, there would be an NA: n row beneath it, with
n representing the number of missing values. Therefore, no
imputations are necessary for the dataset.
The final dataset has 45000 observations each with 23 variables. All 28 variables will be held within the dataset for any further research, regardless of their use in this research.
Below is a quick summary of all variables in the final dataset:
summary(loan.data)
person_age person_gender person_education person_income
Min. : 20.00 Length:45000 Length:45000 Min. : 8000
1st Qu.: 24.00 Class :character Class :character 1st Qu.: 47204
Median : 26.00 Mode :character Mode :character Median : 67048
Mean : 27.76 Mean : 80319
3rd Qu.: 30.00 3rd Qu.: 95789
Max. :144.00 Max. :7200766
person_emp_exp person_home_ownership loan_amnt loan_intent
Min. : 0.00 Length:45000 Min. : 500 Length:45000
1st Qu.: 1.00 Class :character 1st Qu.: 5000 Class :character
Median : 4.00 Mode :character Median : 8000 Mode :character
Mean : 5.41 Mean : 9583
3rd Qu.: 8.00 3rd Qu.:12237
Max. :125.00 Max. :35000
loan_int_rate loan_percent_income cb_person_cred_hist_length credit_score
Min. : 5.42 Min. :0.0000 Min. : 2.000 Min. :390.0
1st Qu.: 8.59 1st Qu.:0.0700 1st Qu.: 3.000 1st Qu.:601.0
Median :11.01 Median :0.1200 Median : 4.000 Median :640.0
Mean :11.01 Mean :0.1397 Mean : 5.867 Mean :632.6
3rd Qu.:12.99 3rd Qu.:0.1900 3rd Qu.: 8.000 3rd Qu.:670.0
Max. :20.00 Max. :0.6600 Max. :30.000 Max. :850.0
previous_loan_defaults_on_file loan_status Male
Length:45000 Min. :0.0000 Min. :0.000
Class :character 1st Qu.:0.0000 1st Qu.:0.000
Mode :character Median :0.0000 Median :1.000
Mean :0.2222 Mean :0.552
3rd Qu.:0.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.000
HS Associate Bachelor Master
Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.266 Mean :0.2673 Mean :0.2978 Mean :0.1551
3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Own_Other Own_Own Own_Mortgage HomeImp
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
Mean :0.0026 Mean :0.06558 Mean :0.4109 Mean :0.1063
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
Debt Personal Venture Medical
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00
Mean :0.1588 Mean :0.1678 Mean :0.1738 Mean :0.19
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00
Previous
Min. :0.000
1st Qu.:0.000
Median :1.000
Mean :0.508
3rd Qu.:1.000
Max. :1.000
Next I will perform some Exploratpory Data Analysis, or EDA. EDA is an important step in the overarching data analysis process, as it is important to familiarize yourself with the dataset.
Note I will not be performing any EDA on the manually created dummy variables, as their distributions are covered in their corresponding original variable.
Below is a histogram of person_age:
hist(loan.data$person_age,
breaks = seq(0, 150, 10),
main = "Distribution of Applicant's Age",
xlim = c(0, 150),
xlab = "Applicant's Age (years)",
ylim = c(0, 40000),
ylab = "Number of Applicants",
col = "lightblue",
labels = TRUE)
abline(h = seq(5000, 40000, 5000), col = "gray", lty = "dotted")
The histogram above shows that person_age has a very
right-skewed distribution. Over 75% of applicants in the dataset are in
their 30’s. A few outliers may be present, with some applicants over the
age of 100 years old.
Below is a pie chart of person_gender:
pst <- loan.data %>%
group_by(person_gender) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
ggplot(pst, aes(x = "", y = perc, fill = person_gender)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = (0.5))) +
coord_polar(theta = "y") +
scale_fill_manual(values = c("female" = "#FF9999", "male" = "#66B3FF"),
labels = c("Female", "Male")) +
labs(title = "Pie Chart of Applicant's Gender", y = "", x = "") +
guides(fill = guide_legend(title = "Gender"))
As shown in the pie chart above, 55% of our loan applicants are male, and 45% are female.
Below is a pie chart of person_education:
pst <- loan.data %>%
group_by(person_education) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc),
person_education = factor(person_education, levels = c("High School", "Associate", "Bachelor", "Master", "Doctorate")))
ggplot(pst, aes(x = "", y = perc, fill = person_education)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = (0.5))) +
coord_polar(theta = "y") +
scale_fill_manual(values = c("High School" = "#BBDEFB", "Associate" = "#90CAF9", "Bachelor" = "#42A5F5", "Master" = "#1E88E5", "Doctorate" = "#1565C0"),
labels = c("High School Diploma", "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Doctorate's Degree")) +
labs(title = "Pie Chart of Applicant's Highest Completed Education Level", y = "", x = "") +
guides(fill = guide_legend(title = "Education Level"))
As shown in the pie chart above, the most common education level among applicants is an associate’s degree (29.78%).
Below is a histogram of person_income:
hist(loan.data$person_income,
breaks = seq(0, 7300000, 100000),
main = "Distribution of Applicant's Income",
xlim = c(0, 2000000),
xlab = "Applicant's Yearly Income (USD)",
ylim = c(0, 40000),
ylab = "Number of Applicants",
col = "lightgreen",
labels = TRUE)
abline(h = seq(5000, 40000, 5000), col = "gray", lty = "dotted")
The histogram above shows that person_income has a very
right-skewed distribution. Over 75% of applicants in the dataset have an
annual income of less than $100,000. A few outliers may be present, with
some applicants over $1.5 million.
NOTE: For visualization purposes, 3 applicants are not shown in the chart above, who have income values of $5,545,545.00, $5,556,399.00, and $7,200,766.00.
Below is a histogram of person_emp_exp:
hist(loan.data$person_emp_exp,
breaks = seq(0, 125, 5),
main = "Distribution of Applicant's Employment Experience",
xlim = c(0, 125),
xaxt = "n",
xlab = "Applicant's Employment Experience (Years)",
ylim = c(0, 40000),
ylab = "Number of Applicants",
col = "yellow",
labels = TRUE)
abline(h = seq(5000, 40000, 5000), col = "gray", lty = "dotted")
axis(1, at = seq(0, 125, 25))
The histogram above shows that person_emp_exp has a very
right-skewed distribution. Over 60% of applicants in the dataset have
less than 5 years of employment experience. A few outliers may be
present, with some applicants over 75 years of employment
experience.
Below is a bar graph chart of person_home_ownership
lds <- loan.data %>%
count(person_home_ownership) %>%
mutate(perc = n / sum(n)*100,
label = paste(n, " (", round(perc, 1), "%)", sep = ""))
ggplot(lds, aes(x = reorder(person_home_ownership, n), y = n)) +
geom_bar(stat = "identity", fill = "lightpink") +
geom_text(aes(label = label), vjust = -0.5) +
labs(title = "Applicant's Home Ownership", x = "Ownership Status", y = "Count")
As shown above, over half of the applicants currently rent their home.
Below is a histogram of loan_amnt:
hist(loan.data$loan_amnt,
breaks = seq(0, 35000, 2500),
main = "Distribution of Requested Loan Amount",
xlim = c(0, 35000),
xlab = "Requested Loan Amount (USD)",
ylim = c(0, 10000),
ylab = "Number of Applicants",
col = "violet",
labels = TRUE)
abline(h = seq(1000, 10000, 1000), col = "gray", lty = "dotted")
The histogram above shows that loan_amnt has a
right-skewed distribution. A majority of applicants in the dataset
requested between $2,500 and $10,000. A few outliers may be present,
with some applicants requesting over $30,000.
Below is a bar graph chart of loan_intent
lds <- loan.data %>%
count(loan_intent) %>%
mutate(perc = n / sum(n)*100,
label = paste(n, "\n(", round(perc, 1), "%)", sep = ""))
ggplot(lds, aes(x = reorder(loan_intent, n), y = n)) +
geom_bar(stat = "identity", fill = "darkred") +
geom_text(aes(label = label, y = n / 2), hjust = 0.5, color = "white") +
labs(title = "Applicant's Loan Intent", x = "Loan Intent", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As shown above, applicants are applying for a loan for many different reasons, with the most popular being an education loan.
Below is a histogram of loan_int_rate:
hist(loan.data$loan_int_rate,
breaks = seq(0, 20, 2),
main = "Distribution of Loan Interest Rate",
xlim = c(0, 20),
xaxt = "n",
xlab = "Interest rate (%)",
ylim = c(0, 15000),
ylab = "Number of Applicants",
col = "gray50",
labels = TRUE)
abline(h = seq(2500, 15000, 2500), col = "gray", lty = "dotted")
axis(1, at = seq(0, 20, 2))
The histogram above shows that loan_int_rate has a
fairly Normal. Most applicants are applying for a loan with an interest
rate between 10-12%.
Below is a histogram of loan_percent_income:
hist(loan.data$loan_percent_income,
breaks = seq(0, 1, .1),
main = "Distribution of Loan Percentage of Annual Income",
xlim = c(0, 1),
xlab = "Loan Percentage of Annual Income (%)",
ylim = c(0, 20000),
ylab = "Number of Applicants",
col = "lawngreen",
labels = TRUE)
abline(h = seq(2500, 20000, 2500), col = "gray", lty = "dotted")
The histogram above shows that loan_percent_income has a
very right-skewed distribution.
Below is a histogram of cb_person_cred_hist_length:
hist(loan.data$cb_person_cred_hist_length,
breaks = seq(0, 30, 3),
main = "Distribution of Applicant's Length of Credit History",
xlim = c(0, 30),
xlab = "Applicant's Length of Credit History (Years)",
ylim = c(0, 20000),
ylab = "Number of Applicants",
col = "gold",
labels = TRUE)
abline(h = seq(2500, 20000, 2500), col = "gray", lty = "dotted")
The histogram above shows that
cb_person_cred_hist_length has a very right-skewed
distribution.
Below is a histogram of credit_score:
hist(loan.data$credit_score,
breaks = seq(350, 900, 50),
main = "Distribution of Applicant's Credit Score",
xlim = c(350, 900),
xaxt = "n",
xlab = "Applicant's Credit Score",
ylim = c(0, 20000),
ylab = "Number of Applicants",
col = "plum",
labels = TRUE)
abline(h = seq(2500, 20000, 2500), col = "gray", lty = "dotted")
axis(1, at = seq(350, 900, 50))
The histogram above shows that credit_score has a
slightly left-skewed distribution. A majority of applicants in the
dataset have a credit score between 600 and 700.
Below is a pie chart of
previous_loan_defaults_on_file:
pst <- loan.data %>%
group_by(previous_loan_defaults_on_file) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
ggplot(pst, aes(x = "", y = perc, fill = previous_loan_defaults_on_file)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = (0.5))) +
coord_polar(theta = "y") +
scale_fill_manual(values = c("No" = "palevioletred", "Yes" = "lightgreen"),
labels = c("No", "Yes")) +
labs(title = "Pie Chart of Previous Loan Status", y = "", x = "") +
guides(fill = guide_legend(title = "Did the applicant\nhave a previous loan?"))
As shown in the pie chart above, 50.8% of our loan applicants had previously took out a loan.
Below is a pie chart of loan_status:
pst <- loan.data %>%
group_by(loan_status) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc),
loan_status = factor(loan_status, levels = c(0, 1), labels = c("No", "Yes")))
ggplot(pst, aes(x = "", y = perc, fill = loan_status)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = (0.5))) +
coord_polar(theta = "y") +
scale_fill_manual(values = c("No" = "palevioletred", "Yes" = "lightgreen"),
labels = c("No", "Yes")) +
labs(title = "Pie Chart of Loan Status", y = "", x = "") +
guides(fill = guide_legend(title = "Did the loan get approved?"))
As shown in the pie chart above, only 22% of the loans got approved.
Next, I will go through 4 different modeling techniques in an attempt to find the best possible model. The 4 techniques that I will use are:
Logistic Regression
Perceptron
Decision Tree
Bagging
Each of the 4 listed-above algorithms are powerful in their own way. I will find the best model with each technique, and then compare each model to each other to determine no only the best model but best modeling technique.
In order to later test the models, I split the dataset into two separate datasets: Training Data and Testing Data. Doing so will allow for the models to be tested on real data without the need for additional research or sampling. Roughly 75% of the overall dataset will be in the training dataset, and the remaining ~25% will be in the testing dataset.
set.seed(323)
index <- sample.split(Y = loan.data$loan_status, SplitRatio = 0.75)
train.data <- loan.data[index, ]
test.data <- loan.data[!index, ]
Logistic Regression builds a linear model for binary classification that estimates the probability of any specified outcome. It estimates the probability of a binary outcome based on one or more predictor variables. It then uses the logistic function to map predictions to a probability between 0 and 1.
I will begin by running the full model with all predictor variables. Once ran, I will see if there are any highly correlated variables within the model. If so, the assumption of multicollinearity will be broken and the model will be biased.
If there are any highly correlated variables, I will remove them one-by-one and re-run the model, now with one less variable. Once all correlated predictors are removed, I will remove any insignificant variables from the model until there are only significant variables left. The remaining variables will make up the Final Model.
After running various models, I am left with highly significant and uncorrelated variables in the model. Below is a summary of the model:
Call:
glm(formula = loan_status ~ person_income + Own_Own + Own_Mortgage +
loan_amnt + HomeImp + Debt + Personal + Venture + Medical +
loan_int_rate + loan_percent_income, family = "binomial",
data = train.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.6888123917 0.0965070171 -69.309 < 0.0000000000000002
person_income 0.0000006538 0.0000001973 3.313 0.000922
Own_Own -2.3850021635 0.0991391337 -24.057 < 0.0000000000000002
Own_Mortgage -0.8444037988 0.0377395933 -22.374 < 0.0000000000000002
loan_amnt -0.0001001720 0.0000037972 -26.381 < 0.0000000000000002
HomeImp 0.8623407431 0.0613081280 14.066 < 0.0000000000000002
Debt 0.9331440727 0.0533499202 17.491 < 0.0000000000000002
Personal 0.2227267684 0.0558353390 3.989 0.00006635
Venture -0.2651102364 0.0588289465 -4.506 0.00000659
Medical 0.6224907458 0.0515580668 12.074 < 0.0000000000000002
loan_int_rate 0.3326549007 0.0062012245 53.643 < 0.0000000000000002
loan_percent_income 15.7528184713 0.2740399808 57.484 < 0.0000000000000002
(Intercept) ***
person_income ***
Own_Own ***
Own_Mortgage ***
loan_amnt ***
HomeImp ***
Debt ***
Personal ***
Venture ***
Medical ***
loan_int_rate ***
loan_percent_income ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35755 on 33749 degrees of freedom
Residual deviance: 24449 on 33738 degrees of freedom
AIC: 24473
Number of Fisher Scoring iterations: 6
Therefore, the best possible model is listed below:
\(loan\_status = -6.6888123917 + 0.0000006538*(person\_income) - 2.3850021635*(Own\_Own) - 0.8444037988*(Own\_Morgage) - 0.0001001720*(loan_amnt) \\ + 0.8623407431*(HomeImp) + 0.9331440727*(Debt) + 0.2227267684*(Personal) - 0.2651102364*(Venture) + 0.6224907458*(Medical) \\ + 0.3326549007*(loan\_int\_rate) + 15.7528184713*(loan\_percent\_income)\)
Below is additional information about the effectiveness of this model on correctly predicting if an applicant’s loan will be approved:
prob.pred <- predict(log.reg.12, newdata = test.data, type = "response")
pred.class <- ifelse(prob.pred > 0.5, 1, 0)
accuracy <- mean(pred.class == test.data$loan_status)
print(paste("Accuracy: ", accuracy))
[1] "Accuracy: 0.849955555555556"
confusionMatrix(factor(pred.class), factor(test.data$loan_status))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 8306 1244
1 444 1256
Accuracy : 0.85
95% CI : (0.8432, 0.8565)
No Information Rate : 0.7778
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5099
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.9493
Specificity : 0.5024
Pos Pred Value : 0.8697
Neg Pred Value : 0.7388
Prevalence : 0.7778
Detection Rate : 0.7383
Detection Prevalence : 0.8489
Balanced Accuracy : 0.7258
'Positive' Class : 0
The model had an 85% accuracy rating.
An additional measure in comparing various models is an ROC Curve. Attached with an ROC Curve is an AUC value, or “Area Under Curve”. The closer to 1 this value is, the better the model is at prediction. Below is the ROC Curve for our model:
roc_curve <- roc(test.data$loan_status, prob.pred)
plot(roc_curve)
auc(roc_curve)
Area under the curve: 0.8637
As shown in the graph above, this model’s AUC is 0.8637. This value will be used to compare other models.
A perceptron is a simple type of artificial neural network that makes predictions based on a linear combination of input features. I utilizes an activation function (commonly referred to as a step function) to make predictions.
I will go through the process of making a simple percepton and testing its accuracy utilizing an ROC Curve.
neuralData <- train.data[, c("person_age", "person_income", "person_emp_exp", "loan_amnt", "loan_int_rate", "loan_percent_income", "cb_person_cred_hist_length", "credit_score", "Male", "HS", "Associate", "Bachelor", "Master", "Own_Other", "Own_Own", "Own_Mortgage", "HomeImp", "Debt", "Personal", "Venture", "Medical", "Previous", "loan_status")]
## feature scaling
neuralData$person_age <- (neuralData$person_age - min(neuralData$person_age)) / (max(neuralData$person_age) - min(neuralData$person_age))
neuralData$person_income <- (neuralData$person_income - min(neuralData$person_income)) / (max(neuralData$person_income) - min(neuralData$person_income))
neuralData$person_emp_exp <- (neuralData$person_emp_exp - min(neuralData$person_emp_exp)) / (max(neuralData$person_emp_exp) - min(neuralData$person_emp_exp))
neuralData$loan_amnt <- (neuralData$loan_amnt - min(neuralData$loan_amnt)) / (max(neuralData$loan_amnt) - min(neuralData$loan_amnt))
neuralData$loan_int_rate <- (neuralData$loan_int_rate - min(neuralData$loan_int_rate)) / (max(neuralData$loan_int_rate) - min(neuralData$loan_int_rate))
neuralData$loan_percent_income <- (neuralData$loan_percent_income - min(neuralData$loan_percent_income)) / (max(neuralData$loan_percent_income) - min(neuralData$loan_percent_income))
neuralData$cb_person_cred_hist_length <- (neuralData$cb_person_cred_hist_length - min(neuralData$cb_person_cred_hist_length)) / (max(neuralData$cb_person_cred_hist_length) - min(neuralData$cb_person_cred_hist_length))
neuralData$credit_score <- (neuralData$credit_score - min(neuralData$credit_score)) / (max(neuralData$credit_score) - min(neuralData$credit_score))
neuralModelFormula <- model.matrix(loan_status ~ person_age + person_income + person_emp_exp +
loan_amnt + loan_int_rate + loan_percent_income
+ cb_person_cred_hist_length + credit_score +
Male + HS + Associate + Bachelor + Master +
Own_Other + Own_Own + Own_Mortgage + HomeImp +
Debt + Personal + Venture + Medical + Previous,
data = neuralData)
implicitFormula <- model.matrix(~., data = neuralData[, -ncol(neuralData)])
columnNames = colnames(implicitFormula)
columnList = paste(columnNames[-c(1,length(columnNames))], collapse = "+")
columnList = paste(c("loan_status", "~", columnList), collapse="")
modelFormula = formula(columnList)
Below shows the complicated neural network for the best model, accompanied by the weights for each variable.
set.seed(323)
neuralData_small <- neuralData[sample(1:nrow(neuralData), 1000), ]
nn_model <- neuralnet(modelFormula, data = neuralData_small, linear.output = FALSE, hidden = c(5))
plot(nn_model, rep = "best")
nn_model$result.matrix
[,1]
error 22.029343917
reached.threshold 0.009677202
steps 12123.000000000
Intercept.to.1layhid1 2.343522359
person_age.to.1layhid1 -2.220192772
person_income.to.1layhid1 122.856875245
person_emp_exp.to.1layhid1 10.100241390
loan_amnt.to.1layhid1 0.887542235
loan_int_rate.to.1layhid1 -2.821403823
loan_percent_income.to.1layhid1 -10.820928021
cb_person_cred_hist_length.to.1layhid1 -0.325423014
credit_score.to.1layhid1 0.211061858
Male.to.1layhid1 -1.314749738
HS.to.1layhid1 0.909318135
Associate.to.1layhid1 0.533676683
Bachelor.to.1layhid1 0.791546840
Master.to.1layhid1 0.954568655
Own_Other.to.1layhid1 36.656582695
Own_Own.to.1layhid1 2.508295720
Own_Mortgage.to.1layhid1 -3.053846439
HomeImp.to.1layhid1 0.835072250
Debt.to.1layhid1 1.868844391
Personal.to.1layhid1 0.667489933
Venture.to.1layhid1 -3.679322721
Medical.to.1layhid1 1.955099128
Intercept.to.1layhid2 4.009026174
person_age.to.1layhid2 -63.583467746
person_income.to.1layhid2 354.463740574
person_emp_exp.to.1layhid2 65.070732368
loan_amnt.to.1layhid2 -19.847012500
loan_int_rate.to.1layhid2 -13.004886138
loan_percent_income.to.1layhid2 -10.803943667
cb_person_cred_hist_length.to.1layhid2 57.881480216
credit_score.to.1layhid2 9.583634227
Male.to.1layhid2 8.123490457
HS.to.1layhid2 -3.613416452
Associate.to.1layhid2 1.800272281
Bachelor.to.1layhid2 6.462884552
Master.to.1layhid2 6.228900184
Own_Other.to.1layhid2 30.176585490
Own_Own.to.1layhid2 50.422175134
Own_Mortgage.to.1layhid2 10.432800083
HomeImp.to.1layhid2 -17.737359476
Debt.to.1layhid2 3.433791142
Personal.to.1layhid2 45.302012108
Venture.to.1layhid2 -5.620044973
Medical.to.1layhid2 -14.217293003
Intercept.to.1layhid3 -2.034336185
person_age.to.1layhid3 -7.155403901
person_income.to.1layhid3 -61.314970365
person_emp_exp.to.1layhid3 13.302612821
loan_amnt.to.1layhid3 -1.565132721
loan_int_rate.to.1layhid3 2.329824951
loan_percent_income.to.1layhid3 7.884036983
cb_person_cred_hist_length.to.1layhid3 1.794780294
credit_score.to.1layhid3 0.863938824
Male.to.1layhid3 0.316483883
HS.to.1layhid3 -1.069344274
Associate.to.1layhid3 -0.308853290
Bachelor.to.1layhid3 -0.760741593
Master.to.1layhid3 2.279709201
Own_Other.to.1layhid3 -533.386560624
Own_Own.to.1layhid3 -4.549112998
Own_Mortgage.to.1layhid3 -4.776238141
HomeImp.to.1layhid3 0.025697369
Debt.to.1layhid3 -0.786603015
Personal.to.1layhid3 0.426890577
Venture.to.1layhid3 -1.195592676
Medical.to.1layhid3 0.563119039
Intercept.to.1layhid4 0.582345490
person_age.to.1layhid4 4.845605284
person_income.to.1layhid4 -86.769179056
person_emp_exp.to.1layhid4 27.642077475
loan_amnt.to.1layhid4 -0.009990142
loan_int_rate.to.1layhid4 0.929297967
loan_percent_income.to.1layhid4 -1.068298954
cb_person_cred_hist_length.to.1layhid4 -12.387329033
credit_score.to.1layhid4 -2.633312237
Male.to.1layhid4 6.578211269
HS.to.1layhid4 -3.471764012
Associate.to.1layhid4 2.807139912
Bachelor.to.1layhid4 -4.102273279
Master.to.1layhid4 2.164172177
Own_Other.to.1layhid4 39.461744394
Own_Own.to.1layhid4 -3.634502257
Own_Mortgage.to.1layhid4 2.926108097
HomeImp.to.1layhid4 -1.466014010
Debt.to.1layhid4 -12.061799004
Personal.to.1layhid4 -2.954685917
Venture.to.1layhid4 4.207105140
Medical.to.1layhid4 -1.233650834
Intercept.to.1layhid5 -1.444722328
person_age.to.1layhid5 18.003004345
person_income.to.1layhid5 82.664696946
person_emp_exp.to.1layhid5 -35.137088857
loan_amnt.to.1layhid5 -2.297558086
loan_int_rate.to.1layhid5 12.600861025
loan_percent_income.to.1layhid5 -19.928828490
cb_person_cred_hist_length.to.1layhid5 -4.624702274
credit_score.to.1layhid5 -4.575712719
Male.to.1layhid5 1.398126046
HS.to.1layhid5 -0.376239601
Associate.to.1layhid5 -1.633584357
Bachelor.to.1layhid5 -0.052538748
Master.to.1layhid5 -4.044571737
Own_Other.to.1layhid5 -531.847963859
Own_Own.to.1layhid5 2.803626744
Own_Mortgage.to.1layhid5 6.058164217
HomeImp.to.1layhid5 2.188492187
Debt.to.1layhid5 1.471973705
Personal.to.1layhid5 -1.625761996
Venture.to.1layhid5 -1.811582060
Medical.to.1layhid5 -1.972843936
Intercept.to.loan_status -0.029427523
1layhid1.to.loan_status -95.295865591
1layhid2.to.loan_status -48.306557045
1layhid3.to.loan_status 136.430114803
1layhid4.to.loan_status -50.033306216
1layhid5.to.loan_status 83.774193492
Since the perceptron does not result in a linear formula for prediction, it can be difficult to compare the model to others. This is why an ROC Curve is necessary to calculate an AUC value, whioch can be compared across model types.
Below is the ROC Curve with the model’s AUC value:
predNN = predict(nn_model, newdata = test.data, lineat.output = FALSE)
category = test.data$loan_status == 1
ROCobj.NN <- roc(category, predNN)
NNAUC = ROCobj.NN$auc
sen.NN = ROCobj.NN$sensitivities
fnr.NN = 1 - ROCobj.NN$specificities
plot(fnr.NN, sen.NN, type = "l", lwd = 2, col = "red",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = "1 - specificity",
ylab = "sensitivity",
main = "ROC Curve of Model")
text(0.87, 0.10, paste("AUC = ", round(NNAUC,4)), cex = 0.7, adj = 1)
As shown in the graph above, this model’s AUC is 0.5. This value will be used to compare other models.
A decision tree is a tree-like structure that splits the data recursively based on feature values to create a set of decision rules. Each internal node represents a decision on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label (for classification) or a numeric value (for regression).
I will create 6 different decision trees and all compare them against one another utilizing the AUC values. The 6 decision trees will all have varying node purity types, false positive weihts, and false negative weights.
tree.builder = function(in.data, fp, fn, purity){
tree = rpart(loan_status ~ .,
data = in.data,
na.action = na.rpart,
method = "class",
model = FALSE,
x = FALSE,
y = TRUE,
parms = list(loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),
split = purity),
control = rpart.control(
minsplit = 10,
minbucket = 10,
cp = 0.01,
xval = 10
))
}
gini.tree.1.1 = tree.builder(in.data = train.data, fp = 1, fn = 1, purity = "gini")
info.tree.1.1 = tree.builder(in.data = train.data, fp = 1, fn = 1, purity = "information")
gini.tree.1.10 = tree.builder(in.data = train.data, fp = 1, fn = 10, purity = "gini")
info.tree.1.10 = tree.builder(in.data = train.data, fp = 1, fn = 10, purity = "information")
gini.tree.10.1 = tree.builder(in.data = train.data, fp = 10, fn = 1, purity = "gini")
info.tree.10.1 = tree.builder(in.data = train.data, fp = 10, fn = 1, purity = "information")
Below are the 6 different decision trees:
rpart.plot(gini.tree.1.1, main = "Tree with Gini index: non-penalization")
rpart.plot(info.tree.1.1, main = "Tree with entropy: non-penalization")
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
rpart.plot(info.tree.1.10, main = "Tree with entropy: penalization")
rpart.plot(gini.tree.10.1, main = "Tree with Gini index: penalization")
rpart.plot(info.tree.10.1, main = "Tree with entropy: penalization")
To determine which of the 6 decision trees are the best, I compare their AUC values. Below is an overlapping graph of all 6 ROC curves and AUC values:
SenSpe = function(in.data, fp, fn, purity){
cutoff = seq(0,1, length = 20)
model = tree.builder(in.data, fp, fn, purity)
pred = predict(model, newdata = in.data, type = "prob")
senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
for (i in 1:length(cutoff)){
pred.out = ifelse(pred[,2] >= cutoff[i], 1, 0)
TP = sum(pred.out == 1 & in.data$loan_status == 1)
TN = sum(pred.out == 0 & in.data$loan_status == 0)
FP = sum(pred.out == 1 & in.data$loan_status == 0)
FN = sum(pred.out == 0 & in.data$loan_status == 1)
senspe.mtx[1,i] = TP/(TP + FN)
senspe.mtx[2,i] = TN/(TN + FP)
accuracy[i] = (TP + TN)/(TP + TN + FP + FN)
}
prediction = pred[, 2]
category = in.data$loan_status == 1
ROCobj <- roc(category, prediction)
AUC = auc(ROCobj)
list(senspe.mtx= senspe.mtx, AUC = round(AUC,5))
}
giniROC11 = SenSpe(in.data = train.data, fp=1, fn=1, purity="gini")
infoROC11 = SenSpe(in.data = train.data, fp=1, fn=1, purity="information")
giniROC110 = SenSpe(in.data = train.data, fp=1, fn=10, purity="gini")
infoROC110 = SenSpe(in.data = train.data, fp=1, fn=10, purity="information")
giniROC101 = SenSpe(in.data = train.data, fp=10, fn=1, purity="gini")
infoROC101 = SenSpe(in.data = train.data, fp=10, fn=1, purity="information")
par(pty = "s")
colors = c("#008B8B", "#00008B", "#8B008B", "#8B0000", "#8B8B00", "#8B4500")
plot(1 - giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,],
type = "l",
xlim = c(0,1),
ylim = c(0,1),
xlab = "1 - specificity: FPR", ylab = "Sensitivity: TPR",
col = colors[1],
lwd = 2,
main = "ROC Curves of Decision Trees",
cex.main = 0.9,
col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)
lines(1 - infoROC11$senspe.mtx[2, ], infoROC11$senspe.mtx[1, ],
col = colors[2], lwd = 2, lty=2)
lines(1 - giniROC110$senspe.mtx[2, ], giniROC110$senspe.mtx[1, ],
col = colors[3], lwd = 2)
lines(1 - infoROC110$senspe.mtx[2, ], infoROC110$senspe.mtx[1, ],
col = colors[4], lwd = 2, lty=2)
lines(1 - giniROC101$senspe.mtx[2, ], giniROC101$senspe.mtx[1, ],
col = colors[5], lwd = 2, lty = 4)
lines(1 - infoROC101$senspe.mtx[2, ], infoROC101$senspe.mtx[1, ],
col = colors[6], lwd = 2, lty=2)
legend("bottomright", c(paste("gini.1.1, AUC =", giniROC11$AUC),
paste("info.1.1, AUC =",infoROC11$AUC),
paste("gini.1.10, AUC =",giniROC110$AUC),
paste("info.1.10, AUC =",infoROC110$AUC),
paste("gini.10.1, AUC =",giniROC101$AUC),
paste("info.10.1, AUC =",infoROC101$AUC)),
col=colors,
lty=rep(1:2,3), lwd=rep(2,6), cex = 0.8, bty = "n")
As shown by the ROC plot above, the gini.1.1 decision tree performed the best with an AUC of 0.94138. Therefore, this decision tree is the best performing tree of the available trees. A copy of the decision tree is below:
rpart.plot(gini.tree.1.1, main = "Tree with Gini index: non-penalization")
Bagging is an ensemble method that aims to reduce variance by training multiple models on different subsets of the data and then aggregating their predictions. It works by bootstrapping the dataset and fitting a model on each bootstrap sample.
I will train the bagging model to create an assemble of decision trees with specific aspects that may or may not differ from the above decision tree. With the given parameters, I will evaluate the performance of the chosen decision tree and ultimate provide an ROC curve with an AUC value.
train = train.data
test = test.data
Loan.bag.train <- bagging(as.factor(loan_status) ~ .,
data = train,
nbagg = 150,
coob = TRUE,
parms = list(loss = matrix(c(0, 10, 1, 0),
ncol = 20,
byrow = TRUE),
split = "gini"),
control = rpart.control(minsplit = 10,
cp = 0.02))
pred = predict(Loan.bag.train, test, type = "prob")
cut.prob = seq(0, 1, length = 20)
senspe.mtx = matrix(0, ncol = length(cut.prob), nrow = 3, byrow = FALSE)
for (i in 1:length(cut.prob)){
pred.out = ifelse(pred[, "1"] >= cut.prob[i], "1", "0")
TP = sum(pred.out == "1" & test$loan_status == 1)
TN = sum(pred.out == "0" & test$loan_status == 0)
FP = sum(pred.out == "1" & test$loan_status == 0)
FN = sum(pred.out == "0" & test$loan_status == 1)
senspe.mtx[1, i] = TP / (TP + FN)
senspe.mtx[2, i] = TN / (TN + FP)
senspe.mtx[3, i] = (TP + TN) / (TP + FN + TN + FP)
}
prediction = pred[, "1"]
category = test$loan_status == 1
ROCobj <- roc(category, prediction)
AUC = auc(ROCobj)
AUC = round(as.vector(AUC[1]), 3)
n = length(senspe.mtx[3,])
idx = which(senspe.mtx[3,] == max(senspe.mtx[3,]))
tick.label = as.character(round(cut.prob,2))
par(mfrow = c(1, 2))
plot(1 - senspe.mtx[2, ], senspe.mtx[1, ],
type = "l", lwd = 2, col = "navy", xlim = c(0, 1),
ylim = c(0, 1), xlab = "1 - Specificity",
ylab = "Sensitivity", main = "ROC (Testing Data)",
cex.main = 0.8)
segments(0, 0, 1, 1, lty = 2, col = "red")
legend("bottomright", c(paste("AUC =", AUC)), bty = "n", cex = 0.8)
plot(1:length(cut.prob), senspe.mtx[3,],
xlab = "Cut-off Probability", ylab = "Accuracy",
ylim = c(min(senspe.mtx[3,]), 1), axes = FALSE,
main = "Cut-off vs Accuracy", cex.main = 0.9, col.main = "navy")
axis(1, at = 1:20, label = round(cut.prob, 2), las = 2)
axis(2)
points(idx, senspe.mtx[3,][idx], pch = 19, col = "red")
segments(idx, min(senspe.mtx[3,]), idx, senspe.mtx[3,][idx], col = "red")
legend("topright", c(paste("Optimal cut-off prob = ", round(median(cut.prob[idx]), 3), sep = ""),
paste("Accuracy = ", round(mean(senspe.mtx[3,][idx]), 3), sep = "")),
cex = 0.8, bty = "n", adj = -0)
As shown in the graph above, this model’s AUC is 0.866. This value will be used to compare other models.
Below is a list of the different modeling techniques and their corresponding AUC values in order from greatest to least
Therefore, the most effective model for our given dataset is the decision tree model, with an AUC of 0.94138. A copy of the decision tree is below:
rpart.plot(gini.tree.1.1, main = "Tree with Gini index: non-penalization")
I will now address the questions asked in the beginning of the report now that the best model has been selected
As shown in the decision tree above, the following variables are significant in predicting an applicant’s chances of getting their loan approved:
previous_loan_defaults_on_fileloan_percent_incomeloan_int_rateperson_incomeperson_home_ownershipThe only demographic variables that influence the chances of a loan
request being approved are previous_loan_defaults_on_file,
person_income, and person_home_ownership.
Other demographic variables such as person_age and
person_gender did not have significant influence.
While the decision tree was best for the provided dataset, instances may occur where different modeling techniques are found to be more effective. I encourage everyone to do their own research when determining the best modeling technique.