Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results.
Create a random forest for regression and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
For this project, I would like to use the the Loan application data set from the Kaggle’s website. https://www.kaggle.com/datasets/angadgupta/loanapplicantdata The data set is structured with 13 variables and 768 observations.Among the 13 variables, the first variable Loan_ID is deleted due to the Loan_ID is not a influential factors among the data set. The variable - Loan_Status in the data is the target variable. The object of this project is to determine the factors that affect the Loan applications resulting either approved or rejected.
Loan ID - Unique Loan ID
Gender - Male/Female
Married - Applicant married(Y/N)
Dependents - Number of Dependents
Education - Application Education(Graduate/Under Graduate)
Self_Employed - Self-Employed(Y/N)
ApplicantIncome - Applicant Income
CoapplicantIncome - Applicant Income
LoanAmount - Loan Amount in Thousands
Loan_Amount_Term - Term of Loan in Months
Credit_History - Credit History(1 for Yes, 0 for No)
Property_Area - Property Areas (Semiurban, Urban, Other)
Loan_Status - Loan Status(True, False)
library(readr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(skimr)
library(rpart)
library(rpart.plot)
library(DMwR)## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
library(randomForest)## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)Loan<-read_csv("https://storage.googleapis.com/kagglesdsdata/datasets/1419016/2350351/LoanApplicantData.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20220330%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220330T005456Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=01e66c108411ae0cdb4712ae4dc548d6fe1488d024b0e2c21cddce41fe945c8b48d335b56ddb323cecee6f1d10cf8843c47a19395cf61a4ca0d6304a8b5f7a03cee7f4d3a052aef3cb05a80f06c3ab7827d63a79c0b92d3179a946e78c10b27fc404897c8840838b50c78f93bf0f4ea0cc93ba459d8b8abe5b8702f81cc017a7cc28f21bfbd4792fb0e29a0a37354d237cc3b067e93f97a4008bb64abb993354bc15e620970e86273bc75093de351005319a1c2cdaac60ff35b71f6914b2c520b92dd1ccf5c8d0f77635a232a26ed7e60a383ef41de08f3501dba19bf8fe652dfe141ca5ee76d6b03fffe7f465df51ae7940b204222e03df508bbf0f50e8aa94", col_types = 'fffnffnnnnfff')
skim(Loan)## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
| Name | Loan |
| Number of rows | 614 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| factor | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Loan_ID | 0 | 1 | FALSE | 614 | LP0: 1, LP0: 1, LP0: 1, LP0: 1 |
| Gender | 0 | 1 | FALSE | 3 | Mal: 489, Fem: 112, emp: 13 |
| Married | 0 | 1 | FALSE | 3 | Yes: 398, No: 213, emp: 3 |
| Education | 0 | 1 | FALSE | 2 | Gra: 480, Not: 134 |
| Self_Employed | 0 | 1 | FALSE | 3 | No: 500, Yes: 82, emp: 32 |
| Credit_History | 0 | 1 | FALSE | 3 | 1: 475, 0: 89, emp: 50 |
| Property_Area | 0 | 1 | FALSE | 3 | Sem: 233, Urb: 202, Rur: 179 |
| Loan_Status | 0 | 1 | FALSE | 2 | Y: 422, N: 192 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Dependents | 15 | 0.98 | 0.76 | 1.02 | 0 | 0.0 | 0.0 | 2.00 | 3 | ▇▂▁▂▁ |
| ApplicantIncome | 0 | 1.00 | 5403.46 | 6109.04 | 150 | 2877.5 | 3812.5 | 5795.00 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1.00 | 1621.25 | 2926.25 | 0 | 0.0 | 1188.5 | 2297.25 | 41667 | ▇▁▁▁▁ |
| LoanAmount | 22 | 0.96 | 146.41 | 85.59 | 9 | 100.0 | 128.0 | 168.00 | 700 | ▇▃▁▁▁ |
| Loan_Amount_Term | 14 | 0.98 | 342.00 | 65.12 | 12 | 360.0 | 360.0 | 360.00 | 480 | ▁▁▁▇▁ |
The skim function present that there are missing values in columns of Dependents, LoanAmount and Loan_Amount Term the data set, and the target variable Loan_Status has 422 ‘Yes’ and 192 ‘No’, it is the signal of class imbalance, will handle the class imbalance problem with SMOTE function.
ggplot(Loan, aes(x = Loan_Status, y = ApplicantIncome))+geom_boxplot()+ylim(0,30000)## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
ggplot(Loan, aes(x = Loan_Status, y = CoapplicantIncome))+geom_boxplot()+ylim(0,10000)## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
According to the Box plot, In terms of income, the median income of Applicants who either approved or rejected seems no difference. However the median income for Co-applicants who get approved are higher than the one who get rejected.
ggplot(Loan, aes(x=Loan_Status, y= Education, color = Education)) +
geom_bar(stat="identity")ggplot(Loan, aes(x=Loan_Status, y= Property_Area, color = Property_Area)) +
geom_bar(stat="identity") The bar plot shows that the Higher portions of Graduate individuals has approved the loan compare to the one who got approved from Not Graduate group. Also, individuals who have property in Urban have the higher chance to get approved the loan, followed by Rural and Semiurban.
First, remove the unnecessary variable - Loan_ID. The decision tree model is very convenient to build, it handles the noise and missing values by itself. The only concern is the class imbalance problem, as the skim() function shows, there are Y: 422, N: 192 in the Loan_Status variable. Therefore, use SMOTE() to fix the class imbalance problem.
Loan<-Loan[,-1]
round(prop.table(table(select(Loan, Loan_Status),exclude = NULL)),4)*100##
## Y N
## 68.73 31.27
set.seed(1234)
Loan<-SMOTE(factor(Loan_Status)~., data.frame(Loan),perc.over = 100,perc.under = 200)
round(prop.table(table(select(Loan, Loan_Status),exclude = NULL)),4)*100##
## Y N
## 50 50
Now, the proportion of Y and N is 50,50.
Split the data set in to 75/25 for training set and test set.
set.seed(1234)
split <- sample(nrow(Loan), round(nrow(Loan)*0.75), replace = F)
train <- Loan[split,]
test <- Loan[-split,]
round(prop.table(table(select(train, Loan_Status),exclude = NULL)),4)*100##
## Y N
## 49.83 50.17
round(prop.table(table(select(test, Loan_Status),exclude = NULL)),4)*100##
## Y N
## 50.52 49.48
With 75/25 split, there are 460 observations in the training data, 154 observations in the test data.
mod<-rpart(Loan_Status~.,method = 'class' ,data = Loan)
rpart.plot(mod) According to the rpart.plot function, the key variable or the root node is the Credit_History. If there is not credit history, the decision directly go to no approval. The second key variable is Co-applicants’ income, and followed by the property area located at urban. Since the Credit_History and Co-applicants’ income are the key facotrs, I’m going to build another tree model without these two variables to compare the result.
loan<-Loan[,-c(7,10)]
mod2<-rpart(Loan_Status~.,method = 'class' ,data = loan)
rpart.plot(mod2) Without the credit history and Co-Applicants income info, the Loan_Amount became the root node, and the key variables followed by marital status and applicants’ income and property locations.
mod_pred <- predict(mod, test, type = 'class')
mod_table<-table(test$Loan_Status, mod_pred)
mod_table## mod_pred
## Y N
## Y 89 8
## N 33 62
mod_accuracy<- sum(diag(mod_table))/nrow(test)
print(paste('The first tree model accuracy is',mod_accuracy))## [1] "The first tree model accuracy is 0.786458333333333"
The first decision tree model presents 78.65% of the accuracy against the test data. Let’s try the second decision tree model to see how does the accuracy change.
mod2_pred <- predict(mod2, test, type = 'class')
mod2_table<-table(test$Loan_Status, mod2_pred)
mod2_table## mod2_pred
## Y N
## Y 80 17
## N 37 58
mod2_accuracy<- sum(diag(mod2_table))/nrow(test)
print(paste('The second tree model accuracy is',mod2_accuracy))## [1] "The second tree model accuracy is 0.71875"
The model accuracy shows that the model included the “Credit_History” and “Co-ApplicantIncome” has higher accuracy about 7% against the test data.
forest_mod <- randomForest(Loan_Status ~ .,ntree = 2000, importance = T,data = train, na.action = na.roughfix)
forest_pred <- predict(forest_mod, test)
confusionMatrix(forest_pred, test$Loan_Status)## Confusion Matrix and Statistics
##
## Reference
## Prediction Y N
## Y 74 27
## N 10 61
##
## Accuracy : 0.7849
## 95% CI : (0.7159, 0.8438)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : 1.25e-13
##
## Kappa : 0.5715
##
## Mcnemar's Test P-Value : 0.008529
##
## Sensitivity : 0.8810
## Specificity : 0.6932
## Pos Pred Value : 0.7327
## Neg Pred Value : 0.8592
## Prevalence : 0.4884
## Detection Rate : 0.4302
## Detection Prevalence : 0.5872
## Balanced Accuracy : 0.7871
##
## 'Positive' Class : Y
##
The condition na.action = na.roughfix allows the random forest model replacing the NA to median values. In this case, the random forest performs the best with accuracy of 79%, and it is slightly better than the first decision tree model about 1 %.
Base on the article, the decision tree can be very complicated when the control variables in change, Therefore, when creating and maintain the decision tree model for providing the best decision to the Loan approval, need to consider include the all the variables that affect the Loan approval at the beginning when making the decision to avoid any change that may cause failing in decision later.