Based on the latest topics presented, bring a loan_data of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
Format: document with screen captures & analysis.
I will the loan_data on loan approval status that I uploaded on GitHub.
## [1] 614 13
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LP001002 | Male | No | 0 | Graduate | No | 5849 | 0 | NA | 360 | 1 | Urban | Y |
LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508 | 128 | 360 | 1 | Rural | N |
LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0 | 66 | 360 | 1 | Urban | Y |
LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358 | 120 | 360 | 1 | Urban | Y |
LP001008 | Male | No | 0 | Graduate | No | 6000 | 0 | 141 | 360 | 1 | Urban | Y |
LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196 | 267 | 360 | 1 | Urban | Y |
For this loan_data, the target variable for my analysis is the Loan_Status
Below is te summary of the missing values per feature
I will remove the Loan_ID
variable, convert categorical variables into factors then impute the loan_data using the mice package following Random Forest method.
##
## iter imp variable
## 1 1 LoanAmount Loan_Amount_Term Credit_History
## 1 2 LoanAmount Loan_Amount_Term Credit_History
## 1 3 LoanAmount Loan_Amount_Term Credit_History
## 1 4 LoanAmount Loan_Amount_Term Credit_History
## 1 5 LoanAmount Loan_Amount_Term Credit_History
## 2 1 LoanAmount Loan_Amount_Term Credit_History
## 2 2 LoanAmount Loan_Amount_Term Credit_History
## 2 3 LoanAmount Loan_Amount_Term Credit_History
## 2 4 LoanAmount Loan_Amount_Term Credit_History
## 2 5 LoanAmount Loan_Amount_Term Credit_History
## 3 1 LoanAmount Loan_Amount_Term Credit_History
## 3 2 LoanAmount Loan_Amount_Term Credit_History
## 3 3 LoanAmount Loan_Amount_Term Credit_History
## 3 4 LoanAmount Loan_Amount_Term Credit_History
## 3 5 LoanAmount Loan_Amount_Term Credit_History
## 4 1 LoanAmount Loan_Amount_Term Credit_History
## 4 2 LoanAmount Loan_Amount_Term Credit_History
## 4 3 LoanAmount Loan_Amount_Term Credit_History
## 4 4 LoanAmount Loan_Amount_Term Credit_History
## 4 5 LoanAmount Loan_Amount_Term Credit_History
## 5 1 LoanAmount Loan_Amount_Term Credit_History
## 5 2 LoanAmount Loan_Amount_Term Credit_History
## 5 3 LoanAmount Loan_Amount_Term Credit_History
## 5 4 LoanAmount Loan_Amount_Term Credit_History
## 5 5 LoanAmount Loan_Amount_Term Credit_History
Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
: 13 | : 3 | : 15 | Graduate :480 | : 32 | Min. : 150 | Min. : 0 | Min. : 9.0 | Min. : 12 | 0: 92 | Rural :179 | N:192 | |
Female:112 | No :213 | 0 :345 | Not Graduate:134 | No :500 | 1st Qu.: 2878 | 1st Qu.: 0 | 1st Qu.:100.0 | 1st Qu.:360 | 1:522 | Semiurban:233 | Y:422 | |
Male :489 | Yes:398 | 1 :102 | NA | Yes: 82 | Median : 3812 | Median : 1188 | Median :128.0 | Median :360 | NA | Urban :202 | NA | |
NA | NA | 2 :101 | NA | NA | Mean : 5403 | Mean : 1621 | Mean :146.1 | Mean :342 | NA | NA | NA | |
NA | NA | 3+: 51 | NA | NA | 3rd Qu.: 5795 | 3rd Qu.: 2297 | 3rd Qu.:167.8 | 3rd Qu.:360 | NA | NA | NA | |
NA | NA | NA | NA | NA | Max. :81000 | Max. :41667 | Max. :700.0 | Max. :480 | NA | NA | NA |
I will split the data into training and testing datasets in the ratio 80:20
## Call:
## rpart(formula = Loan_Status ~ ., data = train_factored, method = "class")
## n= 492
##
## CP nsplit rel error xerror xstd
## 1 0.4090909 0 1.0000000 1.0000000 0.06679061
## 2 0.0100000 1 0.5909091 0.5909091 0.05592289
##
## Variable importance
## Credit_History
## 100
##
## Node number 1: 492 observations, complexity param=0.4090909
## predicted class=Y expected loss=0.3130081 P(node) =1
## class counts: 154 338
## probabilities: 0.313 0.687
## left son=2 (69 obs) right son=3 (423 obs)
## Primary splits:
## Credit_History splits as LR, improve=66.469020, (0 missing)
## Property_Area splits as LRL, improve= 3.966673, (0 missing)
## Education splits as RL, improve= 2.301810, (0 missing)
## ApplicantIncome < 1903 to the left, improve= 2.256965, (0 missing)
## Loan_Amount_Term < 420 to the right, improve= 1.797663, (0 missing)
##
## Node number 2: 69 observations
## predicted class=N expected loss=0.04347826 P(node) =0.1402439
## class counts: 66 3
## probabilities: 0.957 0.043
##
## Node number 3: 423 observations
## predicted class=Y expected loss=0.2080378 P(node) =0.8597561
## class counts: 88 335
## probabilities: 0.208 0.792
Credit history seems to be an important predictor for loan approval.
I will now apply the decision tree to the test data and create a confusion table to evaluate the accuracy of the classifications.
## N Y
## 1 0.2080378 0.79196217
## 8 0.9565217 0.04347826
## 9 0.2080378 0.79196217
## 10 0.2080378 0.79196217
## 21 0.9565217 0.04347826
## 34 0.2080378 0.79196217
## 39 0.2080378 0.79196217
## 41 0.2080378 0.79196217
## 42 0.2080378 0.79196217
## 45 0.2080378 0.79196217
## 47 0.2080378 0.79196217
## 48 0.2080378 0.79196217
## 50 0.2080378 0.79196217
## 52 0.2080378 0.79196217
## 54 0.2080378 0.79196217
## 57 0.2080378 0.79196217
## 59 0.2080378 0.79196217
## 70 0.9565217 0.04347826
## 74 0.9565217 0.04347826
## 77 0.2080378 0.79196217
## 80 0.2080378 0.79196217
## 86 0.2080378 0.79196217
## 88 0.2080378 0.79196217
## 98 0.2080378 0.79196217
## 114 0.2080378 0.79196217
## 118 0.2080378 0.79196217
## 123 0.9565217 0.04347826
## 136 0.2080378 0.79196217
## 139 0.9565217 0.04347826
## 144 0.2080378 0.79196217
## 146 0.2080378 0.79196217
## 148 0.2080378 0.79196217
## 149 0.2080378 0.79196217
## 155 0.2080378 0.79196217
## 158 0.2080378 0.79196217
## 161 0.2080378 0.79196217
## 163 0.9565217 0.04347826
## 182 0.2080378 0.79196217
## 195 0.2080378 0.79196217
## 198 0.2080378 0.79196217
## 200 0.2080378 0.79196217
## 202 0.9565217 0.04347826
## 216 0.2080378 0.79196217
## 220 0.2080378 0.79196217
## 226 0.2080378 0.79196217
## 233 0.2080378 0.79196217
## 242 0.2080378 0.79196217
## 247 0.2080378 0.79196217
## 249 0.2080378 0.79196217
## 255 0.9565217 0.04347826
## 256 0.2080378 0.79196217
## 261 0.2080378 0.79196217
## 268 0.9565217 0.04347826
## 270 0.2080378 0.79196217
## 273 0.2080378 0.79196217
## 275 0.2080378 0.79196217
## 280 0.2080378 0.79196217
## 281 0.9565217 0.04347826
## 292 0.9565217 0.04347826
## 299 0.2080378 0.79196217
## 311 0.2080378 0.79196217
## 312 0.2080378 0.79196217
## 317 0.2080378 0.79196217
## 325 0.2080378 0.79196217
## 327 0.9565217 0.04347826
## 328 0.2080378 0.79196217
## 332 0.2080378 0.79196217
## 336 0.2080378 0.79196217
## 344 0.2080378 0.79196217
## 347 0.9565217 0.04347826
## 348 0.2080378 0.79196217
## 351 0.2080378 0.79196217
## 352 0.2080378 0.79196217
## 355 0.2080378 0.79196217
## 356 0.2080378 0.79196217
## 367 0.2080378 0.79196217
## 371 0.2080378 0.79196217
## 374 0.9565217 0.04347826
## 375 0.2080378 0.79196217
## 376 0.2080378 0.79196217
## 385 0.2080378 0.79196217
## 390 0.2080378 0.79196217
## 396 0.2080378 0.79196217
## 397 0.9565217 0.04347826
## 399 0.2080378 0.79196217
## 401 0.9565217 0.04347826
## 404 0.2080378 0.79196217
## 412 0.2080378 0.79196217
## 419 0.2080378 0.79196217
## 422 0.9565217 0.04347826
## 434 0.2080378 0.79196217
## 436 0.2080378 0.79196217
## 460 0.2080378 0.79196217
## 463 0.2080378 0.79196217
## 468 0.2080378 0.79196217
## 487 0.9565217 0.04347826
## 490 0.2080378 0.79196217
## 495 0.9565217 0.04347826
## 496 0.2080378 0.79196217
## 500 0.9565217 0.04347826
## 503 0.2080378 0.79196217
## 530 0.2080378 0.79196217
## 532 0.2080378 0.79196217
## 536 0.2080378 0.79196217
## 544 0.2080378 0.79196217
## 546 0.2080378 0.79196217
## 547 0.2080378 0.79196217
## 549 0.9565217 0.04347826
## 556 0.2080378 0.79196217
## 559 0.2080378 0.79196217
## 561 0.2080378 0.79196217
## 562 0.2080378 0.79196217
## 566 0.2080378 0.79196217
## 572 0.9565217 0.04347826
## 575 0.2080378 0.79196217
## 579 0.2080378 0.79196217
## 580 0.2080378 0.79196217
## 582 0.2080378 0.79196217
## 593 0.2080378 0.79196217
## 599 0.2080378 0.79196217
## 603 0.2080378 0.79196217
## 610 0.2080378 0.79196217
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 66 3
## Y 88 335
##
## Accuracy : 0.815
## 95% CI : (0.7779, 0.8484)
## No Information Rate : 0.687
## P-Value [Acc > NIR] : 9.652e-11
##
## Kappa : 0.4939
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4286
## Specificity : 0.9911
## Pos Pred Value : 0.9565
## Neg Pred Value : 0.7920
## Prevalence : 0.3130
## Detection Rate : 0.1341
## Detection Prevalence : 0.1402
## Balanced Accuracy : 0.7098
##
## 'Positive' Class : N
##
The model is 81.5% accurate.
Random Forest reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy.
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = train_factored, method = "class")
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.51%
## Confusion matrix:
## N Y class.error
## N 70 84 0.54545455
## Y 12 326 0.03550296
## 1 8 9 10 21 34 39 41 42 45 47 48 50 52 54 57 59 70 74 77
## Y N Y Y N Y Y Y Y Y Y Y Y Y Y Y Y N N Y
## 80 86 88 98 114 118 123 136 139 144 146 148 149 155 158 161 163 182 195 198
## Y Y Y Y Y Y N Y N Y Y Y Y Y Y Y N Y Y Y
## 200 202 216 220 226 233 242 247 249 255 256 261 268 270 273 275 280 281 292 299
## Y N Y Y Y Y Y Y Y N N Y N Y Y Y Y N N Y
## 311 312 317 325 327 328 332 336 344 347 348 351 352 355 356 367 371 374 375 376
## Y Y Y Y N Y Y Y Y N Y Y Y Y Y Y Y N Y Y
## 385 390 396 397 399 401 404 412 419 422 434 436 460 463 468 487 490 495 496 500
## Y Y Y N Y N Y Y Y N Y Y Y Y Y N Y N Y N
## 503 530 532 536 544 546 547 549 556 559 561 562 566 572 575 579 580 582 593 599
## Y Y Y Y Y Y Y N Y Y Y Y N N Y Y Y Y Y Y
## 603 610
## Y Y
## Levels: N Y
In my opinion, decision tree offers better performance when analyzing a feature of a loan_data. Random Forest is good to avoid low quality of data but might build a tree that will not take into consideration the significance that the“Feature” has in the final decision.