Homework #2

Based on the latest topics presented, bring a loan_data of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Format: document with screen captures & analysis.

Data Exploration

I will the loan_data on loan approval status that I uploaded on GitHub.

Loading Data

## [1] 614  13
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 5849 0 NA 360 1 Urban Y
LP001003 Male Yes 1 Graduate No 4583 1508 128 360 1 Rural N
LP001005 Male Yes 0 Graduate Yes 3000 0 66 360 1 Urban Y
LP001006 Male Yes 0 Not Graduate No 2583 2358 120 360 1 Urban Y
LP001008 Male No 0 Graduate No 6000 0 141 360 1 Urban Y
LP001011 Male Yes 2 Graduate Yes 5417 4196 267 360 1 Urban Y

For this loan_data, the target variable for my analysis is the Loan_Status

Data Processing

Missing Values

Below is te summary of the missing values per feature

Imputing Data

I will remove the Loan_ID variable, convert categorical variables into factors then impute the loan_data using the mice package following Random Forest method.

## 
##  iter imp variable
##   1   1  LoanAmount  Loan_Amount_Term  Credit_History
##   1   2  LoanAmount  Loan_Amount_Term  Credit_History
##   1   3  LoanAmount  Loan_Amount_Term  Credit_History
##   1   4  LoanAmount  Loan_Amount_Term  Credit_History
##   1   5  LoanAmount  Loan_Amount_Term  Credit_History
##   2   1  LoanAmount  Loan_Amount_Term  Credit_History
##   2   2  LoanAmount  Loan_Amount_Term  Credit_History
##   2   3  LoanAmount  Loan_Amount_Term  Credit_History
##   2   4  LoanAmount  Loan_Amount_Term  Credit_History
##   2   5  LoanAmount  Loan_Amount_Term  Credit_History
##   3   1  LoanAmount  Loan_Amount_Term  Credit_History
##   3   2  LoanAmount  Loan_Amount_Term  Credit_History
##   3   3  LoanAmount  Loan_Amount_Term  Credit_History
##   3   4  LoanAmount  Loan_Amount_Term  Credit_History
##   3   5  LoanAmount  Loan_Amount_Term  Credit_History
##   4   1  LoanAmount  Loan_Amount_Term  Credit_History
##   4   2  LoanAmount  Loan_Amount_Term  Credit_History
##   4   3  LoanAmount  Loan_Amount_Term  Credit_History
##   4   4  LoanAmount  Loan_Amount_Term  Credit_History
##   4   5  LoanAmount  Loan_Amount_Term  Credit_History
##   5   1  LoanAmount  Loan_Amount_Term  Credit_History
##   5   2  LoanAmount  Loan_Amount_Term  Credit_History
##   5   3  LoanAmount  Loan_Amount_Term  Credit_History
##   5   4  LoanAmount  Loan_Amount_Term  Credit_History
##   5   5  LoanAmount  Loan_Amount_Term  Credit_History
Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
: 13 : 3 : 15 Graduate :480 : 32 Min. : 150 Min. : 0 Min. : 9.0 Min. : 12 0: 92 Rural :179 N:192
Female:112 No :213 0 :345 Not Graduate:134 No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0 1st Qu.:360 1:522 Semiurban:233 Y:422
Male :489 Yes:398 1 :102 NA Yes: 82 Median : 3812 Median : 1188 Median :128.0 Median :360 NA Urban :202 NA
NA NA 2 :101 NA NA Mean : 5403 Mean : 1621 Mean :146.1 Mean :342 NA NA NA
NA NA 3+: 51 NA NA 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:167.8 3rd Qu.:360 NA NA NA
NA NA NA NA NA Max. :81000 Max. :41667 Max. :700.0 Max. :480 NA NA NA

Model Building

Decision Tree

Data Splitting

I will split the data into training and testing datasets in the ratio 80:20

## Call:
## rpart(formula = Loan_Status ~ ., data = train_factored, method = "class")
##   n= 492 
## 
##          CP nsplit rel error    xerror       xstd
## 1 0.4090909      0 1.0000000 1.0000000 0.06679061
## 2 0.0100000      1 0.5909091 0.5909091 0.05592289
## 
## Variable importance
## Credit_History 
##            100 
## 
## Node number 1: 492 observations,    complexity param=0.4090909
##   predicted class=Y  expected loss=0.3130081  P(node) =1
##     class counts:   154   338
##    probabilities: 0.313 0.687 
##   left son=2 (69 obs) right son=3 (423 obs)
##   Primary splits:
##       Credit_History   splits as  LR,       improve=66.469020, (0 missing)
##       Property_Area    splits as  LRL,      improve= 3.966673, (0 missing)
##       Education        splits as  RL,       improve= 2.301810, (0 missing)
##       ApplicantIncome  < 1903 to the left,  improve= 2.256965, (0 missing)
##       Loan_Amount_Term < 420  to the right, improve= 1.797663, (0 missing)
## 
## Node number 2: 69 observations
##   predicted class=N  expected loss=0.04347826  P(node) =0.1402439
##     class counts:    66     3
##    probabilities: 0.957 0.043 
## 
## Node number 3: 423 observations
##   predicted class=Y  expected loss=0.2080378  P(node) =0.8597561
##     class counts:    88   335
##    probabilities: 0.208 0.792

Credit history seems to be an important predictor for loan approval.

I will now apply the decision tree to the test data and create a confusion table to evaluate the accuracy of the classifications.

##             N          Y
## 1   0.2080378 0.79196217
## 8   0.9565217 0.04347826
## 9   0.2080378 0.79196217
## 10  0.2080378 0.79196217
## 21  0.9565217 0.04347826
## 34  0.2080378 0.79196217
## 39  0.2080378 0.79196217
## 41  0.2080378 0.79196217
## 42  0.2080378 0.79196217
## 45  0.2080378 0.79196217
## 47  0.2080378 0.79196217
## 48  0.2080378 0.79196217
## 50  0.2080378 0.79196217
## 52  0.2080378 0.79196217
## 54  0.2080378 0.79196217
## 57  0.2080378 0.79196217
## 59  0.2080378 0.79196217
## 70  0.9565217 0.04347826
## 74  0.9565217 0.04347826
## 77  0.2080378 0.79196217
## 80  0.2080378 0.79196217
## 86  0.2080378 0.79196217
## 88  0.2080378 0.79196217
## 98  0.2080378 0.79196217
## 114 0.2080378 0.79196217
## 118 0.2080378 0.79196217
## 123 0.9565217 0.04347826
## 136 0.2080378 0.79196217
## 139 0.9565217 0.04347826
## 144 0.2080378 0.79196217
## 146 0.2080378 0.79196217
## 148 0.2080378 0.79196217
## 149 0.2080378 0.79196217
## 155 0.2080378 0.79196217
## 158 0.2080378 0.79196217
## 161 0.2080378 0.79196217
## 163 0.9565217 0.04347826
## 182 0.2080378 0.79196217
## 195 0.2080378 0.79196217
## 198 0.2080378 0.79196217
## 200 0.2080378 0.79196217
## 202 0.9565217 0.04347826
## 216 0.2080378 0.79196217
## 220 0.2080378 0.79196217
## 226 0.2080378 0.79196217
## 233 0.2080378 0.79196217
## 242 0.2080378 0.79196217
## 247 0.2080378 0.79196217
## 249 0.2080378 0.79196217
## 255 0.9565217 0.04347826
## 256 0.2080378 0.79196217
## 261 0.2080378 0.79196217
## 268 0.9565217 0.04347826
## 270 0.2080378 0.79196217
## 273 0.2080378 0.79196217
## 275 0.2080378 0.79196217
## 280 0.2080378 0.79196217
## 281 0.9565217 0.04347826
## 292 0.9565217 0.04347826
## 299 0.2080378 0.79196217
## 311 0.2080378 0.79196217
## 312 0.2080378 0.79196217
## 317 0.2080378 0.79196217
## 325 0.2080378 0.79196217
## 327 0.9565217 0.04347826
## 328 0.2080378 0.79196217
## 332 0.2080378 0.79196217
## 336 0.2080378 0.79196217
## 344 0.2080378 0.79196217
## 347 0.9565217 0.04347826
## 348 0.2080378 0.79196217
## 351 0.2080378 0.79196217
## 352 0.2080378 0.79196217
## 355 0.2080378 0.79196217
## 356 0.2080378 0.79196217
## 367 0.2080378 0.79196217
## 371 0.2080378 0.79196217
## 374 0.9565217 0.04347826
## 375 0.2080378 0.79196217
## 376 0.2080378 0.79196217
## 385 0.2080378 0.79196217
## 390 0.2080378 0.79196217
## 396 0.2080378 0.79196217
## 397 0.9565217 0.04347826
## 399 0.2080378 0.79196217
## 401 0.9565217 0.04347826
## 404 0.2080378 0.79196217
## 412 0.2080378 0.79196217
## 419 0.2080378 0.79196217
## 422 0.9565217 0.04347826
## 434 0.2080378 0.79196217
## 436 0.2080378 0.79196217
## 460 0.2080378 0.79196217
## 463 0.2080378 0.79196217
## 468 0.2080378 0.79196217
## 487 0.9565217 0.04347826
## 490 0.2080378 0.79196217
## 495 0.9565217 0.04347826
## 496 0.2080378 0.79196217
## 500 0.9565217 0.04347826
## 503 0.2080378 0.79196217
## 530 0.2080378 0.79196217
## 532 0.2080378 0.79196217
## 536 0.2080378 0.79196217
## 544 0.2080378 0.79196217
## 546 0.2080378 0.79196217
## 547 0.2080378 0.79196217
## 549 0.9565217 0.04347826
## 556 0.2080378 0.79196217
## 559 0.2080378 0.79196217
## 561 0.2080378 0.79196217
## 562 0.2080378 0.79196217
## 566 0.2080378 0.79196217
## 572 0.9565217 0.04347826
## 575 0.2080378 0.79196217
## 579 0.2080378 0.79196217
## 580 0.2080378 0.79196217
## 582 0.2080378 0.79196217
## 593 0.2080378 0.79196217
## 599 0.2080378 0.79196217
## 603 0.2080378 0.79196217
## 610 0.2080378 0.79196217
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  66   3
##          Y  88 335
##                                           
##                Accuracy : 0.815           
##                  95% CI : (0.7779, 0.8484)
##     No Information Rate : 0.687           
##     P-Value [Acc > NIR] : 9.652e-11       
##                                           
##                   Kappa : 0.4939          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4286          
##             Specificity : 0.9911          
##          Pos Pred Value : 0.9565          
##          Neg Pred Value : 0.7920          
##              Prevalence : 0.3130          
##          Detection Rate : 0.1341          
##    Detection Prevalence : 0.1402          
##       Balanced Accuracy : 0.7098          
##                                           
##        'Positive' Class : N               
## 

The model is 81.5% accurate.

Regression Model

Random Forest

Random Forest reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy.

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = train_factored,      method = "class") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.51%
## Confusion matrix:
##    N   Y class.error
## N 70  84  0.54545455
## Y 12 326  0.03550296
##   1   8   9  10  21  34  39  41  42  45  47  48  50  52  54  57  59  70  74  77 
##   Y   N   Y   Y   N   Y   Y   Y   Y   Y   Y   Y   Y   Y   Y   Y   Y   N   N   Y 
##  80  86  88  98 114 118 123 136 139 144 146 148 149 155 158 161 163 182 195 198 
##   Y   Y   Y   Y   Y   Y   N   Y   N   Y   Y   Y   Y   Y   Y   Y   N   Y   Y   Y 
## 200 202 216 220 226 233 242 247 249 255 256 261 268 270 273 275 280 281 292 299 
##   Y   N   Y   Y   Y   Y   Y   Y   Y   N   N   Y   N   Y   Y   Y   Y   N   N   Y 
## 311 312 317 325 327 328 332 336 344 347 348 351 352 355 356 367 371 374 375 376 
##   Y   Y   Y   Y   N   Y   Y   Y   Y   N   Y   Y   Y   Y   Y   Y   Y   N   Y   Y 
## 385 390 396 397 399 401 404 412 419 422 434 436 460 463 468 487 490 495 496 500 
##   Y   Y   Y   N   Y   N   Y   Y   Y   N   Y   Y   Y   Y   Y   N   Y   N   Y   N 
## 503 530 532 536 544 546 547 549 556 559 561 562 566 572 575 579 580 582 593 599 
##   Y   Y   Y   Y   Y   Y   Y   N   Y   Y   Y   Y   N   N   Y   Y   Y   Y   Y   Y 
## 603 610 
##   Y   Y 
## Levels: N Y

Conclusion

In my opinion, decision tree offers better performance when analyzing a feature of a loan_data. Random Forest is good to avoid low quality of data but might build a tree that will not take into consideration the significance that the“Feature” has in the final decision.