1 Decision Tree Model

This section uses the tidymodels framework to implement the CART decision tree model. It is one part of the submission for HW3 Group 6 but is rendered separately to avoid variable collision in code.

Now we create our model recipe. This recipe object describes the dependent and independent variables we wish to use and the data source.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         12
## 
## Training data contained 480 data points and no missing data.

Next we define the statistical model we wish to run on our recipe object. In this case, we are running a decision tree from the rpart package. The mode of decision tree we select is classification. Classification in the rpart package will split branches in a way that minimizes the sum of squares error.

## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Now we create our training and test splits using an 80/20 split. The training set will be used to train our model while the test split will be used to evaluate our final model.

## <Analysis/Assess/Total>
## <385/95/480>

Now we use our model engine and our training data to create a classification model.

## parsnip model object
## 
## Fit time:  16ms 
## n= 385 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 385 117 Y (0.3038961 0.6961039)  
##    2) Credit_History=0 58   6 N (0.8965517 0.1034483) *
##    3) Credit_History=1 327  65 Y (0.1987768 0.8012232)  
##      6) Loan_Amount_Term=36,240,300,480 16   7 N (0.5625000 0.4375000) *
##      7) Loan_Amount_Term=60,84,120,180,360 311  56 Y (0.1800643 0.8199357)  
##       14) Total_Income>=18249 9   3 N (0.6666667 0.3333333) *
##       15) Total_Income< 18249 302  50 Y (0.1655629 0.8344371)  
##         30) Total_Income< 2381.5 7   2 N (0.7142857 0.2857143) *
##         31) Total_Income>=2381.5 295  45 Y (0.1525424 0.8474576) *

Our top level splits are credit history, property area, total income, then loan amount. Our first branch is also a terminal node. When credit history = 0, you are very unlikely to be approved for a loan.

No we’re going to resample our training data with 10 crossfold validation by taking a random 90% and checking the variability of our estimates.

Here we apply our resampled data to our model and define the metrics of interest.

Our prediction accuracy is 73%. To maximize accuracy, we are going to select the model that will optimize for accuracy. This will be the model we use in our final workflow. We fit our total training data one last time and view our splits.

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 385 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 385 117 Y (0.3038961 0.6961039)  
##    2) Credit_History=0 58   6 N (0.8965517 0.1034483) *
##    3) Credit_History=1 327  65 Y (0.1987768 0.8012232)  
##      6) Loan_Amount_Term=36,240,300,480 16   7 N (0.5625000 0.4375000) *
##      7) Loan_Amount_Term=60,84,120,180,360 311  56 Y (0.1800643 0.8199357)  
##       14) Total_Income>=18249 9   3 N (0.6666667 0.3333333) *
##       15) Total_Income< 18249 302  50 Y (0.1655629 0.8344371)  
##         30) Total_Income< 2381.5 7   2 N (0.7142857 0.2857143) *
##         31) Total_Income>=2381.5 295  45 Y (0.1525424 0.8474576) *

Our branches and nodes do not appear to have changed. Let’s look at variable importance. Importance in a decision tree is a combination of number of times a variable appears in a branch, how high of a branch, and number of times it appears in a terminal node.

Credit History is by far our most importable variable for predicting loan status.

Lastly, we test our model and collect our metrics.

Our accuracy is 80%. Below is our confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 13  6
##          Y 18 58
##                                           
##                Accuracy : 0.7474          
##                  95% CI : (0.6478, 0.8309)
##     No Information Rate : 0.6737          
##     P-Value [Acc > NIR] : 0.07526         
##                                           
##                   Kappa : 0.3617          
##                                           
##  Mcnemar's Test P-Value : 0.02474         
##                                           
##             Sensitivity : 0.4194          
##             Specificity : 0.9062          
##          Pos Pred Value : 0.6842          
##          Neg Pred Value : 0.7632          
##              Prevalence : 0.3263          
##          Detection Rate : 0.1368          
##    Detection Prevalence : 0.2000          
##       Balanced Accuracy : 0.6628          
##                                           
##        'Positive' Class : N               
##

2 Code

# ---------------------------------
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
library(tidyverse)
library(tidymodels)
library(caret)

cla = read_csv("cla.csv")

cla$Loan_Amount_Term = factor(cla$Loan_Amount_Term)
cla$Loan_Status = factor(cla$Loan_Status)  # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)
ncla <- cla[,1:13]
loan_rec <- recipe(Loan_Status~., data=ncla) %>% 
  prep()

loan_rec
tree <- decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

tree
set.seed(1234)
loan_split <- initial_split(ncla, prop = .8)

loan_train <- training(loan_split)
loan_test <- testing(loan_split)

loan_split
tree_fit <- tree %>% fit(Loan_Status~., data = loan_train)

tree_fit 
set.seed(1234)
loan_xval <- vfold_cv(data = loan_train, strata = Loan_Status) 
tree_res <- fit_resamples(
  tree,
  loan_rec,
  resamples = loan_xval, 
  metrics = metric_set(accuracy, kap, roc_auc, sens, spec),
  control = control_resamples(save_pred = TRUE)) 

tree_res %>%  collect_metrics(summarize = TRUE)
best_tree <- tree_res %>%
  select_best("accuracy")

final_wf <- 
  workflow() %>% 
  add_model(tree) %>% 
  add_recipe(loan_rec) %>%
  finalize_workflow(best_tree)

final_tree <- 
  final_wf %>%
  fit(data = loan_train)

final_tree
final_tree %>% 
  pull_workflow_fit() %>% 
  vip::vip()
final_fit <- 
  final_wf %>%
  last_fit(split = loan_split, metrics = metric_set( accuracy, kap, roc_auc, sens, spec))

final_fit %>%
  collect_metrics()
final_fit %>%
  collect_predictions() -> pred

confusionMatrix(pred$.pred_class, pred$Loan_Status)

Data 622 Homework 3: Decision Tree Modeling of Loan Approval Data

Randall Thompson - Group 6

Submitted by 04/09/2021

1 Decision Tree Model

2 Code