ML_Decision trees

Load Packages and Dataset

Packages used to import and analyse the data include class, dplyr, googlesheet4.

class - various functions for classification, including k-nearest neighbor, learning vector quantification and self-organizing maps.
rpart - for recursive partitioning (aka divide and conquer)
rpart.plot - for plotting classification trees
loans - original dataset provided but need more work to make it consistent with the version used for the exercise
glue - for referencing results in markdown text
dplyr - for manipulating data

Tweaking on dataset

First step, check on the original dataset’s dimension - and see if it is the same as the one used in the exercise. Fun part (additional but not necessary here) is that the number of rows and number of columns are separately stored in two variables - loan_dim_row and loan_dim_col for being used in glue function.

Here, just wanted to try the glue package which make reference to code results in in-line markdown possible. While the dataset used in the exercise has a dimension of 11312 x 14 (and checked on the heading of each col too) - we have many more rows and two more columns.

glue('the loans has {loan_dim_row} rows and {loan_dim_col} columns.')

## the loans has 39732 rows and 16 columns.

A bit more effort (2 mins?) put into extracting the information on the differences between original dataset and the one used in the exercise. Below are the not so convenient conclusions.

there is a column named keep - seems only the rows marked “keep” are included - use filter to tackle this
rand - random numbers - seems used for making “keep” status but not included in the clean dataset - use select to tackle this i.e. to remove this column
default column (0 and 1 indicators) in the original dataset and outcome column in the exercise dataset - use mutate

loans_revised <- loans %>% 
  filter(keep == 1) %>% 
  select(-rand) %>% 
  mutate(outcome = if_else(default == 0, "repaid","default"))

Need to remove two more columns not needed in the dataset - for markings only it seems. Tricky part is not to do the select below together with above code because default col was used in the mutate function.

loans_clean <- loans_revised %>% 
  select(-keep, -default)

check on dimensions - same as the dataset used by the exercise: 11312 x 14.

dim(loans_clean)

## [1] 11312    14

Final cleaned dataset

loans_clean dataset contains 11,312 randomly-selected people who applied for and later received loans from lending Club.

Theories

classification trees are also known as decision trees - used to find a set of if/else conditions that are helpful for taking action
useful for business strategy, especially in areas where transparency is needed.
decision trees - the goal is to model the relationship between predictors and on outcome of interest
Beginning at the root node, data flows through if/else decision nodes that split the data according to its attributes
The branches indicate the potential choices, and the leaf nodes denote the final decisions. These are also known as terminal nodes because they terminate the decision making process.
Trees - divide and conquer. attempts to divide the dataset into partitions with similar values for the outcome of interest

Case example - loan applications

let’s consider a business process like whether or not to provide someone a loan.
After an applicant fills out a form with personal information like income, credit history, and loan purpose, the bank must quickly decide whether or not the individual is likely to repay the debt.
Using historical applicant data and loan outcomes, a classification tree can be built to learn the criteria that were most predictive of future loan repayment.

Trees

to divide-and-conquer, the decision trees algorithm looks for an initial split that creates the two most homogeneous groups
and divide and conquer another split
each one of these splits results in an if/else decision in the tree structure.

Some Coding into it

First tree classification model

Tasks
- use a decision tree to try to learn patterns in the outcome of these loans (default or repaid)
- based on the requested loan amount and credit score at the time of application
- see how the tree’s predictions differ for an applicant with good credit versus one with bad credit
Generate two variables - good_credit and bad_credit. These two variables are for testing the prediction function on the classifications. Here used slice function to extract two lines of output. In theory, should have used data.frame function to create these two variables but the attributes are quite a few to type out. After a few checks seems these two are just extractions from the loan_clean dataset hence the below method.

good_credit <- slice(loans_clean, 8)
bad_credit <- slice(loans_clean, 3)

build a lending model predicting loan outcome versus loan amount and credit score.

loan_model <- rpart(outcome ~ loan_amount + credit_score, 
                    data = loans_clean, 
                    method = "class", 
                    control = rpart.control(cp = 0))

make a prediction for someone with good credit

predict(loan_model, good_credit, type = "class")

##      1 
## repaid 
## Levels: default repaid

make a prediction for someone with bad credit

predict(loan_model, bad_credit, type = "class")

##       1 
## default 
## Levels: default repaid

Visualizing classification trees

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

examine the loan_model object

loan_model

## n= 11312 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 11312 5654 repaid (0.4998232 0.5001768)  
##    2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
##      4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
##      5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
##       10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
##       11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
##         22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
##         23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
##    3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *

Visualization - plot the loan_model with default settings. note rpart.plot is loaded.

rpart.plot(loan_model)

plot the loan_model with customized settings. A few notes on the settings/options.
- type - type of plot. 2 is the default (split labels are below the node labels); 3 is to draw separate split labels for the left and right directions
- fallen.leaves - default is TRUE to position the leaf nodes at the bottom of the graph
- box.palette - palette for coloring the node boxed based on the fitted value. It is a vector of ‘colors’.
- shadow.col - color of the shadow under the boxes. Default is zero.
- snip - default is False. set to TRUE to interactively trim the tree with click.

rpart.plot(loan_model, type = 3, box.palette = c("red3", "sky blue"), fallen.leaves = TRUE, shadow.col = "darkgray")

from the tree for example we can see applicant that with a low requested loan amount and high credit score would be predicted to repay the loan.

Larger classification trees

choosing where to split
split that produces the most pure partitions will be used first
A is more pure than B. i.e. split based on credit score than for requested amount
diagonal line is not possible (two combinations of features) - which is not possible in the divide and conquer process for the tree.
a decision tree always creates axis-parallel splits - this limitation is a potential weakness of decision trees.
decision trees could be over complex quickly - which could cause the problem of overfitting and the tree in that situation would tend to model the noise instead of modelling the most important trends in the data
consider training and test datasets

Training datasets

Creating random test datasets

determine the number of rows for training. here we use 75% of the observations for training and 25% for testing the model

a <- 0.75 * nrow(loans_clean)
b <- nrow(loans_clean)

create a random sample of row IDs. i.e. sample how many of rows out of all rows in the dataset

sample_rows <- sample(b, a)

create training dataset by subletting the loans_clean dataset using the sample_rows created and test dataset accordingly.

loans_train <- loans_clean[sample_rows, ]
loans_test <- loans_clean[-sample_rows, ]

Building and evaluating a larger tree

build a loan model using the training dataset and all of the available predictors to predict the outcome.

loan_model <- rpart(outcome ~ ., data = loans_train, 
                    method = "class", 
                    control = rpart.control(cp =0))

Make predictions on the test dataset - create a vector of the predicted outcomes. Make sure to include the type argument in the code.

loans_test$pred <- predict(loan_model, loans_test, type = "class")

Examine the confusion matrix to compare the predicted values to the actual outcome values

table(loans_test$outcome, loans_test$pred)

##          
##           default repaid
##   default     751    647
##   repaid      632    798

Compute the accuracy of the predictions.

mean(loans_test$outcome == loans_test$pred)

## [1] 0.5477369

Tending to classification trees - pruning

pre-pruning - set a max depth, min number of observations to split - prevent the tree from growing too large
post-pruning - large tree first then prune first then post pruning overly complex branch. error rate vs. prune (complexity)
simply look for the point at which the curve flattens. The horizontal dotted line identifies the point at which the error rate becomes statistically similar to the most complex model. Typically, you should prune the tree at the complexity level that results in a classification error rate just under this line.