Introduction

Classification is a big part of Data Science, and decision trees are among the tools that have proved useful for it. They are an effective instrument in predicting and additionally, it is easy to understand and interpret them. For any framework or programming language you choose, you will find a lot of tools for decision trees. It is important, though, to understand how exactly it all works as well.

What we are going to do

As a case study, we will do a simple analysis of Lending Club loans data with rpart package and then, we will write our own implementation to understand how it works. All examples in this article will be in R. If you are not familiar with this language, you may want to try R first.

Exploring the data

The dataset contains complete loan data, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.

all_loans <- read.csv('LoanStats3a.csv', stringsAsFactors = F, skip = 1)

For our analysis we will focus only on a few attributes of the dataset:

all_loans <- select(all_loans, grade, term, home_ownership, emp_length, loan_status)

To those who are interested, the definitions of all attributes can be found in the Data Dictionary.

We will consider only Fully paid (safe) and Charged Off (risky) loans.

safe_loans  <- all_loans[all_loans$loan_status == 'Fully Paid',]
risky_loans <- all_loans[all_loans$loan_status == 'Charged Off',]

What we are going do now is predicting whether the loan turns out to be safe or risky. For this purpose, we will use 1 as value for our safe loans, and 0 for risky loans.

safe_loans$safe  <- 1
risky_loans$safe <- 0

Let’s check how many loans of each type we have.

nrow(safe_loans)
## [1] 33136
nrow(risky_loans)
## [1] 5634

In our dataset, we have much more safe loans. Imbalanced data might be a problem, so we will take the same amount of loans of both types.

safe_loans <- safe_loans[1:nrow(risky_loans),]

Now let’s combine the loans in one dataset.

loans <- rbind(safe_loans, risky_loans)

Data transformations

To avoid getting lost in lots of details, we will focus on a simple binary decision tree and convert our factor variables into binary features. Let’s check what’s the difference on the transformation of grade attribute.

knitr::kable(head(loans[1]))
grade
1 B
3 C
4 C
6 A
7 C
8 E
knitr::kable(head(model.matrix(~loans$grade - 1)))
loans$gradeA loans$gradeB loans$gradeC loans$gradeD loans$gradeE loans$gradeF loans$gradeG
0 1 0 0 0 0 0
0 0 1 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
0 0 0 0 1 0 0

And make the transformation for all attributes.

loans.data <- data.frame(safe = loans$safe)
loans.data <- cbind(loans.data, model.matrix(~loans$grade - 1))
loans.data <- cbind(loans.data, model.matrix(~loans$term - 1))
loans.data <- cbind(loans.data, model.matrix(~loans$home_ownership - 1))
loans.data <- cbind(loans.data, model.matrix(~loans$emp_length - 1))

To check the performance of the prediction we will split our data into training and testing subsets - createDataPartition is quite useful for such operations.

inTrain <- createDataPartition(loans.data$safe, p = 0.5, list = F)
train_data <- loans.data[inTrain,] 
test_data  <- loans.data[-inTrain,]

We will use the training subset to train the tree, and the testing one to check its performance.

Using rpart

Let’s check what rpart will give us.

# train rpart tree on training data 
rpart_tree <- rpart(safe ~ ., train_data, method = 'class')

# predict labels for test data using rpart tree
rpart_pred <- predict(rpart_tree, test_data, type = 'class')

# compare labels predicted by rpart to real ones
confusionMatrix(rpart_pred, test_data$safe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2033 1142
##          1  784 1675
##                                           
##                Accuracy : 0.6581          
##                  95% CI : (0.6456, 0.6705)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3163          
##  Mcnemar's Test P-Value : 4.131e-16       
##                                           
##             Sensitivity : 0.7217          
##             Specificity : 0.5946          
##          Pos Pred Value : 0.6403          
##          Neg Pred Value : 0.6812          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3608          
##    Detection Prevalence : 0.5635          
##       Balanced Accuracy : 0.6581          
##                                           
##        'Positive' Class : 0               
## 

The metric we are looking for is Accuracy(for more details read the help page for confusionMatrix).

So, 65% is better than randomly guessing (which is 50% in our case). Not bad, considering how many simplifications we made.

Interpreting the decision tree

But what are those decisions based on? Let’s take a look, we can use rpart.plot package to print the decision tree for our model.

rpart.plot(rpart_tree)

We can see that the model considers if the loan has an A grade first - all A loans go to safe category. Then, the model checks if it’s a B grade, if it’s not even B, loan is considered to be risky. And finally, for all B loans, the model considers if it’s short term or long term loan: short term loans are considered to be risky and long term loans are considered to be safe.

Implementation

As we can see above, the idea is to split the data based on different features to build a tree of decisions, which will lead to the classification of a data point.

So, we use features to split our data, but how do we choose the feature? Usually, the dataset won’t split flawlessly: there will be data points with different labels in the same subset. In such case, we can label the set based on majority and also calculate the error accounting for data points with different labels. For each feature we calculate errors for subsets we get after a split and choose the feature with the smallest error.

calculate_error <- function (labels) {
  # if we use a label of majority as a label of the leaf
  # the minority is the error of the leaf
  min(sum(labels == 1), sum(labels == 0))
}

splitting_feature <- function (data, features, target) {
  # we calculate error of splitting for every feature
  errors <- sapply(
    features, 
    function (feature) {
      left_labels  <- data[data[feature] == 0,][[target]]
      right_labels <- data[data[feature] == 1,][[target]]
      left_error  <- calculate_error(left_labels)            
      right_error <- calculate_error(right_labels)
      error = (left_error + right_error) / length(data)
    }
  )
  # and take the feature with the lowest error
  names(which.min(errors))
}

Another important moment is a generalization, if we need only need to describe our existing data, we can make a really deep decision tree, which will work perfectly on our training dataset. Such decision tree most likely won’t work well on a new data - it’s an overfitting in case of decision trees.

So, we get our data and split it into subsets and afterwards we check the stop criteria and split subsets if those criteria are not met. However, we will need to stop splitting our data in the following situations:

Ok, we’ve built a tree, what’s next? In order to predict new data with our tree, we just evaluate it against all split criteria and see which leaf our new data goes to. Then we just use the label of the leaf to predict the label of our new data.

mytree_predict <- function (tree, data) {   
  # return prediction if the node is a leaf node.
  if (tree$is_leaf) {
    return(tree$prediction)
  }
  
  # split on feature.
  split_feature_value = data[[tree$splitting_feature]]
  
  # keep splitting down the tree
  ifelse(
    split_feature_value == 0,
    mytree_predict(tree$left, data),
    mytree_predict(tree$right, data)
  )
}

You can find the rest of the code in this file.

Testing and comparison

Now we have our decision tree implementation and can check how well it performs!

# train the tree on training data
my_decision_tree <- mytree_create(train_data, colnames(train_data[-1]), 'safe', max_depth = 6)

# predict labels for test data
pred <- mytree_predict(my_decision_tree, test_data)

# compare predicted and real labels
confusionMatrix(pred, test_data$safe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1953 1122
##          1  864 1695
##                                         
##                Accuracy : 0.6475        
##                  95% CI : (0.6349, 0.66)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.295         
##  Mcnemar's Test P-Value : 8.074e-09     
##                                         
##             Sensitivity : 0.6933        
##             Specificity : 0.6017        
##          Pos Pred Value : 0.6351        
##          Neg Pred Value : 0.6624        
##              Prevalence : 0.5000        
##          Detection Rate : 0.3466        
##    Detection Prevalence : 0.5458        
##       Balanced Accuracy : 0.6475        
##                                         
##        'Positive' Class : 0             
## 

64% of accuracy! Quite close to rpart.

PS

Even though for this simple example our implementation is not significantly worse, you should remember that this is just a demostration of basic principles of decision trees, and for more complex data it’s better to consider professional packages like rpart.