Business Intelligence

Agenda

Decision Trees
Collecting/Importing Data
Exploring and Preparing the Data
- Partition data into training and test datasets
Training a Decision Tree Model
Evaluating Model Performance
Improving Decision Tree Accuracy
Regression Trees

Case Study 1 - Iris Flowers with Decision Trees

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, we can develop some prediction model to distinguish the species from each other.

Load and Explore dataset

library(tidyverse)    
iris <- read_csv("iris.csv")

Spliting data into training and testing datasets

We will split our data into two portions:

a training dataset to build the decision tree (130 records)
a test dataset to evaluate the model performance (20 records)

RNGversion("3.5.2")
set.seed(111) # use set.seed to use the same random number sequence as the tutorial
train_sample <- sample(150, 130)

# split the data frames
iris_train <- iris[train_sample, ]
iris_test  <- iris[-train_sample, ]

Then we can check whether the splited training and testing datasets are balanced.

# check the proportion of class variable
prop.table(table(iris_train$species))

## 
##     Iris-setosa Iris-versicolor  Iris-virginica 
##       0.3461538       0.3384615       0.3153846

prop.table(table(iris_test$species))

## 
##     Iris-setosa Iris-versicolor  Iris-virginica 
##            0.25            0.30            0.45

#Load the Decision Tree Algorithm package
# install.packages("C50")
library(C50)
#The algorithm needs a factor type
iris_train$species<-as.factor(iris_train$species)
iris_test$species<-as.factor(iris_test$species)

Stage 1: Training Stage

We will use the Training dataset to build a decision tree model.

#Train a Decision Tree Model
# Only use the first 4 columns as factors
# Use species variable as the DV
iris_model <- C5.0(iris_train[-5], iris_train$species)
summary(iris_model)

plot(iris_model)

Interpretation

Node 5: If petal length is between 1.9 and 4.9, and petal width is less than 1.7, then the flower is predicted to be Iris-versicolor. The training misclassification rate for this node is (1/37).
Node 6: If petal length is higher than 4.9, and petal width is less than 1.7, then the flower is predicted to be Iris-virginica. The training misclassification rate for this node is (2/6).
The overall misclassification rate for training data is 3.1%

Stage 2: Validation Stage

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

iris_pred <- predict(iris_model, iris_test)

Then we can see how well did the model do for the testing data. That is the misclassification rate for testing dataset?

# cross tabulation of predicted versus actual classes
install.packages("gmodels")
library(gmodels)
CrossTable(iris_test$species, iris_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual speciest', 'predicted species'))

Case Study 2 - Identify Risky Bank Loans with Decision Trees

Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory.
In this section, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Load data

The idea behind our credit model is to identify factors that are predictive of higher risk of default.

We will first download and import the credit data, which contains information on loans obtained from a credit agency in Germany.

The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default (fails to pay the principal and interests). Let’s see whether we can determine any patterns that predict this outcome.

    # load data
  library(tidyverse)    
  credit <- read_csv("credit.csv")

Exploring and Preparing Data

We can first take a quick look at the dataset.

    str(credit)

## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ checking_balance    : chr [1:1000] "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
##  $ months_loan_duration: num [1:1000] 6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : chr [1:1000] "critical" "good" "critical" "good" ...
##  $ purpose             : chr [1:1000] "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ...
##  $ amount              : num [1:1000] 1169 5951 2096 7882 4870 ...
##  $ savings_balance     : chr [1:1000] "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
##  $ employment_duration : chr [1:1000] "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ...
##  $ percent_of_income   : num [1:1000] 4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : num [1:1000] 4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : num [1:1000] 67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : chr [1:1000] "none" "none" "none" "none" ...
##  $ housing             : chr [1:1000] "own" "own" "own" "other" ...
##  $ existing_loans_count: num [1:1000] 2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : chr [1:1000] "skilled" "skilled" "unskilled" "skilled" ...
##  $ dependents          : num [1:1000] 1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : chr [1:1000] "yes" "no" "no" "no" ...
##  $ default             : chr [1:1000] "no" "yes" "no" "no" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   checking_balance = col_character(),
##   ..   months_loan_duration = col_double(),
##   ..   credit_history = col_character(),
##   ..   purpose = col_character(),
##   ..   amount = col_double(),
##   ..   savings_balance = col_character(),
##   ..   employment_duration = col_character(),
##   ..   percent_of_income = col_double(),
##   ..   years_at_residence = col_double(),
##   ..   age = col_double(),
##   ..   other_credit = col_character(),
##   ..   housing = col_character(),
##   ..   existing_loans_count = col_double(),
##   ..   job = col_character(),
##   ..   dependents = col_double(),
##   ..   phone = col_character(),
##   ..   default = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Exploring and Preparing Data

Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default.

# look at two characteristics of the applicant.
#Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM).
    table(credit$checking_balance)

## 
##     < 0 DM   > 200 DM 1 - 200 DM    unknown 
##        274         63        269        394

    table(credit$savings_balance)

## 
##      < 100 DM     > 1000 DM  100 - 500 DM 500 - 1000 DM       unknown 
##           603            48           103            63           183

Exploring and Preparing Data

Some of the loan’s features are numeric, such as its duration and the amount of credit requested:

  summary(credit$months_loan_duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    18.0    20.9    24.0    72.0

  summary(credit$amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

The default indicates the outcome, indicating whether the loan applicant was unable to meet the agreed payment.

table(credit$default) #30% went into default

## 
##  no yes 
## 700 300

Spliting data into training and testing datasets

We will split our data into two portions:

a training dataset to build the decision tree (70%)
a test dataset to evaluate the model performance (30%)

RNGversion("3.5.2") # use an older random number generator to match the book
set.seed(230) # use set.seed to use the same random number sequence as the tutorial
train_sample <- sample(1000, 700)

# split the data frames
credit_train <- credit[train_sample, ]
credit_test  <- credit[-train_sample, ]

Spliting data into training and testing datasets

Then we can check whether the splited training and testing datasets are balanced.

# check the proportion of class variable
prop.table(table(credit_train$default))

## 
##        no       yes 
## 0.6971429 0.3028571

prop.table(table(credit_test$default))

## 
##        no       yes 
## 0.7066667 0.2933333

This appears to a fairly even split, so we can now build our decision tree.

Stage 1- Train a Decision Tree Model

#install.packages("C50")
library(C50)

The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame.

#The algorithm needs a factor type if the outcome variable as an input to the function, 
credit_train$default<-as.factor(credit_train$default)
credit_test$default<-as.factor(credit_test$default)
#Train a Decision Tree Model
credit_model <- C5.0(credit_train[-17], credit_train$default)

Interpreting a Decision Tree Model

Then we can see the detailed branches of the decision tree by summary().

# display detailed information about the tree
summary(credit_model)

From the summary, you may observe:

Tree structure
Misclassification rate for each leaf node
Overall misclassification rate for training data
Importance of factors (from Attribute Usage)

Interpreting a Decision Tree Model

From the summary results, the first lines could be represented in plain language as:

If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.”
The Misclassification rate for this node is 44/320
Overall misclassification rate for training data 13.6%
The key factor is the checking_balance

Plot the tree

We definitely need to prune the tree!

plot(credit_model)

Stage 2: Validation Stage

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

# create a factor vector of predictions on test data
credit_pred <- predict(credit_model, credit_test)

Then we can see how well did the model do for the testing data.

# cross tabulation of predicted versus actual classes
#install.packages("gmodels")
library(gmodels)
CrossTable(credit_test$default, credit_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))

Prune the Tree

We can simplify the tree by increase the # of cases in lead nodes:

credit_model2 <- C5.0(credit_train[-17], credit_train$default,
                     control = C5.0Control(minCases = 20))
plot(credit_model2)

Evaluate the simplified Tree

credit_pred2 <- predict(credit_model2, credit_test)
CrossTable(credit_test$default, credit_pred2,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))

Figure out an optimal tree

We mainly refer to the errors for testing data.

Case Study 3 - Regression Tree: Estimating Wine Quality

Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. Variables such as weather, the growing environment, the bottling, manufacturing, bottle design, or even price point, can affect the customer’s perception of taste.

More recently, machine learning has been employed to assist with rating the quality of wine—a notoriously difficult task. A review written by a renowned wine critic often determines whether the product ends up on the top or bottom shelf.

Import Data

In this case study, we will use regression trees and model trees to create a system capable of mimicking expert ratings of wine. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.

We will first download and import the whitewine data, which includes examples of white Vinho Verde wines from Portugal—one of the world’s leading wine-producing countries.

wine <- read.csv("winequality.csv")

The white wine data includes information on 11 chemical properties of 4,898 wine samples.

Exploring and Preparing the Data

# examine the wine data
str(wine)
# the distribution of quality ratings
hist(wine$quality)

Split into Train and Test Dataset

Our last step then is to divide into training and testing datasets. Since the wine data set was already sorted into random order, we can partition into two sets: 75% train, 25% test dataset.

wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

Train a Regression Tree Model

We will use the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team.

#install.packages("rpart")
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
summary(m.rpart)

## Call:
## rpart(formula = quality ~ ., data = wine_train)
##   n= 3750 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.17816211      0 1.0000000 1.0005363 0.02388663
## 2 0.04439109      1 0.8218379 0.8227527 0.02238832
## 3 0.02890893      2 0.7774468 0.7879479 0.02210669
## 4 0.01655575      3 0.7485379 0.7615509 0.02098066
## 5 0.01108600      4 0.7319821 0.7490110 0.02062154
## 6 0.01000000      5 0.7208961 0.7464506 0.02056677
## 
## Variable importance
##              alcohol              density            chlorides 
##                   38                   23                   12 
##     volatile.acidity total.sulfur.dioxide  free.sulfur.dioxide 
##                   12                    7                    6 
##            sulphates                   pH       residual.sugar 
##                    1                    1                    1 
## 
## Node number 1: 3750 observations,    complexity param=0.1781621
##   mean=5.886933, MSE=0.8373493 
##   left son=2 (2473 obs) right son=3 (1277 obs)
##   Primary splits:
##       alcohol              < 10.85    to the left,  improve=0.17816210, (0 missing)
##       density              < 0.992385 to the right, improve=0.11980970, (0 missing)
##       chlorides            < 0.0395   to the right, improve=0.08199995, (0 missing)
##       total.sulfur.dioxide < 153.5    to the right, improve=0.03875440, (0 missing)
##       free.sulfur.dioxide  < 11.75    to the left,  improve=0.03632119, (0 missing)
##   Surrogate splits:
##       density              < 0.99201  to the right, agree=0.869, adj=0.614, (0 split)
##       chlorides            < 0.0375   to the right, agree=0.773, adj=0.334, (0 split)
##       total.sulfur.dioxide < 102.5    to the right, agree=0.705, adj=0.132, (0 split)
##       sulphates            < 0.345    to the right, agree=0.670, adj=0.031, (0 split)
##       fixed.acidity        < 5.25     to the right, agree=0.662, adj=0.009, (0 split)
## 
## Node number 2: 2473 observations,    complexity param=0.04439109
##   mean=5.609381, MSE=0.6108623 
##   left son=4 (1406 obs) right son=5 (1067 obs)
##   Primary splits:
##       volatile.acidity    < 0.2425   to the right, improve=0.09227123, (0 missing)
##       free.sulfur.dioxide < 13.5     to the left,  improve=0.04177240, (0 missing)
##       alcohol             < 10.15    to the left,  improve=0.03313802, (0 missing)
##       citric.acid         < 0.205    to the left,  improve=0.02721200, (0 missing)
##       pH                  < 3.325    to the left,  improve=0.01860335, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 111.5    to the right, agree=0.610, adj=0.097, (0 split)
##       pH                   < 3.295    to the left,  agree=0.598, adj=0.067, (0 split)
##       alcohol              < 10.05    to the left,  agree=0.590, adj=0.049, (0 split)
##       sulphates            < 0.715    to the left,  agree=0.584, adj=0.037, (0 split)
##       residual.sugar       < 1.85     to the right, agree=0.581, adj=0.029, (0 split)
## 
## Node number 3: 1277 observations,    complexity param=0.02890893
##   mean=6.424432, MSE=0.8378682 
##   left son=6 (93 obs) right son=7 (1184 obs)
##   Primary splits:
##       free.sulfur.dioxide  < 11.5     to the left,  improve=0.08484051, (0 missing)
##       alcohol              < 11.85    to the left,  improve=0.06149941, (0 missing)
##       fixed.acidity        < 7.35     to the right, improve=0.04259695, (0 missing)
##       residual.sugar       < 1.275    to the left,  improve=0.02795662, (0 missing)
##       total.sulfur.dioxide < 67.5     to the left,  improve=0.02541719, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 48.5     to the left,  agree=0.937, adj=0.14, (0 split)
## 
## Node number 4: 1406 observations,    complexity param=0.011086
##   mean=5.40256, MSE=0.526423 
##   left son=8 (182 obs) right son=9 (1224 obs)
##   Primary splits:
##       volatile.acidity     < 0.4225   to the right, improve=0.04703189, (0 missing)
##       free.sulfur.dioxide  < 17.5     to the left,  improve=0.04607770, (0 missing)
##       total.sulfur.dioxide < 86.5     to the left,  improve=0.02894310, (0 missing)
##       alcohol              < 10.25    to the left,  improve=0.02890077, (0 missing)
##       chlorides            < 0.0455   to the right, improve=0.02096635, (0 missing)
##   Surrogate splits:
##       density       < 0.99107  to the left,  agree=0.874, adj=0.027, (0 split)
##       citric.acid   < 0.11     to the left,  agree=0.873, adj=0.022, (0 split)
##       fixed.acidity < 9.85     to the right, agree=0.873, adj=0.016, (0 split)
##       chlorides     < 0.206    to the right, agree=0.871, adj=0.005, (0 split)
## 
## Node number 5: 1067 observations
##   mean=5.881912, MSE=0.591491 
## 
## Node number 6: 93 observations
##   mean=5.473118, MSE=1.066482 
## 
## Node number 7: 1184 observations,    complexity param=0.01655575
##   mean=6.499155, MSE=0.7432425 
##   left son=14 (611 obs) right son=15 (573 obs)
##   Primary splits:
##       alcohol        < 11.85    to the left,  improve=0.05907511, (0 missing)
##       fixed.acidity  < 7.35     to the right, improve=0.04400660, (0 missing)
##       density        < 0.991395 to the right, improve=0.02522410, (0 missing)
##       residual.sugar < 1.225    to the left,  improve=0.02503936, (0 missing)
##       pH             < 3.245    to the left,  improve=0.02417936, (0 missing)
##   Surrogate splits:
##       density              < 0.991115 to the right, agree=0.710, adj=0.401, (0 split)
##       volatile.acidity     < 0.2675   to the left,  agree=0.665, adj=0.307, (0 split)
##       chlorides            < 0.0365   to the right, agree=0.631, adj=0.237, (0 split)
##       total.sulfur.dioxide < 126.5    to the right, agree=0.566, adj=0.103, (0 split)
##       residual.sugar       < 1.525    to the left,  agree=0.560, adj=0.091, (0 split)
## 
## Node number 8: 182 observations
##   mean=4.994505, MSE=0.5109588 
## 
## Node number 9: 1224 observations
##   mean=5.463235, MSE=0.5002823 
## 
## Node number 14: 611 observations
##   mean=6.296236, MSE=0.7322117 
## 
## Node number 15: 573 observations
##   mean=6.715532, MSE=0.6642788

Visualize the Decision Trees

Although the tree can be understood using only the preceding output, it is often more readily understood using visualization. The rpart.plot package by Stephen Milborrow provides an easy-to-use function for visulization.

#install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(m.rpart, digits = 3)

Evaluate Model Performance

To use the regression tree model to make predictions on the test data, we use the predict() function.

# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, wine_test)
# compare the distribution of predicted values vs. actual values
summary(p.rpart)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.995   5.463   5.882   5.999   6.296   6.716

summary(wine_test$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.848   6.000   8.000

This finding suggests that the model is not correctly identifying the extreme cases, in particular the best and worst wines. Between the first and third quartile, we may be doing well.

Measuring Performance with Mean Absolute Error

Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE).

# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}
# mean absolute error between predicted and actual values
MAE(p.rpart, wine_test$quality)

## [1] 0.5732104

We can compare MAE values between multiple models (decision tree, linear regressions, and other prediction methods)