- Decision Trees
- Collecting/Importing Data
- Exploring and Preparing the Data
- Partition data into training and test datasets
- Training a Decision Tree Model
- Evaluating Model Performance
- Improving Decision Tree Accuracy
- Regression Trees
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, we can develop some prediction model to distinguish the species from each other.
library(tidyverse)
iris <- read_csv("iris.csv")
We will split our data into two portions:
a training dataset to build the decision tree (130 records)
a test dataset to evaluate the model performance (20 records)
RNGversion("3.5.2")
set.seed(111) # use set.seed to use the same random number sequence as the tutorial
train_sample <- sample(150, 130)
# split the data frames
iris_train <- iris[train_sample, ]
iris_test <- iris[-train_sample, ]
Then we can check whether the splited training and testing datasets are balanced.
# check the proportion of class variable prop.table(table(iris_train$species))
## ## Iris-setosa Iris-versicolor Iris-virginica ## 0.3461538 0.3384615 0.3153846
prop.table(table(iris_test$species))
## ## Iris-setosa Iris-versicolor Iris-virginica ## 0.25 0.30 0.45
#Load the Decision Tree Algorithm package
# install.packages("C50")
library(C50)
#The algorithm needs a factor type
iris_train$species<-as.factor(iris_train$species)
iris_test$species<-as.factor(iris_test$species)
We will use the Training dataset to build a decision tree model.
#Train a Decision Tree Model # Only use the first 4 columns as factors # Use species variable as the DV iris_model <- C5.0(iris_train[-5], iris_train$species) summary(iris_model)
plot(iris_model)
Node 5: If petal length is between 1.9 and 4.9, and petal width is less than 1.7, then the flower is predicted to be Iris-versicolor. The training misclassification rate for this node is (1/37).
Node 6: If petal length is higher than 4.9, and petal width is less than 1.7, then the flower is predicted to be Iris-virginica. The training misclassification rate for this node is (2/6).
The overall misclassification rate for training data is 3.1%
To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:
iris_pred <- predict(iris_model, iris_test)
Then we can see how well did the model do for the testing data. That is the misclassification rate for testing dataset?
# cross tabulation of predicted versus actual classes
install.packages("gmodels")
library(gmodels)
CrossTable(iris_test$species, iris_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual speciest', 'predicted species'))
Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory.
In this section, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.
The idea behind our credit model is to identify factors that are predictive of higher risk of default.
We will first download and import the credit data, which contains information on loans obtained from a credit agency in Germany.
The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default (fails to pay the principal and interests). Let’s see whether we can determine any patterns that predict this outcome.
# load data
library(tidyverse)
credit <- read_csv("credit.csv")
We can first take a quick look at the dataset.
str(credit)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ checking_balance : chr [1:1000] "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ... ## $ months_loan_duration: num [1:1000] 6 48 12 42 24 36 24 36 12 30 ... ## $ credit_history : chr [1:1000] "critical" "good" "critical" "good" ... ## $ purpose : chr [1:1000] "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ... ## $ amount : num [1:1000] 1169 5951 2096 7882 4870 ... ## $ savings_balance : chr [1:1000] "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ... ## $ employment_duration : chr [1:1000] "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ... ## $ percent_of_income : num [1:1000] 4 2 2 2 3 2 3 2 2 4 ... ## $ years_at_residence : num [1:1000] 4 2 3 4 4 4 4 2 4 2 ... ## $ age : num [1:1000] 67 22 49 45 53 35 53 35 61 28 ... ## $ other_credit : chr [1:1000] "none" "none" "none" "none" ... ## $ housing : chr [1:1000] "own" "own" "own" "other" ... ## $ existing_loans_count: num [1:1000] 2 1 1 1 2 1 1 1 1 2 ... ## $ job : chr [1:1000] "skilled" "skilled" "unskilled" "skilled" ... ## $ dependents : num [1:1000] 1 1 2 2 2 2 1 1 1 1 ... ## $ phone : chr [1:1000] "yes" "no" "no" "no" ... ## $ default : chr [1:1000] "no" "yes" "no" "no" ... ## - attr(*, "spec")= ## .. cols( ## .. checking_balance = col_character(), ## .. months_loan_duration = col_double(), ## .. credit_history = col_character(), ## .. purpose = col_character(), ## .. amount = col_double(), ## .. savings_balance = col_character(), ## .. employment_duration = col_character(), ## .. percent_of_income = col_double(), ## .. years_at_residence = col_double(), ## .. age = col_double(), ## .. other_credit = col_character(), ## .. housing = col_character(), ## .. existing_loans_count = col_double(), ## .. job = col_character(), ## .. dependents = col_double(), ## .. phone = col_character(), ## .. default = col_character() ## .. ) ## - attr(*, "problems")=<externalptr>
Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default.
# look at two characteristics of the applicant.
#Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM).
table(credit$checking_balance)
## ## < 0 DM > 200 DM 1 - 200 DM unknown ## 274 63 269 394
table(credit$savings_balance)
## ## < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown ## 603 48 103 63 183
Some of the loan’s features are numeric, such as its duration and the amount of credit requested:
summary(credit$months_loan_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.0 12.0 18.0 20.9 24.0 72.0
summary(credit$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 250 1366 2320 3271 3972 18424
The default indicates the outcome, indicating whether the loan applicant was unable to meet the agreed payment.
table(credit$default) #30% went into default
## ## no yes ## 700 300
We will split our data into two portions:
a training dataset to build the decision tree (70%)
a test dataset to evaluate the model performance (30%)
RNGversion("3.5.2") # use an older random number generator to match the book
set.seed(230) # use set.seed to use the same random number sequence as the tutorial
train_sample <- sample(1000, 700)
# split the data frames
credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]
Then we can check whether the splited training and testing datasets are balanced.
# check the proportion of class variable prop.table(table(credit_train$default))
## ## no yes ## 0.6971429 0.3028571
prop.table(table(credit_test$default))
## ## no yes ## 0.7066667 0.2933333
This appears to a fairly even split, so we can now build our decision tree.
#install.packages("C50")
library(C50)
The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame.
#The algorithm needs a factor type if the outcome variable as an input to the function, credit_train$default<-as.factor(credit_train$default) credit_test$default<-as.factor(credit_test$default) #Train a Decision Tree Model credit_model <- C5.0(credit_train[-17], credit_train$default)
Then we can see the detailed branches of the decision tree by summary().
# display detailed information about the tree summary(credit_model)
From the summary, you may observe:
From the summary results, the first lines could be represented in plain language as:
If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.”
The Misclassification rate for this node is 44/320
Overall misclassification rate for training data 13.6%
The key factor is the checking_balance
We definitely need to prune the tree!
plot(credit_model)
To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:
# create a factor vector of predictions on test data credit_pred <- predict(credit_model, credit_test)
Then we can see how well did the model do for the testing data.
# cross tabulation of predicted versus actual classes
#install.packages("gmodels")
library(gmodels)
CrossTable(credit_test$default, credit_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
We can simplify the tree by increase the # of cases in lead nodes:
credit_model2 <- C5.0(credit_train[-17], credit_train$default,
control = C5.0Control(minCases = 20))
plot(credit_model2)
credit_pred2 <- predict(credit_model2, credit_test)
CrossTable(credit_test$default, credit_pred2,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
We mainly refer to the errors for testing data.
Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. Variables such as weather, the growing environment, the bottling, manufacturing, bottle design, or even price point, can affect the customer’s perception of taste.
More recently, machine learning has been employed to assist with rating the quality of wine—a notoriously difficult task. A review written by a renowned wine critic often determines whether the product ends up on the top or bottom shelf.
In this case study, we will use regression trees and model trees to create a system capable of mimicking expert ratings of wine. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.
We will first download and import the whitewine data, which includes examples of white Vinho Verde wines from Portugal—one of the world’s leading wine-producing countries.
wine <- read.csv("winequality.csv")
The white wine data includes information on 11 chemical properties of 4,898 wine samples.
# examine the wine data str(wine) # the distribution of quality ratings hist(wine$quality)
Our last step then is to divide into training and testing datasets. Since the wine data set was already sorted into random order, we can partition into two sets: 75% train, 25% test dataset.
wine_train <- wine[1:3750, ] wine_test <- wine[3751:4898, ]
We will use the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team.
#install.packages("rpart")
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
summary(m.rpart)
## Call: ## rpart(formula = quality ~ ., data = wine_train) ## n= 3750 ## ## CP nsplit rel error xerror xstd ## 1 0.17816211 0 1.0000000 1.0005363 0.02388663 ## 2 0.04439109 1 0.8218379 0.8227527 0.02238832 ## 3 0.02890893 2 0.7774468 0.7879479 0.02210669 ## 4 0.01655575 3 0.7485379 0.7615509 0.02098066 ## 5 0.01108600 4 0.7319821 0.7490110 0.02062154 ## 6 0.01000000 5 0.7208961 0.7464506 0.02056677 ## ## Variable importance ## alcohol density chlorides ## 38 23 12 ## volatile.acidity total.sulfur.dioxide free.sulfur.dioxide ## 12 7 6 ## sulphates pH residual.sugar ## 1 1 1 ## ## Node number 1: 3750 observations, complexity param=0.1781621 ## mean=5.886933, MSE=0.8373493 ## left son=2 (2473 obs) right son=3 (1277 obs) ## Primary splits: ## alcohol < 10.85 to the left, improve=0.17816210, (0 missing) ## density < 0.992385 to the right, improve=0.11980970, (0 missing) ## chlorides < 0.0395 to the right, improve=0.08199995, (0 missing) ## total.sulfur.dioxide < 153.5 to the right, improve=0.03875440, (0 missing) ## free.sulfur.dioxide < 11.75 to the left, improve=0.03632119, (0 missing) ## Surrogate splits: ## density < 0.99201 to the right, agree=0.869, adj=0.614, (0 split) ## chlorides < 0.0375 to the right, agree=0.773, adj=0.334, (0 split) ## total.sulfur.dioxide < 102.5 to the right, agree=0.705, adj=0.132, (0 split) ## sulphates < 0.345 to the right, agree=0.670, adj=0.031, (0 split) ## fixed.acidity < 5.25 to the right, agree=0.662, adj=0.009, (0 split) ## ## Node number 2: 2473 observations, complexity param=0.04439109 ## mean=5.609381, MSE=0.6108623 ## left son=4 (1406 obs) right son=5 (1067 obs) ## Primary splits: ## volatile.acidity < 0.2425 to the right, improve=0.09227123, (0 missing) ## free.sulfur.dioxide < 13.5 to the left, improve=0.04177240, (0 missing) ## alcohol < 10.15 to the left, improve=0.03313802, (0 missing) ## citric.acid < 0.205 to the left, improve=0.02721200, (0 missing) ## pH < 3.325 to the left, improve=0.01860335, (0 missing) ## Surrogate splits: ## total.sulfur.dioxide < 111.5 to the right, agree=0.610, adj=0.097, (0 split) ## pH < 3.295 to the left, agree=0.598, adj=0.067, (0 split) ## alcohol < 10.05 to the left, agree=0.590, adj=0.049, (0 split) ## sulphates < 0.715 to the left, agree=0.584, adj=0.037, (0 split) ## residual.sugar < 1.85 to the right, agree=0.581, adj=0.029, (0 split) ## ## Node number 3: 1277 observations, complexity param=0.02890893 ## mean=6.424432, MSE=0.8378682 ## left son=6 (93 obs) right son=7 (1184 obs) ## Primary splits: ## free.sulfur.dioxide < 11.5 to the left, improve=0.08484051, (0 missing) ## alcohol < 11.85 to the left, improve=0.06149941, (0 missing) ## fixed.acidity < 7.35 to the right, improve=0.04259695, (0 missing) ## residual.sugar < 1.275 to the left, improve=0.02795662, (0 missing) ## total.sulfur.dioxide < 67.5 to the left, improve=0.02541719, (0 missing) ## Surrogate splits: ## total.sulfur.dioxide < 48.5 to the left, agree=0.937, adj=0.14, (0 split) ## ## Node number 4: 1406 observations, complexity param=0.011086 ## mean=5.40256, MSE=0.526423 ## left son=8 (182 obs) right son=9 (1224 obs) ## Primary splits: ## volatile.acidity < 0.4225 to the right, improve=0.04703189, (0 missing) ## free.sulfur.dioxide < 17.5 to the left, improve=0.04607770, (0 missing) ## total.sulfur.dioxide < 86.5 to the left, improve=0.02894310, (0 missing) ## alcohol < 10.25 to the left, improve=0.02890077, (0 missing) ## chlorides < 0.0455 to the right, improve=0.02096635, (0 missing) ## Surrogate splits: ## density < 0.99107 to the left, agree=0.874, adj=0.027, (0 split) ## citric.acid < 0.11 to the left, agree=0.873, adj=0.022, (0 split) ## fixed.acidity < 9.85 to the right, agree=0.873, adj=0.016, (0 split) ## chlorides < 0.206 to the right, agree=0.871, adj=0.005, (0 split) ## ## Node number 5: 1067 observations ## mean=5.881912, MSE=0.591491 ## ## Node number 6: 93 observations ## mean=5.473118, MSE=1.066482 ## ## Node number 7: 1184 observations, complexity param=0.01655575 ## mean=6.499155, MSE=0.7432425 ## left son=14 (611 obs) right son=15 (573 obs) ## Primary splits: ## alcohol < 11.85 to the left, improve=0.05907511, (0 missing) ## fixed.acidity < 7.35 to the right, improve=0.04400660, (0 missing) ## density < 0.991395 to the right, improve=0.02522410, (0 missing) ## residual.sugar < 1.225 to the left, improve=0.02503936, (0 missing) ## pH < 3.245 to the left, improve=0.02417936, (0 missing) ## Surrogate splits: ## density < 0.991115 to the right, agree=0.710, adj=0.401, (0 split) ## volatile.acidity < 0.2675 to the left, agree=0.665, adj=0.307, (0 split) ## chlorides < 0.0365 to the right, agree=0.631, adj=0.237, (0 split) ## total.sulfur.dioxide < 126.5 to the right, agree=0.566, adj=0.103, (0 split) ## residual.sugar < 1.525 to the left, agree=0.560, adj=0.091, (0 split) ## ## Node number 8: 182 observations ## mean=4.994505, MSE=0.5109588 ## ## Node number 9: 1224 observations ## mean=5.463235, MSE=0.5002823 ## ## Node number 14: 611 observations ## mean=6.296236, MSE=0.7322117 ## ## Node number 15: 573 observations ## mean=6.715532, MSE=0.6642788
Although the tree can be understood using only the preceding output, it is often more readily understood using visualization. The rpart.plot package by Stephen Milborrow provides an easy-to-use function for visulization.
#install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(m.rpart, digits = 3)
To use the regression tree model to make predictions on the test data, we use the predict() function.
# generate predictions for the testing dataset p.rpart <- predict(m.rpart, wine_test) # compare the distribution of predicted values vs. actual values summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.995 5.463 5.882 5.999 6.296 6.716
summary(wine_test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.000 5.000 6.000 5.848 6.000 8.000
This finding suggests that the model is not correctly identifying the extreme cases, in particular the best and worst wines. Between the first and third quartile, we may be doing well.
Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE).
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
# mean absolute error between predicted and actual values
MAE(p.rpart, wine_test$quality)
## [1] 0.5732104
We can compare MAE values between multiple models (decision tree, linear regressions, and other prediction methods)