Agenda

  1. Decision Trees
  2. Collecting/Importing Data
  3. Exploring and Preparing the Data
    • Partition data into training and test datasets
  4. Training a Decision Tree Model
  5. Evaluating Model Performance
  6. Improving Decision Tree Accuracy
  7. Regression Trees

The C5.0 Decision Tree Algorithm

There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm.

The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box.

Case Study - Identify Risky Bank Loans with Decision Trees

  • Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory.

  • In this section, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Step 1: Load data

The idea behind our credit model is to identify factors that are predictive of higher risk of default.

We will first download and import the credit data credit, which contains information on loans obtained from a credit agency in Germany.

The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default (fails to pay the principal and interests). Let’s see whether we can determine any patterns that predict this outcome.

    # load data
  library(tidyverse)    
  credit <- read_csv("credit.csv")

Step 2: Exploring and Preparing Data

We can first take a quick look at the dataset.

    str(credit)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ checking_balance    : chr [1:1000] "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
##  $ months_loan_duration: num [1:1000] 6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : chr [1:1000] "critical" "good" "critical" "good" ...
##  $ purpose             : chr [1:1000] "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ...
##  $ amount              : num [1:1000] 1169 5951 2096 7882 4870 ...
##  $ savings_balance     : chr [1:1000] "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
##  $ employment_duration : chr [1:1000] "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ...
##  $ percent_of_income   : num [1:1000] 4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : num [1:1000] 4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : num [1:1000] 67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : chr [1:1000] "none" "none" "none" "none" ...
##  $ housing             : chr [1:1000] "own" "own" "own" "other" ...
##  $ existing_loans_count: num [1:1000] 2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : chr [1:1000] "skilled" "skilled" "unskilled" "skilled" ...
##  $ dependents          : num [1:1000] 1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : chr [1:1000] "yes" "no" "no" "no" ...
##  $ default             : chr [1:1000] "no" "yes" "no" "no" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   checking_balance = col_character(),
##   ..   months_loan_duration = col_double(),
##   ..   credit_history = col_character(),
##   ..   purpose = col_character(),
##   ..   amount = col_double(),
##   ..   savings_balance = col_character(),
##   ..   employment_duration = col_character(),
##   ..   percent_of_income = col_double(),
##   ..   years_at_residence = col_double(),
##   ..   age = col_double(),
##   ..   other_credit = col_character(),
##   ..   housing = col_character(),
##   ..   existing_loans_count = col_double(),
##   ..   job = col_character(),
##   ..   dependents = col_double(),
##   ..   phone = col_character(),
##   ..   default = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Step 2: Exploring and Preparing Data

Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default.

    # look at two characteristics of the applicant. 
    table(credit$checking_balance)
## 
##     < 0 DM   > 200 DM 1 - 200 DM    unknown 
##        274         63        269        394
    table(credit$savings_balance) 
## 
##      < 100 DM     > 1000 DM  100 - 500 DM 500 - 1000 DM       unknown 
##           603            48           103            63           183
    #Note the data was from Germany, currency unit is Deutsche Marks (DM).

Step 2: Exploring and Preparing Data

Some of the loan’s features are numeric, such as its duration and the amount of credit requested:

  summary(credit$months_loan_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    18.0    20.9    24.0    72.0
  summary(credit$amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

The default indicates the outcome, indicating whether the loan applicant was unable to meet the agreed payment.

table(credit$default) #30% went into default
## 
##  no yes 
## 700 300

Spliting data into training and testing datasets

We will split our data into two portions:

  • a training dataset to build the decision tree (90%)

  • a test dataset to evaluate the model performance (10%)

RNGversion("3.5.2") # use an older random number generator to match the book
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(123) # use set.seed to use the same random number sequence as the tutorial
train_sample <- sample(1000, 900)

The resulting train_sample is a vector of 900 random integers, which we can used as index to spilt training and testing datasets.

str(train_sample)
##  int [1:900] 288 788 409 881 937 46 525 887 548 453 ...
# split the data frames
credit_train <- credit[train_sample, ]
credit_test  <- credit[-train_sample, ]

Spliting data into training and testing datasets

Then we can check whether the splited training and testing datasets are balanced.

# check the proportion of class variable
prop.table(table(credit_train$default))
## 
##        no       yes 
## 0.7033333 0.2966667
prop.table(table(credit_test$default))
## 
##   no  yes 
## 0.67 0.33

This appears to a fairly even split, so we can now build our decision tree.

Step 3- Training a Decision Tree Model

We will use the C5.0 algorithm in the C50 package to train our decision tree model. We can first install then load package C50.

#install.packages("C50")
library(C50)
## Warning: package 'C50' was built under R version 4.4.3

Train a Decision Tree Model

For the first iteration of our credit approval model, we’ll use the default C5.0 configuration, as shown in the following code.

The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame.

#The algorithm needs a factor type if the outcome variable as an input to the function, 
credit_train$default<-as.factor(credit_train$default)
credit_test$default<-as.factor(credit_test$default)
#Train a Decision Tree Model
credit_model <- C5.0(credit_train[-17], credit_train$default)

Interpreting a Decision Tree Model

Then we can see the detailed branches of the decision tree by summary().

# display simple facts about the tree
credit_model
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 16 
## 
## Tree size: 57 
## 
## Non-standard options: attempt to group attributes
# display detailed information about the tree
summary(credit_model)
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr  1 18:24:23 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (17 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance in {unknown,> 200 DM}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history in {perfect,very good}: yes (59/18)
##     credit_history in {poor,critical,good}:
##     :...months_loan_duration <= 22:
##         :...credit_history = critical: no (72/14)
##         :   credit_history = poor:
##         :   :...dependents > 1: no (5)
##         :   :   dependents <= 1:
##         :   :   :...years_at_residence <= 3: yes (4/1)
##         :   :       years_at_residence > 3: no (5/1)
##         :   credit_history = good:
##         :   :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (15/1)
##         :       savings_balance = 100 - 500 DM:
##         :       :...other_credit in {none,store}: no (9/2)
##         :       :   other_credit = bank: yes (3)
##         :       savings_balance = unknown:
##         :       :...other_credit in {none,store}: no (21/8)
##         :       :   other_credit = bank: yes (1)
##         :       savings_balance = < 100 DM:
##         :       :...purpose in {car0,business,renovations}: no (8/2)
##         :           purpose = education:
##         :           :...checking_balance = 1 - 200 DM: no (1)
##         :           :   checking_balance = < 0 DM: yes (4)
##         :           purpose = car:
##         :           :...employment_duration = unemployed: no (4/1)
##         :           :   employment_duration = > 7 years: yes (5)
##         :           :   employment_duration = 4 - 7 years:
##         :           :   :...amount <= 1680: yes (2)
##         :           :   :   amount > 1680: no (3)
##         :           :   employment_duration = 1 - 4 years:
##         :           :   :...years_at_residence <= 2: yes (2)
##         :           :   :   years_at_residence > 2: no (6/1)
##         :           :   employment_duration = < 1 year:
##         :           :   :...years_at_residence <= 2: yes (5)
##         :           :       years_at_residence > 2: no (3/1)
##         :           purpose = furniture/appliances:
##         :           :...job in {management,unskilled}: no (23/3)
##         :               job = unemployed: yes (1)
##         :               job = skilled:
##         :               :...months_loan_duration > 13: [S1]
##         :                   months_loan_duration <= 13:
##         :                   :...housing in {other,own}: no (23/4)
##         :                       housing = rent:
##         :                       :...percent_of_income <= 3: yes (3)
##         :                           percent_of_income > 3: no (2)
##         months_loan_duration > 22:
##         :...savings_balance = 500 - 1000 DM: yes (4/1)
##             savings_balance = > 1000 DM: no (2)
##             savings_balance = 100 - 500 DM:
##             :...credit_history in {poor,critical}: no (14/3)
##             :   credit_history = good:
##             :   :...other_credit in {none,store}: yes (12/2)
##             :       other_credit = bank: no (1)
##             savings_balance = unknown:
##             :...checking_balance = 1 - 200 DM: no (17)
##             :   checking_balance = < 0 DM:
##             :   :...credit_history = critical: no (1)
##             :       credit_history in {poor,good}: yes (12/3)
##             savings_balance = < 100 DM:
##             :...months_loan_duration > 47: yes (21/2)
##                 months_loan_duration <= 47:
##                 :...housing = other:
##                     :...percent_of_income <= 2: no (6)
##                     :   percent_of_income > 2: yes (9/3)
##                     housing = rent:
##                     :...other_credit in {none,store}: yes (16/3)
##                     :   other_credit = bank: no (1)
##                     housing = own:
##                     :...employment_duration = > 7 years: no (13/4)
##                         employment_duration = unemployed:
##                         :...years_at_residence <= 2: yes (4)
##                         :   years_at_residence > 2: no (3)
##                         employment_duration = 4 - 7 years:
##                         :...job in {management,skilled,
##                         :   :       unemployed}: yes (9/1)
##                         :   job = unskilled: no (1)
##                         employment_duration = 1 - 4 years:
##                         :...purpose in {car0,business,education}: yes (7/1)
##                         :   purpose in {furniture/appliances,
##                         :   :           renovations}: no (7)
##                         :   purpose = car:
##                         :   :...years_at_residence <= 3: yes (3)
##                         :       years_at_residence > 3: no (3)
##                         employment_duration = < 1 year:
##                         :...years_at_residence > 3: yes (5)
##                             years_at_residence <= 3:
##                             :...other_credit = bank: no (0)
##                                 other_credit = store: yes (1)
##                                 other_credit = none:
##                                 :...checking_balance = 1 - 200 DM: no (8/2)
##                                     checking_balance = < 0 DM:
##                                     :...job in {management,skilled,
##                                         :       unemployed}: yes (2)
##                                         job = unskilled: no (3/1)
## 
## SubTree [S1]
## 
## employment_duration in {unemployed,> 7 years,1 - 4 years}: yes (10)
## employment_duration in {4 - 7 years,< 1 year}: no (4)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      56  133(14.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     598    35    (a): class no
##      98   169    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   54.22% credit_history
##   47.67% months_loan_duration
##   38.11% savings_balance
##   14.33% purpose
##   14.33% housing
##   12.56% employment_duration
##    9.00% job
##    8.67% other_credit
##    6.33% years_at_residence
##    2.22% percent_of_income
##    1.56% dependents
##    0.56% amount
## 
## 
## Time: 0.0 secs

Interpreting a Decision Tree Model

The preceding output shows some of the first branches in the decision tree. The first three lines could be represented in plain language as:

  1. If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.”

  2. Otherwise, if the checking account balance is less than zero DM or between one and 200 DM.

  3. And the credit history is perfect or very good, then classify as “likely to default.”

Interpreting a Decision Tree Model

  • The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision.

  • For instance, on the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default.

  • Sometimes a tree results in decisions that make little logical sense. They might reflect a real pattern in the data, or they may be a statistical anomaly.

Evaluating Model Performance

Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test dataset.

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

# create a factor vector of predictions on test data
credit_pred <- predict(credit_model, credit_test)

Evaluating Testing Performance

Then we can see how well did the model do for the testing data. A model’s performance is often worse on unseen data.

# cross tabulation of predicted versus actual classes
#install.packages("gmodels")
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.4.3
CrossTable(credit_test$default, credit_pred,
           prop.chisq = FALSE, 
           prop.c = FALSE, 
           prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        59 |         8 |        67 | 
##                |     0.590 |     0.080 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        19 |        14 |        33 | 
##                |     0.190 |     0.140 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        78 |        22 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Improving Model Performance

Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application.

Boosting: One way the C5.0 algorithm can improve accuracy is through adaptive boosting. This is a process in which many decision trees are built and the trees vote on the best class for each example.

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team.

# boosted decision tree with 10 trials
credit_boost10 <- C5.0(credit_train[-17], 
                       credit_train$default,
                       trials = 10)
credit_boost10
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 16 
## 
## Number of boosting iterations: 10 
## Average tree size: 47.5 
## 
## Non-standard options: attempt to group attributes
summary(credit_boost10)
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Apr  1 18:24:23 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (17 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## checking_balance in {unknown,> 200 DM}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history in {perfect,very good}: yes (59/18)
##     credit_history in {poor,critical,good}:
##     :...months_loan_duration <= 22:
##         :...credit_history = critical: no (72/14)
##         :   credit_history = poor:
##         :   :...dependents > 1: no (5)
##         :   :   dependents <= 1:
##         :   :   :...years_at_residence <= 3: yes (4/1)
##         :   :       years_at_residence > 3: no (5/1)
##         :   credit_history = good:
##         :   :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (15/1)
##         :       savings_balance = 100 - 500 DM:
##         :       :...other_credit in {none,store}: no (9/2)
##         :       :   other_credit = bank: yes (3)
##         :       savings_balance = unknown:
##         :       :...other_credit in {none,store}: no (21/8)
##         :       :   other_credit = bank: yes (1)
##         :       savings_balance = < 100 DM:
##         :       :...purpose in {car0,business,renovations}: no (8/2)
##         :           purpose = education:
##         :           :...checking_balance = 1 - 200 DM: no (1)
##         :           :   checking_balance = < 0 DM: yes (4)
##         :           purpose = car:
##         :           :...employment_duration = unemployed: no (4/1)
##         :           :   employment_duration = > 7 years: yes (5)
##         :           :   employment_duration = 4 - 7 years:
##         :           :   :...amount <= 1680: yes (2)
##         :           :   :   amount > 1680: no (3)
##         :           :   employment_duration = 1 - 4 years:
##         :           :   :...years_at_residence <= 2: yes (2)
##         :           :   :   years_at_residence > 2: no (6/1)
##         :           :   employment_duration = < 1 year:
##         :           :   :...years_at_residence <= 2: yes (5)
##         :           :       years_at_residence > 2: no (3/1)
##         :           purpose = furniture/appliances:
##         :           :...job in {management,unskilled}: no (23/3)
##         :               job = unemployed: yes (1)
##         :               job = skilled:
##         :               :...months_loan_duration > 13: [S1]
##         :                   months_loan_duration <= 13:
##         :                   :...housing in {other,own}: no (23/4)
##         :                       housing = rent:
##         :                       :...percent_of_income <= 3: yes (3)
##         :                           percent_of_income > 3: no (2)
##         months_loan_duration > 22:
##         :...savings_balance = 500 - 1000 DM: yes (4/1)
##             savings_balance = > 1000 DM: no (2)
##             savings_balance = 100 - 500 DM:
##             :...credit_history in {poor,critical}: no (14/3)
##             :   credit_history = good:
##             :   :...other_credit in {none,store}: yes (12/2)
##             :       other_credit = bank: no (1)
##             savings_balance = unknown:
##             :...checking_balance = 1 - 200 DM: no (17)
##             :   checking_balance = < 0 DM:
##             :   :...credit_history = critical: no (1)
##             :       credit_history in {poor,good}: yes (12/3)
##             savings_balance = < 100 DM:
##             :...months_loan_duration > 47: yes (21/2)
##                 months_loan_duration <= 47:
##                 :...housing = other:
##                     :...percent_of_income <= 2: no (6)
##                     :   percent_of_income > 2: yes (9/3)
##                     housing = rent:
##                     :...other_credit in {none,store}: yes (16/3)
##                     :   other_credit = bank: no (1)
##                     housing = own:
##                     :...employment_duration = > 7 years: no (13/4)
##                         employment_duration = unemployed:
##                         :...years_at_residence <= 2: yes (4)
##                         :   years_at_residence > 2: no (3)
##                         employment_duration = 4 - 7 years:
##                         :...job in {management,skilled,
##                         :   :       unemployed}: yes (9/1)
##                         :   job = unskilled: no (1)
##                         employment_duration = 1 - 4 years:
##                         :...purpose in {car0,business,education}: yes (7/1)
##                         :   purpose in {furniture/appliances,
##                         :   :           renovations}: no (7)
##                         :   purpose = car:
##                         :   :...years_at_residence <= 3: yes (3)
##                         :       years_at_residence > 3: no (3)
##                         employment_duration = < 1 year:
##                         :...years_at_residence > 3: yes (5)
##                             years_at_residence <= 3:
##                             :...other_credit = bank: no (0)
##                                 other_credit = store: yes (1)
##                                 other_credit = none:
##                                 :...checking_balance = 1 - 200 DM: no (8/2)
##                                     checking_balance = < 0 DM:
##                                     :...job in {management,skilled,
##                                         :       unemployed}: yes (2)
##                                         job = unskilled: no (3/1)
## 
## SubTree [S1]
## 
## employment_duration in {unemployed,> 7 years,1 - 4 years}: yes (10)
## employment_duration in {4 - 7 years,< 1 year}: no (4)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## checking_balance = unknown:
## :...other_credit in {bank,store}:
## :   :...purpose in {car0,furniture/appliances}: no (24.8/6.6)
## :   :   purpose in {business,education,renovations}: yes (19.5/6.3)
## :   :   purpose = car:
## :   :   :...dependents <= 1: yes (20.1/4.8)
## :   :       dependents > 1: no (2.4)
## :   other_credit = none:
## :   :...credit_history in {critical,perfect,very good}: no (102.8/4.4)
## :       credit_history = good:
## :       :...existing_loans_count <= 1: no (112.7/17.5)
## :       :   existing_loans_count > 1: yes (18.9/7.9)
## :       credit_history = poor:
## :       :...years_at_residence <= 1: yes (4.4)
## :           years_at_residence > 1:
## :           :...percent_of_income <= 3: no (11.9)
## :               percent_of_income > 3: yes (14.3/5.6)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (42.9/11.3)
##     savings_balance = unknown:
##     :...credit_history in {poor,perfect}: no (8.5)
##     :   credit_history in {critical,good,very good}:
##     :   :...employment_duration in {unemployed,> 7 years,4 - 7 years,
##     :       :                       < 1 year}: no (52.3/17.3)
##     :       employment_duration = 1 - 4 years: yes (19.7/5.6)
##     savings_balance = 100 - 500 DM:
##     :...existing_loans_count > 3: yes (3)
##     :   existing_loans_count <= 3:
##     :   :...credit_history in {poor,critical,very good}: no (24.6/7.6)
##     :       credit_history = perfect: yes (2.4)
##     :       credit_history = good:
##     :       :...months_loan_duration <= 27: no (23.7/10.5)
##     :           months_loan_duration > 27: yes (5.6)
##     savings_balance = < 100 DM:
##     :...months_loan_duration > 42: yes (28/5.2)
##         months_loan_duration <= 42:
##         :...percent_of_income <= 2:
##             :...employment_duration in {unemployed,4 - 7 years,
##             :   :                       1 - 4 years}: no (86.2/23.8)
##             :   employment_duration in {> 7 years,< 1 year}:
##             :   :...housing = other: no (4.8/1.6)
##             :       housing = rent: yes (10.7/2.4)
##             :       housing = own:
##             :       :...phone = yes: yes (12.9/4)
##             :           phone = no:
##             :           :...percent_of_income <= 1: no (7.1/0.8)
##             :               percent_of_income > 1: yes (17.5/7.1)
##             percent_of_income > 2:
##             :...years_at_residence <= 1: no (31.6/8.5)
##                 years_at_residence > 1:
##                 :...credit_history in {poor,perfect}: yes (20.9/1.6)
##                     credit_history in {critical,good,very good}:
##                     :...job = skilled: yes (95/34.7)
##                         job = unemployed: no (1.6)
##                         job = management:
##                         :...amount <= 11590: no (23.8/7)
##                         :   amount > 11590: yes (3.8)
##                         job = unskilled:
##                         :...checking_balance = 1 - 200 DM: no (17.9/6.2)
##                             checking_balance in {> 200 DM,
##                                                  < 0 DM}: yes (23.8/9.5)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
## 
## checking_balance = unknown:
## :...other_credit = bank:
## :   :...existing_loans_count > 2: no (3.3)
## :   :   existing_loans_count <= 2:
## :   :   :...months_loan_duration <= 8: no (4)
## :   :       months_loan_duration > 8: yes (43/16.6)
## :   other_credit in {none,store}:
## :   :...employment_duration in {unemployed,< 1 year}:
## :       :...purpose in {business,renovations}: yes (6.4)
## :       :   purpose in {car0,car,education}: no (13.2)
## :       :   purpose = furniture/appliances:
## :       :   :...amount <= 4594: no (22.5/7.3)
## :       :       amount > 4594: yes (9.1)
## :       employment_duration in {> 7 years,4 - 7 years,1 - 4 years}:
## :       :...percent_of_income <= 3: no (92.7/3.6)
## :           percent_of_income > 3:
## :           :...age > 30: no (73.6/5.5)
## :               age <= 30:
## :               :...job in {management,unskilled,unemployed}: yes (14/4)
## :                   job = skilled:
## :                   :...credit_history = very good: no (0)
## :                       credit_history = poor: yes (3.6)
## :                       credit_history in {critical,good,perfect}:
## :                       :...age <= 29: no (20.4/4.6)
## :                           age > 29: yes (2.7)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...housing = other:
##     :...dependents > 1: yes (28.3/7.6)
##     :   dependents <= 1:
##     :   :...employment_duration in {unemployed,4 - 7 years,
##     :       :                       < 1 year}: no (22.9/4.5)
##     :       employment_duration in {> 7 years,1 - 4 years}: yes (29.6/10.5)
##     housing = rent:
##     :...credit_history = poor: no (7.1/0.7)
##     :   credit_history = perfect: yes (5.3)
##     :   credit_history in {critical,good,very good}:
##     :   :...employment_duration in {unemployed,> 7 years,
##     :       :                       4 - 7 years}: no (33.9/12.3)
##     :       employment_duration = < 1 year: yes (28.3/9.3)
##     :       employment_duration = 1 - 4 years:
##     :       :...checking_balance = > 200 DM: no (2)
##     :           checking_balance in {1 - 200 DM,< 0 DM}:
##     :           :...years_at_residence <= 3: no (10.3/3.8)
##     :               years_at_residence > 3: yes (20.4/3.1)
##     housing = own:
##     :...job in {management,unemployed}: yes (55.8/19.8)
##         job in {skilled,unskilled}:
##         :...months_loan_duration <= 7: no (25.3/2)
##             months_loan_duration > 7:
##             :...years_at_residence > 3: no (92.2/29.6)
##                 years_at_residence <= 3:
##                 :...purpose = renovations: yes (7/1.3)
##                     purpose in {car0,business,education}: no (32.2/5.3)
##                     purpose = car:
##                     :...months_loan_duration > 40: no (7.2/0.7)
##                     :   months_loan_duration <= 40:
##                     :   :...amount <= 947: yes (12.9)
##                     :       amount > 947:
##                     :       :...months_loan_duration <= 16: no (23.2/8.5)
##                     :           months_loan_duration > 16: [S1]
##                     purpose = furniture/appliances:
##                     :...savings_balance in {100 - 500 DM,
##                         :                   500 - 1000 DM}: yes (14.6/4.5)
##                         savings_balance in {unknown,> 1000 DM}: no (15.4/3.2)
##                         savings_balance = < 100 DM:
##                         :...months_loan_duration > 36: yes (7.1)
##                             months_loan_duration <= 36:
##                             :...existing_loans_count > 1: no (14.1/4.3)
##                                 existing_loans_count <= 1: [S2]
## 
## SubTree [S1]
## 
## savings_balance = 100 - 500 DM: no (4.5/0.7)
## savings_balance in {unknown,500 - 1000 DM,< 100 DM,> 1000 DM}: yes (22.5/2.7)
## 
## SubTree [S2]
## 
## checking_balance in {1 - 200 DM,> 200 DM}: yes (46.7/20)
## checking_balance = < 0 DM: no (22.4/9.1)
## 
## -----  Trial 3:  -----
## 
## Decision tree:
## 
## checking_balance in {unknown,> 200 DM}:
## :...employment_duration = unemployed: yes (16/6.7)
## :   employment_duration = > 7 years: no (98.9/17.1)
## :   employment_duration = 4 - 7 years:
## :   :...checking_balance = > 200 DM: yes (9.6/3.6)
## :   :   checking_balance = unknown:
## :   :   :...age <= 22: yes (6.5/1.6)
## :   :       age > 22: no (42.6/1.5)
## :   employment_duration = < 1 year:
## :   :...amount <= 1333: no (11.7)
## :   :   amount > 1333:
## :   :   :...amount <= 6681: no (38.2/16.3)
## :   :       amount > 6681: yes (5.3)
## :   employment_duration = 1 - 4 years:
## :   :...percent_of_income <= 1: no (20.6/1.5)
## :       percent_of_income > 1:
## :       :...job in {skilled,unemployed}: no (64.9/17.6)
## :           job in {management,unskilled}:
## :           :...existing_loans_count > 2: yes (2.4)
## :               existing_loans_count <= 2:
## :               :...age <= 34: yes (26.4/10.7)
## :                   age > 34: no (10.5)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (35.8/12)
##     savings_balance = 100 - 500 DM:
##     :...amount <= 1285: yes (12.8/0.5)
##     :   amount > 1285:
##     :   :...existing_loans_count <= 1: no (27/9.2)
##     :       existing_loans_count > 1: yes (15.8/4.9)
##     savings_balance = unknown:
##     :...credit_history in {poor,critical,perfect}: no (15.5)
##     :   credit_history in {good,very good}:
##     :   :...age > 56: no (4.5)
##     :       age <= 56:
##     :       :...months_loan_duration <= 18: yes (24.5/5.6)
##     :           months_loan_duration > 18: no (28.4/12.3)
##     savings_balance = < 100 DM:
##     :...months_loan_duration <= 11:
##         :...job = management: yes (13.7/4.9)
##         :   job in {skilled,unskilled,unemployed}: no (45.9/10)
##         months_loan_duration > 11:
##         :...percent_of_income <= 1:
##             :...credit_history in {poor,critical,very good}: no (11.1)
##             :   credit_history in {good,perfect}: yes (24.4/11)
##             percent_of_income > 1:
##             :...job = unemployed: yes (7/3.1)
##                 job = management:
##                 :...years_at_residence <= 1: no (6.6)
##                 :   years_at_residence > 1:
##                 :   :...checking_balance = 1 - 200 DM: yes (15.8/4)
##                 :       checking_balance = < 0 DM: no (23.1/7)
##                 job = unskilled:
##                 :...housing in {other,rent}: yes (12.2/2.2)
##                 :   housing = own:
##                 :   :...purpose = car: yes (18.1/3.9)
##                 :       purpose in {car0,furniture/appliances,business,
##                 :                   education,renovations}: no (32.1/11.1)
##                 job = skilled:
##                 :...checking_balance = 1 - 200 DM:
##                     :...months_loan_duration > 36: yes (6.5)
##                     :   months_loan_duration <= 36:
##                     :   :...other_credit in {bank,store}: yes (8/1.5)
##                     :       other_credit = none:
##                     :       :...dependents > 1: yes (7.4/3.1)
##                     :           dependents <= 1:
##                     :           :...percent_of_income <= 2: no (12.7/1.1)
##                     :               percent_of_income > 2: [S1]
##                     checking_balance = < 0 DM:
##                     :...credit_history in {poor,very good}: yes (16.6)
##                         credit_history in {critical,good,perfect}:
##                         :...purpose in {car0,business,education,
##                             :           renovations}: yes (10.2/1.5)
##                             purpose = car:
##                             :...age <= 51: yes (34.6/8.1)
##                             :   age > 51: no (4.4)
##                             purpose = furniture/appliances:
##                             :...years_at_residence <= 1: no (4.4)
##                                 years_at_residence > 1:
##                                 :...other_credit = bank: yes (2.4)
##                                     other_credit = store: no (0.5)
##                                     other_credit = none:
##                                     :...amount <= 1743: no (11.5/2.4)
##                                         amount > 1743: yes (29/6.6)
## 
## SubTree [S1]
## 
## purpose in {car0,car,furniture/appliances,education}: no (19.8/6.1)
## purpose in {business,renovations}: yes (3.9)
## 
## -----  Trial 4:  -----
## 
## Decision tree:
## 
## checking_balance in {unknown,> 200 DM}:
## :...other_credit = store: no (20.6/9.6)
## :   other_credit = none:
## :   :...employment_duration in {unemployed,> 7 years,4 - 7 years,
## :   :   :                       1 - 4 years}: no (211.3/45.7)
## :   :   employment_duration = < 1 year:
## :   :   :...amount <= 1333: no (8.8)
## :   :       amount > 1333:
## :   :       :...purpose = car: no (4.9)
## :   :           purpose in {car0,furniture/appliances,business,education,
## :   :                       renovations}: yes (32.9/8.1)
## :   other_credit = bank:
## :   :...age > 44: no (14.4/1.2)
## :       age <= 44:
## :       :...years_at_residence <= 1: no (5)
## :           years_at_residence > 1:
## :           :...housing = rent: yes (4.3)
## :               housing in {other,own}:
## :               :...job = unemployed: yes (0)
## :                   job = management: no (4)
## :                   job in {skilled,unskilled}:
## :                   :...age <= 26: no (3.7)
## :                       age > 26:
## :                       :...savings_balance in {100 - 500 DM,
## :                           :                   > 1000 DM}: no (4)
## :                           savings_balance in {unknown,500 - 1000 DM,
## :                                               < 100 DM}: yes (30.6/7.4)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history = perfect:
##     :...housing in {other,rent}: yes (7.8)
##     :   housing = own: no (20.5/9)
##     credit_history = poor:
##     :...checking_balance = < 0 DM: yes (10.4/2.2)
##     :   checking_balance = 1 - 200 DM:
##     :   :...other_credit in {none,bank}: no (24/4.3)
##     :       other_credit = store: yes (5.8/1.2)
##     credit_history = very good:
##     :...age <= 23: no (5.7)
##     :   age > 23:
##     :   :...months_loan_duration <= 27: yes (28.4/3.7)
##     :       months_loan_duration > 27: no (6.9/2)
##     credit_history = critical:
##     :...years_at_residence <= 1: no (6.7)
##     :   years_at_residence > 1:
##     :   :...purpose in {car0,car,business,renovations}: no (62.2/21.9)
##     :       purpose = education: yes (7.9/0.9)
##     :       purpose = furniture/appliances:
##     :       :...phone = yes: no (14.5/2.8)
##     :           phone = no:
##     :           :...amount <= 1175: no (5.2)
##     :               amount > 1175: yes (30.1/7.6)
##     credit_history = good:
##     :...savings_balance = 100 - 500 DM: yes (32.1/11.7)
##         savings_balance in {500 - 1000 DM,> 1000 DM}: no (15.7/4.7)
##         savings_balance = unknown:
##         :...job = unskilled: no (4.4)
##         :   job in {management,skilled,unemployed}:
##         :   :...checking_balance = 1 - 200 DM: no (26.8/10.4)
##         :       checking_balance = < 0 DM: yes (27.8/6)
##         savings_balance = < 100 DM:
##         :...dependents > 1:
##             :...existing_loans_count > 1: no (2.6/0.4)
##             :   existing_loans_count <= 1:
##             :   :...years_at_residence <= 2: yes (10.2/2.9)
##             :       years_at_residence > 2: no (20.4/5.9)
##             dependents <= 1:
##             :...purpose in {car0,business}: no (9.7/2.5)
##                 purpose in {education,renovations}: yes (13/5.1)
##                 purpose = car:
##                 :...employment_duration in {unemployed,
##                 :   :                       1 - 4 years}: no (24.9/9)
##                 :   employment_duration in {> 7 years,4 - 7 years,
##                 :                           < 1 year}: yes (32/8.3)
##                 purpose = furniture/appliances:
##                 :...months_loan_duration > 39: yes (4.8)
##                     months_loan_duration <= 39:
##                     :...phone = yes: yes (21.9/9.2)
##                         phone = no:
##                         :...employment_duration = unemployed: yes (3.3/0.4)
##                             employment_duration in {> 7 years,4 - 7 years,
##                             :                       < 1 year}: no (34.1/8.1)
##                             employment_duration = 1 - 4 years:
##                             :...percent_of_income <= 1: yes (3.8)
##                                 percent_of_income > 1:
##                                 :...months_loan_duration > 21: no (4.9/0.4)
##                                     months_loan_duration <= 21:
##                                     :...years_at_residence <= 3: no (20.9/8.8)
##                                         years_at_residence > 3: yes (5.8)
## 
## -----  Trial 5:  -----
## 
## Decision tree:
## 
## checking_balance = unknown:
## :...other_credit = store: yes (16.9/7.5)
## :   other_credit = bank:
## :   :...housing = other: no (8.3/1.8)
## :   :   housing = rent: yes (4.4/0.8)
## :   :   housing = own:
## :   :   :...phone = yes: yes (12.1/5)
## :   :       phone = no: no (26.9/9.7)
## :   other_credit = none:
## :   :...credit_history in {critical,perfect,very good}: no (60.4/5.1)
## :       credit_history in {poor,good}:
## :       :...purpose in {car0,car,business,education}: no (53.6/12.8)
## :           purpose = renovations: yes (7.3/1.1)
## :           purpose = furniture/appliances:
## :           :...job = unemployed: no (0)
## :               job in {management,unskilled}: yes (19.2/7)
## :               job = skilled:
## :               :...phone = yes: no (14.6/1.8)
## :                   phone = no:
## :                   :...age > 32: no (9.2)
## :                       age <= 32:
## :                       :...employment_duration = 1 - 4 years: no (4.1)
## :                           employment_duration in {unemployed,> 7 years,
## :                           :                       4 - 7 years,< 1 year}:
## :                           :...savings_balance in {100 - 500 DM,
## :                               :                   < 100 DM}: yes (20.5/3)
## :                               savings_balance in {unknown,500 - 1000 DM,
## :                                                   > 1000 DM}: no (3.4)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...percent_of_income <= 2:
##     :...amount > 11054: yes (14.2/1.2)
##     :   amount <= 11054:
##     :   :...other_credit = bank: no (32.3/9.7)
##     :       other_credit = store: yes (8.9/2.6)
##     :       other_credit = none:
##     :       :...purpose in {car0,education}: no (8.4/3.7)
##     :           purpose in {business,renovations}: yes (20.3/9.1)
##     :           purpose = car:
##     :           :...savings_balance = 100 - 500 DM: yes (13.8/3.3)
##     :           :   savings_balance in {unknown,500 - 1000 DM,< 100 DM,
##     :           :                       > 1000 DM}: no (46.6/7.9)
##     :           purpose = furniture/appliances:
##     :           :...employment_duration in {unemployed,
##     :               :                       1 - 4 years}: yes (50.8/19.5)
##     :               employment_duration in {> 7 years,
##     :               :                       4 - 7 years}: no (18.2/2.6)
##     :               employment_duration = < 1 year:
##     :               :...job in {management,skilled,unemployed}: no (16.3/2.9)
##     :                   job = unskilled: yes (6/1.6)
##     percent_of_income > 2:
##     :...years_at_residence <= 1:
##         :...other_credit in {bank,store}: no (7.6)
##         :   other_credit = none:
##         :   :...months_loan_duration > 42: no (2.9)
##         :       months_loan_duration <= 42:
##         :       :...age <= 36: no (26.6/8.4)
##         :           age > 36: yes (5.3)
##         years_at_residence > 1:
##         :...job = unemployed: no (5.2)
##             job in {management,skilled,unskilled}:
##             :...credit_history = perfect: yes (10.9)
##                 credit_history in {poor,critical,good,very good}:
##                 :...employment_duration = < 1 year:
##                     :...checking_balance = > 200 DM: no (2.7)
##                     :   checking_balance in {1 - 200 DM,< 0 DM}:
##                     :   :...months_loan_duration > 21: yes (23.4/0.7)
##                     :       months_loan_duration <= 21:
##                     :       :...amount <= 1928: yes (18.4/4.4)
##                     :           amount > 1928: no (4.5)
##                     employment_duration in {unemployed,> 7 years,4 - 7 years,
##                     :                       1 - 4 years}:
##                     :...months_loan_duration <= 11:
##                         :...age > 47: no (12.2)
##                         :   age <= 47:
##                         :   :...purpose in {car0,car,furniture/appliances,
##                         :       :           business,renovations}: no (25/9.2)
##                         :       purpose = education: yes (3.5)
##                         months_loan_duration > 11:
##                         :...savings_balance in {100 - 500 DM,> 1000 DM}:
##                             :...age <= 58: no (22.7/3.4)
##                             :   age > 58: yes (4.4)
##                             savings_balance in {unknown,500 - 1000 DM,< 100 DM}:
##                             :...years_at_residence <= 2: yes (76.1/22.8)
##                                 years_at_residence > 2:
##                                 :...purpose in {car0,business,
##                                     :           education}: yes (24.7/7.1)
##                                     purpose = renovations: no (1.1)
##                                     purpose = furniture/appliances: [S1]
##                                     purpose = car:
##                                     :...amount <= 1388: yes (17.8/2.2)
##                                         amount > 1388:
##                                         :...housing = own: no (10.9)
##                                             housing in {other,rent}: [S2]
## 
## SubTree [S1]
## 
## employment_duration = unemployed: no (4.4)
## employment_duration in {> 7 years,4 - 7 years,1 - 4 years}:
## :...checking_balance in {1 - 200 DM,> 200 DM}: no (29/10.5)
##     checking_balance = < 0 DM: yes (35.6/12.4)
## 
## SubTree [S2]
## 
## savings_balance = unknown: no (6.8/1.5)
## savings_balance in {500 - 1000 DM,< 100 DM}: yes (21.4/6.4)
## 
## -----  Trial 6:  -----
## 
## Decision tree:
## 
## checking_balance in {unknown,> 200 DM}:
## :...purpose = car0: no (2.2)
## :   purpose = renovations: yes (8.4/3.3)
## :   purpose = education:
## :   :...age <= 44: yes (19.8/7.7)
## :   :   age > 44: no (4.4)
## :   purpose = car:
## :   :...job in {management,unemployed}: no (20.8/1.6)
## :   :   job = unskilled:
## :   :   :...years_at_residence <= 3: no (11/1.3)
## :   :   :   years_at_residence > 3: yes (14.5/3.2)
## :   :   job = skilled:
## :   :   :...other_credit in {bank,store}: yes (17.6/4.9)
## :   :       other_credit = none:
## :   :       :...existing_loans_count <= 2: no (24.6)
## :   :           existing_loans_count > 2: yes (2.4/0.3)
## :   purpose = business:
## :   :...existing_loans_count > 2: yes (3.3)
## :   :   existing_loans_count <= 2:
## :   :   :...amount <= 1823: no (8.1)
## :   :       amount > 1823:
## :   :       :...percent_of_income <= 3: no (12.1/3.3)
## :   :           percent_of_income > 3: yes (13.2/3.4)
## :   purpose = furniture/appliances:
## :   :...age > 44: no (22.7)
## :       age <= 44:
## :       :...job = unemployed: no (0)
## :           job = unskilled:
## :           :...existing_loans_count <= 1: yes (20.9/5.6)
## :           :   existing_loans_count > 1: no (4.5)
## :           job in {management,skilled}:
## :           :...dependents > 1: no (6.6)
## :               dependents <= 1:
## :               :...existing_loans_count <= 1:
## :                   :...savings_balance in {100 - 500 DM,unknown,500 - 1000 DM,
## :                   :   :                   > 1000 DM}: no (16.9)
## :                   :   savings_balance = < 100 DM:
## :                   :   :...age <= 22: yes (8.5/1.3)
## :                   :       age > 22: no (43.1/8.8)
## :                   existing_loans_count > 1:
## :                   :...housing in {other,rent}: yes (9.9/2.1)
## :                       housing = own:
## :                       :...credit_history in {poor,critical,
## :                           :                  very good}: no (18.6/1.6)
## :                           credit_history in {good,perfect}: yes (14.9/4.3)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history = perfect: yes (28.1/9.6)
##     credit_history = very good:
##     :...age <= 23: no (5.5)
##     :   age > 23: yes (30/8.1)
##     credit_history = poor:
##     :...percent_of_income <= 1: no (6.5)
##     :   percent_of_income > 1:
##     :   :...savings_balance in {unknown,500 - 1000 DM}: no (6.4)
##     :       savings_balance in {100 - 500 DM,< 100 DM,> 1000 DM}:
##     :       :...dependents <= 1: yes (25.1/8)
##     :           dependents > 1: no (5/0.9)
##     credit_history = critical:
##     :...savings_balance = unknown: no (8.4)
##     :   savings_balance in {100 - 500 DM,500 - 1000 DM,< 100 DM,> 1000 DM}:
##     :   :...other_credit = bank: yes (16.2/4.3)
##     :       other_credit = store: no (3.7/0.9)
##     :       other_credit = none:
##     :       :...savings_balance = 100 - 500 DM: no (5.9)
##     :           savings_balance in {500 - 1000 DM,> 1000 DM}: yes (7.3/2.3)
##     :           savings_balance = < 100 DM:
##     :           :...purpose in {car0,education,renovations}: yes (8.5/2.2)
##     :               purpose = business: no (4.5/2.2)
##     :               purpose = car:
##     :               :...age <= 29: yes (6.9)
##     :               :   age > 29: no (25.6/6.9)
##     :               purpose = furniture/appliances:
##     :               :...months_loan_duration <= 36: no (38.4/10.9)
##     :                   months_loan_duration > 36: yes (3.8)
##     credit_history = good:
##     :...amount > 8086: yes (24/3.8)
##         amount <= 8086:
##         :...phone = yes:
##             :...age <= 28: yes (23.9/7.5)
##             :   age > 28: no (69.4/17.9)
##             phone = no:
##             :...other_credit in {bank,store}: yes (25.1/7.2)
##                 other_credit = none:
##                 :...percent_of_income <= 2:
##                     :...job in {management,unskilled,unemployed}: no (15.6/2.7)
##                     :   job = skilled:
##                     :   :...amount <= 1386: yes (9.9/1)
##                     :       amount > 1386:
##                     :       :...age <= 24: yes (13.4/4.6)
##                     :           age > 24: no (27.8/3.1)
##                     percent_of_income > 2:
##                     :...checking_balance = < 0 DM: yes (62.5/21.4)
##                         checking_balance = 1 - 200 DM:
##                         :...months_loan_duration > 42: yes (4.9)
##                             months_loan_duration <= 42:
##                             :...existing_loans_count > 1: no (5)
##                                 existing_loans_count <= 1:
##                                 :...age <= 35: no (39.4/13.2)
##                                     age > 35: yes (14.7/4.2)
## 
## -----  Trial 7:  -----
## 
## Decision tree:
## 
## checking_balance = unknown:
## :...employment_duration = unemployed: yes (16.6/8)
## :   employment_duration in {> 7 years,4 - 7 years}: no (101.1/20.4)
## :   employment_duration = < 1 year:
## :   :...amount <= 4594: no (30/5.7)
## :   :   amount > 4594: yes (10.6/0.3)
## :   employment_duration = 1 - 4 years:
## :   :...dependents > 1: no (8)
## :       dependents <= 1:
## :       :...months_loan_duration <= 16: no (32.8/5.3)
## :           months_loan_duration > 16:
## :           :...existing_loans_count > 2: yes (2.7)
## :               existing_loans_count <= 2:
## :               :...percent_of_income <= 3: no (20.9/5.9)
## :                   percent_of_income > 3:
## :                   :...purpose in {car,furniture/appliances,
## :                       :           renovations}: no (19.7/7.5)
## :                       purpose in {car0,business,education}: yes (10.8)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...purpose in {car0,education,renovations}: no (67.2/29.2)
##     purpose = car:
##     :...amount <= 1297: yes (52.4/12.9)
##     :   amount > 1297:
##     :   :...percent_of_income <= 2:
##     :       :...phone = no: no (32.7/6.1)
##     :       :   phone = yes:
##     :       :   :...years_at_residence <= 3: no (20/4.9)
##     :       :       years_at_residence > 3: yes (14.7/3.8)
##     :       percent_of_income > 2:
##     :       :...percent_of_income <= 3: yes (33.1/11.3)
##     :           percent_of_income > 3:
##     :           :...months_loan_duration <= 18: no (18.2/1.6)
##     :               months_loan_duration > 18:
##     :               :...existing_loans_count <= 1: no (19.5/7.2)
##     :                   existing_loans_count > 1: yes (13.8/1)
##     purpose = business:
##     :...age > 46: yes (5.2)
##     :   age <= 46:
##     :   :...months_loan_duration <= 18: no (17.5)
##     :       months_loan_duration > 18:
##     :       :...other_credit in {bank,store}: no (10/0.5)
##     :           other_credit = none:
##     :           :...employment_duration in {unemployed,
##     :               :                       > 7 years}: yes (6.6)
##     :               employment_duration in {4 - 7 years,1 - 4 years,< 1 year}:
##     :               :...age <= 25: yes (4)
##     :                   age > 25: no (19.2/5.6)
##     purpose = furniture/appliances:
##     :...savings_balance = 100 - 500 DM: yes (18.6/6)
##         savings_balance = > 1000 DM: no (5.2)
##         savings_balance in {unknown,500 - 1000 DM,< 100 DM}:
##         :...existing_loans_count > 1:
##             :...existing_loans_count > 2: no (3.6)
##             :   existing_loans_count <= 2:
##             :   :...housing = other: yes (3.3)
##             :       housing in {own,rent}:
##             :       :...savings_balance = unknown: no (6.9)
##             :           savings_balance = 500 - 1000 DM: yes (3.5/1)
##             :           savings_balance = < 100 DM:
##             :           :...age > 54: yes (2.1)
##             :               age <= 54: [S1]
##             existing_loans_count <= 1:
##             :...credit_history in {poor,very good}: no (20.8/9.5)
##                 credit_history in {critical,perfect}: yes (20.3/7.6)
##                 credit_history = good:
##                 :...months_loan_duration <= 7: no (11.4)
##                     months_loan_duration > 7:
##                     :...other_credit = bank: no (14.2/4.6)
##                         other_credit = store: yes (11.7/3.9)
##                         other_credit = none:
##                         :...percent_of_income <= 1: no (20.5/5.2)
##                             percent_of_income > 1:
##                             :...amount > 6078: yes (10.9/1.1)
##                                 amount <= 6078:
##                                 :...dependents > 1: yes (8.7/2.5)
##                                     dependents <= 1: [S2]
## 
## SubTree [S1]
## 
## employment_duration in {unemployed,> 7 years,1 - 4 years}: no (25.7/2.9)
## employment_duration in {4 - 7 years,< 1 year}: yes (15/2.5)
## 
## SubTree [S2]
## 
## employment_duration = > 7 years: no (17.9/2.5)
## employment_duration in {unemployed,4 - 7 years,1 - 4 years,< 1 year}:
## :...job = management: no (6.6)
##     job = unemployed: yes (1.1)
##     job in {skilled,unskilled}:
##     :...years_at_residence <= 1: no (11.8/1.8)
##         years_at_residence > 1:
##         :...checking_balance = 1 - 200 DM: yes (25.1/8.8)
##             checking_balance = > 200 DM: no (14.7/6.3)
##             checking_balance = < 0 DM:
##             :...months_loan_duration <= 16: no (13.8/3.4)
##                 months_loan_duration > 16: yes (19.1/5.5)
## 
## -----  Trial 8:  -----
## 
## Decision tree:
## 
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history = perfect:
## :   :...housing in {other,rent}: yes (8.3)
## :   :   housing = own:
## :   :   :...age <= 34: no (16.6/4.7)
## :   :       age > 34: yes (5.8)
## :   credit_history = poor:
## :   :...checking_balance = < 0 DM: yes (12/2.7)
## :   :   checking_balance = 1 - 200 DM:
## :   :   :...housing = rent: no (8.6)
## :   :       housing in {other,own}:
## :   :       :...amount <= 2279: yes (6.8/0.6)
## :   :           amount > 2279: no (20/5.7)
## :   credit_history = very good:
## :   :...existing_loans_count > 1: yes (2.5)
## :   :   existing_loans_count <= 1:
## :   :   :...age <= 23: no (3.7)
## :   :       age > 23:
## :   :       :...amount <= 8386: yes (32.9/8.1)
## :   :           amount > 8386: no (2.5)
## :   credit_history = critical:
## :   :...years_at_residence <= 1: no (8)
## :   :   years_at_residence > 1:
## :   :   :...savings_balance in {100 - 500 DM,unknown,500 - 1000 DM,
## :   :       :                   > 1000 DM}: no (25.5/5.7)
## :   :       savings_balance = < 100 DM:
## :   :       :...age > 61: no (6)
## :   :           age <= 61:
## :   :           :...existing_loans_count > 2: no (10.7/2.4)
## :   :               existing_loans_count <= 2:
## :   :               :...age > 56: yes (5.4)
## :   :                   age <= 56:
## :   :                   :...amount > 2483: yes (34.1/8.9)
## :   :                       amount <= 2483:
## :   :                       :...purpose in {car0,car,furniture/appliances,
## :   :                           :           renovations}: no (41.4/10.8)
## :   :                           purpose in {business,education}: yes (4.4)
## :   credit_history = good:
## :   :...amount > 8086: yes (26.6/4.8)
## :       amount <= 8086:
## :       :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (17.5/5.1)
## :           savings_balance = 100 - 500 DM:
## :           :...months_loan_duration <= 27: no (21.3/7.1)
## :           :   months_loan_duration > 27: yes (5.1)
## :           savings_balance = unknown:
## :           :...age <= 56: yes (44.7/16.9)
## :           :   age > 56: no (4.4)
## :           savings_balance = < 100 DM:
## :           :...job = unemployed: yes (0.9)
## :               job = management:
## :               :...employment_duration in {unemployed,4 - 7 years,1 - 4 years,
## :               :   :                       < 1 year}: no (17.3/1.6)
## :               :   employment_duration = > 7 years: yes (8/1.2)
## :               job = unskilled:
## :               :...months_loan_duration <= 26: no (59/19.7)
## :               :   months_loan_duration > 26: yes (3.3)
## :               job = skilled:
## :               :...purpose in {car0,business,education,
## :                   :           renovations}: yes (16.6/4.1)
## :                   purpose = car:
## :                   :...dependents <= 1: yes (27.7/10.6)
## :                   :   dependents > 1: no (8.1/1.4)
## :                   purpose = furniture/appliances:
## :                   :...years_at_residence <= 1: no (18.7/6.5)
## :                       years_at_residence > 1:
## :                       :...other_credit = bank: yes (4.5)
## :                           other_credit = store: no (2.3)
## :                           other_credit = none:
## :                           :...percent_of_income <= 3: yes (33.5/15)
## :                               percent_of_income > 3: no (27.3/9.3)
## checking_balance in {unknown,> 200 DM}:
## :...years_at_residence > 2: no (135.6/32.2)
##     years_at_residence <= 2:
##     :...months_loan_duration <= 8: no (12.9)
##         months_loan_duration > 8:
##         :...months_loan_duration <= 9: yes (10.4/1.3)
##             months_loan_duration > 9:
##             :...months_loan_duration <= 16: no (31.3/4.2)
##                 months_loan_duration > 16:
##                 :...purpose in {car0,business,renovations}: no (21.3/8.4)
##                     purpose = education: yes (6.3/0.8)
##                     purpose = car:
##                     :...credit_history in {poor,good,perfect}: no (9.6)
##                     :   credit_history in {critical,very good}: yes (17.3/2.6)
##                     purpose = furniture/appliances:
##                     :...credit_history = poor: yes (4.9)
##                         credit_history in {critical,perfect,
##                         :                  very good}: no (5.6)
##                         credit_history = good:
##                         :...housing in {other,rent}: no (2.6)
##                             housing = own:
##                             :...age <= 25: no (6.8)
##                                 age > 25: yes (29.2/10.2)
## 
## -----  Trial 9:  -----
## 
## Decision tree:
## 
## checking_balance = unknown:
## :...dependents > 1: no (26)
## :   dependents <= 1:
## :   :...amount <= 1474: no (39.7)
## :       amount > 1474:
## :       :...employment_duration in {> 7 years,4 - 7 years}:
## :           :...years_at_residence > 2: no (21.8)
## :           :   years_at_residence <= 2:
## :           :   :...age <= 23: yes (4.1)
## :           :       age > 23: no (19.7/4.2)
## :           employment_duration in {unemployed,1 - 4 years,< 1 year}:
## :           :...purpose in {business,renovations}: yes (23.2/3.6)
## :               purpose in {car0,car,furniture/appliances,education}:
## :               :...other_credit in {bank,store}: yes (29.1/10.5)
## :                   other_credit = none:
## :                   :...purpose in {car0,car}: no (12.3)
## :                       purpose in {furniture/appliances,education}:
## :                       :...amount <= 4455: no (23.7/4.4)
## :                           amount > 4455: yes (11.1/1.3)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...percent_of_income <= 2:
##     :...amount > 11054: yes (15.7/3.6)
##     :   amount <= 11054:
##     :   :...savings_balance in {unknown,500 - 1000 DM,
##     :       :                   > 1000 DM}: no (41.5/11.2)
##     :       savings_balance = 100 - 500 DM:
##     :       :...other_credit in {none,store}: yes (21.7/9.4)
##     :       :   other_credit = bank: no (5.1)
##     :       savings_balance = < 100 DM:
##     :       :...employment_duration in {unemployed,> 7 years}: no (34.6/11.5)
##     :           employment_duration = 1 - 4 years:
##     :           :...job = management: yes (5.1/0.8)
##     :           :   job in {skilled,unskilled,unemployed}: no (65.4/15.8)
##     :           employment_duration = 4 - 7 years:
##     :           :...dependents > 1: no (4.6)
##     :           :   dependents <= 1:
##     :           :   :...amount <= 6527: no (16.8/7.2)
##     :           :       amount > 6527: yes (7)
##     :           employment_duration = < 1 year:
##     :           :...amount <= 2327:
##     :               :...age <= 34: yes (20.5/1.9)
##     :               :   age > 34: no (3)
##     :               amount > 2327:
##     :               :...other_credit in {none,store}: no (20.1/3.9)
##     :                   other_credit = bank: yes (2.8)
##     percent_of_income > 2:
##     :...housing = rent:
##         :...checking_balance in {1 - 200 DM,< 0 DM}: yes (69/22.1)
##         :   checking_balance = > 200 DM: no (3.4)
##         housing = other:
##         :...existing_loans_count > 1: yes (18.7/5.3)
##         :   existing_loans_count <= 1:
##         :   :...savings_balance in {100 - 500 DM,unknown}: no (15.3/3.2)
##         :       savings_balance in {500 - 1000 DM,< 100 DM,
##         :                           > 1000 DM}: yes (29.1/8.6)
##         housing = own:
##         :...credit_history in {poor,perfect}: yes (26.9/7.4)
##             credit_history = very good: no (14.9/5.6)
##             credit_history = critical:
##             :...other_credit in {none,store}: no (63/20.3)
##             :   other_credit = bank: yes (11.7/3.4)
##             credit_history = good:
##             :...other_credit = store: yes (8.9/1.4)
##                 other_credit in {none,bank}:
##                 :...age > 54: no (9.5)
##                     age <= 54:
##                     :...existing_loans_count > 1: no (10.2/2.7)
##                         existing_loans_count <= 1:
##                         :...purpose in {business,renovations}: no (10.1/3.6)
##                             purpose in {car0,education}: yes (4.7)
##                             purpose = car:
##                             :...other_credit = bank: yes (4.9)
##                             :   other_credit = none:
##                             :   :...years_at_residence > 2: no (14.8/4.5)
##                             :       years_at_residence <= 2:
##                             :       :...amount <= 2150: no (14.9/6.2)
##                             :           amount > 2150: yes (11.1)
##                             purpose = furniture/appliances:
##                             :...savings_balance = 100 - 500 DM: yes (3.8)
##                                 savings_balance in {500 - 1000 DM,
##                                 :                   > 1000 DM}: no (2.8)
##                                 savings_balance in {unknown,< 100 DM}:
##                                 :...months_loan_duration > 39: yes (3.3)
##                                     months_loan_duration <= 39:
##                                     :...dependents <= 1: no (57.6/19.4)
##                                         dependents > 1: yes (4.6/1.1)
## 
## 
## Evaluation on training data (900 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0     56  133(14.8%)
##    1     34  211(23.4%)
##    2     39  201(22.3%)
##    3     47  179(19.9%)
##    4     46  174(19.3%)
##    5     50  197(21.9%)
##    6     55  187(20.8%)
##    7     50  190(21.1%)
##    8     51  192(21.3%)
##    9     47  169(18.8%)
## boost             34( 3.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     629     4    (a): class no
##      30   237    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##  100.00% purpose
##   97.11% years_at_residence
##   96.67% employment_duration
##   94.78% credit_history
##   94.67% other_credit
##   92.56% job
##   92.11% percent_of_income
##   90.33% amount
##   85.11% months_loan_duration
##   82.78% age
##   82.78% existing_loans_count
##   75.78% dependents
##   71.56% housing
##   70.78% savings_balance
##   49.22% phone
## 
## 
## Time: 0.0 secs

Evaluate the Boosted Decision Tree Model

Then we can use the boosted model on our test dataset to see its performance.

credit_boost_pred10 <- predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10,
           prop.chisq = FALSE, 
           prop.c = FALSE, 
           prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        62 |         5 |        67 | 
##                |     0.620 |     0.050 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        13 |        20 |        33 | 
##                |     0.130 |     0.200 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        75 |        25 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model.

Making some mistakes cost more than others

Giving a loan to an applicant who is likely to default can be an expensive mistake. One solution is to reduce the number of false negatives by specifying a cost matrix.

The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction.

We can first start by specifying the dimensions

matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
names(matrix_dimensions) <- c("predicted", "actual")

Specifying cost matrix

matrix_dimensions
## $predicted
## [1] "no"  "yes"
## 
## $actual
## [1] "no"  "yes"

Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. We need to assign the penalty/cost score in a specific order

  • Predicted no, actual no

  • Predicted yes, actual no

  • Predicted no, actual yes

  • Predicted yes, actual yes

Specifying cost matrix

Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:

# build the matrix
error_cost <- matrix(c(0, 1, 4, 0), 
                     nrow = 2, 
                     dimnames = matrix_dimensions)
error_cost
##          actual
## predicted no yes
##       no   0   4
##       yes  1   0

Applying the Cost Matrix

To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function.

credit_cost <- C5.0(credit_train[-17], credit_train$default,
                    costs = error_cost)
credit_cost_pred <- predict(credit_cost, credit_test)

CrossTable(credit_test$default, credit_cost_pred,
           prop.chisq = FALSE, 
           prop.c = FALSE, 
           prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        37 |        30 |        67 | 
##                |     0.370 |     0.300 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |         7 |        26 |        33 | 
##                |     0.070 |     0.260 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        44 |        56 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

Regression Trees and Model Trees

Trees for numeric prediction fall into two categories.

  • Regression trees, were introduced in the 1980s as part of the seminal Classification and Regression Tree (CART) algorithm.

    • Despite the name, regression trees do not use linear regression methods as described earlier in this chapter, rather they make predictions based on the average value of examples that reach a leaf.
  • Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node.

Adding Regression to Trees

Advantages compared to Linear regression

  • Although regression methods are typically the first choice for numeric prediction tasks, regression trees or models may be better for features with many complex and non-linear relationships

  • Regression also makes assumptions about how numeric data is distributed and that are often vilated in real-world data.

Case Study - Estimating Wine Quality

Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. Variables such as weather, the growing environment, the bottling, manufacturing, bottle design, or even price point, can affect the customer’s perception of taste.

More recently, machine learning has been employed to assist with rating the quality of wine—a notoriously difficult task. A review written by a renowned wine critic often determines whether the product ends up on the top or bottom shelf.

Import Data

In this case study, we will use regression trees and model trees to create a system capable of mimicking expert ratings of wine. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.

We will first download and import the whitewine data whitewine, which includes examples of white Vinho Verde wines from Portugal—one of the world’s leading wine-producing countries.

wine <- read.csv("winequality-white.csv")

The white wine data includes information on 11 chemical properties of 4,898 wine samples.

Exploring and Preparing the Data

# examine the wine data
str(wine)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

# the distribution of quality ratings
hist(wine$quality)

# summary statistics of the wine data
summary(wine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

Split into Train and Test Dataset

Our last step then is to divide into training and testing datasets. Since the wine data set was already sorted into random order, we can partition into two sets: 75% train, 25% test dataset.

wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

Train a Regression Tree Model

We will use the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team.

#install.packages("rpart")
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
summary(m.rpart)
## Call:
## rpart(formula = quality ~ ., data = wine_train)
##   n= 3750 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.17816211      0 1.0000000 1.0010391 0.02390494
## 2 0.04439109      1 0.8218379 0.8232523 0.02238387
## 3 0.02890893      2 0.7774468 0.7885622 0.02218078
## 4 0.01655575      3 0.7485379 0.7610867 0.02103390
## 5 0.01108600      4 0.7319821 0.7498581 0.02060005
## 6 0.01000000      5 0.7208961 0.7434623 0.02026718
## 
## Variable importance
##              alcohol              density            chlorides 
##                   38                   23                   12 
##     volatile.acidity total.sulfur.dioxide  free.sulfur.dioxide 
##                   12                    7                    6 
##            sulphates                   pH       residual.sugar 
##                    1                    1                    1 
## 
## Node number 1: 3750 observations,    complexity param=0.1781621
##   mean=5.886933, MSE=0.8373493 
##   left son=2 (2473 obs) right son=3 (1277 obs)
##   Primary splits:
##       alcohol              < 10.85    to the left,  improve=0.17816210, (0 missing)
##       density              < 0.992385 to the right, improve=0.11980970, (0 missing)
##       chlorides            < 0.0395   to the right, improve=0.08199995, (0 missing)
##       total.sulfur.dioxide < 153.5    to the right, improve=0.03875440, (0 missing)
##       free.sulfur.dioxide  < 11.75    to the left,  improve=0.03632119, (0 missing)
##   Surrogate splits:
##       density              < 0.99201  to the right, agree=0.869, adj=0.614, (0 split)
##       chlorides            < 0.0375   to the right, agree=0.773, adj=0.334, (0 split)
##       total.sulfur.dioxide < 102.5    to the right, agree=0.705, adj=0.132, (0 split)
##       sulphates            < 0.345    to the right, agree=0.670, adj=0.031, (0 split)
##       fixed.acidity        < 5.25     to the right, agree=0.662, adj=0.009, (0 split)
## 
## Node number 2: 2473 observations,    complexity param=0.04439109
##   mean=5.609381, MSE=0.6108623 
##   left son=4 (1406 obs) right son=5 (1067 obs)
##   Primary splits:
##       volatile.acidity    < 0.2425   to the right, improve=0.09227123, (0 missing)
##       free.sulfur.dioxide < 13.5     to the left,  improve=0.04177240, (0 missing)
##       alcohol             < 10.15    to the left,  improve=0.03313802, (0 missing)
##       citric.acid         < 0.205    to the left,  improve=0.02721200, (0 missing)
##       pH                  < 3.325    to the left,  improve=0.01860335, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 111.5    to the right, agree=0.610, adj=0.097, (0 split)
##       pH                   < 3.295    to the left,  agree=0.598, adj=0.067, (0 split)
##       alcohol              < 10.05    to the left,  agree=0.590, adj=0.049, (0 split)
##       sulphates            < 0.715    to the left,  agree=0.584, adj=0.037, (0 split)
##       residual.sugar       < 1.85     to the right, agree=0.581, adj=0.029, (0 split)
## 
## Node number 3: 1277 observations,    complexity param=0.02890893
##   mean=6.424432, MSE=0.8378682 
##   left son=6 (93 obs) right son=7 (1184 obs)
##   Primary splits:
##       free.sulfur.dioxide  < 11.5     to the left,  improve=0.08484051, (0 missing)
##       alcohol              < 11.85    to the left,  improve=0.06149941, (0 missing)
##       fixed.acidity        < 7.35     to the right, improve=0.04259695, (0 missing)
##       residual.sugar       < 1.275    to the left,  improve=0.02795662, (0 missing)
##       total.sulfur.dioxide < 67.5     to the left,  improve=0.02541719, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < 48.5     to the left,  agree=0.937, adj=0.14, (0 split)
## 
## Node number 4: 1406 observations,    complexity param=0.011086
##   mean=5.40256, MSE=0.526423 
##   left son=8 (182 obs) right son=9 (1224 obs)
##   Primary splits:
##       volatile.acidity     < 0.4225   to the right, improve=0.04703189, (0 missing)
##       free.sulfur.dioxide  < 17.5     to the left,  improve=0.04607770, (0 missing)
##       total.sulfur.dioxide < 86.5     to the left,  improve=0.02894310, (0 missing)
##       alcohol              < 10.25    to the left,  improve=0.02890077, (0 missing)
##       chlorides            < 0.0455   to the right, improve=0.02096635, (0 missing)
##   Surrogate splits:
##       density       < 0.99107  to the left,  agree=0.874, adj=0.027, (0 split)
##       citric.acid   < 0.11     to the left,  agree=0.873, adj=0.022, (0 split)
##       fixed.acidity < 9.85     to the right, agree=0.873, adj=0.016, (0 split)
##       chlorides     < 0.206    to the right, agree=0.871, adj=0.005, (0 split)
## 
## Node number 5: 1067 observations
##   mean=5.881912, MSE=0.591491 
## 
## Node number 6: 93 observations
##   mean=5.473118, MSE=1.066482 
## 
## Node number 7: 1184 observations,    complexity param=0.01655575
##   mean=6.499155, MSE=0.7432425 
##   left son=14 (611 obs) right son=15 (573 obs)
##   Primary splits:
##       alcohol        < 11.85    to the left,  improve=0.05907511, (0 missing)
##       fixed.acidity  < 7.35     to the right, improve=0.04400660, (0 missing)
##       density        < 0.991395 to the right, improve=0.02522410, (0 missing)
##       residual.sugar < 1.225    to the left,  improve=0.02503936, (0 missing)
##       pH             < 3.245    to the left,  improve=0.02417936, (0 missing)
##   Surrogate splits:
##       density              < 0.991115 to the right, agree=0.710, adj=0.401, (0 split)
##       volatile.acidity     < 0.2675   to the left,  agree=0.665, adj=0.307, (0 split)
##       chlorides            < 0.0365   to the right, agree=0.631, adj=0.237, (0 split)
##       total.sulfur.dioxide < 126.5    to the right, agree=0.566, adj=0.103, (0 split)
##       residual.sugar       < 1.525    to the left,  agree=0.560, adj=0.091, (0 split)
## 
## Node number 8: 182 observations
##   mean=4.994505, MSE=0.5109588 
## 
## Node number 9: 1224 observations
##   mean=5.463235, MSE=0.5002823 
## 
## Node number 14: 611 observations
##   mean=6.296236, MSE=0.7322117 
## 
## Node number 15: 573 observations
##   mean=6.715532, MSE=0.6642788

Visualize the Decision Trees

Although the tree can be understood using only the preceding output, it is often more readily understood using visualization. The rpart.plot package by Stephen Milborrow provides an easy-to-use function for visualization.

#install.packages("rpart.plot")
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
rpart.plot(m.rpart, digits = 3)

Visualize the Decision Trees

# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, 
           fallen.leaves = TRUE, 
           type = 3, extra = 101)

Evaluate Model Performance

To use the regression tree model to make predictions on the test data, we use the predict() function.

# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, wine_test)
# compare the distribution of predicted values vs. actual values
summary(p.rpart)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.995   5.463   5.882   5.999   6.296   6.716

Evaluate Model Performance

summary(wine_test$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.848   6.000   8.000

This finding suggests that the model is not correctly identifying the extreme cases, in particular the best and worst wines. Between the first and third quartile, we may be doing well.

Measuring Performance with Mean Absolute Error

Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE).

# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}

Measuring Performance with Mean Absolute Error

On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.

# mean absolute error between predicted and actual values
MAE(p.rpart, wine_test$quality)
## [1] 0.5732104